-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using candidates for prediction (Fonduer Prediction Pipeline) #263
Comments
I'd leave the 1st question to @senwu or @lukehsiao . For the 2nd question, technically yes. We can separate the training and test pipeline, but it is not straight-forward to do so as of today as I described in #259.
No, you don't have to parse the entire corpus again in the test phase, assuming the entire corpus includes train, dev, and test. In the training phase, save the feature keys in a file like below:
In the test phase, load the feature keys from the file to the database as below:
You may have more questions about 2nd question. I'm happy to help you as much as I can. |
Thanks @HiromuHota . I will definitely try this out. I do have more queries on the prediction capabilities of fonduer. Please help me get some clarity on these.
|
Hi @atulgupta9, Thanks for your interests in our research! Let me try to answer your questions, and hope it can help to understand more about Fonduer.
We currently provide two prediction model in Fonduer (Logistic regression, LSTM). You can tune those two models pretty easily and we provide a simple interface for that (see here). Of course you can also customize your own prediction model based on your needs (for example if you want to use BERT to extract features, here are some references). In order to achieve better quality, there are many factors you might want to consider: We recently published another paper which shares some best practices, you can find it here.
One of the design goals of Fonduer is to address the data variety in richly formatted data. In order to address that, we propose to use a unified data model to unify the data with a different structure (like HTML). We are keep improving our data model to handle more cases (we now support plain text, CSV, TSV, HTML, PDF, and etc..).
Yes, the prediction model is trying to predict the true candidate based on your label data and the useful signal of data.
I assume you mean multiple relations (correct me if I am wrong). Right now, Fonduer supports multiple relation candidates extraction, but the learning part only supports a single relation. We will release a new machine learning to support learning multiple relations simultaneously soon. |
Thanks @senwu for a speedy response. It will surely help me proceed forward. Please can we keep this thread open. I will try to post some snippets to let you guys understand the use case we are trying to tackle with fonduer and get your inputs on our approach. |
Use Case DiscussionWe have sourced financial documents from the internet, these documents have no definite structure or a repeating pattern. We will be confining this discussion to the extraction and prediction of revenue terms. Below are some snaps of the documents we are handling. Each snap is of a different document. Although, the above images reflect a specific structure being followed but we have about 299 documents with many of them containing information in images/running text rather than tables. The entire set is divided as follows: CodeExtracting the mentionsFor extracting the revenue mentions we have captured revenue related tags in a csv and are using it
After extraction we got: Getting the candidatesFor candidate extraction , we have further refined those tags and limited it using throttlers.
So basically we are looking for sentences that contain those keywords, and the $ amount.
Featurizer phase
We got Gold labels generationCreated our own function to load and store the labels as directed in the hardware tutorial. Candidate LabellingIn order to label such varied data, we thought of doing this process manually. So the candidate sentences were taken out and manually labelled.
We noticed most of the files contained two or three true candidates and most were false candidates. There is a huge disparity in their no. Should this pose a problem? Is abstaining from voting a solution as per 'Automating the Generation of Hardware Component Knowledge Bases' research paper ? What criteria could be used for abstaining from voting (All are manually labelled here) ? Then we ran the generative model to get the train marginals. Though, I doubt if this is required here as we never abstained from voting. I am really unclear why this is needed. But since the learning model needed it we used it
Learning PhaseThe model used was Sparse Logistic Regression.
The results were not very impressive.
In our second attempt,
Expectations & Queries
Sorry for such a long write-up. Please take your time to go through this and help me. Thanks. |
@senwu @HiromuHota , Can we use the generated feature size as an estimate to predict how the model would behave? If so, what is an appropriate range within which we can state that the predictions will be good? |
Hi @atulgupta9, Thanks for the description of your use case and I am sorry for the late response (a lot of dues this week). Your problem is a very good example/use case for Fonduer. Let me try to To answer your question here:
No, this is no problem and actually a very common case in extracting knowledge from richly formatted data. One general question you might care is how to generate your candidates since it's super easy to generate many negative candidates and miss positive candidates.
This is a good question and very important for the user who wants to provide weak supervisions in our framework. The
Your problem is a very good example/use case for Fonduer.
As I mentioned before, there are several parts you want to consider:
Yes, this is a pretty good example and start point. I will be more powerful if you can add more documents in and let the model learn more.
I think for each phrase, you can do a sanity check to make sure it matches your expectation.
This is a good question, but I don't have the good answer for that. Fonduer generates all multi-modality features based on the feature library from the documents. One thing you can check is that whether the generated features can be a good indicator not. Sen |
Hi @senwu Thanks for all the help. Just want to have your quick thought on this. I am using two sessions (Initialized using Fonduer Meta): one for training; other for predictions Should the previous sessions be terminated before initiating a new one? I guess its something to do with SQLAlchemy. |
Hi @atulgupta9, Sorry for the late response! I think that might be a SQLAlchemy issue about inserting info into database. One potential solution is to reduce the parallelism. Sen |
Guys, I am trying to build api endpoints for automatically calling the fonduer built in functions. I have two separate api endpoints : for mention/candidate extraction and for training. These routes may be called any number of times under different projects. So you can assume that every time the api is called the mentions/candidates and the db we are referring to would be different. Now, I have these variables Depending on the number of candidates user has specified. We will have that many variables. Now for training the model, I would need these candidates for featurization and labelling. But then I realized, that I will have to store either the candidate_<candidate_name> variable defined above or the candiate_extractor that was defined to get the candidates. I tried pickling the candidate_<candidate_name> variable but ended up with an error like this Can you guys help me find a way to do this? Any help would be appreciated. Thanks |
As I mentioned in #259 (comment), it is hard to pickle a dynamically created class (e.g., mention/candidate subclasses in Fonduer). |
I am getting this error, when I try to redefine an existing relation.
Is there any solution for this? |
Currently, candidate subclasses as well as mention subclasses can be defined only once. Having said, the ability to redefine candidate/mention subclasses might be useful, especially during development. |
Unfortunately, we don't support runtime modify the candidate subclasses. But the current system supports create new candidate subclasses. |
There’s no function “upsert_keys” in version 0.5.0 . May I ask what’s the replaced function name |
Featurizer#upsert_keys has been added at 0.7.0. |
Scenario:
For my use case I have a set of financial documents.
The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.
The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.
Problems & Questions
With my initial analysis and usage following the hardware tutorial, I could not obtain good results.
As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.
Please correct me, if that won't be the case and help me with a snippet to showcase the separation.
The text was updated successfully, but these errors were encountered: