New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocks and container added for Text Dataset #205
Conversation
Blocks and container is full added. ( similar to the tabular datasets). Currently working on the models ( reading the papers ), adding the recipes and also exploring other libraries. Also had some doubts regarding the things that need to be added here. |
I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly. |
Yep was working on this currently as a draft PR. Will add a container which will work for the textual dataset, was just experimenting and seeing the results with the TableDataset. The blocks and the main Text.jl is added currently. Was reading the paper "Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)." which uses the amazon_full_csv, ag_news_csv dataset to train the model so that well versed with the trainingmethod. Simulteamously working on the recipes.jl to load the datasets . |
So I have done couple of changes. I really wanted to fit the text data ( eg) amazon_review_full_csv, ag_news_csv , etc ) into a tabular data format as even in the official fastai tutorial ( for the text dataset ) have done it that way. As most of the text dataset to be implemented is news ( without any headers/column names ) but consist majorly of three columns "rating", "title" and "news" , so I added those headers to recipes.jl which I was working on so that even while developing the tutorial and for visualization it is easier for the user to understand the data. Have tested it locally and it is working perfectly. Currently encodings added ( will add more more encodings specific to text while working on the model for training ), containers and blocks are added for the text dataset . Currently working on writing tests for these blocks and containers along with the training implementation for these models. With this format all of the text dataset can be added. Would appreciate some reviews so that I can further improvise it |
Is there a specific tutorial you are targeting here? It would be helpful to reference that as we review. |
The inspiration of the tutorial to be made is from the official fastai text tutorial => https://docs.fast.ai/tutorial.text.html |
To all the maintainers, I just had a question whether there is a need to add text transformations/ cleaning module to this package as it is present in the fastai python package? If we are working with JuliaText it might not be required and if needed be we can add those functions which we might require in the existing repos of JuliaText. |
Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs. |
Great am reading through the paper "Character-level Convolutional Networks for Text |
This particular task seems like a classification task on table data. Does it need a separate dataset recipe type, or can it just reuse the table stuff? Like Lorenz suggested, I think the transforms, etc. should be left out of this PR and only the recipes added. This PR has added a lot of recipes which is great! But the current |
Registered the NLP ( Text ) dataset to be added in the upcoming months. Added functions for the blocks of the Text dataset.
All the nlp dataset ( which are registered ) along with their forthcoming models will be added . Exploring Julia Text, MLutils and other package along with FastAI concepts so that these datasets can work well with Flux. As almost all the text datasets are in csv format it will be easily lo load them and create the corresponding container, working on further concepts to implement these text datasets.
Currently I have added the entire basic structure of the Text Data comprising of the blocks and the containers. Have researched a lot since a week ( understanding FastAI docs and codebase ). Currently working on adding textrow block along with the recipes.jl.
Also currently working on two datasets "imdb" and "amazon_review_full" as both have different folder structure so different blocks would be required. Also going through the 2 papers which have built state of the art model for these two datasets and working on its implementation. Any reviews thus far will be appreciated.
Reopened PR#100 , needed to delete that repo due to merging issue.