Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks and container added for Text Dataset #205

Closed
wants to merge 19 commits into from
Closed

Blocks and container added for Text Dataset #205

wants to merge 19 commits into from

Conversation

arcAman07
Copy link

Registered the NLP ( Text ) dataset to be added in the upcoming months. Added functions for the blocks of the Text dataset.
All the nlp dataset ( which are registered ) along with their forthcoming models will be added . Exploring Julia Text, MLutils and other package along with FastAI concepts so that these datasets can work well with Flux. As almost all the text datasets are in csv format it will be easily lo load them and create the corresponding container, working on further concepts to implement these text datasets.

Currently I have added the entire basic structure of the Text Data comprising of the blocks and the containers. Have researched a lot since a week ( understanding FastAI docs and codebase ). Currently working on adding textrow block along with the recipes.jl.
Also currently working on two datasets "imdb" and "amazon_review_full" as both have different folder structure so different blocks would be required. Also going through the 2 papers which have built state of the art model for these two datasets and working on its implementation. Any reviews thus far will be appreciated.

Reopened PR#100 , needed to delete that repo due to merging issue.

@arcAman07
Copy link
Author

Blocks and container is full added. ( similar to the tabular datasets). Currently working on the models ( reading the papers ), adding the recipes and also exploring other libraries. Also had some doubts regarding the things that need to be added here.

@darsnack
Copy link
Member

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

@arcAman07
Copy link
Author

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

Yep was working on this currently as a draft PR. Will add a container which will work for the textual dataset, was just experimenting and seeing the results with the TableDataset. The blocks and the main Text.jl is added currently. Was reading the paper "Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)." which uses the amazon_full_csv, ag_news_csv dataset to train the model so that well versed with the trainingmethod. Simulteamously working on the recipes.jl to load the datasets .

@arcAman07
Copy link
Author

arcAman07 commented Mar 26, 2022

So I have done couple of changes. I really wanted to fit the text data ( eg) amazon_review_full_csv, ag_news_csv , etc ) into a tabular data format as even in the official fastai tutorial ( for the text dataset ) have done it that way. As most of the text dataset to be implemented is news ( without any headers/column names ) but consist majorly of three columns "rating", "title" and "news" , so I added those headers to recipes.jl which I was working on so that even while developing the tutorial and for visualization it is easier for the user to understand the data. Have tested it locally and it is working perfectly. Currently encodings added ( will add more more encodings specific to text while working on the model for training ), containers and blocks are added for the text dataset . Currently working on writing tests for these blocks and containers along with the training implementation for these models. With this format all of the text dataset can be added. Would appreciate some reviews so that I can further improvise it

@arcAman07
Copy link
Author

image
image

@arcAman07
Copy link
Author

image image

This is for the ag_news_csv dataset.

@darsnack
Copy link
Member

Is there a specific tutorial you are targeting here? It would be helpful to reference that as we review.

@arcAman07
Copy link
Author

The inspiration of the tutorial to be made is from the official fastai text tutorial => https://docs.fast.ai/tutorial.text.html
The data is visualized in a tabular format ( the classes and the text is shown which can be shown by the TextClassificationRecipe struct ) and then the further tutorial deals with training that model and it's visualizations. I do plan to start work with news dataset as the paper I referred earlier in the PR covers the architecture required for training the models on these datasets.

@arcAman07
Copy link
Author

To all the maintainers, I just had a question whether there is a need to add text transformations/ cleaning module to this package as it is present in the fastai python package? If we are working with JuliaText it might not be required and if needed be we can add those functions which we might require in the existing repos of JuliaText.

@lorenzoh
Copy link
Member

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs.

@arcAman07
Copy link
Author

arcAman07 commented Mar 31, 2022

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs

Great am reading through the paper "Character-level Convolutional Networks for Text
Classification" to implement the architecture to train the various news dataset used here along with going through JuliaText and its packages which we can use as text transformations and encodings. Currently have added the blocks and container to load the recipes which are working well ( just like in offical fastai tutorial in a tabular way ). Would love some feedbacks so that I can finish this PR in its totality and start working on the encodings and tutorial in an another PR.

@darsnack
Copy link
Member

darsnack commented Apr 1, 2022

This particular task seems like a classification task on table data. Does it need a separate dataset recipe type, or can it just reuse the table stuff?

Like Lorenz suggested, I think the transforms, etc. should be left out of this PR and only the recipes added. This PR has added a lot of recipes which is great! But the current loadrecipe appears to be hardcoding column names, etc. I would suggest rewriting the datasets to use the existing tabular recipes, then separately think about a text classification task that has a TableRow + Label block. You can look at the tabular classification task as an example.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants