Blocks and container added for Text Dataset #205

arcAman07 · 2022-03-20T18:12:25Z

Registered the NLP ( Text ) dataset to be added in the upcoming months. Added functions for the blocks of the Text dataset.
All the nlp dataset ( which are registered ) along with their forthcoming models will be added . Exploring Julia Text, MLutils and other package along with FastAI concepts so that these datasets can work well with Flux. As almost all the text datasets are in csv format it will be easily lo load them and create the corresponding container, working on further concepts to implement these text datasets.

Currently I have added the entire basic structure of the Text Data comprising of the blocks and the containers. Have researched a lot since a week ( understanding FastAI docs and codebase ). Currently working on adding textrow block along with the recipes.jl.
Also currently working on two datasets "imdb" and "amazon_review_full" as both have different folder structure so different blocks would be required. Also going through the 2 papers which have built state of the art model for these two datasets and working on its implementation. Any reviews thus far will be appreciated.

Reopened PR#100 , needed to delete that repo due to merging issue.

arcAman07 · 2022-03-21T05:02:09Z

Blocks and container is full added. ( similar to the tabular datasets). Currently working on the models ( reading the papers ), adding the recipes and also exploring other libraries. Also had some doubts regarding the things that need to be added here.

darsnack · 2022-03-21T22:20:30Z

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

arcAman07 · 2022-03-22T15:58:12Z

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

Yep was working on this currently as a draft PR. Will add a container which will work for the textual dataset, was just experimenting and seeing the results with the TableDataset. The blocks and the main Text.jl is added currently. Was reading the paper "Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)." which uses the amazon_full_csv, ag_news_csv dataset to train the model so that well versed with the trainingmethod. Simulteamously working on the recipes.jl to load the datasets .

arcAman07 · 2022-03-26T14:52:05Z

So I have done couple of changes. I really wanted to fit the text data ( eg) amazon_review_full_csv, ag_news_csv , etc ) into a tabular data format as even in the official fastai tutorial ( for the text dataset ) have done it that way. As most of the text dataset to be implemented is news ( without any headers/column names ) but consist majorly of three columns "rating", "title" and "news" , so I added those headers to recipes.jl which I was working on so that even while developing the tutorial and for visualization it is easier for the user to understand the data. Have tested it locally and it is working perfectly. Currently encodings added ( will add more more encodings specific to text while working on the model for training ), containers and blocks are added for the text dataset . Currently working on writing tests for these blocks and containers along with the training implementation for these models. With this format all of the text dataset can be added. Would appreciate some reviews so that I can further improvise it

arcAman07 · 2022-03-26T14:52:49Z

arcAman07 · 2022-03-26T14:57:40Z

This is for the ag_news_csv dataset.

darsnack · 2022-03-29T20:20:40Z

Is there a specific tutorial you are targeting here? It would be helpful to reference that as we review.

arcAman07 · 2022-03-29T20:40:15Z

The inspiration of the tutorial to be made is from the official fastai text tutorial => https://docs.fast.ai/tutorial.text.html
The data is visualized in a tabular format ( the classes and the text is shown which can be shown by the TextClassificationRecipe struct ) and then the further tutorial deals with training that model and it's visualizations. I do plan to start work with news dataset as the paper I referred earlier in the PR covers the architecture required for training the models on these datasets.

arcAman07 · 2022-03-30T09:32:50Z

To all the maintainers, I just had a question whether there is a need to add text transformations/ cleaning module to this package as it is present in the fastai python package? If we are working with JuliaText it might not be required and if needed be we can add those functions which we might require in the existing repos of JuliaText.

lorenzoh · 2022-03-31T07:59:17Z

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs.

arcAman07 · 2022-03-31T08:15:24Z

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs

Great am reading through the paper "Character-level Convolutional Networks for Text
Classification" to implement the architecture to train the various news dataset used here along with going through JuliaText and its packages which we can use as text transformations and encodings. Currently have added the blocks and container to load the recipes which are working well ( just like in offical fastai tutorial in a tabular way ). Would love some feedbacks so that I can finish this PR in its totality and start working on the encodings and tutorial in an another PR.

darsnack · 2022-04-01T17:28:13Z

This particular task seems like a classification task on table data. Does it need a separate dataset recipe type, or can it just reuse the table stuff?

Like Lorenz suggested, I think the transforms, etc. should be left out of this PR and only the recipes added. This PR has added a lot of recipes which is great! But the current loadrecipe appears to be hardcoding column names, etc. I would suggest rewriting the datasets to use the existing tabular recipes, then separately think about a text classification task that has a TableRow + Label block. You can look at the tabular classification task as an example.

arcAman07 added 2 commits March 20, 2022 23:39

Blocks and container added for Text Dataset

881d084

Blocks and container added for Text Dataset

1b1427b

Merge branch 'FluxML:master' into master

2ab81a3

arcAman07 and others added 14 commits March 23, 2022 11:44

Merge branch 'FluxML:master' into master

54736a7

Merge branch 'FluxML:master' into master

c19fb4c

Merge branch 'FluxML:master' into master

6be2728

Updated code

deceab8

Updated code

461e618

Updated code

c4ed346

Updated code

a79cc32

Updated code

2fba24e

Code added to load news datasets

6b803af

Code added to load news datasets

5488e03

Code added to load news datasets

7517523

Code added to load news datasets

b49bb2a

encodings added for testing

ac14522

encodings added for testing

e2a383c

Added text file support

85977e9

More recipes added which uses similar model infrastructure

acd5627

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocks and container added for Text Dataset #205

Blocks and container added for Text Dataset #205

arcAman07 commented Mar 20, 2022

arcAman07 commented Mar 21, 2022

darsnack commented Mar 21, 2022

arcAman07 commented Mar 22, 2022

arcAman07 commented Mar 26, 2022 •

edited

arcAman07 commented Mar 26, 2022

arcAman07 commented Mar 26, 2022

darsnack commented Mar 29, 2022

arcAman07 commented Mar 29, 2022

arcAman07 commented Mar 30, 2022

lorenzoh commented Mar 31, 2022

arcAman07 commented Mar 31, 2022 •

edited

darsnack commented Apr 1, 2022

Blocks and container added for Text Dataset #205

Blocks and container added for Text Dataset #205

Conversation

arcAman07 commented Mar 20, 2022

arcAman07 commented Mar 21, 2022

darsnack commented Mar 21, 2022

arcAman07 commented Mar 22, 2022

arcAman07 commented Mar 26, 2022 • edited

arcAman07 commented Mar 26, 2022

arcAman07 commented Mar 26, 2022

darsnack commented Mar 29, 2022

arcAman07 commented Mar 29, 2022

arcAman07 commented Mar 30, 2022

lorenzoh commented Mar 31, 2022

arcAman07 commented Mar 31, 2022 • edited

darsnack commented Apr 1, 2022

arcAman07 commented Mar 26, 2022 •

edited

arcAman07 commented Mar 31, 2022 •

edited