Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Container and Block for Text #207

Merged
merged 25 commits into from May 12, 2022
Merged

Add Container and Block for Text #207

merged 25 commits into from May 12, 2022

Conversation

Chandu-4444
Copy link
Contributor

Tried starting at creating a simple textual recipe based on ImageFolders dataset recipe. This specifically works for imdb and similar datasets. Any feedback is highly appreciated.

@Chandu-4444 Chandu-4444 changed the title Add basic Text module and sample recipe. Add basic Container and Block for Text Mar 26, 2022
@Chandu-4444 Chandu-4444 changed the title Add basic Container and Block for Text Add Container and Block for Text Mar 26, 2022
@Chandu-4444
Copy link
Contributor Author

julia> using FastAI

julia> name, recipe = finddatasets(blocks=(Any, Any), name="imdb")[1]
Pair{String, FastAI.Datasets.DatasetRecipe}("imdb", TextFolders(FastAI.Datasets.parentname, false, FastAI.Text.var"#2#4"()))

julia> data, blocks = loadrecipe(recipe, datasetpath("imdb"))
((mapobs(loadfile, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…]), mapobs(parentname, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju])), (TextBlock(), Label{String}(["neg", "pos"])))

julia> text, class = obs = getobs(data, 1000)
("Every movie I have PPV'd because Leonard Maltin praised it to the skies has blown chunks! 
Every single one! 
When will I ever learn?<br /><br />Evie is a raving Old Bag who thinks nothing of saying she's dying of breast cancer to get her way! 
Laura is an insufferable Medusa filled with  The Holy Spirit (and her hubby's protégé)! 
Caught between these harpies is Medusa's dumb-as-a-rock boy who has been pressed into weed-pulling servitude by The Old Bag!<br /><br />
As I said, when will I ever learn?<br /><br />
I was temporarily lifted out of my malaise when The Old Bag stuck her head in a sink, but, unfortunately, she did not die. 
I was temporarily lifted out of my malaise again when Medusa got mowed down, but, unfortunately, she did not die. 
It should be a capital offense to torture audiences like this!<br /><br />
Without Harry Potter to kick him around, Rupert Grint is just a pair of big blue eyes that practically bulge out of its sockets.  
Julie Walters's scenery-chewing (especially the scene when she \"plays\" God) is even more shameless than her character.
<br /><br />
At least this Harold bangs some bimbo instead of Maude. 
For that, I am truly grateful. And if you're reading this Mr. Maltin, you owe me \$3.99!", "neg")

@Chandu-4444
Copy link
Contributor Author

Chandu-4444 commented Mar 30, 2022

I have started adding functions for replacing words that start with uppercase letters, contain all uppercase letters with special tokens like xxup, xxmaj etc. All the remaining utilities used for preprocessing can be used from JuliaText.

Copy link
Member

@lorenzoh lorenzoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I've left some comments.

Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.

src/Text/blocks/text.jl Outdated Show resolved Hide resolved
src/Text/Text.jl Outdated Show resolved Hide resolved
src/Text/recipes.jl Outdated Show resolved Hide resolved
src/Text/Text.jl Outdated Show resolved Hide resolved
Chandu-4444 and others added 2 commits March 31, 2022 21:11
Sure, I just forgot to remove them from export after I tried testing them.

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
@Chandu-4444
Copy link
Contributor Author

Chandu-4444 commented Mar 31, 2022

Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.

Yes, fastai does have a tutorial that uses this dataset, https://docs.fast.ai/tutorial.text.html. This tutorial focuses on the sentiment analysis. The first part uses a pre-trained language model (called AWD-LSTM) on Wikipedia for predicting the next word (language generation), and is directly used for predicting the sentiment for the given review. In the second part of the tutorial, they used an approach called ULMFit approach that involves fine-tuning the model with the IMDB dataset and using that for predicting the sentiment. They achieved SOTA using the second method.

I'll commit to the suggestions provided and will improve upon those.

Simultaneously, I'll start looking into that AWD-LSTM (https://arxiv.org/abs/1708.02182) paper to get deeper into how the model works. After that, the plan was to go through the ULMFit (https://arxiv.org/abs/1801.06146) paper.

src/Text/recipes.jl Outdated Show resolved Hide resolved
src/Text/recipes.jl Outdated Show resolved Hide resolved
src/Text/recipes.jl Outdated Show resolved Hide resolved
src/Text/transform.jl Outdated Show resolved Hide resolved
src/datasets/containers.jl Outdated Show resolved Hide resolved
Chandu-4444 and others added 4 commits April 1, 2022 22:13
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Copy link
Member

@lorenzoh lorenzoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment regarding the block type, but other than that and Brian's open suggestion, this looks good to me!

src/Text/blocks/text.jl Outdated Show resolved Hide resolved
@lorenzoh
Copy link
Member

Sorry for letting this sit!

Tests were failing due to issues that should be fixed on master, so merging master into this should make the CI green.

Last thing that would be good to have would be some tests

@Chandu-4444
Copy link
Contributor Author

Sure! Will synchronise it with master and add some tests.

@Chandu-4444
Copy link
Contributor Author

Umm... For writing tests to the TextFolders(), I need to access the IMDb dataset. I remember Lorenz mentioning that it isn't very nice to use large datasets for testing as it might overload the CI system. And for other recipes, there are smaller version datasets that replicate the original larger version datasets. I couldn't find any such datasets for IMDb (Actually there is one such dataset that is available as a CSV file, but I need an IMDb-like directory structure for testing the recipe). Is there any workaround?

@ToucheSir
Copy link
Member

I wouldn't worry about testing the bits that require file IO for now, mostly the helper functionality.

@Chandu-4444
Copy link
Contributor Author

That sounds good!

src/Textual/blocks/text.jl Outdated Show resolved Hide resolved
src/Textual/recipes.jl Outdated Show resolved Hide resolved
src/Textual/transform.jl Show resolved Hide resolved
Chandu-4444 and others added 3 commits April 23, 2022 14:16
Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
Add tests for text transforms
@Chandu-4444 Chandu-4444 requested a review from lorenzoh May 4, 2022 14:20
@lorenzoh lorenzoh merged commit 2f227aa into FluxML:master May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants