Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IMDB, Twitter and Stanford Sentiment datasets #18

Merged
merged 40 commits into from Jul 5, 2019

Conversation

Projects
None yet
3 participants
@ComputerMaestro
Copy link
Contributor

commented Apr 16, 2019

I have add just the datadeps. I will be adding the files and once it is done I will let you know @oxinabox .

ComputerMaestro added some commits Apr 16, 2019

.travis.yml Outdated
# push coverage results to Coveralls
- julia -e 'cd(Pkg.dir("CorpusLoaders")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
- julia -e 'cd(Pkg.dir("MultiResolutionIterators")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 22, 2019

Collaborator

This is not correct

This comment has been minimized.

Copy link
@ComputerMaestro

ComputerMaestro Apr 22, 2019

Author Contributor

Sorry, for this silly one

src/IMDB.jl Outdated
paths = Dict{String, Vector{String}}()
folder, subfolder = split(category, "_")
paths = joinpath.(joinpath(datadep"IMDB movie reviews dataset", "aclImdb\\$folder\\$subfolder"),
readdir(joinpath(datadep"IMDB movie reviews dataset", "aclImdb\\$folder\\$subfolder")))

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 22, 2019

Collaborator

prefer:

basepath = joinpath(datadep"IMDB movie reviews dataset", "aclImdb\\$folder\\$subfolder")
paths = joinpath.(basepath, readdir(basepath))
src/IMDB.jl Outdated
:sentences => 2,
:words => 3, :tokens => 3,
:characters => 4
]

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 22, 2019

Collaborator

indentation is off here,

src/IMDB.jl Outdated
IMDB(category="train_pos") = IMDB(datadep"IMDB movie reviews dataset", category)

MultiResolutionIterators.levelname_map(::Type{IMDB}) = [
:Document => 1,

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 22, 2019

Collaborator

you mean :document right?

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2019

I have made some changes. Please check for travis.yml

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

Some more adjustments are needed , I will do them shortly.

ComputerMaestro added some commits Apr 26, 2019

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

I will write test for IMDB and Twitter as soon as possible. For SST-2 I will make another PR.

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

I have added the tests as well.

@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

thanks
Tests seem to be failing, could you take a look at that?

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

Sure @oxinabox , I will look at it and will let you know when it is rectified.

ComputerMaestro added some commits Apr 26, 2019

@@ -1,3 +1,6 @@
using CSV
using DataFrames

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 26, 2019

Collaborator

Why have these moved?

This comment has been minimized.

Copy link
@ComputerMaestro

ComputerMaestro Apr 26, 2019

Author Contributor

These are used when Twitter.jl is being used, so I thought to keep them in Twitter.jl. Should I move them back to CorpusLoaders ?

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 26, 2019

Collaborator

Yes all imports in src should be in the main file.
it makes it easier to keep track of what things are in scope, and what packages you depend on.
(I tried the other way and it eventually becomes a mess)

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

ERROR: LoadError: InitError: could not open file /home/travis/build/JuliaText/CorpusLoaders.jl/test/WikiCorpus_DataDeps.jl
I am not able to getting the cause for this error in travis. The problem is that the WikiCorpus_DataDeps.jl file is inside src folder and is referred inside init funciton CorpusLoaders module and then CorpusLoaders is referred inside test folder and since this folder doesn't contain WikiCorpus_DataDeps.jl , so it cannot open it inside init function

@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

@Ayushk4 how is JuliaText/WordTokenizers.jl#13
going?
it would be good to be using that here.

file, polarity = fileMap[split(category, "_")[1]], polarityMap[split(category, "_")[2]]
path = joinpath(datadep"Twitter Sentiment Dataset", "$file")
dataframe = CSV.read(path, header=0)
Twitter(dataframe.Column6[dataframe.Column1 .== polarity])

This comment has been minimized.

Copy link
@oxinabox

oxinabox Apr 26, 2019

Collaborator
Suggested change
Twitter(dataframe.Column6[dataframe.Column1 .== polarity])
return Twitter(dataframe.Column6[dataframe.Column1 .== polarity])
@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

I am pretty sure we do not need the DataFrames.jl dependency.
Only the CSV one.
CSV has a dependency on DataFrames, but we don't ever call any methods from the DataFrames namepace.

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

I have removed DataFrames.jl.

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

@oxinabox , I have structured Stanford sentiment dataset in a different way. I have removed DelimitedFiles

@Ayushk4 Ayushk4 referenced this pull request May 30, 2019

Merged

Support for CoNLL Corpora #20

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

@oxinabox , is this fine now or any changes are there??

@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented May 31, 2019

I'm pretty busy and am hoping @aviks can take a look

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

I have three more datasets:
WikiText-103
WikiText-2
Yelp dataset
Should we add them as well??

@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Jun 3, 2019

In a seperate PR.
Small PRs are much easier to review
and can be merged faster

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2019

Is this fine? Or any changes are left?

]

function load(dataset::StanfordSentimentTreebank)
Channel(ctype=@NestedVector(Any, 1), csize=4) do docs

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 16, 2019

Collaborator

This line is wrong.
Saying @NestedVector(Any, 1) is exactly the same as sayin
Vector{Any}

We want to properly declare the type the returned objects.

I would also like to know your rational behind csize=4

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 17, 2019

Collaborator

An alternative is below if it would not be too much data would be to not bother with the Channels.
They are only there to stop things that might otherwise use all the RAM,
so it this file is not to big then it is not needed.

But also there are a number of other problems with the above code.
Including the use of file for two different variables. and not using eachline
Note the other changes in the code below

function load(dataset::StanfordSentimentTreebank)
    map(eachline(dataset.filepath)) do line
        doc_raw, sentiment_raw = split(line, '|')
        sents = [intern.(tokenize(sent)) for sent in split_sentences(doc_raw))]
        sentiment = parse(Float64, sentiment_raw)
        put!(docs, [sents, sentiment])  # TODO: rethink this return type.
    end
end

This comment has been minimized.

Copy link
@ComputerMaestro

ComputerMaestro Jun 25, 2019

Author Contributor

I thought of such return type so that sentences and their sentiments can be taken together. If you have any other suggestion please let me know. Or Is it fine to have a 2-d array of first column as array of tokens and second as their respective sentiments and as output??

This comment has been minimized.

Copy link
@ComputerMaestro

ComputerMaestro Jun 25, 2019

Author Contributor

This line is wrong.
Saying @NestedVector(Any, 1) is exactly the same as sayin
Vector{Any}

We want to properly declare the type the returned objects.

I would also like to know your rational behind csize=4

Sorry for that I think I misunderstood csize thing

line = join([dict[id, 1], sentiments[parse(Int, dict[id, 2])+1, 2]], '|')
write(fdataset, line, "\n")
end
close(fdataset)

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 17, 2019

Collaborator

The following style is preferred

local dict
fdict = open("dictionary.txt", "r") do fdict
	dict = split.(readlines(fdict), '|')[1:end-1]
	dict = permutedims(hcat(dict...))
end

local sentiments
open("sentiment_labels.txt", "r") do fsenti
	sentiments = split.(readlines(fsenti), '|')[2:end-1]
	sentiments = permutedims(hcat(sentiments...))
end

open("sa_dataset.txt", "w") do fdataset
	for id=1:size(dict)[1]
		line = join([dict[id, 1], sentiments[parse(Int, dict[id, 2])+1, 2]], '|')
		write(fdataset, line, "\n")
	end
end
@test typeof(doc[2]) <: Number
end

docs = permutedims(reshape(hcat(docs...), (length(docs[1]), length(docs))))

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 17, 2019

Collaborator

Isn't this the same as:

Suggested change
docs = permutedims(reshape(hcat(docs...), (length(docs[1]), length(docs))))
docs = permutedims(hcat(docs...))

This comment has been minimized.

Copy link
@ComputerMaestro

ComputerMaestro Jun 25, 2019

Author Contributor

yeah , sillly me

function Twitter(category="train_pos")
fileMap = Dict("train" => "training.1600000.processed.noemoticon.csv", "test" => "testdata.manual.2009.06.14.csv")
polarityMap = Dict("pos" => 4, "neg" => 0, "neu" => 2)
file, polarity = fileMap[split(category, "_")[1]], polarityMap[split(category, "_")[2]]

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 17, 2019

Collaborator
Suggested change
file, polarity = fileMap[split(category, "_")[1]], polarityMap[split(category, "_")[2]]
file_id, polarity_id = split(category, "_")
file= fileMap[file_id],
polarity = polarityMap[polarity_id]
@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Jun 17, 2019

Is this fine? Or any changes are left?

Sorry this is taking so long.
The overall quality of the code, in terms of cleanlyness and idiomatic programming is still not great.

I think after the comments i made are addressed it will be acceptable.

ComputerMaestro added some commits Jun 25, 2019

@@ -6,7 +6,7 @@ function Twitter(category="train_pos")
fileMap = Dict("train" => "training.1600000.processed.noemoticon.csv", "test" => "testdata.manual.2009.06.14.csv")
polarityMap = Dict("pos" => 4, "neg" => 0, "neu" => 2)

This comment has been minimized.

Copy link
@oxinabox

oxinabox Jun 26, 2019

Collaborator

polarity_map, file_map

@ComputerMaestro

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

I checked that all the above test are failing due to some problem in toktok tokenizer in WordTokenizers v0.5.2 and they pass when used with WordTokenizers v0.4.0. I will ask Ayush

@Ayushk4 Ayushk4 referenced this pull request Jun 27, 2019

Merged

Toktok fix patch #31

@Ayushk4

This comment has been minimized.

Copy link
Member

commented Jun 29, 2019

For record keeping purposes - this also addresses #13 in a way.

@ComputerMaestro ComputerMaestro force-pushed the ComputerMaestro:master branch from 2d3424b to b7e19da Jun 30, 2019

@oxinabox oxinabox merged commit 5d12a44 into JuliaText:master Jul 5, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@oxinabox

This comment has been minimized.

Copy link
Collaborator

commented Jul 5, 2019

Thanks!
Lets get this in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.