Riv_task

The task is to predict 3.txt using the given datasets.

Programs and Modules needed :

Python 3.x
Jupyter
numpy
pandas
sklearn
Keras
plotly, matplotlib, seaborn (for visualizations)
Used the glove embeddings for vectorization. Download glove.6B.zip.
GPU to run the GRU, else without GPU change CuDNNGRU to GRU in model1.ipynb.
eli5 (for debugging models)

Designing of the whole task :

Explanation

EDA

Combined the two text files, 1.txt and 2.txt to create a training dataframe.

Target Values

43.4% is 0 (negative) and 56.6% is 1 (positive)

Word Cloud

To visualize some important and repeating words.

Word Count plots.

These are necessary to find how the dataset is skewed towards bias if any. So we found, mostly these reviews are about harry potter and da vinci code movies. First column is on bad reviews and Second is on good reviews.

Bigram plot to see relevant words together

Trigram plot to see some more words and to decide what model to use

After first base model:

Base model was of TFidVectorizer and logistic regression: Weights for some words, Green is for positive and Red is for negative.

Model

The final model consists of an ensemble of a GRU using no embeddings and a GRU with glove 6 billion words and 300 dimensions embeddings. The output from the model is in 3.txt.

Why the model ?

First I used CuDNNGRU as its good for learning long term dependencies in sentences. I tried various other models but GRU had the best precision recall score. The model was good, but it considered da vinci as a good word, but its a name, therefore, I decided to use ensembling. To know unknown words and to make sense of movies like da vinci and other names and relation between words, so I decided to use word embeddings.

Accuracy measure :

I used the F1 Score because the F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. So I used it to determine the best value for a threshold to get weights for the predicted values(reference here)

The final accuracy was 0.9454 or 94.54% on the validation set.

How to make the model even better ?

More data, the dataset only was around ~8k samples.
To use shop review data instead of given movie review data.
To use more bigger embeddings like Glove 840 billion 300 dimension embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
assignment		assignment
3.txt		3.txt
EDA_.ipynb		EDA_.ipynb
README.md		README.md
_config.yml		_config.yml
model1.ipynb		model1.ipynb
out.csv		out.csv
out_2.csv		out_2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Riv_task

Programs and Modules needed :

Designing of the whole task :

Explanation

EDA

Target Values

Word Cloud

Word Count plots.

After first base model:

Model

Why the model ?

Accuracy measure :

How to make the model even better ?

About

Releases

Packages

Languages

ASH1998/riv_task

Folders and files

Latest commit

History

Repository files navigation

Riv_task

Programs and Modules needed :

Designing of the whole task :

Explanation

EDA

Target Values

Word Cloud

Word Count plots.

After first base model:

Model

Why the model ?

Accuracy measure :

How to make the model even better ?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages