The task is to predict 3.txt using the given datasets.
-
Python 3.x
-
Jupyter
-
numpy
-
pandas
-
sklearn
-
Keras
-
plotly, matplotlib, seaborn (for visualizations)
-
Used the glove embeddings for vectorization. Download glove.6B.zip.
-
GPU to run the GRU, else without GPU change
CuDNNGRU
toGRU
in model1.ipynb. -
eli5 (for debugging models)
Combined the two text files, 1.txt and 2.txt to create a training dataframe.
43.4% is 0 (negative) and 56.6% is 1 (positive)
To visualize some important and repeating words.
These are necessary to find how the dataset is skewed towards bias if any. So we found, mostly these reviews are about harry potter and da vinci code movies. First column is on bad reviews and Second is on good reviews.
Bigram plot to see relevant words together
Trigram plot to see some more words and to decide what model to use
Base model was of TFidVectorizer and logistic regression: Weights for some words, Green is for positive and Red is for negative.
The final model consists of an ensemble of a GRU using no embeddings
and a GRU with glove 6 billion words and 300 dimensions embeddings
.
The output from the model is in 3.txt.
First I used CuDNNGRU as its good for learning long term dependencies in sentences. I tried various other models but GRU had the best precision recall score. The model was good, but it considered da vinci as a good word, but its a name, therefore, I decided to use ensembling. To know unknown words and to make sense of movies like da vinci and other names and relation between words, so I decided to use word embeddings.
I used the F1 Score
because the F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.
So I used it to determine the best value for a threshold to get weights for the predicted values(reference here)
The final accuracy was 0.9454 or 94.54% on the validation set.
- More data, the dataset only was around ~8k samples.
- To use shop review data instead of given movie review data.
- To use more bigger embeddings like Glove 840 billion 300 dimension embeddings.