IMDb-Reviews-Sentiment-Analysis

Overview

In this project I try to tackle the problem of Sentiment Analysis using various Machine Learning and Deep Learning approaches. The dataset that was used was the IMDb Movie Reviews Dataset which contains positive and negative reviews because it contained 25000 train instances and 25000 test ones, thus, it was feasible to train the models in our local machine, and also there was the Sentiment Analysis Leaderboard, hence, I could see how good our approaches are in comparison to state of the art models. The proposed approaches are the following:

Machine Learning Approach using Preprocessing and Feature Extraction techniques and different classifiers(SVM and Naive Bayes specifically).
LSTM with trained embeddings.
Transfer Learning with BERT.

Out of all the approaches and after some hyperparameter tuning, I determined that the best approach was the last one because it yielded accuracy equal to 89.5%. In comparison, with the state of the art models as of 09/07/2022 this approaches is placed 27th according to the aforementioned Leaderboard.

Manual

In order to run the experiments you should follow these steps:

Create and activate a virtual environment and then run the command pip3 install -r requirements.txt in order to install the necessary libraries in order to run the code
Unzip the imdb.zip. In this file there is contained the IMDb reviews dataset.
For Machine Learning Experiments run the ./code/ml_approach.py using Python3.
For LSTM experiments you can either run the ./code/lstm_experiments.sh script that runs the Grid Search I present on this report or the ./code/lstm_approach.py script in order to run a single experiment. You can add the parameter -h in order to get the usage of that file.
For BERT experiments you can either run the ./code/bert_experiments.sh script that runs the Grid Search I present on this report or the ./code/bert_approach.py script in order to run a single experiment. You can add the parameter -h in order to get the usage of that file.
In the ./code/{bert|lstm}_logs directories contain the output of each experiment.

Further Work

In the future, I can use another pre-trained models, bigger models or ensembles, but I believe that there is still room for improvement using those models in terms of handling the data, because in essence I currently just feed the data as the are(except for the basic preprocessing) and instead I need to find a more sophisticated approach that exploits its structure and address their weaknesses. Some thoughts are the following:

Try bidirectional LSTM. This is expected to yield better performance because it does not only consider for each word the words before but also the words after. Then you need to take the "leftmost" and "rightmost" outputs and pass them to Linear layer.
You should find some way to mitigate the following case. There is the case that the review will be like:

"Jack Nicolsson rocks! Movie sucks..." or "Jack Nicolsson rocks, but movie sucks" or "the structure is good, but product doesn't work"

This means that the classifier will be confused without proper algorithm. This includes the existing ones. Cutting tokens won't work because I will lose information about the part that I care. Using TF-IDF just neutralizes if I have a positive and negative part in the sentence.

You can take each sentence/phrase to predict it and use ensembles to draw the final prediction or just pass them to Logistic Regression classifier, for instance.

I should be careful with stopword removal because "not" is also a stopwords. No need to say more.

Finally, I can evaluate how my proposed approaches perform on different kind of data, like product reviews, tweets etc.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
.gitattributes		.gitattributes
Natural Language Processing Final Project - Presentation.pdf		Natural Language Processing Final Project - Presentation.pdf
Natural_Language_Processing_Report.pdf		Natural_Language_Processing_Report.pdf
README.md		README.md
imdb.zip		imdb.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb-Reviews-Sentiment-Analysis

Overview

Manual

Further Work

About

Releases

Packages

Languages

ManosL/IMDb-Reviews-Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

IMDb-Reviews-Sentiment-Analysis

Overview

Manual

Further Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages