Data Science Exam

The objective is to perform a sentiment analysis task, analyzing users' textual reviews, to understand if a comment includes a positive or negative mood.
In other words, the goal is to build a robust classification model that is able to predict the sentiment contained in a given text.

Dataset

The dataset for this competition has been specifically scraped from the tripadvisor.it Italian web site. It contains 41077 textual reviews written in the Italian language.
The dataset is provided as textual files with multiple lines. Each line is composed of two fields: text and class. The text field contains the review written by the user, while the class field contains a label that can get the following values:

pos: if the review shows a positive sentiment.
neg: if the review shows a negative sentiment.

The dataset can be downloaded from here.

Data exploration

Understand the dataset and the distribution of the data: count the number of elements, check for missing data, mean and standard deviation review length, classes distribution.

Preprocessing

Preprocess the data before feeding it to some machine learning algorithm.
This part consists of the following steps: transform the words in lower case, remove not meaningful words (stop words, too long and too short words), try to balance the classes, calculate the TF-IDF of the words, split the dataset in training and test sets.

Classification

Apply some machine learning algorithms to train a classifier in order to predict the class of a sentence.
Many classifiers were used: Support Vector Machine, Naive Bayes, K-Neighbors Classifier, Random Forest Classifier, Decision Tree Classifier, Deep Neural Network, SGD Classifier, Logistic Regression.

Result

The best model was Linear Support Vector Classifier with a F1 weighted score of 0.9696.

Cloud words of the most common words:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
.gitignore		.gitignore
Data-Science-Exam-Report.pdf		Data-Science-Exam-Report.pdf
ListNegativeWord.csv		ListNegativeWord.csv
ListPositiveWord.csv		ListPositiveWord.csv
README.md		README.md
development.csv		development.csv
evaluation.csv		evaluation.csv
sample_submission.csv		sample_submission.csv
solution.ipynb		solution.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

Images

Images

.gitignore

.gitignore