### 💾💻📊 Data Science - MMI Portfolio No. 3
# 💥 Sentiment Analysis (NLP + Machine Learning) 💥



An extremely common dataset for benchmarking, method development, and tutorials is **MNIST** consisting of handwritten digits. In reference to this dataset several variants appeared such as sign-MNIST (photos of hand gestures for sign language), audioMNIST (audiofiles of spoken digits) and **fashionMNIST** made of low resolution photos of 10 types of clothes. We will here use this tp explore different possibilities for using dimensionality reduction techniques.

Please complete the following exercises:
## 1. Data Exploration and Cleaning
- Select equal number of reviews for all possible ratings (1 to 5).
- Use suitable Python libraries to detect the language for every restaurant review, for instance `langdetect`.
- Show a distribution of detected languages.
- Continue to work only with English entries.

Hint: This could take a little while, so better use `tqdm` or similar to show the progress.

## 2. TF-IDF + Logistic Regression vs Linear Regression
- Use the `TfidfVectorizer` to create vectors of all remaining reviews.
- Train a Logistic Regression model on the 5 rating classes
- Train a Linear Regression model on the rating (1 to 5).
- Compare both models, which works better?
- Also compare single tokens vs. 1 + 2-grams (`ngrams = (1,2)`), which works better.
- Show the 10 most relevant words (or ngrams) for predicting high or low ratings.

## 3. Language Model --> word vectors + Logistic Regression vs. Random Forest
- Use Spacy and a larger English language model (`spacy.load("en_core_web_lg")`) to create vectors for each review. This might take a while so use a progress bar.
- Train a Logistic Regression model as well as a RandomForestClassifier on the review embeddings to predict the 5 rating classes.
- Evaluate the performance of both models, and also compare to the TF-IDF case before.



## General instructions
- The final notebook should be executable in the correct order (this means it should work if you do `Kernel` --> `Restart kernel and run all cells...`)
- Just providing code and plots is not enough, you should document and comment where necessary. Not so much on small code-related things (you may still do this if you want though, but this is not required), but mostly to explain what you do, why you do it, what you observe.

More specifically:
- Please briefly comment on the changes you make to the data, in particular if you apply complex operations or if your changes depend on a certain choice you have to make.
- Please add descriptions and/or interpretations to the results you generate (for instance tables, plots). This doesn't have to be a lot of text. For simple, easy-to-understand results, a brief sentence can be enough. For more complex results, you might want to add a bit more explanation.

---
Please add your Name here
## Name: Kevin Zielke

---

## Imports and helper function
Use this part to import the main libraries used in this notebook.  
Also add more complex helper functions to this part (if you use any).

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

from tqdm.notebook import tqdm

# add imports if anything is missing
# for instance. feel free to use other plotting libraries (e.g. seaborn, plotly...)

## Data download and import
The following analysis should be done with data from TripAdvisor, namely restraurant reviews from **Barcelona**. You can find the respective data (`Barcelona_reviews.csv`) on the [TripAdvisor data](https://zenodo.org/records/6583422).


In [3]:
filename = "../../Datasets/TripAdvisor/Barcelona_reviews.csv"
data = pd.read_csv(filename)
data = data.drop(["Unnamed: 0"], axis=1)
data.head()

  data = pd.read_csv(filename)


Unnamed: 0,parse_count,restaurant_name,rating_review,sample,review_id,title_review,review_preview,review_full,date,city,url_restaurant,author_id
0,1,Chalito_Rambla,1,Negative,review_774086112,Terrible food Terrible service,"Ok, this place is terrible! Came here bc we’ve...","Ok, this place is terrible! Came here bc we’ve...","October 12, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_0
1,2,Chalito_Rambla,5,Positive,review_739142140,The best milanesa in central Barcelona,This place was a great surprise. The food is d...,This place was a great surprise. The food is d...,"January 14, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_1
2,3,Chalito_Rambla,5,Positive,review_749758638,Family bonding,The food is excellent.....the ambiance is very...,The food is excellent.....the ambiance is very...,"March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_2
3,4,Chalito_Rambla,5,Positive,review_749732001,Best food,"The food is execellent ,affortable price for p...","The food is execellent ,affortable price for p...","March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_3
4,5,Chalito_Rambla,5,Positive,review_749691057,Amazing Food and Fantastic Service,"Mr Suarez,The food at your restaurant was abso...","Mr Suarez,The food at your restaurant was abso...","March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_4


In [None]:
# Some cleaning
mask = data.rating_review == "Barcelona_Catalonia"
data = data[~mask]
data.rating_review = data.rating_review.astype(int)

## 1. Data Exploration and Cleaning

Hint: This could take a little while, so better use `tqdm` or similar to show the progress.

### 1.1 - Select equal number of reviews for all possible ratings (1 to 5).

### 1.2 - Use suitable Python libraries to detect the language for every restaurant review, for instance `langdetect`.

### 1.3 - Show a distribution of detected languages.

### 1.4 - Continue to work only with English entries.

## 2. TF-IDF + Logistic Regression vs Linear Regression

### 2.1 - Use the `TfidfVectorizer` to create vectors of all remaining reviews.

### 2.2 - Train a Logistic Regression model on the 5 rating classes

### 2.3 - Train a Linear Regression model on the rating (1 to 5).

### 2.4 - Compare both models, which works better?

### 2.5 - Also compare single tokens vs. 1 + 2-grams (`ngrams = (1,2)`), which works better.

### 2.6 - Show the 10 most relevant words (or ngrams) for predicting high or low ratings.


## 3. Language Model --> word vectors + Logistic Regression vs. Random Forest

### 3.1 - Use Spacy and a larger English language model (`spacy.load("en_core_web_lg")`) to create vectors for each review. This might take a while so use a progress bar.

### 3.2 - Train a Logistic Regression model as well as a RandomForestClassifier on the review embeddings to predict the 5 rating classes.

### 3.3 - Evaluate the performance of both models, and also compare to the TF-IDF case before.