# Sentiment Analysis

As part of the movie recommendation idea, we want the end user to have an additional data point for their consideration of our recommendation. 

We have therefore decided to employ a sentiment analysis, based on movie reviews. The sentiment analysis will be based on a simple linear regression model, trained on a [sentiment-classified IMDB movie review dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). As part of this, the movie reviews will be preprocessed, and vectorized using TF-IDF. 
 

### Initial imports and data setup

In [1]:
import pandas as pd
import numpy as np
import nltk
import re 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score




In [11]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nojan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nojan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nojan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nojan\AppData\Roaming\nltk_data...


In [4]:
# Load the data
df=pd.read_csv("training_data.csv") 

In [5]:
# Reviewing base characteristics of the data
print(df.head())


                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [6]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
None


In [7]:
print(df.isnull().sum())

review       0
sentiment    0
dtype: int64


The dataset contains 50K reviews, labeled positive/negative. Luckily, the dataset seems well-made as there are no missing values.

### Preprocessing
The next step is to do some preprocessing, to reduce number of features/dimensionality (I.E cleaning up the text and boiling it down to the bare minimum needed for the model)

In [8]:
from bs4 import BeautifulSoup


In [9]:
reviews = []
reviews =[review for review in df['review']]
#preprocessing

def simple_preprocess_text(corpus):
    # Remove HTML tags
    corpus = [BeautifulSoup(text, "html.parser").get_text() for text in corpus]

    # Remove urls
    corpus = [re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE) for text in corpus]
    
    # Remove non-alphabetic characters
    corpus = [re.sub("[^a-zA-Z]", " ", text) for text in corpus]

    # Convert to lowercase
    corpus = [text.lower() for text in corpus]

    return corpus

def preprocessing_lemmatization(corpus):
    # Tokenize the text (split it into words)
    corpus = [word_tokenize(text) for text in corpus]

    # Remove stop words (The, a, on, etc)
    stop_words = set(stopwords.words("english"))
    corpus = [[word for word in text if word not in stop_words] for text in corpus]

    # Lemmatization,
    # AKA remove word endings to get the base form
    lemmatizer = nltk.stem.WordNetLemmatizer()
    corpus = [[lemmatizer.lemmatize(word) for word in text] for text in corpus]

    # Join the words back into one string
    corpus = [" ".join(text) for text in corpus]

    return corpus


In [12]:
# EXAMPLE of pre-processing some text
sample_text = ["<p>It's a great movie. I love it! <a href='http://www.google.com'>Google</a></p>"]
print(sample_text)

test= simple_preprocess_text(sample_text)
print(test)
test= preprocessing_lemmatization(test)

print(test)

["<p>It's a great movie. I love it! <a href='http://www.google.com'>Google</a></p>"]
['it s a great movie  i love it  google']
['great movie love google']


In [13]:
processed_text= simple_preprocess_text(reviews)
processed_text= preprocessing_lemmatization(processed_text)


  corpus = [BeautifulSoup(text, "html.parser").get_text() for text in corpus]


As shown, the text is now cleaned up, and only the most important words for the sentiment analysis is kept

The next step is to use TF-IDF to extract the most important features from the text. TF stands for term frequency, the frequency of a word appearing in a document. IDF stands for Inverse Document Frequency, which is a weighing method that indicate how commonly a word appears across all the documents, which in our case are the reviews. 

#### TODO: maybe include math of the 2 weighting techniques

In [14]:
# TF-IDF vectorization,
# which is a way to represent text data as a matrix of numbers
vector = TfidfVectorizer(max_features=5000)
X = vector.fit_transform(processed_text).toarray()
# We will also include the sentiment column as the target variable
y = df['sentiment'].values


In [15]:
# Split the data into training and testing sets
# 80% of the data will be used for training and 20% for testing
# The random_state parameter is used to ensure that the data is split in the same way every time the code is run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



We will now be training and testing on different model types, to evaluate which one is best for our purposes

In [16]:
models = {}

In [17]:
# Train a logistic regression model
models['Logistic Regression'] = LogisticRegression()
models['Logistic Regression'].fit(X_train, y_train)

In [19]:
# Train a SVM model
from sklearn.svm import LinearSVC
models['SVM'] = LinearSVC() # faster converge, if linear
models['SVM'].fit(X_train, y_train)



In [20]:
# Train a Naive Bayes model
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()
models['Naive Bayes'].fit(X_train, y_train)

In [24]:
# Train a Random Forest model
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier(n_jobs=-1)
models['Random Forest'].fit(X_train, y_train)

In [None]:
# Train a Gradient Boosting model
#from sklearn.ensemble import GradientBoostingClassifier
#models['Gradient Boosting'] = GradientBoostingClassifier()
#models['Gradient Boosting'].fit(X_train, y_train)

# TAGER MEGET LANG TID; 17+ MINUTTER

In [27]:

for model in models:
    y_pred = models[model].predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {model}")
    print(f"Accuracy of sentiment prediction: {accuracy}")
    print()


Model: Logistic Regression
Accuracy of sentiment prediction: 0.8878

Model: SVM
Accuracy of sentiment prediction: 0.8828

Model: Naive Bayes
Accuracy of sentiment prediction: 0.7965

Model: Random Forest
Accuracy of sentiment prediction: 0.849

Model: Gradient Boosting
Accuracy of sentiment prediction: 0.8118



The results show that the simple linear regression performs the best on this dataset.
This is the once we will continue with

In [35]:
import pickle 

with open('reviews.pkl', 'rb') as file:
    custom_data = pd.read_pickle(file)



In [42]:
custom_data = pd.DataFrame.from_dict(custom_data)

In [53]:
test = list(custom_data["tt0111161"])

In [54]:
test= simple_preprocess_text(test)
test= preprocessing_lemmatization(test)

In [69]:
custom_data["tt0103873"]

0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
Name: tt0103873, dtype: object

In [None]:
sentiments = {}
bad_data = []
for title in custom_data:
    temp = list(custom_data[title])
    if temp[0] is None:
        bad_data.append(title)
        continue
    temp= simple_preprocess_text(temp)
    temp= preprocessing_lemmatization(temp)
    temp = vector.transform(temp).toarray()
    sentiment = models['Logistic Regression'].predict(temp)
    sentiments[title] = sentiment