<a href="https://colab.research.google.com/github/ThomasDarrieumerlou/Project_applied/blob/main/Project_applied.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Movie Recommendation Based on Review Analysis

## 1. <a name="1">Reading the dataset</a>

We will use the __pandas__ library to read our dataset.

In [None]:
!pip install transformers torch scikit-learn

In [3]:
import pandas as pd

#https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv

dataset = pd.read_csv('https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/master/IMDB-Dataset.csv', header=0)
print('The shape of the dataset is:', dataset.shape)

The shape of the dataset is: (50000, 2)


In [4]:
dataset.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [5]:
dataset["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [6]:
print(dataset.isna().sum())

review       0
sentiment    0
dtype: int64


## 4. <a name="4">Train - Validation Split</a>

Let's split our dataset into training (80%) and validation (20%).

In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

At this stage we clean the text

In [8]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list=[]
    for sent in texts:

        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence=[]

        sent = sent.lower() # Lowercase
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:

        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

Linear regression

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(dataset[["review"]],
                                                  dataset["sentiment"],
                                                  test_size=0.20,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

In [10]:
print("Processing the reviewText fields")
X_train["review"] = process_text(X_train["review"].tolist())
X_val["review"] = process_text(X_val["review"].tolist())

Processing the reviewText fields


In [11]:
text_features = ['review']

model_features = text_features

model_target = 'sentiment'

In [31]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

### COLUMN_TRANSFORMER ###
##########################

numerical_processor = Pipeline([
    ('num_scaler', MinMaxScaler())
])


text_processor_0 = Pipeline([
    ('text_vect_0', CountVectorizer(binary=True, max_features=500))
])

# text_precessor_1 = Pipeline([
#     ('text_vect_1', CountVectorizer(binary=True, max_features=150))
# ])
data_preprocessor = ColumnTransformer([
    ('text_pre_0', text_processor_0, text_features[0]),
    #('text_pre_1', text_precessor_1, text_features[1])
])

### PIPELINE ###
################
pipeline = Pipeline([
    ('data_preprocessing', data_preprocessor),
    ('logistic_regression', LogisticRegression(penalty = 'l2',
                              C = 0.1))
])

from sklearn import set_config
set_config(display='diagram')
pipeline

In [32]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Assuming your classes are labeled as 0 and 1
class_labels = np.unique(y_train)  # y_train should be your target variable array from the training data

# Computing class weights
# 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies
class_weights = compute_class_weight('balanced', classes=class_labels, y=y_train)

# Creating a dictionary to pass to the model
class_weights_dict = {class_labels[i]: class_weights[i] for i in range(len(class_labels))}


# Assembling the Full Pipeline
# This pipeline integrates data preprocessing steps and a machine learning model (Logistic Regression).
# Logistic Regression is used for binary classification (here likely predicting 'isPositive').
# Hyperparameters like 'penalty' and 'C' are set for the logistic regression model.
# The pipeline allows for streamlined processing and prediction.
pipeline = Pipeline([
    ('data_preprocessing', data_preprocessor),
    ('logistic_regression', LogisticRegression(penalty = 'l2', class_weight=class_weights_dict,
                              C = 0.1, max_iter=1000, solver='saga'))
])

# Visualizing the Pipeline
# Visual representation of the pipeline is enabled for clarity, especially useful for more complex pipelines.
from sklearn import set_config
set_config(display='diagram')
pipeline

In [33]:
pipeline.fit(X_train, y_train.values)

In [34]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Making Predictions on the Validation Dataset
# This line uses the previously defined and fitted pipeline to make predictions on the validation dataset.
# The 'predict' method applies all preprocessing steps to the validation data and then uses the trained model for prediction.
val_predictions = pipeline.predict(X_val)

# Printing the Confusion Matrix
# The confusion matrix is a table used to evaluate the performance of the classification model.
# It shows the actual vs. predicted values, helping to understand the cases of true positives, true negatives, false positives, and false negatives.
print(confusion_matrix(y_val.values, val_predictions))

# Printing the Classification Report
# The classification report provides key metrics about the performance of the classifier.
# This includes precision, recall, f1-score for each class, and a support count showing the number of true instances in each class.
print(classification_report(y_val.values, val_predictions))

# Calculating and Printing the Accuracy
# Accuracy is the ratio of correctly predicted observations to the total observations.
# High accuracy means the model performs well on the validation data.
# It's a quick way to see how well the model is performing, especially for balanced datasets.
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[4185  773]
 [ 749 4293]]
              precision    recall  f1-score   support

    negative       0.85      0.84      0.85      4958
    positive       0.85      0.85      0.85      5042

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

Accuracy (validation): 0.8478


RUN 1 => nb feature 50

Accuracy (validation): 0.7093

----------------
RUN 2 => nb feature 150

Accuracy (validation): 0.7563

---
RUN 3 => nb feature 200

Accuracy (validation): 0.7875

---
RUN 4 => nb feature 500

Accuracy (validation): 0.8478

KNN PART

You can use the Hugging Face Transformers library to load a pre-trained BERT model and tokenize your text dat

In [17]:
from transformers import BertTokenizer, BertModel
import torch
from tqdm import tqdm

import pandas as pd
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(dataset["review"], dataset['sentiment'], test_size=0.2, random_state=42)

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# BERT-based Classifier
# Load BERT tokenizer and model, move to GPU
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

# Tokenize and encode the text data, move to GPU
max_length = 128
X_train_tokens = tokenizer(list(train_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)
X_test_tokens = tokenizer(list(test_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)

# Calculate BERT embeddings for the text data
def get_bert_embeddings(tokens):
    embeddings = []
    for i in tqdm(range(len(tokens['input_ids']))):
        with torch.no_grad():
            output = bert_model(input_ids=tokens['input_ids'][i].unsqueeze(0), attention_mask=tokens['attention_mask'][i].unsqueeze(0))
        embeddings.append(output[0].squeeze().mean(dim=0).cpu().numpy())
    return embeddings

X_train_bert_embeddings = get_bert_embeddings(X_train_tokens)
X_test_bert_embeddings = get_bert_embeddings(X_test_tokens)

cuda


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

100%|██████████| 40000/40000 [09:16<00:00, 71.81it/s]
100%|██████████| 10000/10000 [02:20<00:00, 71.03it/s]


In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train_bert_embeddings, train_labels)

NB = GaussianNB()
NB.fit(X_train_bert_embeddings, train_labels)

# Train a classifier on BERT embeddings (you can use any classifier of your choice)
# Here, we'll use Logistic Regression as an example
lr = LogisticRegression(max_iter=500)
lr.fit(X_train_bert_embeddings, train_labels)

#rf=RandomForestClassifier()
#rf.fit(train_embeddings, train_labels)
#xgb=GradientBoostingClassifier()
#xgb.fit(train_embeddings, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluate the KNN model's performance on sentiment classification

In [25]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f'KNN Accuracy: {accuracy * 100:.2f}%')
#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Naive Bais Accuracy: {accuracy * 100:.2f}%')

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Logistic Regression Accuracy: {accuracy * 100:.2f}%')

KNN Accuracy: 78.89%
Naive Bais Accuracy: 75.49%
Logistic Regression Accuracy: 83.24%


In [26]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)


Classification Report:
              precision    recall  f1-score   support

    negative       0.74      0.88      0.80      4961
    positive       0.85      0.70      0.77      5039

    accuracy                           0.79     10000
   macro avg       0.80      0.79      0.79     10000
weighted avg       0.80      0.79      0.79     10000


Confusion Matrix:
[[4351  610]
 [1501 3538]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.74      0.78      0.76      4961
    positive       0.77      0.73      0.75      5039

    accuracy                           0.75     10000
   macro avg       0.76      0.76      0.75     10000
weighted avg       0.76      0.75      0.75     10000


Confusion Matrix:
[[3869 1092]
 [1359 3680]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83      4961
    positive       0.84      0.83      0.83      5039

    accuracy         

Result :

- KNN Accuracy: 76.91%
- Naive Bais Accuracy: 75.49%
- Logistic Regression Accuracy: 83.26%

---
test with KNN 10
- Accuracy : 0.77

test linear regression with max_iter 1000
- Accuracy 0.83

---
test with KNN 50
- Accuracy : 0.79

test linear regression with max_iter 500
- Accuracy : 0.83