<a href="https://colab.research.google.com/github/Loai-AL-Sabahi/Movies-Reviews-Sentiment-Analysis-Machine-Learning-AI/blob/main/Assignment_2_Classification_(Language_Processing).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Title: Assignment 2 (Part A) Classification Using Sentimental Analysis**
# **Name: Loai AL-sabahi**
**Matric Number: A21MJ4003**





*   **Inrtoduction:**
Sentiment analysis is a natural language processing (NLP) technique aimed at determining the sentiment or opinion expressed in text data. In this code example, we're exploring sentiment analysis using various classification algorithms to classify movie reviews as either positive or negative. The dataset contains movie reviews along with their corresponding sentiment labels.

**Importing essential libraries and reading the data**

In [None]:
# Importing libraries
import pandas as pd
import nltk
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier #Fisrt Model
from sklearn.linear_model import LogisticRegression #Second Model
from sklearn.tree import DecisionTreeClassifier #Third Model
from sklearn.ensemble import RandomForestClassifier #Forth Model
from sklearn.naive_bayes import GaussianNB #Fifth Model
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis #Sixth Model


This code above imports necessary libraries for data manipulation (pandas), natural language processing (nltk), string manipulation (string), and machine learning models from scikit-learn (KNeighborsClassifier, LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GaussianNB, QuadraticDiscriminantAnalysis).

**Read the Data**

In [None]:
# Read the data from the CSV file
data = pd.read_csv('/content/drive/MyDrive/loai python/assignments/Assignment 2/IMDB Dataset.csv')
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
1994,One of the worst movies ever made... If you ca...,negative
1995,"Feeling Minnesota, directed by Steven Baigelma...",negative
1996,THE CELL (2000) Rating: 8/10<br /><br />The Ce...,positive
1997,"This movie, despite its list of B, C, and D li...",negative


Reads a CSV file containing IMDB movie reviews and their corresponding sentiments (positive or negative) into a Pandas DataFrame.

 **Preprocessing:**
  

1. tokenize
2. remove punctuation
3. remove symbols
4. remove stop words  


In [None]:
# Define the punctuation and tag mapping
punct_map = dict.fromkeys(map(ord, string.punctuation + string.digits + "<>" + ","))

In [None]:
# nltk.download('stopwords')
# nltk.download('punkt')

In [None]:
# Define the stop words set
stop_words = set(nltk.corpus.stopwords.words('english'))

In [None]:
# Define a function to preprocess the review column
def preprocess_review(review):
    # Tokenize the review
    tokens = nltk.tokenize.word_tokenize(review)
    # Remove punctuation, symbols, tag
    tokens = [token.translate(punct_map) for token in tokens]
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    # Return the tokens
    return tokens

In [None]:
# Apply the function to the review column
data['review'] = data['review'].apply(preprocess_review)
# Replace the values in the sentiment column
data['sentiment'] = data['sentiment'].replace({'positive': 1, 'negative': 0})

In [None]:
data

Unnamed: 0,review,sentiment
0,"[One, reviewers, mentioned, watching, , Oz, ep...",1
1,"[A, wonderful, little, production, , , br, , ,...",1
2,"[I, thought, wonderful, way, spend, time, hot,...",1
3,"[Basically, family, little, boy, , Jake, , thi...",0
4,"[Petter, Mattei, , Love, Time, Money, , visual...",1
...,...,...
1994,"[One, worst, movies, ever, made, , If, get, mo...",0
1995,"[Feeling, Minnesota, , directed, Steven, Baige...",0
1996,"[THE, CELL, , , , Rating, , , , br, , , , br, ...",1
1997,"[This, movie, , despite, list, B, , C, , D, li...",0


The code above Defines a function preprocess_review to tokenize, remove punctuation, symbols, tags, and stop words from each review. This function is then applied to the 'review' column of the DataFrame. Additionally, it converts sentiment labels ('positive' and 'negative') to numerical values (1 and 0).

**Convert Tokens Back to Text**

In [None]:
# Convert tokens back to text
data['review'] = data['review'].apply(' '.join)

Converts the list of tokens back to text by joining them with spaces. This is necessary for further processing.

In [None]:
data

Unnamed: 0,review,sentiment
0,One reviewers mentioned watching Oz episode h...,1
1,A wonderful little production br br The...,1
2,I thought wonderful way spend time hot summer ...,1
3,Basically family little boy Jake thinks zomb...,0
4,Petter Mattei Love Time Money visually stunn...,1
...,...,...
1994,One worst movies ever made If get movies with...,0
1995,Feeling Minnesota directed Steven Baigelmann ...,0
1996,THE CELL Rating br br The Cell lik...,1
1997,This movie despite list B C D list celebs ...,0


**TF-IDF Vectorization:**

TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is applied to convert the textual data into numerical features suitable for machine learning models.

In [None]:
# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

In [None]:
# Transform the text data into TF-IDF features
X = tfidf_vectorizer.fit_transform(data['review'])
y = data['sentiment']

**Model Training and Evaluation:**
Six different classification algorithms are implemented and evaluated:
1. K-Nearest Neighbors (KNN)
2. Logistic Regression
3. Decision Tree Classifier
4. Random Forest Classifier
5. Gaussian Naive Bayes
6. Quadratic Discriminant Analysis (QDA)

For each model, the data is split into training and testing sets, the model is trained on the training data, and predictions are made on the test data.

The accuracy of each model is calculated using the accuracy_score metric from Scikit-learn and printed out.

**Train-Test Split**

In [None]:
# Split the data into training and testing sets (adjust test_size and random_state as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The code above Splits the data into training and testing sets using the train_test_split function. It takes 80% of the data for training (X_train, y_train) and 20% for testing (X_test, y_test). The random_state ensures reproducibility.

**KNeighborsClassifier**

In [None]:
# Initialize the KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)

The K-Nearest Neighbors (KNN) classifier is a type of instance-based learning algorithm. It classifies a data point based on the majority class of its k-nearest neighbors in the feature space. In this case, n_neighbors=5 means that the algorithm considers the 5 nearest neighbors to make predictions.

In [None]:
# Train the classifier
knn_classifier.fit(X_train, y_train)

The fit method is used to train the KNN classifier on the training data (X_train, y_train).

In [None]:
# Predict on the test set
y_pred = knn_classifier.predict(X_test)

The predict method is applied to the test set (X_test) to make predictions based on the trained model.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy for KNeighborsClassifier: {accuracy:.2f}")

Accuracy for KNeighborsClassifier: 0.67


The accuracy of the model is evaluated by comparing the predicted labels (y_pred) with the actual labels from the test set (y_test). The accuracy_score function from scikit-learn is used for this purpose.

**Logistic Regression**

In [None]:
# Initialize the LogisticRegression classifier
logistic_regression = LogisticRegression(max_iter=1000)

Logistic Regression is a linear model for binary classification. It models the probability that a given instance belongs to a particular class. The max_iter parameter sets the maximum number of iterations for the optimization algorithm.

In [None]:
# Train the classifier
logistic_regression.fit(X_train, y_train)

The fit method is used to train the Logistic Regression model on the training data (X_train, y_train).

In [None]:
# Predict on the test set
y_pred = logistic_regression.predict(X_test)

The predict method is applied to the test set (X_test) to make predictions based on the trained Logistic Regression model.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of LogisticRegression: {accuracy:.2f}")

Accuracy of LogisticRegression: 0.88


Similar to the KNN model, the accuracy of the Logistic Regression model is evaluated using the accuracy_score function

**Decision Tree Classifier**

In [None]:
# Initialize the DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

A Decision Tree is a non-linear model that recursively splits the data based on features to create a tree-like structure, where leaves represent the class labels.

In [None]:
# Train the classifier
decision_tree.fit(X_train, y_train)

The fit method is used to train the Decision Tree classifier on the training data (X_train, y_train).

In [None]:
# Predict on the test set
y_pred = decision_tree.predict(X_test)

The predict method is applied to the test set (X_test) to make predictions based on the trained Decision Tree model.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of DecisionTreeClassifier: {accuracy:.2f}")

Accuracy of DecisionTreeClassifier: 0.71


As before, the accuracy of the Decision Tree model is evaluated using the accuracy_score function.

**Random Forest Classifier**

In [None]:
# Initialize the RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)

Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of the classes (classification) of the individual trees.

In [None]:
# Train the classifier
random_forest.fit(X_train, y_train)

The fit method is used to train the Random Forest classifier on the training data (X_train, y_train). n_estimators=100 specifies that 100 decision trees will be created in the ensemble.

In [None]:
# Predict on the test set
y_pred = random_forest.predict(X_test)

The predict method is applied to the test set (X_test) to make predictions based on the ensemble of decision trees.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of RandomForestClassifier: {accuracy:.2f}")

Accuracy of RandomForestClassifier: 0.82


The accuracy of the Random Forest model is evaluated using the accuracy_score function.

**GaussianNB**

In [None]:
# Initialize the GaussianNB classifier
gaussian_nb = GaussianNB()

Gaussian Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem. It assumes that the features follow a Gaussian distribution.

In [None]:
# Train the classifier
gaussian_nb.fit(X_train.toarray(), y_train)

The fit method is used to train the Gaussian Naive Bayes classifier on the training data (X_train, y_train). The toarray() method is used to convert the sparse TF-IDF matrix to a dense NumPy array.

In [None]:
# Predict on the test set
y_pred = gaussian_nb.predict(X_test.toarray())

The predict method is applied to the test set (X_test) to make predictions based on the trained Gaussian Naive Bayes model.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of GaussianNB: {accuracy:.2f}")

Accuracy of GaussianNB: 0.60


The accuracy of the Gaussian Naive Bayes model is evaluated using the accuracy_score function.

**Quadratic Discriminant Analysis**

In [None]:
# Initialize the QDA classifier
qda_classifier = QuadraticDiscriminantAnalysis()

Quadratic Discriminant Analysis is a classification algorithm that models the distribution of each class with a quadratic decision boundary.

In [None]:
# Train the classifier
qda_classifier.fit(X_train.toarray(), y_train)



The fit method is used to train the Quadratic Discriminant Analysis classifier on the training data (X_train, y_train). Similar to GaussianNB, the toarray() method is used to convert the sparse TF-IDF matrix to a dense NumPy array

In [None]:
# Predict on the test set
y_pred = qda_classifier.predict(X_test.toarray())

The predict method is applied to the test set (X_test) to make predictions based on the trained Quadratic Discriminant Analysis model.

In [None]:
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Binary Classification using QDA: {accuracy:.2f}")

Accuracy of Binary Classification using QDA: 0.53


The accuracy of the Quadratic Discriminant Analysis model is evaluated using the accuracy_score function.

**Testing the models**

In [None]:
def predict_sentiment(user_review):
    # Preprocess the user input
    processed_review = preprocess_review(user_review)
    processed_review = ' '.join(processed_review)

    # Transform the processed review using the same TF-IDF vectorizer
    user_input = tfidf_vectorizer.transform([processed_review])

    # Predict sentiment using the trained model
    # UNCOMMENT THE MODEL THAT YOU WANT TO USE
    # prediction = knn_classifier.predict(user_input)
    prediction = logistic_regression.predict(user_input)
    # prediction = decision_tree.predict(user_input)
    # prediction = random_forest.predict(user_input)
    # prediction = gaussian_nb.predict(user_input.toarray())
    # prediction = qda_classifier.predict(user_input.toarray())


    # Map the prediction to a sentiment label
    sentiment = 'positive' if prediction[0] == 1 else 'negative'

    return sentiment

This function, predict_sentiment, takes a user's movie review as input.
The input review is preprocessed using the preprocess_review function, converting it into a format suitable for the model.

The processed review is then transformed using the same TF-IDF vectorizer (tfidf_vectorizer) that was used during the training phase.

The user's input is converted into a TF-IDF representation (user_input) suitable for prediction.

The sentiment of the review is predicted using the logistic regression model (logistic_regression). The commented-out lines represent other models that could be used, and you can uncomment the line of the model you want to use.

The predicted sentiment is then mapped to a human-readable label ('positive' or 'negative') based on the model's prediction.

The function returns the predicted sentiment.

In [None]:
# Take user input for a review
user_review = input("Enter your movie review: ")

# Predict sentiment for the user input
result = predict_sentiment(user_review)
print(f"The model predicts this review as: {result}")

Enter your movie review: This movie is so bad
The model predicts this review as: negative


The code takes user input for a movie review using the input function.

The predict_sentiment function is then called with the user's input, and the result is stored in the variable result.

Finally, the predicted sentiment is printed to the console.

**Conclusion: Logistic Regression as the Optimal Model**

Among the range of classification models evaluated for sentiment analysis on movie reviews, Logistic Regression emerged as the most effective. With an accuracy of 0.88, Logistic Regression outperformed other models, showcasing its superior predictive capability for distinguishing between positive and negative sentiments in movie reviews.