<a href="https://colab.research.google.com/github/Montaser778/NLP_IMDB/blob/main/Lab6_NLP_IMDB_Montaser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 6: AI in Natural Language Processing (NLP)**
## **Objective**
This lab applies Natural Language Processing (NLP) techniques using **TF-IDF** and **sentiment analysis**.

## **Dataset Justification**
We selected the **IMDb Movie Reviews dataset** because it is widely used for **sentiment analysis** tasks and contains **a balanced set of positive and negative reviews**.  
The dataset is large (50,000 reviews), making it suitable for training **machine learning models** on real-world text data. Additionally, it allows us to evaluate **Natural Language Processing (NLP)** techniques such as **TF-IDF and classification models like Logistic Regression and Naïve Bayes**.


## **1. Import Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

## **2. Load and Explore Dataset**

In [None]:
import pandas as pd

file_path = "IMDB Dataset.csv"

try:
    df = pd.read_csv(file_path, encoding="utf-8", nrows=5000, on_bad_lines='skip')
except TypeError:
    df = pd.read_csv(file_path, encoding="utf-8", nrows=5000, error_bad_lines=False)

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## **3. Data Preprocessing**

###  Text Cleaning in IMDB Dataset

In this section, we will define a function to clean the text data found in the `review` column of the IMDB dataset.  
The cleaning process includes:
- Removing URLs  
- Removing special characters  
- Removing numbers  
- Converting all text to lowercase  
- Removing extra whitespace

### **3.1 Text Cleaning**

In [None]:
import pandas as pd
import re

# Function to clean text
def clean_text(text):
    """
    Cleans text by:
    - Removing URLs
    - Removing special characters
    - Removing numbers
    - Converting text to lowercase
    - Removing extra spaces
    """
    text = str(text)  # Ensure text is a string
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply text cleaning
df['cleaned_text'] = df['review'].apply(clean_text)

# Display the cleaned text
df[['review', 'cleaned_text']].head()

Unnamed: 0,review,cleaned_text
0,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,a wonderful little production br br the filmin...
2,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,basically there s a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter mattei s love in the time of money is a...


## **4. Feature Extraction using TF-IDF**

### Text Vectorization using TF-IDF

After cleaning the text data, the next step is to convert it into numerical format so that machine learning models can process it.  
We will use **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorization, which helps in giving importance to words that are frequent in a document but not across all documents.

Key points:
- We use `TfidfVectorizer` to extract features from the cleaned text.
- We limit the number of features to the top 5000 most informative ones.
- The sentiment labels are mapped to binary values: `positive` → 1 and `negative` → 0.

In [None]:
vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(df['cleaned_text'])
y = df['sentiment'].map({'positive': 1, 'negative': 0})  # Convert labels to binary

## **5. Train Sentiment Analysis Model**
### **5.1 Split Data into Train & Test Sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

### **5.2 Train Model (Logistic Regression & Naïve Bayes)**

### Training Classification Models

Now that the text data has been vectorized, we can train machine learning models to classify the movie reviews as **positive** or **negative**.

In this section, we will train two popular classification models:

1. **Logistic Regression**  
   - A widely used linear model for binary classification problems.
   - It estimates the probability that a given input belongs to a particular class.

2. **Multinomial Naïve Bayes**  
   - A probabilistic model based on Bayes’ theorem.
   - It is especially effective for text classification tasks.

In [None]:
# Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Naïve Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

## **6. Model Predictions & Evaluation**

After training the models, we now evaluate their performance using the test set.  
We will compare both **Logistic Regression** and **Naïve Bayes** based on:

- **Accuracy Score**: The percentage of correct predictions.
- **Classification Report**: Provides precision, recall, F1-score, and support for each class.

In [None]:
# Predictions
y_pred_logistic = logistic_model.predict(X_test)
y_pred_nb = nb_model.predict(X_test)

# Accuracy & Classification Report
logistic_acc = accuracy_score(y_test, y_pred_logistic)
nb_acc = accuracy_score(y_test, y_pred_nb)

logistic_report = classification_report(y_test, y_pred_logistic)
nb_report = classification_report(y_test, y_pred_nb)

# Display Results
print("Logistic Regression Accuracy:", logistic_acc)
print("\nLogistic Regression Report:\n", logistic_report)
print("\nNaïve Bayes Accuracy:", nb_acc)
print("\nNaïve Bayes Report:\n", nb_report)

Logistic Regression Accuracy: 0.876

Logistic Regression Report:
               precision    recall  f1-score   support

           0       0.89      0.88      0.88       530
           1       0.86      0.87      0.87       470

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000


Naïve Bayes Accuracy: 0.842

Naïve Bayes Report:
               precision    recall  f1-score   support

           0       0.84      0.87      0.85       530
           1       0.85      0.81      0.83       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000



## **7. Conclusion**
- The **Logistic Regression** and **Naïve Bayes** models were trained successfully on the IMDb dataset.
- **TF-IDF** was used to convert text data into numerical representations.
- **Performance comparison** between Logistic Regression and Naïve Bayes.
- Further improvements could include **deep learning approaches** (e.g., Transformers, BERT).