# Movie Reviews Sentiment Analysis (NLP Project)
This notebook demonstrates how to perform sentiment analysis on IMDB movie reviews.
We will go through the following steps:
- Load and explore the dataset
- Preprocess the text data (cleaning, removing noise, stemming)
- Convert text into numerical features (Bag of Words)
- Train Naive Bayes classifiers
- Evaluate the models


## 1. Importing Required Libraries
We start by importing the necessary Python libraries for data processing, NLP, and machine learning.

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score
import joblib


## 2. Loading and Exploring the Dataset
Let's load the IMDB dataset and check its shape and the first few rows.

In [None]:
dataset = pd.read_csv('IMDB.csv')
dataset.shape  # Checking shape (rows, columns)

In [None]:
dataset.head()  # Displaying first few rows of the dataset

In [None]:
dataset['sentiment'].value_counts()  # Checking class distribution

## 3. Encoding Target Labels
Convert the sentiment labels from strings to binary (positive=1, negative=0).

In [None]:
dataset['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)
dataset.head()

## 4. Text Cleaning
We will clean the text in the following steps:
1. Remove HTML tags
2. Remove special characters
3. Convert to lowercase
4. Remove stopwords
5. Perform stemming

In [None]:
# Remove HTML tags
def clean(text):
    return re.sub(re.compile(r'<.*?>'), '', text)
dataset['review'] = dataset['review'].apply(clean)
dataset['review'][0]

In [None]:
# Remove special characters
def is_special(text):
    return ''.join([ch if ch.isalnum() else ' ' for ch in text])
dataset['review'] = dataset['review'].apply(is_special)
dataset['review'][0]

In [None]:
# Convert to lowercase
dataset['review'] = dataset['review'].apply(str.lower)
dataset['review'][0]

In [None]:
# Remove stopwords
def rem_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    return [w for w in words if w not in stop_words]
dataset['review'] = dataset['review'].apply(rem_stopwords)
dataset['review'][0]

In [None]:
# Perform stemming
def stem_txt(text):
    ss = SnowballStemmer('english')
    return " ".join([ss.stem(w) for w in text])
dataset['review'] = dataset['review'].apply(stem_txt)
dataset['review'][0]

## 5. Feature Extraction using Bag of Words (BoW)
We convert the cleaned text into a numeric matrix using CountVectorizer.

In [None]:
cv = CountVectorizer(max_features=2000)
X = cv.fit_transform(dataset['review']).toarray()
y = dataset['sentiment'].values
X.shape, y.shape

## 6. Train-Test Split
Split the dataset into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

## 7. Model Training
Train three Naive Bayes models: Gaussian, Multinomial, and Bernoulli.

In [None]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
gnb.fit(X_train, y_train)
mnb.fit(X_train, y_train)
bnb.fit(X_train, y_train)

## 8. Save Trained Models
Save all three models using `joblib`.

In [None]:
joblib.dump(gnb, 'MRSA_gnb.pkl')
joblib.dump(mnb, 'MRSA_mnb.pkl')
joblib.dump(bnb, 'MRSA_bnb.pkl')

## 9. Model Evaluation
Make predictions and evaluate the accuracy of each model.

In [None]:
accuracy_score(y_test, gnb.predict(X_test))

In [None]:
accuracy_score(y_test, mnb.predict(X_test))

In [None]:
accuracy_score(y_test, bnb.predict(X_test))