<a href="https://colab.research.google.com/github/Innov8iveGuru/Python/blob/main/SentimentAnalysisNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Setup the environment and import all necessary libraries

In [1]:
# Import necessary libraries
import nltk
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Download NLTK data (if not already installed)
nltk.download('stopwords')
nltk.download('punkt')

# Import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


 2. Load and IMDb dataset containing 50,000 reviews, labeled as either positive or negative.

In [2]:
from keras.datasets import imdb

# Load the dataset (only keeping the top 10,000 most frequent words)
# x_train and x_test contain the reviews (encoded as word indices)
# y_train and y_test contain the sentiment labels (1 = positive, 0 = negative)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

# Let's explore the dataset
print(f'Training data: {len(x_train)} reviews')
print(f'Testing data: {len(x_test)} reviews')
print(f'Example review (encoded): {x_train[0]}')
print(f'Label (1 = positive, 0 = negative): {y_train[0]}')

# Get the word index from IMDb to decode reviews back to words
word_index = imdb.get_word_index()

# Invert the index to get word->index mapping
index_word = {i: word for word, i in word_index.items()}

# Example: decoding the first review
decoded_review = ' '.join([index_word.get(i - 3, '?') for i in x_train[0]])
print(f'Decoded review: {decoded_review}')


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Training data: 25000 reviews
Testing data: 25000 reviews
Example review (encoded): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 

3. Data Preprocessing
We’ll follow these steps for preprocessing:

i) Tokenization: Convert text into individual words (tokens).

ii) Stop Word Removal: Remove words that don’t add much meaning, such as "the," "is," etc.

iii) Vectorization: Use techniques like Bag of Words or TF-IDF to convert the text into numerical features.

Since the dataset is already tokenized (converted to indices), we can directly apply vectorization techniques like TF-IDF.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

# Decode the dataset to get the raw reviews (since they're currently word indices)
def decode_review(encoded_review):
    return ' '.join([index_word.get(i - 3, '?') for i in encoded_review])

# Convert the training and test datasets into decoded form (text format)
x_train_text = [' '.join([index_word.get(i - 3, '?') for i in review]) for review in x_train]
x_test_text = [' '.join([index_word.get(i - 3, '?') for i in review]) for review in x_test]

# Vectorize the data using TF-IDF (Term Frequency-Inverse Document Frequency)
tfidf_vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')

# Fit and transform the training data, and transform the test data
x_train_tfidf = tfidf_vectorizer.fit_transform(x_train_text)
x_test_tfidf = tfidf_vectorizer.transform(x_test_text)

# Check the shape of the transformed data
print(f'Training data shape: {x_train_tfidf.shape}')
print(f'Testing data shape: {x_test_tfidf.shape}')


Training data shape: (25000, 9477)
Testing data shape: (25000, 9477)


4. Model Building (Naive Bayes)
We will use Multinomial Naive Bayes from Scikit-Learn, which works well with word frequencies (like those produced by TF-IDF).

In [4]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the Multinomial Naive Bayes model
nb_model = MultinomialNB()

# Train the model on the training data
nb_model.fit(x_train_tfidf, y_train)

# Make predictions on the test data
y_pred = nb_model.predict(x_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Naive Bayes Model Accuracy: {accuracy * 100:.2f}%')

# Print a classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Print the confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Naive Bayes Model Accuracy: 83.64%
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84     12500
           1       0.86      0.81      0.83     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000

Confusion Matrix:
 [[10846  1654]
 [ 2437 10063]]


In order to improve the model accuracy, we will turn to using logistic regression for the same task.

In [5]:
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression model
lr_model = LogisticRegression(max_iter=200)  # Increasing max_iter for convergence

# Train the model on the training data
lr_model.fit(x_train_tfidf, y_train)

# Make predictions on the test data
y_pred_lr = lr_model.predict(x_test_tfidf)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f'Logistic Regression Model Accuracy: {accuracy_lr * 100:.2f}%')

# Print a classification report
print("Classification Report (Logistic Regression):\n", classification_report(y_test, y_pred_lr))

# Print the confusion matrix
print("Confusion Matrix (Logistic Regression):\n", confusion_matrix(y_test, y_pred_lr))

Logistic Regression Model Accuracy: 88.12%
Classification Report (Logistic Regression):
               precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix (Logistic Regression):
 [[10996  1504]
 [ 1466 11034]]
