<a href="https://colab.research.google.com/github/Sanju-255/sentiment-analysis-app/blob/main/Project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [581]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [582]:
import pandas as pd

# Purpose: Safely load the dataset from the specified Google Drive path.
try:
    # Attempt to read the Excel file. Handles FileNotFoundError if the path is incorrect.
    df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/P597 DATASET.xlsx')
except FileNotFoundError:
    print("Error: The file 'P597 DATASET.xlsx' was not found. Please check the exact path and drive mount status.")
    # Create an empty DataFrame to prevent subsequent code errors, or exit the program
    df = pd.DataFrame()
except Exception as e:
    print(f"An unexpected error occurred during file loading: {e}")
    df = pd.DataFrame()

# Proceed only if the DataFrame loaded successfully
if not df.empty:
    print(f"Dataset loaded successfully with {len(df)} rows and {len(df.columns)} columns.")
    display(df.head())

Dataset loaded successfully with 1440 rows and 3 columns.


Unnamed: 0,title,rating,body
0,Horrible product,1,Very disappointed with the overall performance...
1,Camera quality is not like 48 megapixel,3,Camera quality is low
2,Overall,4,"Got the mobile on the launch date,Battery must..."
3,A big no from me,1,1. It doesn't work with 5.0GHz WiFi frequency....
4,Put your money somewhere else,1,"Not worth buying....faulty software, poor disp..."


In [583]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/P597 DATASET.csv', index=False)

* **Handling Missing Values:** Check for and address any missing values in the relevant columns.

In [584]:
# Check for missing values
print("Missing values before handling:")
print(df[['body']].isnull().sum())

# Decide on a strategy for missing values if any are found.
# For example, dropping rows with missing values in 'cleaned_body':
# df.dropna(subset=['cleaned_body'], inplace=True)

# Or filling missing values in 'cleaned_body' with an empty string:
# df['cleaned_body'].fillna('', inplace=True)

# Re-check for missing values after handling
# print("\nMissing values after handling:")
# print(df[['body', 'cleaned_body', 'tokens', 'tokens_no_stopwords', 'lemmatized_tokens', 'lemmatized_text', 'sentiment_categorized']].isnull().sum())

Missing values before handling:
body    0
dtype: int64


* **Handling Duplicate Rows:** Identify and remove any duplicate rows that might skew the analysis.

In [585]:
# Check for duplicate rows
# Exclude columns with list types from duplicate check
columns_to_check = [col for col in df.columns if not isinstance(df[col].iloc[0], list)]
print("Number of duplicate rows before handling:", df.duplicated(subset=columns_to_check).sum())

# Remove duplicate rows
df.drop_duplicates(subset=columns_to_check, inplace=True)

# Re-check for duplicate rows
print("Number of duplicate rows after handling:", df.duplicated(subset=columns_to_check).sum())

Number of duplicate rows before handling: 0
Number of duplicate rows after handling: 0


# 1. NLP Sentiment Analysis

## A. Data Preprocessing



*  **Cleaning:** Remove HTML tags, special characters, and multiple spaces.



In [586]:
import re
import pandas as pd # Already imported, but good practice to show dependency

# Function: clean_text
# Purpose: Preprocesses text by removing non-alphanumeric noise elements and handling edge cases.
def clean_text(text):
    # Robust Error Handling: Check for missing (NaN) or non-string inputs
    if pd.isnull(text) or not isinstance(text, str):
        return ""
    try:
        # 1. Remove HTML tags (e.g., <body>, <br/>)
        text = re.sub(r'<.*?>', '', text)
        # 2. Remove special characters (keep only A-Z, a-z, 0-9, and spaces)
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # 3. Replace multiple spaces with a single space and strip leading/trailing whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    except Exception as e:
        # Log the error but return an empty string to keep the pipeline moving
        print(f"Error cleaning text '{text[:50]}...': {e}")
        return ""

df['cleaned_body'] = df['body'].apply(clean_text)
display(df[['body', 'cleaned_body']].head())

Unnamed: 0,body,cleaned_body
0,Very disappointed with the overall performance...,Very disappointed with the overall performance...
1,Camera quality is low,Camera quality is low
2,"Got the mobile on the launch date,Battery must...",Got the mobile on the launch dateBattery must ...
3,1. It doesn't work with 5.0GHz WiFi frequency....,1 It doesnt work with 50GHz WiFi frequency 24G...
4,"Not worth buying....faulty software, poor disp...",Not worth buyingfaulty software poor display q...




*  **Normalization:** Convert all text to lowercase.



In [587]:
df['cleaned_body'] = df['cleaned_body'].str.lower()
display(df[['body', 'cleaned_body']].head())

Unnamed: 0,body,cleaned_body
0,Very disappointed with the overall performance...,very disappointed with the overall performance...
1,Camera quality is low,camera quality is low
2,"Got the mobile on the launch date,Battery must...",got the mobile on the launch datebattery must ...
3,1. It doesn't work with 5.0GHz WiFi frequency....,1 it doesnt work with 50ghz wifi frequency 24g...
4,"Not worth buying....faulty software, poor disp...",not worth buyingfaulty software poor display q...




*   **Tokenization:** Split the text into individual words or tokens.



In [588]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [589]:
# Removed redundant nltk.download('all') as punkt is sufficient for tokenization
# import nltk
# nltk.download('all')

In [590]:
from nltk.tokenize import word_tokenize

df['tokens'] = df['cleaned_body'].apply(word_tokenize)
display(df[['cleaned_body', 'tokens']].head())

Unnamed: 0,cleaned_body,tokens
0,very disappointed with the overall performance...,"[very, disappointed, with, the, overall, perfo..."
1,camera quality is low,"[camera, quality, is, low]"
2,got the mobile on the launch datebattery must ...,"[got, the, mobile, on, the, launch, datebatter..."
3,1 it doesnt work with 50ghz wifi frequency 24g...,"[1, it, doesnt, work, with, 50ghz, wifi, frequ..."
4,not worth buyingfaulty software poor display q...,"[not, worth, buyingfaulty, software, poor, dis..."




*   **Stop Word Removal:** Eliminate common words like "the," "a," "is," which don't usually add significant sentiment.



In [591]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

df['tokens_no_stopwords'] = df['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
display(df[['tokens', 'tokens_no_stopwords']].head())

Unnamed: 0,tokens,tokens_no_stopwords
0,"[very, disappointed, with, the, overall, perfo...","[disappointed, overall, performance, samsung]"
1,"[camera, quality, is, low]","[camera, quality, low]"
2,"[got, the, mobile, on, the, launch, datebatter...","[got, mobile, launch, datebattery, must, appre..."
3,"[1, it, doesnt, work, with, 50ghz, wifi, frequ...","[1, doesnt, work, 50ghz, wifi, frequency, 24gh..."
4,"[not, worth, buyingfaulty, software, poor, dis...","[worth, buyingfaulty, software, poor, display,..."




*   **Lemmatization/Stemming:** Reduce words to their root form (e.g., "running" to "run").



In [592]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character used by WordNetLemmatizer"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"N": wordnet.NOUN,
                "V": wordnet.VERB,
                "A": wordnet.ADJ,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

df['lemmatized_tokens'] = df['tokens_no_stopwords'].apply(lambda tokens: [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens])
display(df[['tokens_no_stopwords', 'lemmatized_tokens']].head())

Unnamed: 0,tokens_no_stopwords,lemmatized_tokens
0,"[disappointed, overall, performance, samsung]","[disappointed, overall, performance, samsung]"
1,"[camera, quality, low]","[camera, quality, low]"
2,"[got, mobile, launch, datebattery, must, appre...","[get, mobile, launch, datebattery, must, appre..."
3,"[1, doesnt, work, 50ghz, wifi, frequency, 24gh...","[1, doesnt, work, 50ghz, wifi, frequency, 24gh..."
4,"[worth, buyingfaulty, software, poor, display,...","[worth, buyingfaulty, software, poor, display,..."


## B. Sentiment Labeling



*   **Binary Classification (Positive/Negative):** A common approach is to classify 4 and 5-star ratings as Positive (1) and 1 or 2-star ratings as Negative (0). You may choose to drop 3-star (Neutral) reviews or categorize them separately.



In [593]:
def categorize_sentiment(rating):
    if rating in [4, 5]:
        return 1  # Positive
    elif rating in [1, 2]:
        return 0  # Negative
    else:
        return 2  # Neutral (rating 3)

df['sentiment_categorized'] = df['rating'].apply(categorize_sentiment)

display(df[['rating', 'sentiment_categorized']].head())
display(df['sentiment_categorized'].value_counts().rename('Sentiment Counts (Categorized)'))

Unnamed: 0,rating,sentiment_categorized
0,1,0
1,3,2
2,4,1
3,1,0
4,1,0


Unnamed: 0_level_0,Sentiment Counts (Categorized)
sentiment_categorized,Unnamed: 1_level_1
1,729
0,512
2,199


## C. Feature Engineering (Vectorization)

Convert the text data into a numerical format that a machine learning model can understand.



*   **Bag-of-Words (BoW) / CountVectorizer:** Counts the frequency of words.



In [594]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert the list of tokens back to strings for CountVectorizer
df['lemmatized_text'] = df['lemmatized_tokens'].apply(lambda tokens: ' '.join(tokens))

count_vectorizer = CountVectorizer(max_features=5000) # You can adjust max_features
count_matrix = count_vectorizer.fit_transform(df['lemmatized_text'])

print("CountVectorizer matrix shape:", count_matrix.shape)

CountVectorizer matrix shape: (1440, 5000)




*   **TF-IDF (Term Frequency-Inverse Document Frequency):** Weights words by their importance (frequent in one document but rare across the corpus).



In [595]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust max_features
tfidf_matrix = tfidf_vectorizer.fit_transform(df['lemmatized_text'])

print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (1440, 5000)




*   Use the SMOTE (Synthetic Minority Over-sampling Technique) method to balance the training data.



In [596]:
# **Improvement 6: Addressing Class Imbalance using SMOTE**
# SMOTE oversamples the minority class(es) by generating synthetic samples,
# providing a more balanced training dataset.

from collections import Counter
from imblearn.over_sampling import SMOTE
# Note: Ensure you have your feature matrix (X_train) and target labels (y_train) ready from the train_test_split.
# X_train should be the result of your TF-IDF/CountVectorizer applied to the training set.

# Check the distribution before SMOTE
# Note: Run this cell AFTER the data splitting cell (bbe668ec)
print("Original class distribution (y_train):", Counter(y_train))

# Initialize SMOTE with a random state for reproducibility
smote = SMOTE(random_state=42)

# Apply SMOTE only to the training data. Use X_train from the previous cell.
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Check the distribution after SMOTE
print("Resampled class distribution (y_resampled):", Counter(y_resampled))

# Update: Use X_resampled and y_resampled for training the traditional ML models.
# You would typically use these resampled dataframes in the cell where you train your model.
# This cell now just demonstrates the resampling step.

Original class distribution (y_train): Counter({1: 596, 0: 401, 2: 155})
Resampled class distribution (y_resampled): Counter({1: 596, 2: 596, 0: 596})




*   Use Randomized Search Cross-Validation to efficiently find optimal hyperparameters, which inherently includes cross-validation.



In [597]:
# **Improvement 4 & 5: Hyperparameter Tuning and Cross-validation**
# We use RandomizedSearchCV with 5-fold cross-validation (cv=5) to find the best
# hyperparameters for our Logistic Regression classifier.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Define the base model
model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)

# Define the search space for hyperparameters
# 'C': Inverse of regularization strength
# 'penalty': Regularization norm
param_dist = {
    'C': uniform(loc=0.1, scale=10), # Search C between 0.1 and 10.1
    'penalty': ['l1', 'l2']
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=50,                  # Number of parameter settings that are sampled (tune this for time vs. performance)
    cv=5,                       # Use 5-fold cross-validation
    scoring='f1_weighted',      # Optimize for weighted F1-score (Improvement 3)
    verbose=1,
    n_jobs=-1,                  # Use all available cores
    random_state=42
)

# Fit the random search to the resampled training data (if you used SMOTE)
print("Starting Randomized Search Cross-Validation...")
# Note: Ensure X_resampled and y_resampled are defined from the SMOTE step (cell ztZBWeyQuvRU).
# If not using SMOTE, use X_train and y_train from the train_test_split (cell bbe668ec).
random_search.fit(X_resampled, y_resampled)

# Output the best results
print("\n--- Hyperparameter Tuning Results ---")
print(f"Best Weighted F1 Score (5-fold CV): {random_search.best_score_:.4f}")
print(f"Best Hyperparameters: {random_search.best_params_}")
print("-------------------------------------")

# The final, optimized model to use for prediction and final evaluation
best_model = random_search.best_estimator_

# Use the best model for final prediction:
# y_pred = best_model.predict(X_test_vec)
# Now proceed to evaluate the best_model.

Starting Randomized Search Cross-Validation...
Fitting 5 folds for each of 50 candidates, totalling 250 fits

--- Hyperparameter Tuning Results ---
Best Weighted F1 Score (5-fold CV): 0.8820
Best Hyperparameters: {'C': np.float64(8.424426408004217), 'penalty': 'l2'}
-------------------------------------




*   Implement a basic unit test class to ensure your preprocessing functions work reliably.



In [598]:
# **Improvement 9: Unit Testing for Preprocessing Functions**
# Unit tests confirm that the text cleaning step is robust against different types of noise.

import unittest
# Assuming your 'clean_text' function from Improvement 2 is defined earlier.

class TestPreprocessing(unittest.TestCase):

    def test_html_removal(self):
        # Test case: Ensure HTML tags are completely removed.
        self.assertEqual(clean_text("Review with <b>bold</b> text and <br> line break."),
                         "Review with bold text and line break")

    def test_special_char_and_spacing_removal(self):
        # Test case: Remove punctuation, non-alphanumeric chars, and fix spacing.
        self.assertEqual(clean_text("This is great!!! It cost $1,000. \n\n Extra space."),
                         "This is great It cost 1000 Extra space")

    def test_empty_and_null_input(self):
        # Test case: Handle missing or non-string inputs gracefully.
        self.assertEqual(clean_text(None), "")
        self.assertEqual(clean_text(12345), "")

# Run the tests in the Colab environment
print("--- Running Unit Tests for `clean_text` ---")
unittest.main(argv=['first-arg-is-ignored'], exit=False)

...
----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK


--- Running Unit Tests for `clean_text` ---


<unittest.main.TestProgram at 0x79cf6ad079e0>

In [599]:
import pickle
import os

# Define the exact path where Streamlit expects the file
VECTORIZER_PATH = '/content/drive/MyDrive/Colab Notebooks/tfidf_vectorizer.pkl'

# Ensure the directory exists (optional, but good practice)
os.makedirs(os.path.dirname(VECTORIZER_PATH), exist_ok=True)

# Assuming your vectorizer object is named 'vectorizer' (or 'tfidf_vectorizer')
print("Saving TF-IDF Vectorizer...")
try:
    with open(VECTORIZER_PATH, 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    print(f"Vectorizer successfully saved to: {VECTORIZER_PATH}")
except Exception as e:
    print(f"Error saving vectorizer: {e}")

Saving TF-IDF Vectorizer...
Vectorizer successfully saved to: /content/drive/MyDrive/Colab Notebooks/tfidf_vectorizer.pkl


## D. Model Training and Evaluation



> Algorithms:





*   **Traditional ML:** Logistic Regression, Naive Bayes, Support Vector Machines (SVM), or Random Forest are good starting points for text classification.



In [600]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Assuming you want to use the TF-IDF matrix and the categorized sentiment labels
X = tfidf_matrix
y = df['sentiment_categorized']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.7604166666666666
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.83      0.85       111
           1       0.69      0.95      0.80       133
           2       0.00      0.00      0.00        44

    accuracy                           0.76       288
   macro avg       0.52      0.59      0.55       288
weighted avg       0.66      0.76      0.70       288



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))




*   **Extract and highlight the key metrics** (F1-score, Precision, Recall) from the classification report for a clearer discussion, especially for the minority class.



In [601]:
# **Improvement 3: Enhanced Model Evaluation Metrics Display**
# Explicitly highlight key metrics like Weighted F1-Score and the performance
# on the minority class ('Neutral').

from sklearn.metrics import classification_report

# Assuming 'y_test' is the true labels and 'y_pred' are the model predictions.
# Ensure 'target_names' map correctly to your label encoding (e.g., 0, 1, 2).
report = classification_report(
    y_test,
    y_pred,
    target_names=['Negative', 'Neutral', 'Positive'],
    output_dict=True
)

# Overall Performance Summary
weighted_f1 = report['weighted avg']['f1-score']
accuracy = report['accuracy']
print(f"\n--- Overall Key Performance Metrics ---")
print(f"Overall Accuracy: {accuracy*100:.2f}%")
print(f"Overall Weighted F1-Score: {weighted_f1:.4f} (Primary metric for imbalanced data)")
print("---------------------------------------")

# Performance on Minority Class (assuming 'Neutral' is the minority)
neutral_metrics = report['Neutral']
print(f"\nMetrics for Minority Class (Neutral):")
print(f"  Precision: {neutral_metrics['precision']:.4f}")
print(f"  Recall: {neutral_metrics['recall']:.4f}")
print(f"  F1-Score: {neutral_metrics['f1-score']:.4f}")
print(f"  Support: {neutral_metrics['support']} samples")


--- Overall Key Performance Metrics ---
Overall Accuracy: 76.04%
Overall Weighted F1-Score: 0.6995 (Primary metric for imbalanced data)
---------------------------------------

Metrics for Minority Class (Neutral):
  Precision: 0.6940
  Recall: 0.9549
  F1-Score: 0.8038
  Support: 133.0 samples


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))




*   **Deep Learning:** For more complex patterns, consider models like Recurrent Neural Networks (RNNs), LSTMs, or pre-trained models from Hugging Face Transformers (e.g., BERT/RoBERTa).





*   Implement a simple LSTM (Long Short-Term Memory) Deep Learning model for contextual text analysis.



In [602]:
# **Improvement 7: Exploring More Advanced Models - Simple LSTM**
# LSTMs are effective for sequence data like text, capturing contextual information.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import class_weight
import numpy as np

# --- 1. Deep Learning Data Preparation ---
MAX_WORDS = 10000  # Max words in vocabulary
MAX_LEN = 100      # Max sequence length (to pad/truncate reviews)

# Assuming your columns are 'cleaned_body' and 'sentiment_categorized'
X = df['cleaned_body'].astype(str)
y = df['sentiment_categorized']

# Encode labels (e.g., 'Negative' -> 0, 'Neutral' -> 1, 'Positive' -> 2)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Tokenization and Sequence Conversion
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
padded_sequences = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

# Split data (Stratified split maintains class proportions)
X_train_dl, X_test_dl, y_train_dl, y_test_dl = train_test_split(
    padded_sequences, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

# Convert labels to categorical for the final layer (3 classes)
y_train_cat = tf.keras.utils.to_categorical(y_train_dl, num_classes=3)
y_test_cat = tf.keras.utils.to_categorical(y_test_dl, num_classes=3)

# Calculate class weights for imbalanced data (Alternative to SMOTE for DL)
weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_dl),
    y=y_train_dl
)
class_weights = {i: weights[i] for i in range(len(weights))}


# --- 2. Build the Simple LSTM Model ---
EMBEDDING_DIM = 128

model_lstm = Sequential([
    Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=MAX_LEN),
    LSTM(64),
    Dropout(0.5), # Regularization
    Dense(3, activation='softmax')
])

model_lstm.compile(optimizer='adam',
                   loss='categorical_crossentropy',
                   metrics=['accuracy'])

model_lstm.summary()

# --- 3. Train the Model ---
NUM_EPOCHS = 10
BATCH_SIZE = 32

history = model_lstm.fit(
    X_train_dl,
    y_train_cat,
    epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_test_dl, y_test_cat),
    class_weight=class_weights, # Applying class weights
    verbose=1
)

# --- 4. Evaluate and Save (Example) ---
loss, accuracy = model_lstm.evaluate(X_test_dl, y_test_cat, verbose=0)
print(f"\nLSTM Model Test Accuracy: {accuracy*100:.2f}%")

# To use this model in Streamlit, you must save the model and the tokenizer:
# model_lstm.save('lstm_sentiment_model.h5')
# import pickle
# with open('lstm_tokenizer.pickle', 'wb') as handle:
#     pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)



Epoch 1/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 115ms/step - accuracy: 0.2936 - loss: 1.1102 - val_accuracy: 0.5035 - val_loss: 1.0875
Epoch 2/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 87ms/step - accuracy: 0.4063 - loss: 1.0935 - val_accuracy: 0.4965 - val_loss: 1.0881
Epoch 3/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 80ms/step - accuracy: 0.4423 - loss: 1.1042 - val_accuracy: 0.4861 - val_loss: 1.0850
Epoch 4/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 70ms/step - accuracy: 0.4665 - loss: 1.0650 - val_accuracy: 0.1493 - val_loss: 1.1197
Epoch 5/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 77ms/step - accuracy: 0.3338 - loss: 1.0716 - val_accuracy: 0.1597 - val_loss: 1.1128
Epoch 6/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 120ms/step - accuracy: 0.4203 - loss: 1.0483 - val_accuracy: 0.3750 - val_loss: 1.1048
Epoch 7/10
[1m36/36[0m [32m━━

In [603]:
# Assuming the trained LSTM model is named 'model_lstm'
import os
import pickle

# Define the exact path where Streamlit expects the file
MODEL_PATH = '/content/drive/MyDrive/Colab Notebooks/lstm_sentiment_model.h5'
TOKENIZER_PATH = '/content/drive/MyDrive/Colab Notebooks/lstm_tokenizer.pickle'

# Ensure the directory exists (optional, but good practice)
os.makedirs(os.path.dirname(MODEL_PATH), exist_ok=True)

print("Saving LSTM Model...")
try:
    # Use the Keras/TensorFlow save method for .h5 files
    model_lstm.save(MODEL_PATH)
    print(f"LSTM Model successfully saved to: {MODEL_PATH}")
except Exception as e:
    print(f"Error saving LSTM model: {e}")

print("Saving Tokenizer...")
try:
    with open(TOKENIZER_PATH, 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    print(f"Tokenizer successfully saved to: {TOKENIZER_PATH}")
except Exception as e:
    print(f"Error saving tokenizer: {e}")



Saving LSTM Model...
LSTM Model successfully saved to: /content/drive/MyDrive/Colab Notebooks/lstm_sentiment_model.h5
Saving Tokenizer...
Tokenizer successfully saved to: /content/drive/MyDrive/Colab Notebooks/lstm_tokenizer.pickle


Convert the TF-IDF matrix to a format suitable for deep learning models (e.g., dense arrays).




*   Convert the sparse TF-IDF matrix and the target variable to dense NumPy arrays for deep learning models and display their shapes.





In [604]:
import numpy as np

X_dense = tfidf_matrix.todense()
y_dense = np.array(y)

print("Dense TF-IDF matrix shape:", X_dense.shape)
print("Target variable shape:", y_dense.shape)

Dense TF-IDF matrix shape: (1440, 5000)
Target variable shape: (1440,)


Build a simple neural network model using a library like Keras or PyTorch.




*   Define and compile a simple neural network model using Keras for multi-class classification.






In [605]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

# Convert target variable to categorical
y_categorical = to_categorical(y_dense)

# Split the data again for the neural network
from sklearn.model_selection import train_test_split
X_train_nn, X_test_nn, y_train_nn, y_test_nn = train_test_split(X_dense, y_categorical, test_size=0.2, random_state=42)

# Define the model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train_nn.shape[1],)))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax')) # 3 units for 3 sentiment categories

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)




*   The model has been defined and compiled. The next step is to train the model on the prepared data.





In [606]:
# Train the model
history = model.fit(X_train_nn, y_train_nn,
                    epochs=10, # You can adjust the number of epochs
                    batch_size=32, # You can adjust the batch size
                    validation_split=0.2) # Use a validation split to monitor performance

Epoch 1/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - accuracy: 0.4054 - loss: 1.0819 - val_accuracy: 0.4892 - val_loss: 1.0025
Epoch 2/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.5486 - loss: 0.9543 - val_accuracy: 0.5368 - val_loss: 0.8903
Epoch 3/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.6645 - loss: 0.8383 - val_accuracy: 0.7229 - val_loss: 0.7472
Epoch 4/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.8017 - loss: 0.6433 - val_accuracy: 0.7749 - val_loss: 0.6409
Epoch 5/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.8459 - loss: 0.4534 - val_accuracy: 0.7706 - val_loss: 0.6003
Epoch 6/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.8479 - loss: 0.3644 - val_accuracy: 0.7662 - val_loss: 0.6117
Epoch 7/10
[1m29/29[0m [32m━━━━



*   Evaluate the trained neural network model on the test set to assess its performance.





In [607]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test_nn, y_test_nn, verbose=0)

print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Get predictions
y_pred_nn = model.predict(X_test_nn)
y_pred_classes_nn = np.argmax(y_pred_nn, axis=1)
y_true_classes_nn = np.argmax(y_test_nn, axis=1)

# Print classification report
print("Classification Report (Neural Network):\n", classification_report(y_true_classes_nn, y_pred_classes_nn))

Test Loss: 0.8162
Test Accuracy: 0.7188
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Classification Report (Neural Network):
               precision    recall  f1-score   support

           0       0.79      0.80      0.79       111
           1       0.72      0.84      0.78       133
           2       0.30      0.14      0.19        44

    accuracy                           0.72       288
   macro avg       0.60      0.59      0.59       288
weighted avg       0.68      0.72      0.69       288



## Summary:

### Data Analysis Key Findings
* The sparse TF-IDF matrix was successfully converted to a dense NumPy array with a shape of (1440, 5000).
* The target variable was converted to a NumPy array with a shape of (1440,).
* A Keras Sequential model was built, compiled, and trained for multi-class sentiment classification.
* The trained neural network model achieved a test accuracy of approximately 71.88%.
* The classification report for the neural network showed varying performance across the sentiment classes, with lower performance for class 2.

### Insights or Next Steps
* The neural network model shows signs of overfitting based on the difference between training and validation accuracy; consider implementing more robust regularization techniques or using a larger dataset.
* Further investigation into the performance disparity across sentiment classes is needed to understand why class 2 has significantly lower precision and recall.


# 3. Deployment (Acceptance Criterion)

To make your results accessible and interactive, you need to deploy the trained model in a web application.



*   **Pros:** Requires only Python knowledge, excellent for data science/ML projects, fast to prototype, and handles UI components (text input, buttons, plots) easily.





*   **Process:**





1.   Save your trained model (e.g., using pickle or joblib).



In [608]:
import pickle
import os

# Define the exact path where Streamlit expects the file
VECTORIZER_PATH = '/content/drive/MyDrive/Colab Notebooks/tfidf_vectorizer.pkl'

# Ensure the directory exists (optional, but good practice)
os.makedirs(os.path.dirname(VECTORIZER_PATH), exist_ok=True)

# Save the TF-IDF vectorizer
print("Saving TF-IDF Vectorizer...")
try:
    with open(VECTORIZER_PATH, 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    print(f"TF-IDF vectorizer successfully saved to: {VECTORIZER_PATH}")
except Exception as e:
    print(f"Error saving vectorizer: {e}")

Saving TF-IDF Vectorizer...
TF-IDF vectorizer successfully saved to: /content/drive/MyDrive/Colab Notebooks/tfidf_vectorizer.pkl


In [609]:
import os

# Define the exact path where Streamlit expects the file
MODEL_PATH = '/content/drive/MyDrive/Colab Notebooks/sentiment_model.h5'

# Ensure the directory exists (optional, but good practice)
os.makedirs(os.path.dirname(MODEL_PATH), exist_ok=True)

# Save the Keras model
print("Saving Neural Network Model...")
try:
    model.save(MODEL_PATH)
    print(f"Neural network model successfully saved to: {MODEL_PATH}")
except Exception as e:
    print(f"Error saving neural network model: {e}")



Saving Neural Network Model...
Neural network model successfully saved to: /content/drive/MyDrive/Colab Notebooks/sentiment_model.h5





2.   Create a Streamlit Python script (app.py) that loads the model, accepts user input (a new review), preprocesses the text, runs the prediction, and displays the sentiment result.



In [610]:
%%writefile app.py
# **Improvement 8: Streamlit UI Enhancement**
import streamlit as st
import pandas as pd
import altair as alt # For visualization
import pickle
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences # Keep in case needed for other models later, or remove if only using TF-IDF
import os # Import os module
import re # Import re for cleaning function

# --- Define the correct paths to the saved files in Google Drive ---
VECTORIZER_PATH = '/content/drive/MyDrive/Colab Notebooks/tfidf_vectorizer.pkl'
MODEL_PATH = '/content/drive/MyDrive/Colab Notebooks/sentiment_model.h5'
# Removed TOKENIZER_PATH as we are focusing on the TF-IDF + Keras model

# --- Load the saved model and vectorizer ---
vectorizer = None
model = None
# Removed tokenizer loading

st.write("Attempting to load files from Google Drive...")

if not os.path.exists('/content/drive/MyDrive/Colab Notebooks/'):
    st.error("Error: Google Drive not mounted or the 'Colab Notebooks' folder does not exist.")
else:
    try:
        with open(VECTORIZER_PATH, 'rb') as f:
            vectorizer = pickle.load(f)
        st.success(f"TF-IDF vectorizer loaded successfully from: {VECTORIZER_PATH}")
    except FileNotFoundError:
        st.error(f"Error: TF-IDF vectorizer not found at {VECTORIZER_PATH}. Please ensure the file exists.")
    except Exception as e:
        st.error(f"Error loading TF-IDF vectorizer from {VECTORIZER_PATH}: {e}")

    try:
        # Custom objects might be needed if you used custom layers/functions
        model = load_model(MODEL_PATH)
        st.success(f"Neural network model loaded successfully from: {MODEL_PATH}")
    except FileNotFoundError:
        st.error(f"Error: Neural network model not found at {MODEL_PATH}. Please ensure the file exists.")
    except Exception as e:
         st.error(f"Error loading Keras model from {MODEL_PATH}: {e}")

# Define your label map
LABEL_MAP = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}

# Define text cleaning function (should match the one used during training)
def clean_text(text):
    if pd.isnull(text) or not isinstance(text, str):
        return ""
    try:
        text = re.sub(r'<.*?>', '', text)
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    except Exception as e:
        return ""

# Define padding length (not strictly needed for TF-IDF, but keep if you switch models)
# MAX_LEN = 100 # Or whatever MAX_LEN you used for padding

# --- Enhanced UI/UX and Prediction Logic ---

st.title("Sentiment Analysis Project Demo 📊")
st.markdown("Enter a customer review (text) below to instantly classify its sentiment (Negative, Neutral, or Positive).")
st.markdown("---")

user_input = st.text_area("Enter review here:", "")

# Proceed only if there is input and the required models are loaded
if user_input and vectorizer is not None and model is not None:

    # 1. Prediction and Error Handling (Improvement 2: Streamlit Error Handling)
    try:
        # Preprocess the input text
        cleaned_input = clean_text(user_input)

        # Use TF-IDF for feature extraction
        input_vector = vectorizer.transform([cleaned_input]).todense() # Convert to dense for Keras model

        # Get prediction probabilities from the Keras model
        prediction_proba = model.predict(input_vector)[0]


        if prediction_proba is not None:
            # 2. Determine the predicted class
            predicted_class_index = prediction_proba.argmax()
            predicted_sentiment = LABEL_MAP[predicted_class_index]

            # 3. Display the primary prediction with clear formatting
            st.subheader("Analysis Result")
            if predicted_sentiment == 'Positive':
                st.success(f"**Predicted Sentiment:** {predicted_sentiment} 🎉 (Confidence: {prediction_proba.max():.2%})")
            elif predicted_sentiment == 'Negative':
                st.error(f"**Predicted Sentiment:** {predicted_sentiment} 😔 (Confidence: {prediction_proba.max():.2%})")
            else:
                st.warning(f"**Predicted Sentiment:** {predicted_sentiment} 🤔 (Confidence: {prediction_proba.max():.2%})")

            # 4. Visualize Prediction Probabilities
            st.subheader("Prediction Probability Distribution")

            proba_df = pd.DataFrame({
                'Sentiment': list(LABEL_MAP.values()),
                'Probability': prediction_proba
            }).sort_values(by='Probability', ascending=False)

            # Create a visually engaging bar chart
            chart = alt.Chart(proba_df).mark_bar().encode(
                x=alt.X('Probability', axis=alt.Axis(format='.0%')),
                y=alt.Y('Sentiment', sort='-x'),
                color=alt.condition(
                    alt.datum.Sentiment == predicted_sentiment,
                    alt.value('#28a745'),  # Green for predicted class
                    alt.value('steelblue')
                ),
                tooltip=['Sentiment', alt.Tooltip('Probability', format='.2%')]
            ).properties(
                title='Model Confidence Across Classes'
            )
            st.altair_chart(chart, use_container_width=True)

    except Exception as e:
        st.error(f"An unexpected error occurred during prediction: {e}")

Overwriting app.py


In [611]:
%%writefile requirements.txt
streamlit
tensorflow
scikit-learn
pandas
nltk

Overwriting requirements.txt


In [612]:
!pip install streamlit





# Project Management.

In [613]:
# **Improvement: Version Control (Git)**

# Key Steps:
# 1.  **Initialize:** `git init`
# 2.  **Tracking:** `git add .`
# 3.  **Committing:** `git commit -m "Initial sentiment analysis project setup"`
# 4.  **Remote:** Track changes, create branches for new features, and manage collaboration effectively.

# Example Git commands (can be run in separate cells with '!' prefix)
# !git init
# !git add .
# !git commit -m "Initial sentiment analysis project setup"