# Sentiment Analysis on Product Reviews

## Table of Contents
1. [Load Data](#Load-Data-And-Analysis )
2. [Preprocessing](#preprocessing)
3. [Model Selection](#model-selection)
4. [Visualization](#visualization)
5. [Summary & Key Insights](#Notebook-summary-&-key-insights)

# Load Data And Analysis 

In [6]:
import kagglehub

path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
print("Path to dataset files:", path)

Path to dataset files: C:\Users\AIJimmy\.cache\kagglehub\datasets\lakshmi25npathi\imdb-dataset-of-50k-movie-reviews\versions\1


In [7]:
import pandas as pd

df = pd.read_csv(path + "/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
print("value counts:", df['sentiment'].value_counts())
print("----------------------------------------")
print("Missing values:", df.isnull().sum())
print("----------------------------------------")
print("Dataset shape:", df.shape)
print("----------------------------------------")
print("info:", df.info())

value counts: sentiment
positive    25000
negative    25000
Name: count, dtype: int64
----------------------------------------
Missing values: review       0
sentiment    0
dtype: int64
----------------------------------------
Dataset shape: (50000, 2)
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
info: None


# Preprocessing

In [9]:
reviews = df['review'].values
reviews = [review.lower() for review in reviews]
if df['sentiment'].dtype == 'object':
    df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0}).values
all_text = ' '.join(reviews)
unique_chars = set(all_text)
print("Unique characters:", sorted(unique_chars))
print("Number of unique characters:", len(unique_chars))
df.head()

Unique characters: ['\x08', '\t', '\x10', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x80', '\x84', '\x85', '\x8d', '\x8e', '\x91', '\x95', '\x96', '\x97', '\x9a', '\x9e', '\xa0', '¡', '¢', '£', '¤', '¦', '§', '¨', '©', 'ª', '«', '\xad', '®', '°', '³', '´', '·', 'º', '»', '½', '¾', '¿', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ğ', 'ı', 'ō', 'ż', 'א', 'ג', 'ו', 'י', 'כ', 'ל', 'מ', 'ן', 'ר', '–', '‘', '’', '“', '”', '…', '″', '₤', '▼', '★', '、', '\uf0b7', '，']
Number of unique characters: 162


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\AIJimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\AIJimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\AIJimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\AIJimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [11]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from num2words import num2words
from nltk.corpus import wordnet
import contractions
from nltk.corpus import stopwords
import re 

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [12]:
# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [13]:
#Preprocess
def preprocess_text(text):
    text = contractions.fix(text)
    text = text.replace('.', ' . ')
    text = re.sub(r'<[^>]*>', '', text)
    text = re.sub(r'[^a-zA-Z0-9]+', ' ', text)
    text = "".join(num2words(int(word)) if word.isdigit() else word for word in text)
    word_tokens = word_tokenize(text)
    text = [w for w in word_tokens if not w in stop_words]
    tagged = nltk.tag.pos_tag(text)
    lemmatized_words = []

    for word, tag in tagged:
        wordnet_pos = get_wordnet_pos(tag) or wordnet.NOUN
        lemmatized_words.append(lemmatizer.lemmatize(word, pos=wordnet_pos))
    return ' '.join(lemmatized_words)

In [14]:
df['cleaned_review'] = df['review'].apply(preprocess_text)
cleaned_reviews = df['cleaned_review'].values
all_text = ' '.join(cleaned_reviews)
unique_chars = set(all_text)
print("Unique characters:", sorted(unique_chars))
print("Number of unique characters:", len(unique_chars))
print("Example:", cleaned_reviews[0])

Unique characters: [' ', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Number of unique characters: 53
Example: One reviewer mention watch one Oz episode hook They right exactly happen The first thing strike Oz brutality unflinching scene violence set right word GO Trust show faint hearted timid This show pull punch regard drug sex violence Its hardcore classic use word It call OZ nickname give Oswald Maximum Security State Penitentary It focus mainly Emerald City experimental section prison cell glass front face inwards privacy high agenda Them City home many Aryans Muslims gangsta Latinos Christians Italians Irish scuffle death stare dodgy dealing shady agreement never far away I would say main appeal show due fact go show would dare Forget pretty picture paint mainstream a

# Model selection

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['cleaned_review'])
X_train, X_test, y_train, y_test = train_test_split(vectors, df['sentiment'], test_size=0.2, random_state=42)

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

models_and_params = [
    {
        'name' : 'Logistic Regression',
        'model' : LogisticRegression(),
        'params' : {
            'classifier__C' : [0.01, 0.1, 1, 10],
            'classifier__max_iter' : [100, 200, 300],
            'classifier__solver' : ['liblinear', 'saga'],
        }
    },
    {
        'name' : 'Naive Bayes',
        'model' : MultinomialNB(),
        'params' : {
            'classifier__alpha' : [0.01, 0.1, 1, 10],
        }
    }
]

for model_info in models_and_params:
    model = Pipeline(
        steps=[('classifier', model_info['model'])]
    )
    grid_search = GridSearchCV(model, model_info['params'], cv=5)
    grid_search.fit(X_train, y_train)
    print(f"Best parameters for {model_info['name']}: {grid_search.best_params_}")
    model_info['best_model'] = grid_search.best_estimator_

for model_info in models_and_params:
    print(f"Evaluating {model_info['name']}")
    y_pred = model_info['best_model'].predict(X_test)
    print(classification_report(y_test, y_pred))


Best parameters for Logistic Regression: {'classifier__C': 10, 'classifier__max_iter': 200, 'classifier__solver': 'saga'}
Best parameters for Naive Bayes: {'classifier__alpha': 1}
Evaluating Logistic Regression
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4961
           1       0.89      0.91      0.90      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

Evaluating Naive Bayes
              precision    recall  f1-score   support

           0       0.85      0.88      0.86      4961
           1       0.88      0.84      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



In [17]:
clf = LogisticRegression(C=10, max_iter=200, solver='saga')
clf.fit(X_train, y_train)

In [18]:
import pickle
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
with open('model.pkl', 'wb') as f:
    pickle.dump(clf, f)

# Visualization

In [19]:
Num_of_features = 10
coefs = clf.coef_
coefs = coefs.ravel()
feature_names = vectorizer.get_feature_names_out()
pairs = list(zip(feature_names, coefs))
top_pos = sorted(pairs, key=lambda x: x[1], reverse=True)[:Num_of_features]
top_neg = sorted(pairs, key=lambda x: x[1])[:Num_of_features]
print("Top positive features:", top_pos)
print("Top negative features:", top_neg)

Top positive features: [('excellent', np.float64(11.058472595532075)), ('great', np.float64(10.868374841586272)), ('seven', np.float64(10.66307307157192)), ('perfect', np.float64(9.262220967193551)), ('wonderful', np.float64(8.852451030952583)), ('brilliant', np.float64(8.802715179701718)), ('hilarious', np.float64(8.540779200401234)), ('refresh', np.float64(8.521265395548962)), ('highly', np.float64(8.073636292643107)), ('best', np.float64(7.999910007613406))]
Top negative features: [('waste', np.float64(-14.105350395872716)), ('bad', np.float64(-13.507386489383984)), ('awful', np.float64(-13.350315684036415)), ('worst', np.float64(-11.085566023715314)), ('poor', np.float64(-11.0072290120967)), ('disappointment', np.float64(-10.45735763941926)), ('bore', np.float64(-10.165097050966425)), ('terrible', np.float64(-9.625499990815955)), ('horrible', np.float64(-9.416235057575374)), ('boring', np.float64(-9.312301787769174))]


In [29]:
import plotly.express as px

features = [*top_pos, *top_neg]

coef_df = pd.DataFrame(features, columns=['feature', 'value'])
coef_df = coef_df.sort_values('value')

fig = px.bar(
    coef_df,
    x='value',
    y='feature',
    orientation='h',
    color='value',
    color_continuous_scale='RdYlGn',
    title='Feature Importance Bar Chart',
    range_color=[coef_df['value'].min(),    coef_df['value'].max()],
    labels={'value': 'Coefficient'},
    height=900
)
fig.update_layout(yaxis_title='Feature', xaxis_title='Coefficient', plot_bgcolor='rgba(240,240,245,1)')
fig.show()

In [26]:

import pandas as pd
import plotly.express as px

feature_names = []
for feature in features:
    feature_names.append(feature[0])
feature_counts = {}
for review in df['cleaned_review']:
    for feature in feature_names:
        feature_counts[feature] = feature_counts.get(feature, 0) + review.count(feature)
sorted_dict = dict(sorted(feature_counts.items() , reverse= True, key=lambda item: item[1]))
print("Feature counts:", sorted_dict)


Feature counts: {'bad': 25288, 'great': 19126, 'best': 11752, 'seven': 9012, 'poor': 5146, 'perfect': 4701, 'waste': 4342, 'excellent': 3933, 'bore': 3686, 'wonderful': 3620, 'awful': 3504, 'terrible': 3109, 'brilliant': 2710, 'horrible': 2324, 'hilarious': 2202, 'highly': 1932, 'boring': 1557, 'worst': 1035, 'disappointment': 823, 'refresh': 467}


In [35]:
count_df = pd.DataFrame(list(feature_counts.items()), columns=['feature', 'count'])
count_df = count_df.sort_values('count', ascending=True)

fig = px.bar(
    count_df,
    x='count',
    y='feature',
    orientation='h',
    color_discrete_sequence=px.colors.qualitative.Plotly,
    title='Feature Counts Bar Chart',
    height=600
)
fig.update_layout(
    yaxis_title='Feature',
    xaxis_title='Count',
)
fig.show()


# Notebook summary & key insights

## Task
Sentiment analysis on IMDB 50k movie reviews (binary: positive / negative).

## Data & Checks
- **Source:** lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.
- Performed value counts, checked for missing values and basic info.
- Mapped sentiment to 0/1.

## Preprocessing
- Lowercasing, contraction expansion, HTML removal, non‑alpha removal.
- Tokenization (NLTK), stopword removal, number to words conversion.
- POS tagging and lemmatization.
- **Result:** Normalized token set with lower variance, but risk of removing sentiment-bearing tokens due to over-cleaning.
- **Insight:** Always verify there are no empty reviews after aggressive cleaning (drop or fill empties before vectorizing).

## Vectorization
- Applied `TfidfVectorizer` on cleaned reviews.
- **Recommendation:** Include vectorizer in the `Pipeline` so hyperparameters (ngram_range, min_df, max_df, max_features) can be tuned.

## Modeling & Selection
- Train/test split: 80/20 (`random_state=42`).
- Models: Logistic Regression (grid search over `C`, `max_iter`, `solver`) and MultinomialNB (`alpha`).
- Used `GridSearchCV` with `Pipeline` wrapping the classifier.
- **Insight:** Store pipeline steps to avoid `AttributeError` when accessing coefficients (use `pipeline.named_steps['classifier']`).

## Evaluation
- Used `classification_report` to show precision, recall, and F1 per class.
- **Recommendation:** Add confusion matrix, ROC/AUC, and cross-validation score summaries for more robust model comparison.

## Feature Inspection & Visualization
- For Logistic Regression: Map `vectorizer.get_feature_names_out()` to `coef_` to get feature words and coefficients.
- For MultinomialNB: Use the difference of `clf.feature_log_prob_` to rank class-indicative words.
- Visualized top features as horizontal bar charts (Plotly or matplotlib).
- **Insight:** Interpret coefficients carefully — both magnitude and sign matter. Consider applying frequency filtering (`min_df`) to exclude rare or noisy tokens.
