# **Product Review Sentiment Analysis**
The goal of this project is to analyse customer reviews to determine their sentiment (positive, negative or neutral) based on the text content of the reviews and associated metadata. This will help in understanding customer feedback, identifying product strengths and weaknesses and improving the overall customer experience by offering actionable insights.


---

### **Key Steps in the Project**

1. **Data Collection**: Utilise the provided dataset containing customer reviews and associated metadata, such as star ratings and product categories.
2. **Data Preprocessing**: Handle missing data, clean text (e.g remove noise and irrelevant symbols) and standardise review content.
3. **Sentiment Labelling**: Use star ratings to label reviews as negative (1-2 stars), neutral (3 stars) or positive (4-5 stars).
4. **Exploratory Data Analysis (EDA)**:  Visualise the distribution of sentiments across different product categories and review ratings.
5. **Feature Extraction and Text Processing**: Convert review text into numerical features using tokenization, stopword removal and TF-IDF vectorization.
6. **Model Selection and Model Training**: Train a machine learning model (such as logistic regression, random forest or neural networks) to predict review sentiment based on extracted features.
7. **Model Evaluation**: Assess model performance using accuracy, precision, recall and F1 score to ensure reliable sentiment predictions.
8. **Performance Metrics**: Analyse metrics like confusion matrix and detailed performance scores to understand the strengths and weaknesses of the model.
9. **Building a Sentiment Analysis Dashboard with Streamlit and Real-Time Sentiment Prediction**: Create an interactive dashboard to visualise sentiment distribution and allow users to input reviews for real-time sentiment predictions.
10. **Business Insights and Recommendations**: Provide actionable insights and recommendations to businesses to improve product offerings based on patterns identified in customer feedback.

### **Step 1: Importing Required Libraries and Data Collection** ###

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
import joblib
import warnings
from collections import Counter
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

: 

In [None]:
# Load the dataset with the correct delimiter
file_path = "Amazon Product Review.txt"
data = pd.read_csv(file_path, delimiter=",", encoding="utf-8")
# Rename columns if necessary
data.columns = data.columns.str.strip()  # Remove any leading/trailing spaces

# Print dataset info
print("✅ Dataset loaded successfully!")
print(f"Shape: {data.shape}")
print("\n🧱 Columns:")
print(data.columns.tolist())
print("\n📊 Missing values:")
print(data.isnull().sum())

# Optional: Preview the data
print("\n🔍 Sample data:")
print(data.head(3))

### **Step 2 - Data Preprocessing** ###

In [None]:
# Drop rows with missing values in 'review_headline' or 'review_body'
data.dropna(subset=['review_headline', 'review_body'], inplace=True)

# Function to clean text data
def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)      # Remove numbers
    text = text.lower()                   # Convert to lowercase
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply the clean_text function to 'review_headline' and 'review_body'
data['cleaned_review_headline'] = data['review_headline'].apply(clean_text)
data['cleaned_review_body'] = data['review_body'].apply(clean_text)

# Clean extra spaces
data['cleaned_review_headline'] = data['cleaned_review_headline'].str.replace(r'\s+', ' ', regex=True).str.strip()
data['cleaned_review_body'] = data['cleaned_review_body'].str.replace(r'\s+', ' ', regex=True).str.strip()

# Verify the new columns
print("Cleaned Review Headline and Body:")
print(data[['cleaned_review_headline', 'cleaned_review_body']].head())

### **Step 3 - Sentiment Labelling** ###

In [None]:
# Step 3: Sentiment Labelling

# Define a mapping from star ratings to sentiment labels
sentiment_map = {
    1: 'very negative',
    2: 'negative',
    3: 'neutral',
    4: 'positive',
    5: 'very positive'
}

# Apply mapping
data['sentiment_label'] = data['star_rating'].map(sentiment_map)

# Handle unexpected values
data['sentiment_label'].fillna('unknown', inplace=True)

# Verify the result
print("\n✅ Sentiment labels based on star ratings:")
print(data[['star_rating', 'sentiment_label']].head())


### **Step 4 - Exploratory Data Analysis (EDA)** ###

In [None]:
# Count the sentiment labels for each star rating
sentiment_counts_by_rating = (
    data.groupby('star_rating')['sentiment_label']
    .value_counts()
    .unstack(fill_value=0)
    .reindex(columns=['very negative', 'negative', 'neutral', 'positive', 'very positive'])
)

# Display the result
print("\n✅ Sentiment Counts by Star Rating:")
print(sentiment_counts_by_rating)


In [None]:
# Define the correct order for sentiment categories
sentiment_order = ['very negative', 'negative', 'neutral', 'positive', 'very positive']

# Reindex the columns of sentiment_counts_by_rating to match the desired order
sentiment_counts_by_rating = sentiment_counts_by_rating.reindex(columns=sentiment_order)

# Define a custom color palette for 5 categories
custom_colors = ['#B22222',  # very negative (dark red)
                 '#FF6347',  # negative (tomato red)
                 '#FFD700',  # neutral (gold)
                 '#32CD32',  # positive (lime green)
                 '#006400']  # very positive (dark green)

# Plot the distribution of sentiment across star ratings
plt.figure(figsize=(16, 7))
sentiment_counts_by_rating.plot(
    kind='bar',
    stacked=True,
    color=custom_colors,
    edgecolor='black'
)
plt.title('Sentiment Distribution Across Star Ratings', fontsize=16, color='white')
plt.xlabel('Star Rating', fontsize=14, color='lightgray')
plt.ylabel('Number of Reviews', fontsize=14, color='lightgray')
plt.xticks(rotation=0, color='lightgray')
plt.yticks(color='lightgray')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(
    title='Sentiment',
    fontsize=12,
    bbox_to_anchor=(1, 1),
    loc='upper left',
    frameon=False
)
plt.tight_layout()
plt.show()

In [None]:

# Define the correct order for sentiment categories
sentiment_order = ['very negative', 'negative', 'neutral', 'positive', 'very positive']

# Group and reindex to ensure correct order
sentiment_counts_by_category = data.groupby('product_category')['sentiment_label'].value_counts().unstack().fillna(0)
sentiment_counts_by_category = sentiment_counts_by_category.reindex(columns=sentiment_order)

# Use the same custom colors as before
custom_colors = ['#B22222',  # very negative (dark red)
                 '#FF6347',  # negative (tomato red)
                 '#FFD700',  # neutral (gold)
                 '#32CD32',  # positive (lime green)
                 '#006400']  # very positive (dark green)

plt.style.use('dark_background')  # Set dark background
plt.figure(figsize=(18, 8))
sentiment_counts_by_category.plot(
    kind='bar',
    stacked=True,
    color=custom_colors,
    edgecolor='black'
)
plt.title('Sentiment Distribution Across Product Categories', fontsize=16, color='white')
plt.xlabel('Product Category', fontsize=14, color='lightgray')
plt.ylabel('Number of Reviews', fontsize=14, color='lightgray')
plt.xticks(rotation=45, ha='right', color='lightgray')
plt.yticks(color='lightgray')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title='Sentiment', fontsize=12, bbox_to_anchor=(1, 1), loc='upper left', frameon=False)
plt.tight_layout()
plt.show()

In [None]:
# 3. Star Rating Distribution by Sentiment
# Count plot for sentiment distribution across star ratings
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='star_rating', hue='sentiment_label', palette='Set2')
plt.title('Sentiment Distribution Across Star Ratings', fontsize=18)
plt.xlabel('Star Rating', fontsize=14)
plt.ylabel('Number of Reviews', fontsize=14)
plt.legend(title='Sentiment', fontsize=12)
plt.xticks(rotation=45)
plt.show()

In [None]:
# 4. Perform Basic Text Analysis - Word Cloud
# Combine all cleaned reviews into a single string
all_reviews = ' '.join(data['cleaned_review_body'])

# Generate a word cloud with a unique color scheme
wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='black',
    colormap='Spectral',
    contour_color='white',
    contour_width=1,
    random_state=0  # Ensures reproducibility
).generate(all_reviews)

# Display the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Hide axes
plt.title('Word Cloud of Reviews', fontsize=16, color='white')
plt.show()

In [None]:
# 5. Most Common Words
# Tokenize the cleaned review body
all_words = ' '.join(data['cleaned_review_body']).split()

# Count the frequency of each word
word_counts = Counter(all_words)

# Get the most common words (e.g., top 20)
most_common_words = word_counts.most_common(20)

# Create a DataFrame from the most common words
common_words_df = pd.DataFrame(most_common_words, columns=['Word', 'Frequency'])

# Plot the most common words with a vibrant color palette and custom styling
plt.figure(figsize=(12, 6))
sns.barplot(x='Frequency', y='Word', data=common_words_df, hue='Word', palette='magma', errorbar=None, legend=False)
plt.title('Most Common Words in Reviews', fontsize=18, fontweight='bold', color='white')
plt.xlabel('Frequency', fontsize=14, color='lightgray')
plt.ylabel('Words', fontsize=14, color='lightgray')
plt.grid(axis='x', color='gray', linestyle='--', linewidth=0.7)
plt.xticks(color='lightgray')
plt.yticks(color='lightgray')
plt.gca().set_facecolor('black')  # Set the background color to black
plt.show()


### **Step 5: Feature Extraction and Text Processing** ###

In [None]:
# Function to clean text data
def clean_text(text):
    # Remove HTML tags and special characters
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)      # Remove numbers
    text = text.lower()                   # Convert to lowercase
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    return text

# Apply the clean_text function to 'review_body'
data['cleaned_review_body'] = data['review_body'].apply(clean_text)

In [None]:
# Map sentiment_label once
sentiment_map = {
    'very negative': 0,
    'negative': 1,
    'neutral': 2,
    'positive': 3,
    'very positive': 4
}
data['sentiment_code'] = data['sentiment_label'].map(sentiment_map)

In [None]:
# Improved TF-IDF vectorizer (fit once)
vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.9,
    stop_words='english'
)
X = vectorizer.fit_transform(data['cleaned_review_body'])
y = data['sentiment_code']

# Optional: preview TF-IDF matrix and words with "love"
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print("Shape of TF-IDF matrix:", X.shape)

k_words = [word for word in tfidf_df.columns if word.startswith('love')]
tfidf_k_df = tfidf_df[k_words]
print("TF-IDF features starting with 'love':\n", tfidf_k_df.head())


### **Step 6: Model Selection and Model Training** ###

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.utils import resample

results = {}
# Split the data into training and test sets with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

# Apply SMOTE oversampling on training data only
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# 1. Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_balanced, y_train_balanced)
y_pred_lr = lr_model.predict(X_test)
results['Logistic Regression'] = {
    'accuracy': accuracy_score(y_test, y_pred_lr),
    'precision': precision_score(y_test, y_pred_lr, average='weighted'),
    'recall': recall_score(y_test, y_pred_lr, average='weighted'),
    'f1': f1_score(y_test, y_pred_lr, average='weighted'),
    'report': classification_report(y_test, y_pred_lr)
}

# 2. Multinomial Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_balanced, y_train_balanced)
y_pred_nb = nb_model.predict(X_test)
results['Naive Bayes'] = {
    'accuracy': accuracy_score(y_test, y_pred_nb),
    'precision': precision_score(y_test, y_pred_nb, average='weighted'),
    'recall': recall_score(y_test, y_pred_nb, average='weighted'),
    'f1': f1_score(y_test, y_pred_nb, average='weighted'),
    'report': classification_report(y_test, y_pred_nb)
}

# 3. Linear SVM
svm_model = LinearSVC(max_iter=1000)
svm_model.fit(X_train_balanced, y_train_balanced)
y_pred_svm = svm_model.predict(X_test)
results['SVM'] = {
    'accuracy': accuracy_score(y_test, y_pred_svm),
    'precision': precision_score(y_test, y_pred_svm, average='weighted'),
    'recall': recall_score(y_test, y_pred_svm, average='weighted'),
    'f1': f1_score(y_test, y_pred_svm, average='weighted'),
    'report': classification_report(y_test, y_pred_svm)
}

# Optionally, print results summary
for model_name, metrics in results.items():
    print(f"=== {model_name} ===")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Precision: {metrics['precision']:.4f}")
    print(f"Recall: {metrics['recall']:.4f}")
    print(f"F1-score: {metrics['f1']:.4f}")
    print(metrics['report'])

### **Step 7: Model Evaluation** ###

In [None]:
# After training and predicting with all models and storing results in `results` dictionary:

import pandas as pd

# Extract key metrics into a DataFrame for easy comparison
comparison_df = pd.DataFrame({
    model: {
        'Accuracy': metrics['accuracy'],
        'Precision': metrics['precision'],
        'Recall': metrics['recall'],
        'F1 Score': metrics['f1']
    } for model, metrics in results.items()
}).T  # Transpose to have models as rows

print("Model Performance Comparison:\n")
print(comparison_df)

# Find the best model by F1 score
best_model = comparison_df['F1 Score'].idxmax()
best_f1 = comparison_df.loc[best_model, 'F1 Score']

print(f"\nBest model based on F1 Score: {best_model} (F1 Score = {best_f1:.4f})")

# Optionally, print the classification report for the best model
print(f"\nClassification Report for {best_model}:\n")
print(results[best_model]['report'])


In [None]:
# Generate the confusion matrix for Naive Bayes
conf_matrix = confusion_matrix(y_test, y_pred_nb)

# Define the class labels for display
labels = ['Very Neg', 'Neg', 'Neutral', 'Pos', 'Very Pos']

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=labels, yticklabels=labels)

plt.title('Confusion Matrix - Naive Bayes')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### **Step 8: Performance Metrics** ###

In [None]:
import numpy as np

# Check if the model is a Naive Bayes model
if hasattr(nb_model, 'feature_log_prob_'):
    # Get feature names from the vectorizer
    feature_names = vectorizer.get_feature_names_out()

    # Get log probabilities (shape: n_classes x n_features)
    class_labels = nb_model.classes_
    log_probs = nb_model.feature_log_prob_

    # Calculate the difference between "very positive" and "very negative"
    class_index_positive = np.where(class_labels == 4)[0][0]  # 'very positive'
    class_index_negative = np.where(class_labels == 0)[0][0]  # 'very negative'

    # Calculate the difference in log-probability
    importance = log_probs[class_index_positive] - log_probs[class_index_negative]

    # Create a DataFrame for feature importance
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance
    })

    # Sort by absolute importance
    importance_df['AbsImportance'] = np.abs(importance_df['Importance'])
    importance_df = importance_df.sort_values(by='AbsImportance', ascending=False)

    # Display top 10 most important features
    print("Top 10 most important features for classifying sentiment (very pos vs. very neg):")
    print(importance_df.head(10))

    # Plot
    plt.figure(figsize=(10, 6))
    sns.barplot(data=importance_df.head(10), x='Importance', y='Feature', palette='coolwarm')
    plt.title('Top 10 Discriminative Features (Very Positive vs Very Negative)')
    plt.xlabel('Log Probability Difference')
    plt.ylabel('Feature')
    plt.show()
else:
    print("The model does not have feature_log_prob_ (maybe not a Naive Bayes model).")


In [None]:
# Save both model and vectorizer
joblib.dump(nb_model, 'sentiment_model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')