# **Sentiment Classification of Customer Reviews for Product Insights**

The goal of this project is to classify customer reviews as positive or negative using TF-IDF and Logistic Regression, as well as look at the top recurring keywords for both in order to understand specifically which experiences drive reviews, and therefore to drive CX (customer experience) strategy development.

>**Dataset**: Yelp Polarity Reviews
>
>**References**: Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems (pp. 649–657).

## **Step 1: Dataset Load and Preprocessing**

In [None]:
!pip install datasets
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from wordcloud import WordCloud
import shap
import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from datasets import load_dataset
import re

In [None]:
dataset = load_dataset("yelp_polarity", split="train[:20%]") # 80:20 test train split
dataset = dataset.shuffle(seed=42)  # Shuffle for randomness

# Convert to pandas DataFrame
df = pd.DataFrame(dataset)
df.head()

In [None]:
# Modifying text for consistency
def clean_text(text):
    text = text.lower() # convert all text to lowercase
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text) # remove punctuation
    text = re.sub(r"\s+", " ", text).strip() # remove extra whitespace
    return text

df['clean_text'] = df['text'].apply(clean_text)
df[['text', 'clean_text']].head()


# **Step 2: Exploratory Data Analysis: Visualising the Overall Distribution of Positive and Negative Reviews**

In [None]:
sns.set(style="whitegrid")

# Create count plot
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='label', data=df, palette=['#FF6B6B', '#4ECDC4'])

# Labels
ax.set_xticklabels(['Negative', 'Positive'], fontsize=12)
ax.set_title("Sentiment Distribution", fontsize=16, fontweight='bold')
ax.set_xlabel("Sentiment", fontsize=12)
ax.set_ylabel("Number of Reviews", fontsize=12)

# Labels for bars
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2., height + 200, f'{height:,}',
            ha="center", fontsize=11)

plt.tight_layout()
plt.show()

# **Step 3: Model Development and Evaluation Metrics**

TF-IDF (Term Frequency–Inverse Document Frequency) is an interpretable statistic indicating the relative significance of words in predicting sentiment. This ensures that words in the reviews are assigned weight in relation to their relative importance among all the words that occur in the reviews.

---



In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

# TF-IDF + Logistic Regression
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
model = LogisticRegression(max_iter=1000)
pipeline = make_pipeline(vectorizer, model)
pipeline.fit(X_train, y_train)

# Get predictions
y_pred = pipeline.predict(X_test)

# Get precision, recall, and F1-score
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

## Example: Running the Model with A Sample Review

You can replace the text in the predict_sentiment function with a review entered by you to illustrate how the model classifies reviews as positive or negative.

In [None]:
def predict_sentiment(text):
    cleaned = clean_text(text)
    vectorized = vectorizer.transform([cleaned])
    prediction = model.predict(vectorized)[0]
    return "Positive" if prediction == 1 else "Negative"

predict_sentiment("We were left waiting for so long.")

# **Insights: Discovering Trends in Reviews**

## Most Frequently Occuring Positive and Negative Words

These are the top keywords that appear to drive positive and negative sentiment. These can heavily inform business strategy.


In [None]:
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

coef_df = pd.DataFrame({
    'word': feature_names,
    'coefficient': coefficients
})

top_pos = coef_df.sort_values(by='coefficient', ascending=False).head(40)
top_neg = coef_df.sort_values(by='coefficient').head(40)

top_words = pd.concat([top_pos, top_neg])
top_words['sentiment'] = ['Positive'] * 40 + ['Negative'] * 40

plt.figure(figsize=(8, 12))
sns.barplot(data=top_words, x='coefficient', y='word', hue='sentiment',
            palette={'Positive': 'green', 'Negative': 'red'})
plt.axvline(0, color='black', linewidth=0.5)
plt.title("Top 20 Words Influencing Sentiment Prediction")
plt.xlabel("Model Coefficient")
plt.ylabel("Word")
plt.tight_layout()
plt.show()

# **SHAP Values**

SHAP values indicate the influence that a keyword has on the model's outcome.
>**Positive SHAP Value**: The feature pushes the model’s prediction towards the positive class.
>
>**Negative SHAP Value**: The feature pushes the model’s prediction towards the negative class.

In [None]:
X_test_tfidf = vectorizer.transform(X_test[:100])
explainer = shap.Explainer(model, X_test_tfidf)
shap_values = explainer(X_test_tfidf)

shap.summary_plot(shap_values, X_test_tfidf, feature_names=vectorizer.get_feature_names_out(), show=True)
coefficients = model.coef_[0]
top_idx = np.argsort(np.abs(coefficients))[-40:]
top_words = [(vectorizer.get_feature_names_out()[i], coefficients[i]) for i in reversed(top_idx)]

## Top words influencing predictions

In [None]:
for word, coef in top_words:
    print(f"{word}: {coef:.3f} ({'Positive' if coef > 0 else 'Negative'})")

#**Key Results and Recommendations**

**Strategic Recommendations**

* **Improve Consistency in Service Quality**
The presence of words like “rude” and “unfortunately” points to inconsistent or disappointing service interactions. Implement customer service training and real-time feedback mechanisms to ensure consistently positive experiences.

* **Prioritize Food Quality and Taste**:
Words like “bland,” “tasteless,” and “gross” indicate dissatisfaction with food preparation. Standardize food preparation procedures and monitor kitchen performance closely to maintain taste and quality.

* **Leverage Positive Themes in Marketing**:
Highlight top positive sentiment drivers such as “delicious,” “awesome,” “friendly,” and “fantastic” in branding and promotions. Reinforcing these strengths can attract new customers and build loyalty.

* **Monitor and Respond to Negative Trends Early**:
Use sentiment analysis tools to flag reviews containing highly negative keywords (e.g., “awful,” “worst,” “disgusting”) for immediate escalation. This allows the team to respond quickly and mitigate reputational risk.