# Consumer Reviews Summarization - Project Part 2


[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/Ariamestra/ConsumerReviews/blob/main/project_part2.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ariamestra/ConsumerReviews/blob/main/project_part2.ipynb)


## 1. Introduction
My project goal is to develop a baseline model using a Naive Bayes classifier, designed to summarize customer reviews. This is intended to help potential buyers quickly navigate through reviews when assessing a product. The system will focus on condensing the essential content of each review and its associated rating into a concise, single-sentence comment. These comments will be categorized as positive, neutral, or negative, aligning with the review's original rating. This approach will simplify the review evaluation process, making it more efficient.<br>
<br>
**Data** <br>
The dataset was sourced from Kaggle, specifically the [Consumer Review of Clothing Product](https://www.kaggle.com/datasets/jocelyndumlao/consumer-review-of-clothing-product)
 dataset. This dataset includes customer reviews from Amazon. It has all sorts of feedback from buyers about different products. Along with the customers' actual reviews, ratings, product type, material, construction, color, finish, and durability.<br>



## Prep

In [20]:
# Import all of the Python Modules/Packages 
import pandas as pd
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import matplotlib.pyplot as plt
import seaborn as sns
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from nltk.probability import FreqDist
from heapq import nlargest
from sklearn.metrics import classification_report, accuracy_score


from nltk.probability import FreqDist


# Now download NLTK resources
#nltk.download('stopwords')
#nltk.download('punkt')

data_URL = 'https://raw.githubusercontent.com/Ariamestra/ConsumerReviews/main/Reviews.csv'
df = pd.read_csv(data_URL)
print(f"Shape: {df.shape}")
df.head()

Shape: (49338, 9)


Unnamed: 0,Title,Review,Cons_rating,Cloth_class,Materials,Construction,Color,Finishing,Durability
0,,Absolutely wonderful - silky and sexy and comf...,4.0,Intimates,0.0,0.0,0.0,1.0,0.0
1,,Love this dress! it's sooo pretty. i happene...,5.0,Dresses,0.0,1.0,0.0,0.0,0.0
2,Some major design flaws,I had such high hopes for this dress and reall...,3.0,Dresses,0.0,0.0,0.0,1.0,0.0
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5.0,Pants,0.0,0.0,0.0,0.0,0.0
4,Flattering shirt,This shirt is very flattering to all due to th...,5.0,Blouses,0.0,1.0,0.0,0.0,0.0


In [21]:
# Count the number of nulls in reviews
number_of_nulls = df['Review'].isnull().sum()
print(f"Number of nulls in the reviews: {number_of_nulls}")

# Calculate the number of nulls in rating
number_of_nulls_in_ratings = df['Cons_rating'].isnull().sum()
print(f"Number of nulls in the ratings: {number_of_nulls_in_ratings}")

original_count = df.shape[0]
df_cleaned = df.dropna(subset=['Review', 'Cons_rating']) # Drop rows with nulls in reviews and ratings columns
cleaned_count = df_cleaned.shape[0] # Number of rows after dropping nulls
rows_dropped = original_count - cleaned_count

print(f"Number of rows dropped: {rows_dropped}")

# Get the shape after dropping null values
df_shape_after_dropping = df_cleaned.shape

print(f"Shape of the DataFrame after dropping rows: {df_shape_after_dropping}")

Number of nulls in the reviews: 831
Number of nulls in the ratings: 214
Number of rows dropped: 1043
Shape of the DataFrame after dropping rows: (48295, 9)


In [29]:
# Calculate the length of each review in terms of word count
df['Review_length'] = df['Review'].astype(str).apply(lambda x: len(x.split()))

# Filter out reviews that are shorter than 20 words
df_filtered = df[df['Review_length'] >= 20]
df = df[df['Review_length'] >= 20]

# Longest and shortest reviews 
longest_review_row = df_filtered.loc[df_filtered['Review_length'].idxmax()]
longest_review = longest_review_row['Review']
longest_review_length = longest_review_row['Review_length']

shortest_review_row = df_filtered.loc[df_filtered['Review_length'].idxmin()]
shortest_review = shortest_review_row['Review']
shortest_review_length = shortest_review_row['Review_length']

df_filtered_shape = df_filtered.shape
df_filtered_shape

print(f"Longest review length: {longest_review_length} words")
print(f"Shortest review length: {shortest_review_length} words")
print(f"Shape of the DataFrame after dropping rows below 20 words: {df_filtered_shape}")

Longest review length: 668 words
Shortest review length: 20 words
Shape of the DataFrame after dropping rows below 20 words: (32925, 14)


In [23]:
# Tokenization and stop words removal
stop_words = set(stopwords.words('english'))
df['Processed_Reviews'] = df['Review'].apply(lambda x: ' '.join([word for word in word_tokenize(str(x).lower()) if word.isalpha() and word not in stop_words]))

# Remove punctuation
# Make everything lowercase

print(f"Done")

Done


In [24]:
# Feature Extraction
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['Processed_Reviews']).toarray()

# Negative is 0 = 1 and 2 rating 
# Neutral is 1 = 3 rating
# Positive is 2 = 4 and 5 rating

#y = df['Cons_rating'].apply(lambda x: 0 if x < 3 else (1 if x == 3 else 2)) 
y = df['Sentiment_Summary'] = df['Cons_rating'].apply(lambda x: 'Negative' if x < 3 else ('Neutral' if x == 3 else 'Positive'))

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(df['Processed_Reviews'], y, test_size=0.2, random_state=42)
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', max_df=0.95, min_df=0.05)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

print(f"Done")

X_train shape: (26340,)
X_test shape: (6585,)
y_train shape: (26340,)
y_test shape: (6585,)
Done


Train the Naive Bayes classifier

In [26]:
# Model Training -------------------------- Fix error --------------------------------------------------------
model = MultinomialNB()

model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
# Add positive, neutral and negative to chart
print(metrics.classification_report(y_test, y_pred))
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print('-' * 100)

# Summarization
df_tfidf = tfidf_vectorizer.transform(df['Processed_Reviews'])
df['Predicted_Sentiment'] = model.predict(df_tfidf)
summary = df['Predicted_Sentiment'].value_counts(normalize=True) * 100
print(summary)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       919
     Neutral       0.00      0.00      0.00       844
    Positive       0.73      1.00      0.85      4822

    accuracy                           0.73      6585
   macro avg       0.24      0.33      0.28      6585
weighted avg       0.54      0.73      0.62      6585

Accuracy: 0.7322703113135915
----------------------------------------------------------------------------------------------------
Predicted_Sentiment
Positive    99.996963
Negative     0.003037
Name: proportion, dtype: float64


Basline model

In [27]:
# Sentiment
df['Sentiment_Summary'] = df['Cons_rating'].apply(lambda x: 'Negative' if x < 3 else ('Neutral' if x == 3 else 'Positive'))


# Display the first 5 reviews with their predicted sentiments 
for index, row in df.head(5).iterrows():
    print("Original Review:", row['Review'])
    print("Predicted Sentiment:", row['Sentiment_Summary'])
    print('-' * 100)


Original Review: Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Predicted Sentiment: Positive
----------------------------------------------------------------------------------------------------
Original Review: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
Predicted Sentiment: Neutral
-

In [28]:
# Summarize 

# Make sure they are intaking sentiment/rating

def summarize_review(review, num_sentences=2):

    stopWords = set(stopwords.words("english"))
    words = word_tokenize(review.lower()) 

    freqTable = FreqDist(words)
    sentences = sent_tokenize(review)
    sentenceValue = dict()

    for sentence in sentences:
        for word, freq in freqTable.items():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq

    summary_sentences = nlargest(num_sentences, sentenceValue, key=sentenceValue.get)
    summary = ' '.join(summary_sentences)
    return summary

df['Review'] = df['Review'].astype(str)

# Apply the summarization function to your reviews
df['Summarized_Review'] = df['Review'].apply(summarize_review)

# Print the first 5 original and summarized reviews
for index, row in df.head(5).iterrows():
    original_review = row['Review']
    summarized_review = row['Summarized_Review']

    # Calculate word count
    original_word_count = len(original_review.split())
    summarized_word_count = len(summarized_review.split())

    print("Original Review:", original_review)
    print("Original Review Word Count:", original_word_count)
    print("Summarized Review:", summarized_review)
    print("Summarized Review Word Count:", summarized_word_count)
    print('-' * 100)



Original Review: Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Original Review Word Count: 62
Summarized Review: i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. would definitely be a true midi on someone who is truly petite.
Summarized Review Word Count: 36
----------------------------------------------------------------------------------------------------
Original Review: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable 

## Conclusion