# Consumer Reviews Summarization - Project Part 2


[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/Ariamestra/ConsumerReviews/blob/main/project_part2.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ariamestra/ConsumerReviews/blob/main/project_part2.ipynb)


## 1. Introduction
My project goal is to develop a baseline model using a Naive Bayes classifier, designed to summarize customer reviews. This is intended to help potential buyers quickly navigate through reviews when assessing a product. The system will focus on condensing the essential content of each review and its associated rating into a concise, single-sentence comment. These comments will be categorized as positive, neutral, or negative, aligning with the review's original rating. This approach will simplify the review evaluation process, making it more efficient.
**Data** <br>
The dataset was sourced from Kaggle, specifically the [Consumer Review of Clothing Product](https://www.kaggle.com/datasets/jocelyndumlao/consumer-review-of-clothing-product)
 dataset. This dataset includes customer reviews from Amazon. It has all sorts of feedback from buyers about different products. Along with the customers' actual reviews, ratings, product type, material, construction, color, finish, and durability.<br>



## Prep

In [32]:
# import all of the python modules/packages you'll need here
import pandas as pd
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import seaborn as sns
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split


# Now download NLTK resources
#nltk.download('stopwords')
#nltk.download('punkt')

data_URL = 'https://raw.githubusercontent.com/Ariamestra/ConsumerReviews/main/Reviews.csv'
df = pd.read_csv(data_URL)
print(f"Shape: {df.shape}")
df.head()

Shape: (49338, 9)


Unnamed: 0,Title,Review,Cons_rating,Cloth_class,Materials,Construction,Color,Finishing,Durability
0,,Absolutely wonderful - silky and sexy and comf...,4.0,Intimates,0.0,0.0,0.0,1.0,0.0
1,,Love this dress! it's sooo pretty. i happene...,5.0,Dresses,0.0,1.0,0.0,0.0,0.0
2,Some major design flaws,I had such high hopes for this dress and reall...,3.0,Dresses,0.0,0.0,0.0,1.0,0.0
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5.0,Pants,0.0,0.0,0.0,0.0,0.0
4,Flattering shirt,This shirt is very flattering to all due to th...,5.0,Blouses,0.0,1.0,0.0,0.0,0.0


In [33]:
# Check for the number of missing values in each column
print("Find all of the nulls:")
print(df.isnull().sum())

Find all of the nulls:
Title            3968
Review            831
Cons_rating       214
Cloth_class        16
Materials       43597
Construction    43595
Color           43596
Finishing       43601
Durability      43604
dtype: int64


In [34]:
# Count the number of nulls in reviews
number_of_nulls = df['Review'].isnull().sum()
print(f"Number of nulls in the review column: {number_of_nulls}")

# Calculate the number of nulls in rating
number_of_nulls_in_ratings = df['Cons_rating'].isnull().sum()
print(f"Number of nulls in the rating column: {number_of_nulls_in_ratings}")

original_count = df.shape[0]
df_cleaned = df.dropna(subset=['Review', 'Cons_rating']) # Drop rows with nulls in reviews and ratings columns
cleaned_count = df_cleaned.shape[0] # Number of rows after dropping nulls
rows_dropped = original_count - cleaned_count

print(f"Number of rows dropped: {rows_dropped}")

# Get the shape after dropping null values
df_shape_after_dropping = df_cleaned.shape

print(f"Shape of the DataFrame after dropping rows: {df_shape_after_dropping}")

Number of nulls in the review column: 831
Number of nulls in the rating column: 214
Number of rows dropped: 1043
Shape of the DataFrame after dropping rows: (48295, 9)


In [35]:
# Calculate the length of each review in terms of word count
df['Review_length'] = df['Review'].astype(str).apply(lambda x: len(x.split()))

# Filter out reviews that are shorter than 10 words
df_filtered = df[df['Review_length'] >= 10]

# Optional: If you want to view the longest and shortest reviews in the filtered DataFrame
longest_review_row = df_filtered.loc[df_filtered['Review_length'].idxmax()]
longest_review = longest_review_row['Review']
longest_review_length = longest_review_row['Review_length']

shortest_review_row = df_filtered.loc[df_filtered['Review_length'].idxmin()]
shortest_review = shortest_review_row['Review']
shortest_review_length = shortest_review_row['Review_length']

print(f"Longest review length: {longest_review_length} words")
print(f"Shortest review length: {shortest_review_length} words")

Longest review length: 668 words
Shortest review length: 10 words


In [36]:
# Tokenization and stop words removal
stop_words = set(stopwords.words('english'))
df['Processed_Reviews'] = df['Review'].apply(lambda x: ' '.join([word for word in word_tokenize(str(x).lower()) if word.isalpha() and word not in stop_words]))
print(f"Done")

Done


In [37]:
# Feature Extraction
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['Processed_Reviews']).toarray()
y = df['Cons_rating'].apply(lambda x: 1 if x > 3 else 0)  # Assuming ratings > 3 are positive

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Done")

Done


Train the Naive Bayes classifier

In [38]:
# Model Training---------------------------------------------- Error ---------------------------------------------- 
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(metrics.classification_report(y_test, y_pred))

# Summarization
df['Predicted_Sentiment'] = model.predict(X)
summary = df['Predicted_Sentiment'].value_counts(normalize=True) * 100
print(summary)


NameError: name 'MultinomialNB' is not defined

Basline model

## Conclusion