Product reviews with multinomial Naive Bayes

OBJECTIVES:
     The primary objective of this project is to classify product reviews using the Multinomial Naive Bayes (MNB) algorithm, focusing on sentiment analysis. MNB is particularly suited for text-based tasks where features represent word counts or frequencies, making it ideal for analyzing product reviews. The project begins by collecting a dataset of product reviews, which may include review text, sentiment labels (e.g., positive, negative), or star ratings.

Preprocessing steps like tokenization, stopword removal, and vectorization (using Bag of Words or TF-IDF) are applied to convert raw text into numerical features. The MNB model is trained on these features to classify reviews into positive, negative, or neutral sentiments, or into star rating categories.

The model’s performance is evaluated using metrics such as accuracy, precision, recall, and F1-score, and further optimized through hyperparameter tuning and feature selection (e.g., using n-grams). Visualization techniques will be used to analyze the distribution of sentiments and identify patterns in the review data. Ultimately, the project aims to provide insights into customer sentiments and help businesses make data-driven decisions based on product feedback.

For a product reviews classification project using **Multinomial Naive Bayes**, several publicly available datasets can serve as excellent sources of data:

### 1. **Amazon Product Reviews**:
   - **Source**: [Amazon Customer Reviews Dataset](https://registry.opendata.aws/amazon-reviews/)
   - **Description**: A comprehensive dataset containing millions of product reviews from Amazon customers. It includes product titles, review text, star ratings (1 to 5 stars), and categories across various industries like electronics, books, and clothing.
   - **Use Case**: Sentiment analysis, classification of review categories, or star rating prediction.

### 2. **Yelp Reviews**:
   - **Source**: [Yelp Open Dataset](https://www.yelp.com/dataset)
   - **Description**: The Yelp dataset provides reviews for businesses like restaurants, retail stores, and more. It includes review text, ratings, business details, and user feedback.
   - **Use Case**: Sentiment classification, opinion mining, or category classification based on reviews.

### 3. **IMDb Movie Reviews**:
   - **Source**: [IMDb Movie Reviews](https://ai.stanford.edu/~amaas/data/sentiment/)
   - **Description**: A popular dataset used for sentiment analysis, consisting of 50,000 movie reviews categorized as positive or negative.
   - **Use Case**: Binary sentiment classification (positive/negative).

### 4. **Kaggle Datasets**:
   - **Source**: [Kaggle Product Reviews](https://www.kaggle.com/datasets)
   - **Description**: Kaggle hosts numerous datasets for product reviews, including specialized ones like electronics, clothing, or cosmetics reviews.
   - **Use Case**: Various use cases, including star rating prediction and sentiment analysis.

These datasets provide diverse data for building, training, and evaluating the Multinomial Naive Bayes classifier.

**Import** **Libraries** **and Dataset**

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv(r'https://raw.githubusercontent.com/YBIFoundation/MachineLearning/main/Dataset/Women%20Clothing%20E-Commerce%20Review.csv')
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review,Rating,Recommended,Positive Feedback,Division,Department,Category
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Clothing ID        23486 non-null  int64 
 1   Age                23486 non-null  int64 
 2   Title              19676 non-null  object
 3   Review             22641 non-null  object
 4   Rating             23486 non-null  int64 
 5   Recommended        23486 non-null  int64 
 6   Positive Feedback  23486 non-null  int64 
 7   Division           23472 non-null  object
 8   Department         23472 non-null  object
 9   Category           23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB


In [16]:
df.shape

(23486, 10)

**Missing** **Values**

In [17]:
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,845
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [18]:
df[df['Review']=='']=np.NaN
df['Review'].fillna("No Review",inplace=True)
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,0
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [19]:
df["Review"]

Unnamed: 0,Review
0,Absolutely wonderful - silky and sexy and comf...
1,Love this dress! it's sooo pretty. i happene...
2,I had such high hopes for this dress and reall...
3,"I love, love, love this jumpsuit. it's fun, fl..."
4,This shirt is very flattering to all due to th...
...,...
23481,I was very happy to snag this dress at such a ...
23482,"It reminds me of maternity clothes. soft, stre..."
23483,"This fit well, but the top was very see throug..."
23484,I bought this dress for a wedding i have this ...


**Define target(y) and feature (x)**

In [20]:
df.columns

Index(['Clothing ID', 'Age', 'Title', 'Review', 'Rating', 'Recommended',
       'Positive Feedback', 'Division', 'Department', 'Category'],
      dtype='object')

In [25]:
x=df['Review']
y=df["Rating"]
df["Rating"].value_counts()


Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


**Train Test Split**

In [27]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x, y,train_size=0.7, stratify=y,random_state = 2529)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(16440,) (7046,) (16440,) (7046,)


**Get Feature Extraction test to Tokens**

In [32]:
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer is in the sklearn.feature_extraction.text module
cv=CountVectorizer(lowercase=True, analyzer='word', ngram_range=(2, 3), stop_words= 'english', max_features=5000)
X_train =cv.fit_transform(x_train) # make sure you are using the correct variable name x_train
cv.get_feature_names_out()



array(['10 12', '10 bought', '10 fit', ..., 'yellow color', 'yoga pants',
       'zipper little'], dtype=object)

In [33]:
X_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [34]:
X_test=cv.fit_transform(x_test)
cv.get_feature_names_out()

array(['10 12', '10 dress', '10 fit', ..., 'years come', 'years old',
       'yoga pants'], dtype=object)

In [35]:
X_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

**Get Model Train**

In [36]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(X_train,y_train)

**Get Model Prediction**

In [37]:
y_pred=model.predict(X_test)
y_pred.shape

(7046,)

In [38]:
y_pred

array([1, 5, 5, ..., 5, 5, 5])

**Get Model Probability For Each Predicted Class**

In [39]:
model.predict_proba(X_test)

array([[0.71118473, 0.02625165, 0.15465118, 0.01496876, 0.09294369],
       [0.02416867, 0.04769471, 0.35268622, 0.16185007, 0.41360034],
       [0.03582725, 0.06660584, 0.12226277, 0.21618005, 0.55912409],
       ...,
       [0.02320281, 0.08950939, 0.08962183, 0.16719203, 0.63047394],
       [0.01167675, 0.00202714, 0.08539004, 0.34347398, 0.55743209],
       [0.03959824, 0.05612822, 0.00688869, 0.1560574 , 0.74132745]])

**Get Model Evaluation**

In [40]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test,y_pred))

[[  15   13   45   36  144]
 [  43   43   86   85  213]
 [ 116   78  113  166  388]
 [ 166  108  194  336  719]
 [ 371  272  349  722 2225]]


In [44]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.02      0.06      0.03       253
           2       0.08      0.09      0.09       470
           3       0.14      0.13      0.14       861
           4       0.25      0.22      0.23      1523
           5       0.60      0.56      0.58      3939

    accuracy                           0.39      7046
   macro avg       0.22      0.21      0.21      7046
weighted avg       0.42      0.39      0.40      7046



In [45]:
df['Rating'].value_counts()

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


In [48]:
df.replace({'Rating': {1:0, 2:0, 3:0, 4:1, 5:1}}, inplace=True)
y=df['Rating']
x=df['Review']

**Train Test Split**

In [49]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x, y,train_size=0.7, stratify=y,random_state = 2529)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(16440,) (7046,) (16440,) (7046,)


**Get Feature Text Extraction To Tokens**

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(lowercase=True, analyzer='word', ngram_range=(2, 3), stop_words= 'english', max_features=5000)
x_train =cv.fit_transform(x_train)
x_test=cv.fit_transform(x_test)

**Get Model Re-Train**

In [54]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(x_train,y_train)

**Get Model Prediction**

In [55]:
y_pred=model.predict(x_test)
y_pred.shape

(7046,)

In [58]:
y_pred

array([0, 0, 0, ..., 0, 0, 0])

**Get Model Evaluation**

In [59]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test,y_pred))

[[7046]]


In [66]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7046

    accuracy                           1.00      7046
   macro avg       1.00      1.00      1.00      7046
weighted avg       1.00      1.00      1.00      7046



This project focuses on using **Multinomial Naive Bayes (MNB)** for **product review classification**, primarily for **sentiment analysis** or categorizing reviews based on ratings or product types. Multinomial Naive Bayes is particularly effective for text classification problems where the features represent word counts or word frequencies.

### Project Workflow:

1. **Data Collection**:
   The first step involves gathering a dataset of product reviews. Popular sources include Amazon, Yelp, or IMDb, which provide reviews with associated metadata like review text, star ratings, and product categories.

2. **Data Preprocessing**:
   Since Naive Bayes requires numerical input, the review text must be converted into a format that the model can process. Preprocessing involves:
   - **Tokenization**: Splitting review text into individual words (tokens).
   - **Stopwords Removal**: Removing common words that do not contribute much meaning (e.g., "the", "is").
   - **Vectorization**: Converting the text into numerical features using techniques like **Bag of Words (BoW)** or **TF-IDF (Term Frequency-Inverse Document Frequency)**. These techniques count word occurrences or measure the importance of words in a review.

3. **Model Training**:
   The Multinomial Naive Bayes classifier is then trained on the processed dataset. The goal is to predict the sentiment (positive, negative, or neutral) or other categories like product types or star ratings based on the review text.

4. **Model Evaluation**:
   The model's performance is measured using evaluation metrics like **accuracy**, **precision**, **recall**, and **F1-score**. A **confusion matrix** helps analyze the model’s predictions for different sentiment classes or categories.

5. **Optimization**:
   Hyperparameter tuning, feature selection (like using **n-grams** for capturing word sequences), and cross-validation can improve the model's accuracy. By adjusting parameters, the model becomes more robust in classifying unseen data.

6. **Insights and Visualization**:
   The project ends by analyzing the results and visualizing the distribution of sentiments across different products or categories. For example, identifying common words in positive vs. negative reviews or which product categories have more positive or negative reviews.

### Goal:
The ultimate goal of the project is to provide meaningful insights into customer feedback, enabling businesses to understand customer sentiment, identify areas for improvement, and make data-driven decisions based on product reviews.