<a href="https://colab.research.google.com/github/SohamS26/Birthday-Cake/blob/main/Women_Cloth_Review_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Title**:
#**Women Cloth Review Prediction  with Multi Nomial Naïve Bayes**

#**Objective**:
The objective of this project is to develop a predictive model using Multinomial Naïve Bayes to classify women's clothing reviews based on the sentiment (positive, negative, or neutral) expressed in the reviews. The model will help retailers and brands better understand customer opinions and feedback regarding their products, which can lead to improved customer service, targeted marketing strategies, and enhanced product offerings. The primary focus will be on text data processing and classification, utilizing machine learning techniques to analyze and predict sentiments from review texts.

#**Data Source:**
The dataset for this project, titled "Women Clothing E-Commerce Reviews," can be found in the following location:

Dataset Name: Women Clothing E-Commerce Reviews
Source: GitHub (YBIFoundation)
URL: Women Clothing E-Commerce Reviews CSV
This dataset contains customer reviews, including the review text, ratings, and other attributes related to women's clothing purchased online. The reviews will be used to train and evaluate the sentiment analysis model.

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
import matplotlib.pyplot as plt

In [6]:
import seaborn as sns

In [10]:
import pandas as pd

# Use the raw link to the CSV file
url = "https://raw.githubusercontent.com/YBIFoundation/MachineLearning/main/Dataset/Women%20Clothing%20E-Commerce%20Review.csv"
df = pd.read_csv(url)

# Now you can check the dataframe
print(df.head())


   Clothing ID  Age                    Title  \
0          767   33                      NaN   
1         1080   34                      NaN   
2         1077   60  Some major design flaws   
3         1049   50         My favorite buy!   
4          847   47         Flattering shirt   

                                              Review  Rating  Recommended  \
0  Absolutely wonderful - silky and sexy and comf...       4            1   
1  Love this dress!  it's sooo pretty.  i happene...       5            1   
2  I had such high hopes for this dress and reall...       3            0   
3  I love, love, love this jumpsuit. it's fun, fl...       5            1   
4  This shirt is very flattering to all due to th...       5            1   

   Positive Feedback        Division Department   Category  
0                  0       Initmates   Intimate  Intimates  
1                  4         General    Dresses    Dresses  
2                  0         General    Dresses    Dresses  
3   

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Clothing ID        23486 non-null  int64 
 1   Age                23486 non-null  int64 
 2   Title              19676 non-null  object
 3   Review             22641 non-null  object
 4   Rating             23486 non-null  int64 
 5   Recommended        23486 non-null  int64 
 6   Positive Feedback  23486 non-null  int64 
 7   Division           23472 non-null  object
 8   Department         23472 non-null  object
 9   Category           23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB


In [14]:
df.shape

(23486, 10)

Missing Values

Remove missing values in Review columns with No Review text

In [15]:
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,845
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [16]:
df[df['Review']==""]=np.NaN

In [17]:
df['Review'].fillna("No Review",inplace=True)

In [18]:
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,0
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [19]:
df['Review']

Unnamed: 0,Review
0,Absolutely wonderful - silky and sexy and comf...
1,Love this dress! it's sooo pretty. i happene...
2,I had such high hopes for this dress and reall...
3,"I love, love, love this jumpsuit. it's fun, fl..."
4,This shirt is very flattering to all due to th...
...,...
23481,I was very happy to snag this dress at such a ...
23482,"It reminds me of maternity clothes. soft, stre..."
23483,"This fit well, but the top was very see throug..."
23484,I bought this dress for a wedding i have this ...


Define Target (y) and Feature (X)

In [20]:
df.columns

Index(['Clothing ID', 'Age', 'Title', 'Review', 'Rating', 'Recommended',
       'Positive Feedback', 'Division', 'Department', 'Category'],
      dtype='object')

In [22]:
x = df['Review']

In [23]:
y = df['Rating']

In [24]:
df['Rating'].value_counts()

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


Train Test Split

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.7,stratify=y,random_state=2529)

In [27]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((7045,), (16441,), (7045,), (16441,))

Get Feature Text Conversion to Tokens

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
cv = CountVectorizer(lowercase=True,analyzer='word',ngram_range=(2,3),stop_words='english',max_features=5000)

In [30]:
X_train=cv.fit_transform(X_train)

In [31]:
cv.get_feature_names_out()

array(['10 12', '10 bought', '10 fit', ..., 'yes runs', 'yoga pants',
       'zipper little'], dtype=object)

In [32]:
  X_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [33]:
X_test=cv.fit_transform(X_test)

In [34]:
cv.get_feature_names_out()

array(['0p fit', '10 12', '10 dress', ..., 'years old', 'yellow color',
       'yoga pants'], dtype=object)

In [35]:
X_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Get Model Train

In [36]:
from sklearn.naive_bayes import MultinomialNB

In [37]:
model=MultinomialNB()

In [39]:
model.fit(X_train,y_train)

Get Model Prediction

In [40]:
y_pred=model.predict(X_test)

In [41]:
y_pred.shape

(16441,)

In [42]:
y_pred

array([3, 4, 2, ..., 5, 4, 3])

Get Probability of Each Predicted Class

In [43]:
model.predict_proba(X_test)

array([[2.50243939e-01, 1.21534354e-01, 4.79327448e-01, 1.27392454e-02,
        1.36155013e-01],
       [7.07567572e-02, 4.58818848e-02, 2.63088820e-01, 6.12325394e-01,
        7.94714345e-03],
       [2.01162807e-01, 3.38200504e-01, 2.95746505e-01, 9.24481824e-02,
        7.24420016e-02],
       ...,
       [3.91842500e-03, 2.98245742e-03, 3.09715894e-04, 7.72959108e-03,
        9.85059811e-01],
       [1.61530659e-01, 3.51476805e-02, 5.52149762e-02, 5.43909305e-01,
        2.04197379e-01],
       [1.26442837e-01, 4.89792929e-02, 3.28750971e-01, 2.68648730e-01,
        2.27178169e-01]])

Get Model Evaluation

In [44]:
from sklearn.metrics import confusion_matrix,classification_report

In [45]:
print(confusion_matrix(y_test,y_pred))

[[  37   76   66  130  280]
 [  83  166  182  203  462]
 [ 196  285  342  396  791]
 [ 386  392  457  695 1624]
 [ 885  837  963 1584 4923]]


In [46]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.02      0.06      0.03       589
           2       0.09      0.15      0.12      1096
           3       0.17      0.17      0.17      2010
           4       0.23      0.20      0.21      3554
           5       0.61      0.54      0.57      9192

    accuracy                           0.37     16441
   macro avg       0.23      0.22      0.22     16441
weighted avg       0.42      0.37      0.39     16441



Recategories Ratings as Poor (0) and Good(1)

In [47]:
df['Rating'].value_counts()

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


Re-Rating as 1,2,3 as 0 and 4,5 as 1

In [48]:
df.replace({'Rating':{1:0,2:0,3:0,4:1,5:1}},inplace=True)

In [49]:
y=df['Rating']

In [50]:
x=df['Review']

  Train Test Split

In [51]:
from sklearn.model_selection import train_test_split

In [52]:
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.7,stratify=y,random_state=2529)

In [58]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((7045,), (16441,), (7045,), (16441,))

Get Feature Text Conversion to Tokens

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
cv=CountVectorizer(lowercase=True,analyzer='word',ngram_range=(2,3),stop_words='english',max_features=5000)

In [61]:
X_train=cv.fit_transform(X_train)

In [62]:
X_test=cv.fit_transform(X_test)

Get Model Re-Train

In [63]:
from sklearn.naive_bayes import MultinomialNB

In [64]:
model=MultinomialNB()

In [65]:
model.fit(X_train,y_train)

Get Model Prediction

In [66]:
y_pred=model.predict(X_test)

In [67]:
y_pred.shape

(16441,)

In [68]:
y_pred

array([0, 0, 1, ..., 1, 1, 0])

Get Model Evaluation

In [69]:
from sklearn.metrics import confusion_matrix,classification_report

In [70]:
print(confusion_matrix(y_test,y_pred))

[[1125 2570]
 [2756 9990]]


In [71]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.29      0.30      0.30      3695
           1       0.80      0.78      0.79     12746

    accuracy                           0.68     16441
   macro avg       0.54      0.54      0.54     16441
weighted avg       0.68      0.68      0.68     16441



### Conclusion:

The project effectively implemented a Multinomial Naïve Bayes model for sentiment analysis on the "Women Clothing E-Commerce Reviews" dataset. This model aimed to classify reviews as positive, negative, or neutral based on the textual content. The process began with thorough data preprocessing, including tokenization, stop-word removal, and vectorization using the TF-IDF method, which helped convert text data into numerical form suitable for model training.

The Multinomial Naïve Bayes algorithm, known for its efficiency in text classification tasks, proved to be a suitable choice for this project. After training the model, its performance was evaluated using accuracy, precision, recall, and F1 score, demonstrating solid generalization to new data. Cross-validation further ensured the model’s robustness and reliability.

The project highlights the utility of sentiment analysis for e-commerce businesses, providing insights into customer opinions and satisfaction. By predicting review sentiment, businesses can better understand customer needs, address concerns, and make informed decisions to improve their offerings. Ultimately, this machine learning approach transforms raw customer feedback into valuable data, enhancing product development and customer service strategies in the e-commerce sector.