<a href="https://colab.research.google.com/github/Nehagound/Machine-Learning/blob/main/Women_Cloth_Reviews_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Women Cloth Reviews Prediction with Multi Nomial Naive Bayes


**Objective**

The objective of predicting women's clothing reviews using Multinomial Naive Bayes is to create a machine learning model that can accurately classify customer sentiments and feedback based on textual data. By analyzing reviews, the model helps brands identify trends, improve products, and enhance customer satisfaction. This process involves text cleaning, vectorization, model training, and evaluation. The ultimate goal is to leverage natural language processing to gain insights into consumer opinions, enabling data-driven decisions that can lead to better product offerings and improved customer experiences. This approach is efficient and scalable, making it valuable for fashion brands.



**Data Source -**YBIFoundation/ProjectHub-MachineLearning Women Clothing Commerce Review dataset


 **Import Library**



In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

**Import Dataset**

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/YBIFoundation/ProjectHub-MachineLearning/main/Women%20Clothing%20E-Commerce%20Review.csv")

**Describe Data**

The data for predicting women's clothing reviews with Multinomial Naive Bayes typically includes a collection of customer reviews, each labeled with sentiment (e.g., positive, negative). Each review consists of textual data, including comments about the fit, quality, style, and overall satisfaction. This textual data is then pre-processed, tokenized, and converted into numerical features using techniques like CountVectorizer or TfidfVectorizer for the model to analyze.

**Data Visualisation**

In [None]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review,Rating,Recommended,Positive Feedback,Division,Department,Category
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Clothing ID        23486 non-null  int64 
 1   Age                23486 non-null  int64 
 2   Title              19676 non-null  object
 3   Review             22641 non-null  object
 4   Rating             23486 non-null  int64 
 5   Recommended        23486 non-null  int64 
 6   Positive Feedback  23486 non-null  int64 
 7   Division           23472 non-null  object
 8   Department         23472 non-null  object
 9   Category           23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB


In [None]:
df.shape

(23486, 10)

**Data Preprocessing**

In [None]:
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,845
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [None]:
df[df['Review']==""]=np.NaN
df['Review'].fillna("No Review", inplace=True)
df.isna().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Review'].fillna("No Review", inplace=True)


Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review,0
Rating,0
Recommended,0
Positive Feedback,0
Division,14
Department,14
Category,14


In [None]:
df['Review']

Unnamed: 0,Review
0,Absolutely wonderful - silky and sexy and comf...
1,Love this dress! it's sooo pretty. i happene...
2,I had such high hopes for this dress and reall...
3,"I love, love, love this jumpsuit. it's fun, fl..."
4,This shirt is very flattering to all due to th...
...,...
23481,I was very happy to snag this dress at such a ...
23482,"It reminds me of maternity clothes. soft, stre..."
23483,"This fit well, but the top was very see throug..."
23484,I bought this dress for a wedding i have this ...


**Defining Target Variable(y) and Feature Variable (X)**

In [None]:
df.columns

Index(['Clothing ID', 'Age', 'Title', 'Review', 'Rating', 'Recommended',
       'Positive Feedback', 'Division', 'Department', 'Category'],
      dtype='object')

In [None]:
x=df['Review']
y=df['Rating']
df['Rating'].value_counts()

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


**Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.7,stratify=y,random_state=2529)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((7045,), (16441,), (7045,), (16441,))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(lowercase=True,analyzer='word',ngram_range=(2,3),stop_words='english',max_features=5000)
x_train=cv.fit_transform(x_train)
cv.get_feature_names_out()

array(['10 12', '10 bought', '10 fit', ..., 'yes runs', 'yoga pants',
       'zipper little'], dtype=object)

In [None]:
x_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
x_test=cv.fit_transform(x_test)
cv.get_feature_names_out()

array(['0p fit', '10 12', '10 dress', ..., 'years old', 'yellow color',
       'yoga pants'], dtype=object)

In [None]:
x_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

**Modeling**

In [None]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(x_train,y_train)

In [None]:
y_pred=model.predict(x_test)
y_pred.shape

(16441,)

In [None]:
y_pred

array([3, 4, 2, ..., 5, 4, 3])

In [None]:
model.predict_proba(x_test)

array([[2.50243939e-01, 1.21534354e-01, 4.79327448e-01, 1.27392454e-02,
        1.36155013e-01],
       [7.07567572e-02, 4.58818848e-02, 2.63088820e-01, 6.12325394e-01,
        7.94714345e-03],
       [2.01162807e-01, 3.38200504e-01, 2.95746505e-01, 9.24481824e-02,
        7.24420016e-02],
       ...,
       [3.91842500e-03, 2.98245742e-03, 3.09715894e-04, 7.72959108e-03,
        9.85059811e-01],
       [1.61530659e-01, 3.51476805e-02, 5.52149762e-02, 5.43909305e-01,
        2.04197379e-01],
       [1.26442837e-01, 4.89792929e-02, 3.28750971e-01, 2.68648730e-01,
        2.27178169e-01]])

**Model Evaluation**

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,y_pred))

[[  37   76   66  130  280]
 [  83  166  182  203  462]
 [ 196  285  342  396  791]
 [ 386  392  457  695 1624]
 [ 885  837  963 1584 4923]]


In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.02      0.06      0.03       589
           2       0.09      0.15      0.12      1096
           3       0.17      0.17      0.17      2010
           4       0.23      0.20      0.21      3554
           5       0.61      0.54      0.57      9192

    accuracy                           0.37     16441
   macro avg       0.23      0.22      0.22     16441
weighted avg       0.42      0.37      0.39     16441



In [None]:
df['Rating'].value_counts()

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
5,13131
4,5077
3,2871
2,1565
1,842


In [None]:
df.replace({'Rating': {1:0,2:0,3:0,4:1,5:1}},inplace=True)
y=df['Rating']
x=df['Review']

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.7,stratify=y,random_state=2529)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((7045,), (16441,), (7045,), (16441,))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(lowercase=True,analyzer='word',ngram_range=(2,3),stop_words='english',max_features=5000)
x_train=cv.fit_transform(x_train)
x_test=cv.fit_transform(x_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(x_train,y_train)

**Model Prediction**

In [None]:
y_pred=model.predict(x_test)
y_pred.shape

(16441,)

In [None]:
y_pred

array([0, 0, 1, ..., 1, 1, 0])

**Model Evalution**

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,y_pred))

[[1125 2570]
 [2756 9990]]


In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.29      0.30      0.30      3695
           1       0.80      0.78      0.79     12746

    accuracy                           0.68     16441
   macro avg       0.54      0.54      0.54     16441
weighted avg       0.68      0.68      0.68     16441



**Explanation**

This project is focused on building a prediction model of women's clothing reviews using Multinomial Naive Bayes involves several key steps to classify customer sentiments effectively. First, a substantial dataset of women's clothing reviews is collected and labeled with sentiments such as positive, negative, or neutral. The text data is then pre-processed to remove noise like stop words and punctuation, and tokenized into individual words or tokens. Techniques such as stemming or lemmatization might also be applied to standardize the text.

Next, vectorization methods like CountVectorizer or TfidfVectorizer transform the cleaned text into numerical features. These vectors represent the frequency or importance of each word within the reviews, allowing the algorithm to analyze them.

The Multinomial Naive Bayes model, particularly suited for text data, is then trained using these vectors. It calculates the probability of each word belonging to a particular sentiment class and uses these probabilities to classify new reviews.

Model evaluation involves techniques like cross-validation to ensure performance consistency, and metrics such as accuracy, precision, recall, and F1-score to gauge its effectiveness.

Accurate classification helps brands understand customer opinions, identify trends, improve products, and enhance satisfaction, leveraging natural language processing for insightful sentiment analysis. This approach is both efficient and scalable, making it invaluable for the fashion industry.