**Women's Clothing E-Commerce Reviews **

![Image](https://moldkansascity.com/wp-content/uploads/2016/03/customer-reviews.png)

Welcome to my 4th kernel on Kaggle. Through this notebook, I want to explore and analyze the reviews part. All reviews are valuable, and a mix of positive and negative reviews helps to improve consumer trust in the opinions they read. Reviews are certainly an important factor to increase conversions. 
[Here](https://econsultancy.com/blog/9366-ecommerce-consumer-reviews-why-you-need-them-and-how-to-use-them) is the good article on 'E-commerce consumer reviews: why you need them and how to use them'

In this Notebook, I explored the below things:-

1. The age group of female who wrote the most, least, and very few reviews
2. The age group bought what kind of product (Department names) 
3. The age group bought what kind of product (Class names) 
4. The Department has what number or percentage of Class names
5. Which Division name has what number of Department names 
6. Which Division name has what number of Class name
7. Count the frequency of words in Review Text column
8. Wordcloud of the Review Text column
9. Review Text column - How many are positive, negative, and neutral reviews based on the sentiment and polarity value.
10. Wordcloud of the positive Review Text column
11. Wordcloud of the negative Review Text column
12. Used Multinomial Naive Bayes Algorithm to predict which product has 5 rating and which has 1 rating?
13. Used Multinomial Naive Bayes Algorithm to predict which product is recommended and which is not?

**Keys- Multinomial Naive Bayes, WordCloud, TextBlob, Word Frequency, StopWords, Sentiments, NLP, NLTK**


*Required Libraries:-*

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
%matplotlib inline

*Importing file into dataframe*

In [None]:
df_rough=pd.read_csv("../input/Womens Clothing E-Commerce Reviews.csv",index_col=False)
column_contain=['Clothing ID','Age','Title','Review Text','Rating','Recommended IND','Positive Feedback Count','Division Name','Department Name','Class Name']

In [None]:
df=pd.DataFrame(data=df_rough,columns=column_contain)
df.info()

*This dataset has 23486 entries and 9 columns. Some of the entries are missing like Title, Division Name, Department Name, and Class Name. *

In [None]:
df.head()

In [None]:
df.describe()

*Is there any correlation between user's rating and reviews length ?*

In [None]:
df['Review Text']=df['Review Text'].astype(str)
df['Review Length']=df['Review Text'].apply(len)

In [None]:
g = sns.FacetGrid(data=df, col='Rating')
g.map(plt.hist, 'Review Length', bins=50)

*From the above chart, we can say that the users gave 5 rating oftenly. Infact, there are less no of users who gave rating 1 and 2.*

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x='Rating', y='Review Length', data=df)

*From the above boxplot, we can conclude that the rating 3 and 4 have more lenth in review. *

In [None]:
rating = df.groupby('Rating').mean()
rating.corr()

In [None]:
sns.heatmap(data=rating.corr(), annot=True)

The above correlation map shows that there is not much correlation among the column. The columns like review length and positive feedback count are slightly correlated. And, the number 0.93 in negative inidicates that it is nowhere correlated with Review Length. As the Age grows the length of review decreases.

In [None]:
df.head()

In [None]:
df.groupby(['Rating', pd.cut(df['Age'], np.arange(0,100,10))])\
       .size()\
       .unstack(0)\
       .plot.bar(stacked=True)

*From the above barplot, we can say that the age group 10-20 gave less rating. It is obvious. In this age group teenagers generally don't care about online shopping and reviews. The age group 30-40 gave more 5 rating as compared to all the other age groups. In fact, this is the age group who gave most of the reviews and ratings. And similarly, the age group above 70 did not care about the online shopping stuff. *

In [None]:
plt.figure(figsize=(15,15))
df.groupby(['Department Name', pd.cut(df['Age'], np.arange(0,100,10))])\
       .size()\
       .unstack(0)\
       .plot.bar(stacked=True)

*In the above barplot, I want to concentrate on the department and the age group. The females from 20-70 age were more active and bought the stuff online. From the above barplot, we can conclude that the females were more focussed on Tops and Dressess department. And, somewhat focussed on Bottoms too but not that much. They were less concentrated on Trend department. *

In [None]:
plt.figure(figsize=(15,15))
df.groupby(['Class Name', pd.cut(df['Age'], np.arange(0,100,10))])\
       .size()\
       .unstack(0)\
       .plot.bar(stacked=True)


In [None]:
z=df.groupby(by=['Department Name'],as_index=False).count().sort_values(by='Class Name',ascending=False)

plt.figure(figsize=(10,10))
sns.set_style("whitegrid")
ax = sns.barplot(x=z['Department Name'],y=z['Class Name'], data=z)
plt.xlabel("Department Name")
plt.ylabel("Count")
plt.title("Counts Vs Department Name")

*The above barplot shows that there are maximum entries for Top which is around 10500. And, then the Dresses department is having around 6000 entries*

In [None]:
w=df.groupby(by=['Division Name'],as_index=False).count().sort_values(by='Class Name',ascending=False)

plt.figure(figsize=(10,10))
sns.set_style("whitegrid")
ax = sns.barplot(x=w['Division Name'],y=w['Class Name'], data=w)
plt.xlabel("Division Name")
plt.ylabel("Count")
plt.title("Counts Vs Division Name")

*In our dataset, there are 3 divisions which are General, General Petite, and Intimates. The General Division producs were more sold out as compared to General Petite and Intimates. There were around 14K producs were sold in General division, 8K products in General Petite division and around 1600 products were sold in Initmates division.  *

In [None]:
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
import nltk
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
from wordcloud import WordCloud, STOPWORDS
import re

top_N = 100
#convert list of list into text
#a=''.join(str(r) for v in df_usa['title'] for r in v)

a = df['Review Text'].str.lower().str.cat(sep=' ')

# removes punctuation,numbers and returns list of words
b = re.sub('[^A-Za-z]+', ' ', a)

#remove all the stopwords from the text
stop_words = list(get_stop_words('en'))         
nltk_words = list(stopwords.words('english'))   
stop_words.extend(nltk_words)

word_tokens = word_tokenize(b)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

# Remove characters which have length less than 2  
without_single_chr = [word for word in filtered_sentence if len(word) > 2]

# Remove numbers
cleaned_data_title = [word for word in without_single_chr if not word.isnumeric()]        

# Calculate frequency distribution
word_dist = nltk.FreqDist(cleaned_data_title)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])

plt.figure(figsize=(10,10))
sns.set_style("whitegrid")
ax = sns.barplot(x="Word",y="Frequency", data=rslt.head(7))

*The above barplot calculates the frequency of the word in Review Text column. The word Dress appeared more in the text. Next to this, the word Love comes second which is an indicator of positive review. *

In [None]:
def wc(data,bgcolor,title):
    plt.figure(figsize = (100,100))
    wc = WordCloud(background_color = bgcolor, max_words = 1000,  max_font_size = 50)
    wc.generate(' '.join(data))
    plt.imshow(wc)
    plt.axis('off')

In [None]:
wc(cleaned_data_title,'black','Most Used Words')

*The above wordcloud is for the most oftenly used words in the Review Text column*

In [None]:
from textblob import TextBlob

bloblist_desc = list()

df_review_str=df['Review Text'].astype(str)
for row in df_review_str:
    blob = TextBlob(row)
    bloblist_desc.append((row,blob.sentiment.polarity, blob.sentiment.subjectivity))
    df_polarity_desc = pd.DataFrame(bloblist_desc, columns = ['Review','sentiment','polarity'])
 
def f(df_polarity_desc):
    if df_polarity_desc['sentiment'] > 0:
        val = "Positive Review"
    elif df_polarity_desc['sentiment'] == 0:
        val = "Neutral Review"
    else:
        val = "Negative Review"
    return val

df_polarity_desc['Sentiment_Type'] = df_polarity_desc.apply(f, axis=1)

plt.figure(figsize=(10,10))
sns.set_style("whitegrid")
ax = sns.countplot(x="Sentiment_Type", data=df_polarity_desc)


*According to the above graph, there are more positive reviews but again it depends on the polarity value. I condsidered sentiment value > 0 is Positive Review..*

In [None]:
positive_reviews=df_polarity_desc[df_polarity_desc['Sentiment_Type']=='Positive Review']
negative_reviews=df_polarity_desc[df_polarity_desc['Sentiment_Type']=='Negative Review']

In [None]:
wc(positive_reviews['Review'],'black','Most Used Words')

 *The above wordcloud only for the Positive reviews.*

In [None]:
wc(negative_reviews['Review'],'black','Most Used Words')

*The above wordcloud only for the Negative reviews.*

In [None]:
import string
def text_process(review):
    nopunc=[word for word in review if word not in string.punctuation]
    nopunc=''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [None]:
df['Review Text'].head(5).apply(text_process)

*This is how the above function works. Basically, the above function removes the punctuations, converts word into lower case, and removes the stopwords from the sentence.*

**Vectorization**

*At this moment, we have our review text column as a token (which has no punctuations and stopwords).
We can use Scikit-learn’s CountVectorizer to convert the text collection into a matrix of token counts. You can imagine this resulting matrix as a 2-D matrix, where each row is a unique word, and each column is a review.*

In [None]:
df=df.dropna(axis=0,how='any')
rating_class = df[(df['Rating'] == 1) | (df['Rating'] == 5)]
X_review=rating_class['Review Text']
y=rating_class['Rating']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer=CountVectorizer(analyzer=text_process).fit(X_review)

In [None]:
print(len(bow_transformer.vocabulary_))

*Now, the above number is the  size of the vocabulary stored in the vectoriser (based on X_review) *

In [None]:
X_review = bow_transformer.transform(X_review)

**Training Data and Test Data**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)

**Train our model **

*To predict the rating of the reviews,  we will use Naive Bayes Machine Learning Algorithm. Since this works well with the text data. *

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)

**Testing the model**

In [None]:
predict=nb.predict(X_test)

*Once we predicted the values, now the most important task is to check and evaluate our model against the actual ratings (stored in y_test) using confusion_matrix and classification_report from Scikit-learn.*

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

*Woaah !! Our model achieved 95% efficiency. It means the business can predict the users liked the product or not*

**Now let's test the model with the data**

In [None]:
rating_positive=df['Review Text'][3]
rating_positive

*First, I want to test with the positive review. I have chosen the above review and its rating is 5. After evaluating it should predict the rating as 5.*

In [None]:
rating_positive_transformed = bow_transformer.transform([rating_positive])
nb.predict(rating_positive_transformed)[0]

*Second, I want to test with the negative review. I have chosen the below review and its rating is 1. After evaluating it should predict the rating as 1.*

In [None]:
rating_negative=df['Review Text'][61]
rating_negative

In [None]:
rating_negative_transformed = bow_transformer.transform([rating_negative])
nb.predict(rating_negative_transformed)[0]

**Now, I want to predict which product is recommended and which is not**

In [None]:
X_predict_recommend=df['Review Text']
y_recommend=df['Recommended IND']

bow_transformer=CountVectorizer(analyzer=text_process).fit(X_predict_recommend)

X_predict_recommend = bow_transformer.transform(X_predict_recommend)

X_train, X_test, y_train, y_test = train_test_split(X_predict_recommend, y_recommend, test_size=0.3, random_state=101)

nb = MultinomialNB()
nb.fit(X_train, y_train)

predict_recommendation=nb.predict(X_test)


print(confusion_matrix(y_test, predict_recommendation))
print('\n')
print(classification_report(y_test, predict_recommendation))

*For the second prediction (Product Recommendation Vs Review Text), the model predicted 87% efficiency. Now lets test the data.*

In [None]:
rating_positive

In [None]:
rating_positive_transformed = bow_transformer.transform([rating_positive])
nb.predict(rating_positive_transformed)[0]

*In the above block, our model predicted correclty. Since it was a positive review and it was recommended. Hence we got value 1.*

In [None]:
rating_negative

In [None]:
rating_negative_transformed = bow_transformer.transform([rating_negative])
nb.predict(rating_negative_transformed)[0]

*For the above scenario, our model predicted correclty. Since it was a negative review and it was not recommended. Hence we got value 0.*

**Thank you so much for viewing this work. If you like this, please upvote and do comment.**