## INTRODUCTION

### Welcome to my first and one of my favourite NLP analysis!

In these times of quarantine, online shopping have shown to be one of the greatest stress-busters for women!😍 <br>
While all of us are on our journeys to becoming shopaholics or broke 😁, let us analyze the importance of CUSTOMER REVIEWS while shopping online!

As per facts, **61%** of customers read online reviews before making a purchase decision, and they are now essential for *e-commerce sites*. Also, according to Reevoo, reviews produce an average **18%** uplift in sales. Hence ***USER REVIEWS*** are proven sales drivers, and something the majority of customers will definitely want to see before deciding to make a purchase.

<img src="https://media.giphy.com/media/cqw80XStn460U/giphy.gif">



**Customer segmentation**, on the other hand, helps in targeted marketing, new customer acquisitions and hence more successful campaigns. 
Today we are going to perform Customer Segmentation by clustering the valuable **Customer Reviews**.

Most simplified methods have been used in my notebook to enable ease of understanding for beginners such as myself!

In this project you will also get a glimpse of:
- Text preprocessing for NLP (Stopwords removal, Lemmatization, part-of-speech tagging), 
- Sentiment analysis (vader sentiment)
- Dimensionality reduction (pca+tsne)
- Clustering (K-means)
- Topic Modeling (LDA)
<br><br>
working together to provide **meaningful insights**!
<br>
LET'S DIVE IN!

### 1. Importing data and libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import seaborn as sns
import time

# nltk
import nltk
from nltk.corpus import stopwords
stoplist= stopwords.words('english')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer= WordNetLemmatizer()
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')

# Enable logging
import logging
logging.basicConfig(level= logging.INFO)

# You will find more libraries as they come in use!

In [None]:
df= pd.read_csv("/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.columns= df.columns.str.replace(" ", "_")
df.head()

### 2. Feature Engineering

In [None]:
# Relation between Division name, Department name and Class name? 
df[['Division_Name','Department_Name','Class_Name']].groupby(['Division_Name','Department_Name','Class_Name']).agg('count')

In [None]:
# Since the reviews is our main content, dropping rows where 'Review Text' is null
df.dropna(subset=['Review_Text'], inplace=True)
df.shape

In [None]:
# Review word count
df['rev_word_count']= df['Review_Text'].apply(lambda x: len(x.strip().split()))

# Unique word count
df['unique_word_count']= df['Review_Text'].apply(lambda x: len(set(str(x).split())))

In [None]:
# Bucketing Clothing ID's with 1 or 2 count

clothing_id_to_combine=[]
for val, cnt in df.Clothing_ID.value_counts().iteritems():
    # If that Clothing_ID is present less than 1%(~200) of the total data, club it into '000' (default) id
    if(cnt<200):         
        clothing_id_to_combine.append(val)        

print("# of clothing ID's clubbed: ",len(clothing_id_to_combine))

df['new_clothingID']= df.Clothing_ID.apply(lambda x: '000' if x in clothing_id_to_combine else x)
df.new_clothingID.value_counts(normalize=True)

### 3. Sentiment Analysis

Calibrating Title and Review sentiment scores using [VADER](https://github.com/cjhutto/vaderSentiment) Sentiment! <br>
VADER (Valence Aware Dictionary and sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

In [None]:
pip install vaderSentiment

In [None]:
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer= SentimentIntensityAnalyzer()

# Try it out!
vs= analyzer.polarity_scores("Vader sentiment looks interesting, I have high hopes!")
print(vs)

The **compound score** obtained from valence scores is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

In [None]:
df['review_sentiment']= [analyzer.polarity_scores(line)['compound'] for line in df['Review_Text']]

After analyzing certain Title sentiments, we notice that titles with 0 scores consist of positive, negative and neutral sentiments:

*Negative sentiments with 0 score*:<br>
Falls flat  >>>>> 0.0<br>
Runs short  >>>>> 0.0<br>

*Positive sentiments with 0 score*:<br>
Must have  >>>>> 0.0<br>
Comfy  >>>>> 0.0<br>
Stylish and versatile!  >>>>> 0.0 <br>

*Neutral sentiments with 0 score*:<br>
Simple, stylish, lovely-runs a bit big  >>>>> 0.0<br>
Some things you should know...  >>>>> 0.0<br>
Mehh  >>>>> 0.0<br>
Neutral blue  >>>>> 0.0<br>

**Hence we have assigned score '0.0' score to 'No title' above.**

In [None]:
# Title provides extra insight of the sentiment of the customer while writing the review. Hence we will also obtain Title sentiment alongwith review sentiment score.
# However 13% reviews don't have a Title. Hence filling in the null values with 'no title' and assigning it 0(neutral) sentiment

df.Title.fillna('no title', inplace=True)
df['title_sentiment']= df['Title'].apply(lambda x: analyzer.polarity_scores(x)['compound'] if str(x)!= 'no title' else 0.0)

for index, row in df[100:120].iterrows():
    print(row['Title']," >>>>>", row['title_sentiment'])

In [None]:
df['total_sentiment_score']= df['title_sentiment']+ df['review_sentiment']

In [None]:
# Golden rule: Save up the original dataframe before encoding!
df_orig= df.copy()
df_orig.shape 

# df= df_orig.copy()

### Feature Encoding and prepping up our data for clustering!

In [None]:
# Dropping text columns- we have already used them to calcuate the total sentiment score

df.drop(columns=['Review_Text','Title','Clothing_ID','review_sentiment','title_sentiment'], inplace=True)

In [None]:
# Deciding on categorical columns- object datatype -very slow processing

cat_cols= ['Division_Name','Department_Name','Class_Name','new_clothingID']
for col in cat_cols:
    print(col," has categories:", df[col].nunique())
    df[col]= df[col].astype('category')

In [None]:
# Binary encoding our categorical columns

import category_encoders as ce

be= ce.BinaryEncoder(cols= cat_cols,drop_invariant=True).fit(df) 

df= be.transform(df)

## Clustering Begins!

After certain experiments,it is evident that pca followed by tsne gives most identifiable and clear clusters in low dimensions. Refer [example](https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b) below:

![image.png](attachment:image.png)

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy import stats
from sklearn.cluster import KMeans
import pylab as pl
%matplotlib inline
import matplotlib.pyplot as plt

Note: While optimizing TSNE- Since intuition behind perplexity(K) is how many neighbors each data point can “sense”, it is widely accepted that K~ N^(1/2). Hence in our case, K~150

In [None]:
pca_tsne= Pipeline([("pca", PCA(n_components= 0.90, random_state=33)),
                    ("tsne", TSNE(n_components=2,
                                  perplexity= 170,
                                  random_state=33, 
                                  learning_rate= 350, 
                                  n_iter= 5000,
                                  n_jobs=-1,
                                  n_iter_without_progress=150,
                                  verbose=1))])
t0= time.time()
df_pca_tsne_reduced= pca_tsne.fit_transform(df)
t1= time.time()

print("pca+tsne took:{:.1f}s ".format(t1-t0))

In [None]:
sns.set(rc= {'figure.figsize': (13,13)})
sns.scatterplot(df_pca_tsne_reduced[:,0], df_pca_tsne_reduced[:,1])
plt.show()

## K-Means Clustering

One can easily spot the clusters above. 
Now let us have the opinion of K-Means as well! 
We will color by K-Means clustering to figure out if both the algorithms agree on the clustering!

In [None]:
review_data_std = stats.zscore(df_pca_tsne_reduced)
review_data_std = np.array(review_data_std)

sns.set(rc= {'figure.figsize': (7,7)})
number_of_clusters = range(1,20)

t0= time.time()
kmeans = [KMeans(n_clusters=i,max_iter=1000,random_state=33,n_jobs=-1) for i in number_of_clusters]
score = [-1*kmeans[i].fit(df_pca_tsne_reduced).score(df_pca_tsne_reduced) for i in range(len(kmeans))]
t1= time.time()

pl.plot((number_of_clusters),score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

print("Plotting the Elbow curve took:{:.1f}s ".format(t1-t0))

In [None]:
k_means_test = KMeans(n_clusters=3, max_iter=1500, random_state=33,verbose=1, n_jobs=-1)

#fitting on your model
-1*k_means_test.fit(df_pca_tsne_reduced).score(df_pca_tsne_reduced)
y_pred= k_means_test.labels_

# Assigning cluster labels to each data point
df_orig['klabels'] = k_means_test.labels_

In [None]:
# Analyzing 
size_of_each_cluster= df_orig.groupby('klabels').size().reset_index()
size_of_each_cluster.columns = ['klabels','number_of_points']
size_of_each_cluster['percentage'] = (size_of_each_cluster['number_of_points']/np.sum(size_of_each_cluster['number_of_points']))*100

print(size_of_each_cluster)

In [None]:
palette = sns.hls_palette(3, l=.4, s=.9)

sns.set(rc= {'figure.figsize': (13,13)})
sns.scatterplot(df_pca_tsne_reduced[:,0], df_pca_tsne_reduced[:,1], hue= y_pred, legend='full', palette=palette)
plt.title("t-sne with KMeans labels")
plt.show()

As we can see, the K-Means clusters also closely represents the clusters created by PCA and TSNE. Together it has produced some classic clustering. <br>
Apart from clustering the reviews together, we would also like to understand ***'meaning of each cluster'***. This can be achieved via **TOPIC MODELING**. <br>

Hence, now we will attempt to find the most significant words in each cluster. K-means clustered the articles but did not label the topics. Through topic modeling we will find out what the most important terms for each cluster are. This will add more meaning to the cluster by giving **keywords** to quickly identify the themes of the cluster.

## Topic Modeling- Latent Dirichlet Allocation(LDA) 

For topic modeling, we will use the infamous LDA (Latent Dirichlet Allocation) algorithm. In LDA, each document can be described by a distribution of topics and each topic can be described by a distribution of words.

STEP 1: Preprocessing text - Tokenizing sentences, stopwords removal and lemmatization

In [None]:
def get_pos_tag(tag):
    """This function is used to get the part-of-speech(POS) for lemmatization"""
    
    if tag.startswith('N') or tag.startswith('J'):
        return wordnet.NOUN
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN #default case

In [None]:
import re
def preprocess(text):
    """ 1. Removes Punctuations
        2. Removes words smaller than 3 letters
        3. Converts into lowercase
        4. Lemmatizes words
        5. Removes Stopwords
    """   
    punctuation= list(string.punctuation)
    doc_tokens= nltk.word_tokenize(text)
    word_tokens= [word.lower() for word in doc_tokens if not (word in punctuation or len(word)<=3)]
    
    # Lemmatize    
    pos_tags=nltk.pos_tag(word_tokens)
#     print(pos_tags)
    doc_words=[wordnet_lemmatizer.lemmatize(word, pos=get_pos_tag(tag)) for word, tag in pos_tags]
    doc_words= [word for word in doc_words if word not in stoplist]
    
    return doc_words

df_clean = df_orig['Review_Text'].apply(preprocess)
df_clean.head()

STEP 2: DATA CLEANING- PROCURE ONLY NOUNS AND ADJECTIVES TO OBTAIN MEANINGFUL TOPICS!

In [None]:
# Adding business stopwords to exclude

common_terms= ["wear","look","ordered","color","purchase","order"]

stoplist= stoplist+ common_terms

In [None]:
# Tried multiple parts of speech and obtained best topic results using Nouns and Adjectives!
def get_nouns_adjs(series):
    
    " Topic Modeling using only nouns and adjectives"
    
    pos_tags= nltk.pos_tag(series)
    all_adj_nouns= [word for (word, tag) in pos_tags if (tag=="NN" or tag=="NNS" or tag=="JJ")] 
    return all_adj_nouns

df_nouns_adj = df_clean.apply(get_nouns_adjs)

Step 3: Add bigrams to your corpus using Word2vec model from gensim

In [None]:
# Importing gensim related libraries
import gensim
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora import Dictionary
from gensim.models import Phrases
from collections import Counter
from gensim.models import Word2Vec

In [None]:
docs= list(df_nouns_adj)
phrases = gensim.models.Phrases(docs, min_count=10, threshold=20)
bigram_model = gensim.models.phrases.Phraser(phrases)

In [None]:
def make_bigrams(texts):
    return [bigram_model[doc] for doc in texts]

# Form Bigrams
data_words_bigrams = make_bigrams(docs)

In [None]:
# Checkout most frequent bigrams :
bigram_counter1= Counter()
for key in phrases.vocab.keys():
    if key not in stopwords.words('english'):
        if len(str(key).split('_'))>1:
            bigram_counter1[key]+=phrases.vocab[key]

for key, counts in bigram_counter1.most_common(20):
    print(key,">>>>", counts)

**Feeding the bigrams into a Word2Vec model produces more meaningful bigrams**

In [None]:
w2vmodel = Word2Vec(bigram_model[docs], size=100, sg=1, hs= 1, seed=33, iter=35)
bigram_counter = Counter()

for key in w2vmodel.wv.vocab.keys():
    if key not in stoplist:
        if len(str(key).split("_")) > 1:
            bigram_counter[key] += w2vmodel.wv.vocab[key].count

for key, counts in bigram_counter.most_common(30):
    print(key,">>>>> " ,counts)

**Checkout some cool stuff from the bigram model!**

In [None]:
# MostOften mentioned along with the word 'pregnant'
w2vmodel.most_similar(positive= ['pregnant'])

In [None]:
# Which color is to 'work' as 'white' is to 'wedding'
w2vmodel.wv.most_similar(['work','white'], ['wedding'], topn=5)

In [None]:
w2vmodel.wv.most_similar(['price','steal'], ['discount'], topn=5)

In [None]:
# What is a 'deal_breaker', if 'quality'is 'worth_penny' 
w2vmodel.wv.most_similar(positive=["deal_breaker","quality"], negative=["worth_penny"], topn=3)

Step 4: Create a dictionary and corpus for input to our LDA model. Filter out the most common and uncommon words. 

In [None]:
dictionary= Dictionary(data_words_bigrams)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.6)
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Step 5: Train your LDA model- Topic Modeling

In [None]:
from gensim.models.ldamulticore import LdaMulticore

t0= time.time()
passes= 150
np.random.seed(1) # setting up random seed to get the same results
ldamodel= LdaMulticore(corpus, 
                    id2word=dictionary, 
                    num_topics=4, 
#                   alpha='asymmetric', 
                    chunksize= 4000, 
                    batch= True,
                    minimum_probability=0.001,
                    iterations=350,
                    passes=passes)                    

t1= time.time()
print("time for",passes," passes: ",(t1-t0)," seconds")

STEP 5: *Ta-Daa!* Here are your Topics!

In [None]:
ldamodel.show_topics(num_words=25, formatted=False)

<!-- **Well, what are the Topics saying??**

**TOPIC 0**- *Top wear* <br>
(Items) Top, sweater, shirt, jacket, tank, tee ; (And related stuff): color(white, black, blue), look, fabric(soft, material), price (sale), fit, quality <br>

**TOPIC 1**- *Attributes of clothes*
size(small, medium, large, petite, big, short, little,  length, regular,  lb, true), fit (tight, true), fabric, color, fittings/sizing in different body parts(arm, waist, shoulder, bottom, side), return if not a good one <br>

**TOPIC 2**- *Lower wear*
(Items) dress, jean, pant, skirt, boot (And related stuff): color, work/casual, length-short, size-true, season- summer, fall, material <br> -->

Storing the major topic against each review!

In [None]:
lda_corpus= ldamodel[corpus]

**[Note](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py)**

Calling **model[corpus]** only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

In [None]:
# Obtaining the main topic for each review:

all_topics = ldamodel.get_document_topics(corpus)
num_docs = len(all_topics)

all_topics_csr= gensim.matutils.corpus2csc(all_topics)
all_topics_numpy= all_topics_csr.T.toarray()

major_topic= [np.argmax(arr) for arr in all_topics_numpy]
df_orig['major_lda_topic']= major_topic

**Analyze K-means Clustering against Topic Labeling**

In [None]:
sns.set(rc= {'figure.figsize': (5,3)})
sns.set_style('darkgrid')

df_orig.major_lda_topic.value_counts().plot(kind='bar')

In [None]:
df_orig.groupby(['klabels'])['major_lda_topic'].value_counts(ascending=False, normalize=True)

### Deriving Conclusions- Looking at the data

In [None]:
num_cols= ['Age','Positive_Feedback_Count','rev_word_count', 'unique_word_count','total_sentiment_score']

cat_cols= ['major_lda_topic','Division_Name','Department_Name','Class_Name']

cluster1= df_orig.loc[(df_orig.klabels==0)]
cluster2= df_orig.loc[(df_orig.klabels==1)]
cluster3= df_orig.loc[(df_orig.klabels==2)]

**Cluster 1 Analysis**

In [None]:
pd.DataFrame((cluster1.Rating.value_counts()*100)/df_orig.Rating.value_counts()).plot(kind='bar')

In [None]:
print('Visualizing numerical features:')
for i, col in enumerate(num_cols):
    plt.figure(i)
    sns.distplot(cluster1[col])


In [None]:
print('Visualizing categorical features:')
for i, col in enumerate(cat_cols):
    plt.figure(i)
    chart= sns.countplot(cluster1[col], order= cluster1[col].value_counts().index)
    chart.set_xticklabels(chart.get_xticklabels(),rotation=90)

In [None]:
# **Cluster 2 Analysis**
print('Visualizing numerical features:')
for i, col in enumerate(num_cols):
    plt.figure(i)
    sns.distplot(cluster2[col])

In [None]:
pd.DataFrame((cluster2.Rating.value_counts()*100)/df_orig.Rating.value_counts()).plot(kind='bar')

In [None]:
print('Visualizing categorical features:')
for i, col in enumerate(cat_cols):
    plt.figure(i)
    chart= sns.countplot(cluster2[col], order= cluster2[col].value_counts().index)
    chart.set_xticklabels(chart.get_xticklabels(),rotation=90)

In [None]:
# Cluster 3 Analysis
pd.DataFrame((cluster3.Rating.value_counts()*100)/df_orig.Rating.value_counts()).plot(kind='bar')

In [None]:
print('Visualizing numerical features:')
for i, col in enumerate(num_cols):
    plt.figure(i)
    sns.distplot(cluster3[col])


In [None]:
print('Visualizing categorical features:')
for i, col in enumerate(cat_cols):
    plt.figure(i)
    chart= sns.countplot(cluster3[col], order= cluster3[col].value_counts().index)
    chart.set_xticklabels(chart.get_xticklabels(),rotation=90)

## INSIGHTS- CLUSTER AND TOPIC ANALYSIS

<img src="https://media.giphy.com/media/uBfr9DFs9vc40/giphy.gif">

**Repeating this analysis for each cluster and summarizing the graphs, We obtain the following observations:**

**Cluster 1 contains reviews having:**
- Young/Middle-aged women(age group 25-40) who have written descriptive reviews(50-80 words).
- Reviews are related to topic 3(31%),2,1. 

**Cluster 2 contains reviews having:**
- Middle/Elderly women(age group 35-60) who have written rather precise reviews(10-40 words).
- Reviews are related to topic 2(~40%),3,1. 

**Cluster 3 contains reviews having:**
- All age groups inclusive(25-60) who have written very detailed reviews(80-110 words).
- Reviews are related to topic 3(~37%),0,1. 


**Understanding Topics:**

**topic 0 contains reviews having:**
- Reviews are related to Dress, jackets, skirts. 
- Concerning stuff: casual/work wear, look, color, fit.

**topic 1 contains reviews having:**
- Reviews are related to bottomwear such as Jeans, pant, denim, skirts
- Concerning stuff: Stretch, skinny, short

**topic 2 contains reviews having:**
- Reviews are related to Topwear- shirt, sweater, jacket.
- Concerning stuff: Fabric, white, material, color, sleeve, arm, blue, boxy

**topic 3 contains reviews having:**
- Reviews are related to size related-issues- return!
- Concerning stuff: Small, petite, large, medium, regular, little, short, tight
- Related to- waist, bust, hip, shoulder!

<img src="https://media.giphy.com/media/lD76yTC5zxZPG/giphy.gif">


**Here my little attempt on NLP comes to end**

**Thanks for reading my kernel!**

**If you have any doubts or suggestions for improving my analysis, do let me know in the comments!**

**If you liked my kernel, do upvote!**

**Take care!**

References:

PFB few amazing blogs and notebooks : 

https://www.kaggle.com/adhok93/understanding-age-wise-sentiments-using-k-means <br>
https://www.kaggle.com/maksimeren/covid-19-literature-clustering/notebook <br>
https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/4-Topic-Modeling.ipynb <br>
https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b<br>
https://distill.pub/2016/misread-tsne/ <br>
Understanding LDA:<br>
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/ <br>