<a href="https://colab.research.google.com/github/KeGuo627/Review-Analysis-and-Topic-Modelling/blob/main/Review_Analysis_and_Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# E-Commerce Consumer Review Analysis

In order to explore trends in the customer reviews from an anonymized women’s clothing E-commerce platform, I will use unsupervised learning model to cluster these consumer feedbacks and identify their potential topics among these documents.

## Contents

* Part 0: Loading Data

* Part 1: Data Overview

* Part 2: Tokenizing and Stemming

* Part 3: TF-IDF

* Part 4: Data Clustering

* Part 5: Topic Modeling 

# Part 0: Loading Data

In [None]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
#https://drive.google.com/file/d/1sx3J_82AKSM8b6JsQBe8lekYU33K-83v/view?usp=sharing

In [None]:
file = drive.CreateFile({'id':'1sx3J_82AKSM8b6JsQBe8lekYU33K-83v'})
file.GetContentFile('review.csv')  # tab separate

In [None]:
import numpy as np
import pandas as pd
import nltk
# import gensim

from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Load data into dataframe
review_df = pd.read_csv('review.csv')

# Part 1: Data Overview

In [None]:
review_df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [None]:
review_df[review_df.isnull().any(axis=1)].head()
print(review_df.isnull().sum())

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64


In [None]:
# Remove missing value
review_df.dropna(subset=['Title','Review Text','Division Name','Department Name','Class Name'],inplace=True)

In [None]:
review_df.reset_index(inplace=True, drop=True)

In [None]:
review_df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
2,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
3,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
4,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


In [None]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19662 entries, 0 to 19661
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               19662 non-null  int64 
 1   Clothing ID              19662 non-null  int64 
 2   Age                      19662 non-null  int64 
 3   Title                    19662 non-null  object
 4   Review Text              19662 non-null  object
 5   Rating                   19662 non-null  int64 
 6   Recommended IND          19662 non-null  int64 
 7   Positive Feedback Count  19662 non-null  int64 
 8   Division Name            19662 non-null  object
 9   Department Name          19662 non-null  object
 10  Class Name               19662 non-null  object
dtypes: int64(6), object(5)
memory usage: 1.7+ MB


In [None]:
# use the first 1000 data as our training data
review_data = review_df.loc[:999, 'Review Text'].tolist()
print(type(review_data))
print(review_data[0])
print(type(review_data[0]))

<class 'list'>
I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
<class 'str'>


# Part 3: Tokenizing and Stemming

Loading stopwords and stemmer function, coming from NLTK library.
Stop words are words like "a", "the", or "in" which don't convey significant meaning.
Stemming is the process of breaking a word down into its root.

In [None]:
# Use nltk's English stopwords.
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append("'s")
stopwords.append("'m")
stopwords.append("n't")
stopwords.append("br")

print ("We use " + str(len(stopwords)) + " stop-words from nltk library.")
print (stopwords[:10])

We use 183 stop-words from nltk library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


Tokenize, Stem our reviews.

In [None]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

# tokenization and stemming
def tokenization_and_stemming(text):
    tokens = []
    # exclude stop words and tokenize the document, generate a list of string 
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if token.isalpha():
            filtered_tokens.append(token)
            
    # stemming
    print(filtered_tokens)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [None]:
tokenization_and_stemming(review_data[0])

['high', 'hopes', 'dress', 'really', 'wanted', 'work', 'initially', 'ordered', 'petite', 'small', 'usual', 'size', 'found', 'outrageously', 'small', 'small', 'fact', 'could', 'zip', 'reordered', 'petite', 'medium', 'overall', 'top', 'half', 'comfortable', 'fit', 'nicely', 'bottom', 'half', 'tight', 'layer', 'several', 'somewhat', 'cheap', 'net', 'layers', 'imo', 'major', 'design', 'flaw', 'net', 'layer', 'sewn', 'directly', 'zipper', 'c']


['high',
 'hope',
 'dress',
 'realli',
 'want',
 'work',
 'initi',
 'order',
 'petit',
 'small',
 'usual',
 'size',
 'found',
 'outrag',
 'small',
 'small',
 'fact',
 'could',
 'zip',
 'reorder',
 'petit',
 'medium',
 'overal',
 'top',
 'half',
 'comfort',
 'fit',
 'nice',
 'bottom',
 'half',
 'tight',
 'layer',
 'sever',
 'somewhat',
 'cheap',
 'net',
 'layer',
 'imo',
 'major',
 'design',
 'flaw',
 'net',
 'layer',
 'sewn',
 'direct',
 'zipper',
 'c']

Now, we get the new list of the first review text, which extract the core, meaningful word by using tokenization and stemming.

# Part 3: TF-IDF

TF: Term Frequency

IDF: Inverse Document Frequency

TF-IDF=TF/IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words : built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.99, max_features=1000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix = tfidf_model.fit_transform(review_data) #fit the vectorizer to synopses
print(tfidf_matrix.shape)
print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " reviews and " + str(tfidf_matrix.shape[1]) + " terms.")

[]
[]
[]
[]
['nowhere']
['rather']
['along']
['top']
['nothing']
[]
['therein']
[]
['together']
[]
['one']
['con']
['call']
['may']
['nobody']
['give']
['hasnt']
[]
[]
[]
['often']
[]
[]
[]
[]
[]
['otherwise']
['also']
['hereafter']
['indeed']
['thereby']
[]
['least']
['third']
[]
['former']
['thence']
['whether']
['last']
['somewhere']
['neither']
['co']
['never']
[]
['three']
['whenever']
['whereby']
[]
['thick']
['eleven']
['whither']
['whereupon']
[]
['always']
[]
['thereafter']
[]
['must']
['two']
['whole']
['besides']
[]
['moreover']
['becomes']
['de']
[]
['whereas']
['alone']
['made']
['within']
['could']
[]
['please']
['part']
['thru']
['already']
[]
[]
['either']
['even']
['fifteen']
[]
[]
['hence']
[]
['six']
['somehow']
['since']
['however']
['due']
[]
['get']
['couldnt']
[]
['per']
[]
['every']
['next']
[]
[]
['cant']
['elsewhere']
['seeming']
['take']
[]
['full']
['seemed']
['interest']
[]
['hereupon']
['nine']
[]
[]
['become']
[]
['bottom']
[]
[]
['thin']
['still']
['here

  'stop_words.' % sorted(inconsistent))


['overall', 'gorgeous', 'blouse', 'flattering', 'love', 'sleeves', 'unique', 'also', 'flattering', 'love', 'blouse', 'bust', 'fitted', 'bust', 'tight', 'drapes', 'style', 'fit', 'blouse', 'truly', 'flattering', 'figure', 'issues', 'inappropriate', 'amounts', 'cleavage', 'showing', 'anything', 'ties', 'weighted', 'end', 'metal', 'tubes', 'really', 'like', 'makes', 'strings', 'drape', 'nicely', 'back', 'drapes', 'beautifully', 'fe']
['color', 'like', 'photo', 'fit', 'work', 'busty']
['first', 'saw', 'jacket', 'hanging', 'store', 'look', 'cute', 'saw', 'sale', 'decided', 'try', 'glad', 'inside', 'gray', 'material', 'soft', 'broad', 'shoulders', 'jacket', 'hinder', 'arms', 'comfortable', 'bought', 'surprised', 'roomy', 'sleeves', 'tad', 'bit', 'long', 'gray', 'inner', 'lining', 'elastic', 'cuff', 'pushed', 'sure', 'would', 'small']
['tag', 'sale', 'opportunity', 'take', 'second', 'look', 'items', 'may', 'overlooked', 'earlier', 'pick', 'something', 'fun', 'good', 'price', 'found', 'dress',

In [None]:
# check the parameters
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_stemming>,
 'use_idf': True,
 'vocabulary': None}

In [None]:
print(tfidf_matrix.shape)
tfidf_matrix.todense()

(1000, 431)


matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.20445267,
         0.19200209],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

Sparse matrices come up in TF-IDF encoding schemes used in the preparation of data. todense() means to convert sparse matrices to dense matrices. Next,save the terms identified by TF-IDF.

In [None]:
# words
tf_selected_words = tfidf_model.get_feature_names()

In [None]:
# print out words
print(len(tf_selected_words))
tf_selected_words

431


['abl',
 'absolut',
 'actual',
 'ad',
 'add',
 'addit',
 'ador',
 'agre',
 'alreadi',
 'alway',
 'amaz',
 'ankl',
 'anoth',
 'anyth',
 'appear',
 'appropri',
 'area',
 'arm',
 'armhol',
 'arriv',
 'away',
 'awkward',
 'bad',
 'baggi',
 'bare',
 'base',
 'basic',
 'beauti',
 'belt',
 'best',
 'better',
 'big',
 'bigger',
 'bit',
 'black',
 'blazer',
 'blous',
 'blue',
 'bodi',
 'boot',
 'bought',
 'boxi',
 'bra',
 'brand',
 'bright',
 'bust',
 'busti',
 'button',
 'buy',
 'ca',
 'came',
 'cami',
 'camisol',
 'cardigan',
 'casual',
 'cheap',
 'chest',
 'classic',
 'clean',
 'close',
 'cloth',
 'coat',
 'cold',
 'collar',
 'color',
 'come',
 'comfi',
 'comfort',
 'complet',
 'compliment',
 'consid',
 'construct',
 'cool',
 'coral',
 'cotton',
 'cover',
 'coverag',
 'cozi',
 'cream',
 'crop',
 'cuff',
 'curvi',
 'cut',
 'cute',
 'dark',
 'day',
 'decid',
 'deep',
 'definit',
 'delic',
 'denim',
 'depend',
 'design',
 'differ',
 'disappoint',
 'dot',
 'drape',
 'dress',
 'dri',
 'easi',
 'e

In conclusion, the above 1000 review text contain 431 unique words(features) and I get the feature name via the alphabetical order.

# Part 4: Data Clustering

In this part, I plan to apply K-means to cluster customer reviews.

In [None]:
# k-means clustering
from sklearn.cluster import KMeans

num_clusters = 5
#five clusters
# number of clusters
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
print(clusters)
print(len(clusters))

[0, 2, 2, 1, 0, 3, 0, 1, 1, 1, 3, 0, 3, 3, 2, 2, 1, 0, 3, 3, 1, 1, 1, 4, 3, 2, 4, 2, 2, 2, 2, 2, 0, 2, 3, 2, 0, 4, 0, 3, 2, 2, 3, 2, 3, 2, 1, 2, 2, 2, 2, 2, 0, 2, 3, 3, 3, 3, 0, 1, 2, 0, 1, 1, 2, 2, 2, 0, 3, 0, 2, 1, 2, 2, 0, 2, 2, 2, 2, 3, 2, 0, 2, 2, 3, 0, 0, 1, 1, 2, 1, 0, 2, 2, 0, 3, 1, 3, 2, 3, 2, 2, 2, 3, 0, 4, 2, 2, 3, 0, 3, 0, 3, 0, 1, 3, 0, 3, 2, 2, 3, 0, 0, 4, 3, 2, 3, 2, 3, 2, 1, 2, 0, 3, 0, 3, 2, 4, 1, 0, 2, 3, 4, 0, 4, 3, 3, 1, 0, 3, 3, 4, 3, 2, 0, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 0, 3, 1, 2, 2, 1, 0, 2, 1, 0, 3, 0, 3, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 2, 1, 0, 0, 2, 0, 4, 2, 2, 0, 1, 2, 3, 3, 3, 2, 4, 2, 2, 2, 1, 1, 0, 2, 3, 0, 2, 1, 2, 2, 2, 0, 3, 0, 3, 2, 1, 2, 2, 2, 3, 0, 0, 3, 4, 2, 0, 1, 2, 0, 1, 3, 0, 1, 3, 1, 0, 2, 0, 0, 2, 0, 2, 2, 2, 0, 3, 2, 0, 3, 3, 0, 0, 2, 4, 0, 2, 1, 2, 0, 0, 2, 3, 3, 2, 0, 2, 2, 2, 2, 4, 0, 3, 2, 3, 0, 4, 2, 0, 3, 2, 3, 2, 0, 2, 0, 2, 3, 2, 3, 3, 1, 1, 2, 1, 0, 1, 4, 0, 3, 4, 2, 1, 1, 4, 3, 1, 0, 3, 2, 0, 3, 4, 1, 3, 1, 1, 4, 3, 4, 1, 1, 1, 2, 1, 

In [None]:
# create DataFrame films from all of the input files.
product = { 'review': review_df[:1000]["Review Text"], 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])

In [None]:
frame.head(10)

Unnamed: 0,review,cluster
0,I had such high hopes for this dress and reall...,0
1,"I love, love, love this jumpsuit. it's fun, fl...",2
2,This shirt is very flattering to all due to th...,2
3,"I love tracy reese dresses, but this one is no...",1
4,I aded this in my basket at hte last mintue to...,0
5,"I ordered this in carbon for store pick up, an...",3
6,I love this dress. i usually get an xs but it ...,0
7,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1
8,Dress runs small esp where the zipper area run...,1
9,More and more i find myself reliant on the rev...,1


In [None]:
print ("Number of reviews included in each cluster:")
frame['cluster'].value_counts().to_frame()

Number of reviews included in each cluster:


Unnamed: 0,cluster
2,320
0,222
3,203
1,162
4,93


In [None]:
print(km.cluster_centers_.shape)
km.cluster_centers_

(5, 431)


array([[0.00334889, 0.00920344, 0.01065908, ..., 0.0077852 , 0.0019494 ,
        0.00220814],
       [0.00503465, 0.01467896, 0.0071495 , ..., 0.00808421, 0.00934747,
        0.00725462],
       [0.00315978, 0.00944195, 0.00562876, ..., 0.00125132, 0.        ,
        0.00266847],
       [0.00130687, 0.00357555, 0.00346853, ..., 0.00555173, 0.00463092,
        0.00986587],
       [0.0083452 , 0.00229507, 0.01011697, ..., 0.01794535, 0.00242186,
        0.00673463]])

five clusters, 431 features.
Using the k-means, the centorid of each clusters represent the this group. km.cluster_centers_ denotes the importances of each feature in centroid.

In [None]:
print ("<Review Text clustering result by K-means>")
print("Sort it in decreasing-order and get the top k items")
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print ("Cluster " + str(i) + " words:", end='')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        Cluster_keywords_summary[i].append(tf_selected_words[ind])
        print (tf_selected_words[ind] + ",", end='')
    print ()
    
    cluster_reviews = frame[frame.cluster==i].review.tolist()
    print ("Cluster " + str(i) + " reviews (" + str(len(cluster_reviews)) + " reviews): ")
    print (", ".join(cluster_reviews))
    print()

<Review Text clustering result by K-means>
Sort it in decreasing-order and get the top k items
Cluster 0 words:size,fit,small,order,petit,tri,
Cluster 0 reviews (222 reviews): 
I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c, I aded this in my basket at hte last mintue to see what it would look like in person. (store pick up). i went with teh darkler color only because i am so pale :-) hte color is really gorgeous, and turns out it mathced everythiing i was trying on with it prefectly. it is a little baggy on me and hte xs is hte msallet siz

# Part 5: Topic Modeling

In this part, we prefer use Latent Dirichlet Allocation to explore the potential topic or structures of documents

In [None]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tfidf_model_lda = CountVectorizer(max_df=0.99, max_features=1000,
                                 min_df=0.01, stop_words='english',
                                 tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(review_data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix_lda.shape[0]) + \
      " reviews and " + str(tfidf_matrix_lda.shape[1]) + " terms.")

[]
[]
[]
[]
['nowhere']
['rather']
['along']
['top']
['nothing']
[]
['therein']
[]
['together']
[]
['one']
['con']
['call']
['may']
['nobody']
['give']
['hasnt']
[]
[]
[]
['often']
[]
[]
[]
[]
[]
['otherwise']
['also']
['hereafter']
['indeed']
['thereby']
[]
['least']
['third']
[]
['former']
['thence']
['whether']
['last']
['somewhere']
['neither']
['co']
['never']
[]
['three']
['whenever']
['whereby']
[]
['thick']
['eleven']
['whither']
['whereupon']
[]
['always']
[]
['thereafter']
[]
['must']
['two']
['whole']
['besides']
[]
['moreover']
['becomes']
['de']
[]
['whereas']
['alone']
['made']
['within']
['could']
[]
['please']
['part']
['thru']
['already']
[]
[]
['either']
['even']
['fifteen']
[]
[]
['hence']
[]
['six']
['somehow']
['since']
['however']
['due']
[]
['get']
['couldnt']
[]
['per']
[]
['every']
['next']
[]
[]
['cant']
['elsewhere']
['seeming']
['take']
[]
['full']
['seemed']
['interest']
[]
['hereupon']
['nine']
[]
[]
['become']
[]
['bottom']
[]
[]
['thin']
['still']
['here

  'stop_words.' % sorted(inconsistent))


['blush', 'stripes', 'subtle', 'definitely', 'give', 'elongating', 'effect', 'legs', 'comfortable', 'pair', 'crop', 'pants', 'calves', 'definitely', 'feeling', 'tight']
['got', 'small', 'mauve', 'fit', 'great', 'length', 'perfect', 'inches', 'knees', 'cute', 'cozy', 'aske']
['similar', 'pair', 'capris', 'retailer', 'ordered', 'thought', 'different', 'color', 'less', 'flattering', 'may', 'keep']
['fun', 'detail', 'beading', 'lace', 'arms', 'little', 'longer', 'body', 'sweatshirt', 'little', 'shorter', 'expected', 'style', 'piece', 'fit', 'tts', 'proportions', 'mind', 'ladies', 'store', 'said', 'ordered', 'size', 'might', 'little', 'longer', 'body', 'arms', 'shoulders', 'would', 'biggest', 'change', 'material', 'thick', 'nice', 'lighter', 'layer', 'really', 'love']
['ordered', 'online', 'fit', 'perfectly', 'looking', 'lightweight', 'pants', 'hot', 'humid', 'summer', 'days', 'pair', 'exactly', 'needed', 'striped', 'pattern', 'cute', 'adds', 'color']
['loved', 'dress', 'moment', 'tried', '

Previously, LDA needs interger inputs so that CountVectorizer will fit much better but now this limitation does not exist. CountVectorizer only have the TF, not include the IDF.

In [None]:
# document topic matrix for tfidf_matrix_lda
#lda = LatentDirichletAllocation(n_components=5) (five topic)
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(1000, 5)
[[0.15889189 0.00621107 0.1623613  0.34318052 0.32935523]
 [0.01879995 0.92542524 0.01858721 0.01855581 0.01863179]
 [0.01725382 0.81030837 0.13860684 0.01693375 0.01689723]
 ...
 [0.00984008 0.00969034 0.0097531  0.51456502 0.45615146]
 [0.00802853 0.00799707 0.00794259 0.96820069 0.00783112]
 [0.23231697 0.73835404 0.00974716 0.0098667  0.00971512]]


In [None]:
# topics and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

(5, 431)
[[ 6.54299409  0.20737545  0.20387779 ...  7.14310776  1.19765222
   0.20077805]
 [ 0.20160746  9.16783556  3.33188781 ...  4.64307443  0.20000383
   1.61977563]
 [ 3.8101164   0.20483163  5.32119595 ...  5.98792013  0.202072
   0.20271638]
 [ 0.51338646 24.21583059  2.94495207 ... 10.0195487  11.19835733
   7.41299174]
 [ 3.93189559  0.20412678 21.19808638 ...  0.20634898  0.20191462
  11.5637382 ]]


components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic.

In [None]:
print(tfidf_matrix_lda.shape)
tfidf_matrix_lda.todense()

(1000, 431)


matrix([[0, 0, 0, ..., 0, 1, 1],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [None]:
print(lda.fit_transform(tfidf_matrix).shape)
lda.fit_transform(tfidf_matrix)

(1000, 5)


array([[0.03647182, 0.80968035, 0.03710467, 0.03647929, 0.08026388],
       [0.05368116, 0.05513883, 0.0564762 , 0.78015821, 0.0545456 ],
       [0.05071878, 0.79454108, 0.05253338, 0.05112328, 0.05108347],
       ...,
       [0.038859  , 0.84319372, 0.03934108, 0.03921215, 0.03939404],
       [0.03734438, 0.84904353, 0.03814229, 0.03742717, 0.03804263],
       [0.04046663, 0.83686749, 0.04135545, 0.04041379, 0.04089663]])

In [None]:
tfidf_matrix.todense()

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.20445267,
         0.19200209],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [None]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,topic
Doc0,0.16,0.01,0.16,0.34,0.33,3
Doc1,0.02,0.93,0.02,0.02,0.02,1
Doc2,0.02,0.81,0.14,0.02,0.02,1
Doc3,0.01,0.01,0.01,0.97,0.01,3
Doc4,0.29,0.01,0.42,0.27,0.01,2
Doc5,0.01,0.26,0.72,0.01,0.01,2
Doc6,0.95,0.01,0.01,0.01,0.01,0
Doc7,0.67,0.01,0.01,0.17,0.15,0
Doc8,0.01,0.01,0.01,0.39,0.58,4
Doc9,0.01,0.01,0.01,0.95,0.01,3


In [None]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
3,264
1,235
2,177
0,164
4,160


LDA spreads the data more uniform than k-means

In [None]:
# topic word matrix
print(lda.components_)
# topic-word matrix
review_df_topic_words = pd.DataFrame(lda.components_)

# column and index
review_df_topic_words.columns = tfidf_model_lda.get_feature_names()
review_df_topic_words.index = topic_names

review_df_topic_words.head()
#print(df_topic_words.shape)

[[0.20025138 0.24391517 0.20036274 ... 0.20024865 0.20024428 0.20031208]
 [3.80712634 6.57772543 7.16396662 ... 6.43219774 3.31044448 4.93102611]
 [0.20084443 1.09406888 0.20049834 ... 0.20096635 0.20006203 0.20263053]
 [0.20256754 0.20037757 0.2053595  ... 0.20018222 0.20075534 0.20033214]
 [0.20080415 1.2657702  0.20054145 ... 0.20070327 0.20085807 0.61415914]]


Unnamed: 0,abl,absolut,actual,ad,add,addit,ador,agre,alreadi,alway,amaz,ankl,anoth,anyth,appear,appropri,area,arm,armhol,arriv,away,awkward,bad,baggi,bare,base,basic,beauti,belt,best,better,big,bigger,bit,black,blazer,blous,blue,bodi,boot,...,uniqu,use,usual,versatil,version,vest,vibrant,waist,wait,want,wardrob,warm,wash,way,wear,weather,wed,week,weight,weird,went,white,wide,winter,wish,woman,wonder,wool,wore,work,worn,worri,worth,wrinkl,xl,xs,xxs,year,zip,zipper
Topic0,0.200251,0.243915,0.200363,0.200301,0.200301,0.200338,0.202291,0.200271,0.20033,0.20145,0.200302,0.20029,0.200287,0.200305,0.200252,0.200292,0.200836,1.203493,0.200714,0.200333,0.200264,0.20026,0.200239,0.200261,0.200676,0.200231,0.200273,0.200463,0.200274,0.200333,0.200378,0.20027,0.200332,0.200357,0.200396,0.200231,0.20836,0.200291,0.218652,0.200273,...,0.20048,0.200336,0.200338,0.200335,0.20027,0.200152,0.200291,0.20032,0.200247,0.200363,0.200306,0.200774,0.200286,0.200331,0.200673,0.200289,0.200252,0.200299,0.200287,0.200273,0.203408,0.200446,0.200293,0.200301,0.200946,2.993939,0.709461,1.044494,0.200309,0.200437,0.2003,0.200284,0.201952,0.200659,0.200244,0.20028,0.200236,0.200249,0.200244,0.200312
Topic1,3.807126,6.577725,7.163967,3.679775,4.903629,2.076713,7.39759,3.60377,4.571343,4.861381,6.038996,4.441293,7.78889,4.93581,6.670162,2.818703,8.15103,12.109346,3.677887,4.043253,3.350771,3.712848,5.27986,4.533112,0.207133,4.105296,0.206329,25.634908,3.5496,5.262155,8.646336,17.256605,4.643564,21.971293,10.3504,3.002615,14.473371,12.174565,10.188122,4.055293,...,7.412438,3.908392,20.221383,3.468899,3.26099,0.203417,5.335507,19.147709,8.367846,19.236927,1.940341,5.413671,6.027866,14.572351,38.855415,4.423042,3.366471,2.797675,9.220861,5.844599,11.386192,9.855449,8.107696,6.211942,6.985074,0.203907,5.071773,4.475607,6.44907,21.837915,9.032237,3.279405,0.247536,3.732351,4.38961,17.567646,4.479346,6.432198,3.310444,4.931026
Topic2,0.200844,1.094069,0.200498,0.476077,1.384946,2.315763,0.834545,0.20007,0.286883,0.583102,1.945338,0.635613,0.972683,1.340771,0.200065,0.203685,0.200064,0.200069,0.610017,0.656821,0.200067,0.200066,0.200062,0.203143,0.200072,0.20006,1.285897,1.424493,0.20007,0.877877,0.51623,0.201282,0.406963,1.324631,4.75061,0.200061,0.201339,1.30283,0.201448,2.838311,...,0.465889,0.991092,1.113491,2.847571,0.202081,0.200475,0.200361,0.66847,0.200483,0.200555,2.227064,3.376168,0.269057,0.786358,4.750255,0.200226,0.200069,0.200078,0.204076,0.20007,0.200738,2.358583,0.200075,1.500727,2.877001,0.761416,0.200078,0.202704,1.967041,0.214284,1.699529,0.200074,1.490734,0.201372,0.201513,1.692176,0.200061,0.200966,0.200062,0.202631
Topic3,0.202568,0.200378,0.20536,0.200514,0.429179,0.200919,0.200241,0.200597,0.870351,0.20022,0.200224,0.200215,0.200413,0.200503,0.200605,0.552446,0.200441,0.200472,0.200228,0.202967,0.204219,0.200193,0.202285,0.200388,0.200204,0.200168,0.200202,0.200932,0.200587,0.200246,0.200552,0.200522,0.200238,0.200299,0.201019,0.200168,0.200398,0.200574,0.200228,0.200659,...,1.241535,0.200251,0.200627,0.200397,0.200199,4.543054,0.200652,0.200349,0.201108,0.20022,0.205904,0.200559,0.200407,0.315991,1.622903,0.20024,0.200191,0.200765,0.50879,0.200339,0.20067,0.200274,0.200325,0.202411,0.200641,0.20017,0.529408,0.200868,0.200442,0.368092,1.714372,0.200213,0.212206,0.210449,0.201622,0.200595,0.200172,0.200182,0.200755,0.200332
Topic4,0.200804,1.26577,0.200541,0.200654,0.200368,0.536724,0.217554,1.045603,0.753117,0.200101,0.202631,0.200108,0.202265,0.201161,0.200542,0.200113,0.21175,0.201672,0.200817,1.395131,0.200093,0.202963,0.201479,0.200097,3.308506,0.200181,2.983357,0.574328,0.229899,0.235441,1.824827,0.200244,0.202092,0.504176,0.891745,0.20008,0.716026,0.205153,0.485803,0.200099,...,0.467463,0.840432,0.333592,0.208208,0.200357,0.200683,0.665565,0.200882,0.200822,0.55616,0.200482,0.20046,3.781699,2.322595,0.94924,0.716852,0.200357,0.67483,0.20038,0.2031,0.200199,0.200213,1.96387,0.201623,0.200264,0.201011,0.200292,0.203014,1.292829,1.264899,0.200137,0.483374,5.082132,0.200339,0.200624,0.201865,0.200082,0.200703,0.200858,0.614159


In [None]:
# print top 20 keywords for each topic
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model_lda, lda_model=lda, n_words=20)        

review_df_topic_words = pd.DataFrame(topic_keywords)
review_df_topic_words.columns = ['Word '+str(i) for i in range(review_df_topic_words.shape[1])]
review_df_topic_words.index = ['Topic '+str(i) for i in range(review_df_topic_words.shape[0])]
review_df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Word 15,Word 16,Word 17,Word 18,Word 19
Topic 0,woman,sweater,arm,cute,wool,coral,wonder,coat,green,half,hole,issu,bra,shirt,size,short,larger,boxi,absolut,someth
Topic 1,size,dress,fit,look,love,like,wear,color,small,order,fabric,perfect,great,littl,beauti,nice,petit,tri,realli,skirt
Topic 2,love,great,jean,comfort,soft,pant,color,super,bought,cozi,tee,casual,comfi,fit,black,wear,dress,flatter,look,perfect
Topic 3,vest,noth,support,time,compliment,mani,worn,comfort,wear,everi,fun,outfit,knit,love,day,uniqu,soft,night,light,design
Topic 4,suit,worth,dri,cheap,wash,dress,cute,feel,qualiti,poor,stiff,bare,fabric,dot,basic,hand,person,price,love,return


In the above table, we can find top 20 keywords for each topic.
In LDA, each document is assumed to be characterized by a particular set of topics.

LDA results:
1.Each document is a mixture of a small number of topics
2.Each word's creation is attributable to one of the document's topics.

Topic0: dress,fabric,fit

Topic1:cozi, warm, sweater

Topic2:dress, little, work

Topic3:jean, comfort, pant

Topic4:coat,wool,green