# Topic Modeling
Prepared by: Yifan Ren, Ricardo Lu, and Dr. Yilu Zhou

Welcome to Lab 4: Topic Modeling. This will be the last lab of the semester. We are going to talk about 3 latent methods for <b>dimension reduction</b> and <b>topic modeling</b>：
1. Latent Semantic Analysis (LSA or LSI)
2. Latent Dirichlet Allocation (LDA)
3. Correlated LDA Topic Model (Optional)


Hightly recommend you go through the link to learn more about both models: https://towardsdatascience.com/2-latent-methods-for-dimension-reduction-and-topic-modeling-20ff6d7d547

In the same folder, we provide a regular expression ipython file for your reference. Let's get started!

In [1]:
import pandas as pd 
import gensim
from gensim import corpora,models

## Preprocessing 

In [2]:
# Read data
# use read_csv to read csv file, not read_table
df = pd.read_csv('fashion.csv')
df

Unnamed: 0,year,season,brand,author of review,location,time,review text
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ..."
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...
...,...,...,...,...,...,...,...
429,2016,Spring,Zo Jordan,Maya Singer,LONDON,"September 19, 2015","Water, water, everywhere, / nor any drop to dr..."
430,2016,Spring,Zuhair Murad,Amy Verner,PARIS,"October 4, 2015","From a new Paris showroom, Zuhair Murad came a..."
431,2016,Spring,1205,Luke Leitch,LONDON,"September 19, 2015",Fashion and Instagram are such (often sacchari...
432,2016,Spring,3.1 Phillip Lim,Maya Singer,NEW YORK,"September 14, 2015",Let other New York City fashion designers toas...


In [3]:
#convert all review text into list format
docs = df['review text'].tolist()
docs[0]

'Detachment was the word of the day at A Dtacher (yes, like the labels name, bien sr). Designer Mona Kowalska loves the high concept, and one imagines that today detachment included being unconcerned with the gaze of others. Kowalskas woman, both as she appears on the runway and the real world, dresses for herself. Her intensely arty bend, and taste for clothes that match it, make A Dtacher a cultishly beloved brand among certain shoppers. This season, Kowalska presented them with a lineup of relatively playful offerings.\rThe collection opened with a pair of midi dresses in an Indonesian-inspired floral print, which reemerged later imagined with allover Pop white polka dots. Elsewhere came cardigans in an uncanny kind of amoxicillin pink that you imagined the A Dtacher woman wearing with tongue firmly in cheek (they had Kawakubo-esque allover holes, to boot). The popcorn knits were pretty fun, too.\rThe choice to use hardier materials lent dresses eccentric volumes, but also led to a 

In [4]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]
    
# Remove stopwords.
docs = [[token for token in doc if token not in stopwords.words('english')] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [5]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [6]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=10)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [7]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 10 documents, or more than 70% of the documents.
# This step would be necessary in larger text
# dictionary.filter_extremes(no_below=10, no_above=0.7)

In [8]:
# docs[0]

## Generate Term Document Matrix

In [9]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 13812
Number of documents: 434


In [10]:
# generate a unique token list 
sort_token = sorted(dictionary.items(),key=lambda k:k[0], reverse = False)
unique_token = [token.encode('utf8') for (ID,token) in sort_token]

In [11]:
import numpy as np
matrix = gensim.matutils.corpus2dense(corpus,num_terms=len(dictionary),dtype = 'int')
matrix = matrix.T #transpose the matrix 

#convert the numpy matrix into pandas data frame
matrix_df = pd.DataFrame(matrix, columns=unique_token)

In [12]:
#write matrix dataframe into csv
matrix_df.to_csv('Term_Document_matrix.csv')

## LDA model 

In [13]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 100
eval_every = 1  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

In [14]:
lda = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [15]:
lda = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

lda.print_topics(10) #V matrix, topic matrix
import re
for i,topic in lda.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"collection","dress","one","show","season","new","look","year","designer","way"


Top 10 words for topic #2:
"dress","designer","one","like","collection","show","look","season","woman","spring"


Top 10 words for topic #3:
"dress","collection","like","show","designer","look","one","woman","new","said"


Top 10 words for topic #4:
"new","dress","designer","like","one","collection","show","season","spring","look"


Top 10 words for topic #5:
"dress","collection","one","fashion","look","designer","said","like","spring","clothes"


Top 10 words for topic #6:
"collection","dress","like","designer","look","new","one","piece","skirt","also"


Top 10 words for topic #7:
"collection","look","anderson","dress","said","black","pant","new","lee","bag"


Top 10 words for topic #8:
"collection","one","dress","piece","season","designer","show","clothes","like","look"


Top 10 words for topic #9:
"collection","dress","look","show","designer","new","one","print","like","way"


In [1]:
top_topics = lda.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

NameError: name 'lda' is not defined

In [17]:
# Generate U Matrix for LDA model
corpus_lda = lda[corpus] #transform lda model

#convert corpus_lda to numpy matrix
U_matrix_lda = gensim.matutils.corpus2dense(corpus_lda,num_terms=10).T

#write U_matrix into pandas dataframe and output
U_matrix_lda_df = pd.DataFrame(U_matrix_lda)
U_matrix_lda_df.to_csv('U_matrix_lda.csv')

In [18]:
print(matrix_df.shape)
print(U_matrix_lda_df.shape)

(434, 13812)
(434, 10)


See what we have achieved! We decrease features from 7493 to 10!

## LSI model 

In [19]:
# Tfidf Transformation 
tfidf = models.TfidfModel(corpus) #fit tfidf model
corpus_tfidf = tfidf[corpus]      #transform tfidf model

In [20]:
# Train LSI model.
from gensim.models import LsiModel


lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

import re
for i,topic in lsi.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"show","new","woman","season","print","silk","white","brand","black","jacket"


Top 10 words for topic #2:
"versace","show","model","fashion","cotton","denim","jumpsuit","people","knit","graphic"


Top 10 words for topic #3:
"denim","jean","osborne","brand","dkny","chow","gown","vintage","red","black_white"


Top 10 words for topic #4:
"lee","johnson","biker","he","anderson","gown","valli","shoulder","flower","jacket"


Top 10 words for topic #5:
"lee","lim","shirt","webb","jean","pop","pom","twist","taylor","japanese"


Top 10 words for topic #6:
"osborne","chow","dkny","wang","giorgetti","lee","denim","walker","jean","lim"


Top 10 words for topic #7:
"osborne","de","johnson","wu","dkny","he","chow","comme","scott","lee"


Top 10 words for topic #8:
"johnson","wang","chiuri","osborne","chow","anderson","piccioli","african","font","dkny"


Top 10 words for topic #9:
"wang","denim","giorgetti","woman","webb","versace","pom","taits","sweater","reference"


Top

In [21]:
# Generate U Matrix for LSI model
corpus_lsi = lsi[corpus_tfidf] #transform lda model

#convert corpus_lsi to numpy matrix
U_matrix_lsi = gensim.matutils.corpus2dense(corpus_lsi,num_terms=10).T

#write U_matrix into pandas dataframe and output
pd.DataFrame(U_matrix_lsi).to_csv('U_matrix_lsi.csv')

## Correlated LDA Topic Model (Optional)

In [23]:
pip install tomotopy

Collecting tomotopy
  Downloading tomotopy-0.11.1-cp38-cp38-macosx_10_14_x86_64.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 3.1 MB/s eta 0:00:01
Installing collected packages: tomotopy
Successfully installed tomotopy-0.11.1
Note: you may need to restart the kernel to use updated packages.


In [24]:
import tomotopy as tp

In [25]:
ctm = tp.CTModel(k=10)
for doc in docs:
    ctm.add_doc(doc)
for i in range(0, 500, 10):
    ctm.train(10)

In [26]:
U_matrix_lda_df = pd.DataFrame([doc.get_topic_dist() for doc in ctm.docs])

In [27]:
imitate_print = lambda ctm:[(i," + ".join([str(round(p,3))+"*"+'"{}"'.format(w) for w,p in ctm.get_topic_words(i)])) for i in range(10)]

In [28]:
import re
for i,topic in imitate_print(ctm):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"also","back","runway","high","fabric","make","denim","work","gown","suit"


Top 10 words for topic #2:
"black","would","long","around","see","feel","little","idea","always","short"


Top 10 words for topic #3:
"show","spring","shirt","girl","coat","time","used","material","perhaps","often"


Top 10 words for topic #4:
"made","first","woman","style","point","go","many","trouser","body","floral"


Top 10 words for topic #5:
"season","way","silk","brand","well","hand","york","label","new_york","best"


Top 10 words for topic #6:
"new","said","knit","day","though","leather","time","felt","really","something"


Top 10 words for topic #7:
"like","piece","fashion","pant","today","line","cut","take","inspired","backstage"


Top 10 words for topic #8:
"one","designer","white","could","even","model","came","thats","another","bit"


Top 10 words for topic #9:
"dress","collection","print","clothes","jacket","theme","cotton","thing","le","come"


Top 10 words for topic #

In [29]:
U_matrix_lda_df.to_csv('U_matrix_ctm.csv')