<a href="https://colab.research.google.com/github/IggyZhao/Social-Media-Popularity-and-Marketing-Strategies-by-Iggy/blob/master/Social_Media_Popularity_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Clustering and Topic Modeling

*In this project, I use unsupervised learning models to cluster unlabeled documents into different groups, visualize the results and identify their latent topics/structures.

## Contents

* [Part 1: Load Data](#Part-1:-Load-Data)
* [Part 2: Tokenizing and Stemming](#Part-2:-Tokenizing-and-Stemming)
* [Part 3: TF-IDF](#Part-3:-TF-IDF)
* [Part 4: K-means clustering](#Part-4:-K-means-clustering)
* [Part 5: Topic Modeling - Latent Dirichlet Allocation](#Part-5:-Topic-Modeling---Latent-Dirichlet-Allocation)


# Part 0: Setup Google Drive Environment

In [None]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Load the data from Google Drive

In [None]:
file = drive.CreateFile({'id':'12DFPuZqbsvJ79XdS1ssrgZMNFCDyeP88'}) # 
file.GetContentFile('data.csv')  
# 12DFPuZqbsvJ79XdS1ssrgZMNFCDyeP88/view?usp=sharing

# https://drive.google.com/file/d/1DVmZ1wQIsAZXSsJFbkG8Jz8aJzoXpTEn/view?usp=sharing

# Part 1: Load Data

Import Packages

In [None]:
import numpy as np
import pandas as pd
import nltk
import gensim   # for LDA

from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Load data into dataframe
df = pd.read_csv('data.csv', encoding= 'unicode_escape')

# https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s

In [None]:
df.head(10) 

In [None]:
# Remove missing value
df.dropna(subset=['text'],inplace=True)

In [None]:
df.describe()

Unnamed: 0,has_hashtag,display_text_width,is_popular
count,7391.0,7391.0,7391.0
mean,0.193208,97.668786,0.151536
std,0.394841,50.305006,0.358594
min,0.0,4.0,0.0
25%,0.0,63.0,0.0
50%,0.0,89.0,0.0
75%,0.0,126.0,0.0
max,1.0,288.0,1.0


In [None]:
# use the first 1000 data as our sample data
# data = df.loc[:1000, 'review_body'].tolist()
data = df.loc[:, 'text'].tolist()

# Part 2: Tokenizing and Stemming

Load stopwords and stemmer function from NLTK library.
Stop words are words like "a", "the", or "in" which don't convey significant meaning.
Stemming is the process of breaking a word down into its root.

In [None]:
# Use nltk's English stopwords.
stopwords = nltk.corpus.stopwords.words('english') # defualt stopwords in English
stopwords.append("â€™")
stopwords.append("'m")
stopwords.append("n't")
stopwords.append("br")
# Also added some words into the stopwords

print ("We use " + str(len(stopwords)) + " stop-words from nltk library.")
print (stopwords[:20])

We use 183 stop-words from nltk library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


Use our defined functions to analyze (i.e. tokenize, stem) tweets.

In [None]:
from nltk.stem.snowball import SnowballStemmer
# REGULAR EXPRESSION
import re

stemmer = SnowballStemmer("english")

# tokenization and stemming
def tokenization_and_stemming(text):
    tokens = []
    # exclude stop words and tokenize the document, generate a list of string 
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):  # re 是正则表达式, 和 regularization 无瓜, 这个帮我们把一些string filter出来, 只提取文字信息, 不要emoji 和数字啥的, 包含就不要, 只留纯文字
            filtered_tokens.append(token)
            
    # stemming
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems
    

In [None]:
tokenization_and_stemming(data[0])  # data[0]

['turn',
 'inanim',
 'object',
 'think',
 "'d",
 'roll',
 'duct',
 'tape',
 'fix',
 'stuff',
 'great',
 'way',
 'realli',
 'trust',
 'anyth']

In [None]:
from nltk.stem import WordNetLemmatizer
# REGULAR EXPRESSION
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()

# tokenization and stemming
def tokenization_and_lemmatizer(text):
    tokens = []
    # exclude stop words and tokenize the document, generate a list of string 
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):  
            filtered_tokens.append(token)
            
    # stemming
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in filtered_tokens]
    return lemmatized

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
tokenization_and_lemmatizer(data[0])

['turned',
 'inanimate',
 'object',
 'think',
 "'d",
 'roll',
 'duct',
 'tape',
 'fix',
 'stuff',
 'great',
 'way',
 'really',
 'trust',
 'anything']

Lemmatization looks better than Stemming. The difference between they two can be referred from:
#### https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

# Part 3: TF-IDF

TF: Term Frequency

IDF: Inverse Document Frequency



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words: built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.99, max_features=250,    # TfidfVectorizer 这个玩意: max_df 最大的 document frequency, max_features=1000最后最多只要1000 个词
                                 min_df=0.01, stop_words='english',  # min_df=0.01 出现太少也没用, 删了吧
                                 use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,1))  # use_idf=True 如果是 F 的话只有 TF 不乘以 IDF 了, ngram_range (1,3) 就是 123 生成三种, 1-3 的意思

tfidf_matrix = tfidf_model.fit_transform(data) #fit the vectorizer to synopses    # fit_transform 把 fit 和 transform 结合到一次 fit 根据信息生成 dictionary, transform 把 dictionary 转化成数字

print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " reviews and " + str(tfidf_matrix.shape[1]) + " terms.")

  'stop_words.' % sorted(inconsistent))


In total, there are 7391 reviews and 94 terms.


In [None]:
# check the parameters
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 250,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_stemming>,
 'use_idf': True,
 'vocabulary': None}

Save the terms identified by TF-IDF.

In [None]:
# words
tf_selected_words = tfidf_model.get_feature_names()

In [None]:
# print out words 
tf_selected_words   # 看看选出来的词儿吧  

["'d",
 "'s",
 'account',
 'alway',
 'amp',
 'away',
 'best',
 'better',
 'big',
 'ca',
 'check',
 'chicken',
 'click',
 'come',
 'contact',
 'countri',
 'day',
 'dm',
 'dog',
 'e-mail',
 'enter',
 'episod',
 'everi',
 'favorit',
 'feel',
 'film',
 'friend',
 'good',
 'got',
 'great',
 'guy',
 'happen',
 'happi',
 'hear',
 'help',
 'hope',
 'http',
 'https',
 'inform',
 'it\x89ûª',
 'kfc',
 'know',
 'let',
 'life',
 'like',
 'littl',
 'live',
 'll',
 'locat',
 'look',
 'love',
 'lovethemad',
 'make',
 'moosejaw',
 'moosejaw.com',
 'moosejawmad',
 'movi',
 'na',
 'need',
 'netflix',
 'new',
 'order',
 'peopl',
 'perfect',
 'pleas',
 'reach',
 'realli',
 'right',
 'say',
 'season',
 'seen',
 'someth',
 'sorri',
 'start',
 'store',
 'stori',
 'sure',
 'tell',
 'thank',
 'thing',
 'think',
 'time',
 'today',
 'tri',
 'use',
 've',
 'want',
 'watch',
 'way',
 'whattowatchonnetflix',
 'win',
 'work',
 'world',
 'year']

# Part 4: K-means clustering

In [None]:
# k-means clustering
from sklearn.cluster import KMeans

num_clusters = 3

# number of clusters
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

## 4.1. Analyze K-means Result

In [None]:
# create DataFrame films from all of the input files.
product = { 'text': df[:].text, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['text', 'cluster'])

In [None]:
frame.head(10)

Unnamed: 0,text,cluster
0,If I were turned into an inanimate object I th...,1
1,4 years of training? No thanks. I'm gonna star...,1
2,Eye of the Tiger playing as I push a shopping ...,1
3,Back to the Future Part II was ahead of its ti...,1
4,Three days until Valentine's Day which means t...,1
5,Someone needs to make a heated blanket that sm...,1
6,"Playing a new game I call ""guess how much chan...",1
7,I'm just glad we've all agreed to let somethin...,1
8,@C_A_resist That's what we like to hear. We'll...,1
9,Asked the guy across from my what I should twe...,1


In [None]:
print ("Number of texts included in each cluster:") 
frame['cluster'].value_counts().to_frame() 

Number of texts included in each cluster:


Unnamed: 0,cluster
1,4893
2,1269
0,1229


In [None]:
km.cluster_centers_  # 把 kmeans 的 cluster 的中心点打印了出来

# 此处我认为中心点也构成了一句话, 代表了我这个 cluster, 然后这个里面的值越大, 代表这个词的重要性越强, 也就是说我们可以提取出这个 cluster 里面最重要的词儿 

array([[1.60393978e-03, 5.95083445e-02, 0.00000000e+00, 4.28434701e-03,
        1.94379299e-02, 3.34135909e-03, 2.77079203e-02, 3.05778557e-03,
        5.73380302e-03, 4.03888885e-03, 1.34812271e-02, 9.82976626e-03,
        1.33138740e-02, 2.80057259e-02, 6.12479359e-04, 0.00000000e+00,
        2.19055051e-02, 3.83569614e-04, 3.50652605e-03, 0.00000000e+00,
        8.59568378e-03, 1.68138870e-02, 7.97668393e-03, 7.64486378e-03,
        5.36606985e-03, 1.39341792e-02, 1.11785587e-02, 1.27627478e-02,
        1.01755241e-02, 3.84777812e-03, 6.04474830e-03, 6.71065763e-03,
        1.21079313e-02, 2.86379390e-03, 7.48757004e-03, 2.80970967e-03,
        2.38945896e-03, 5.50655244e-01, 9.46667002e-04, 6.02302550e-03,
        1.50880235e-02, 1.26779391e-02, 5.96307279e-03, 1.19695800e-02,
        1.91318832e-02, 3.72334223e-03, 6.04084407e-03, 3.20863205e-03,
        1.14668440e-03, 1.10341579e-02, 3.31280226e-02, 1.49696709e-02,
        1.32738094e-02, 9.91447662e-03, 0.00000000e+00, 1.482559

In [None]:
km.cluster_centers_.shape

(3, 94)

In [None]:
# 这里写了一些 function, 利用中心点的 94 个词里 tfidf 最大的 6 个词, 选出来了 
# 上面的东西只能做出 cluster, 而背后的含义, 需要你自己去解读, 基于你自己的 domain knowledge 以及你对内容的理解
# 也就是说, 聚类后的 analysis 是最重要的 

print ("<Document clustering result by K-means>")

# km.cluster_centers_ denotes the importances of each items in centroid.
# We need to sort it in decreasing-order and get the top k items.
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print ("Cluster " + str(i) + " words:", end='')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        Cluster_keywords_summary[i].append(tf_selected_words[ind])
        print (tf_selected_words[ind] + ",", end='')
    print ()
    
    cluster_text = frame[frame.cluster==i].text.tolist()
    print ("Cluster " + str(i) + " text (" + str(len(cluster_reviews)) + " text): ")
    print (", ".join(cluster_text))
    print ()

# Part 5: Topic Modeling - Latent Dirichlet Allocation

In [None]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=3)

# LDA 要求你必须用整数~ 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA requires integer values
tfidf_model_lda = CountVectorizer(max_df=0.99, max_features=500,
                                 min_df=0.01, stop_words='english',
                                 tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix_lda.shape[0]) + \
      " reviews and " + str(tfidf_matrix_lda.shape[1]) + " terms.")

  'stop_words.' % sorted(inconsistent))


In total, there are 7391 reviews and 94 terms.


In [None]:
# document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)


(7391, 3)
[[0.05555895 0.75982874 0.18461231]
 [0.04177652 0.91515796 0.04306553]
 [0.11220793 0.77666032 0.11113175]
 ...
 [0.9046772  0.04763193 0.04769087]
 [0.91660329 0.04167638 0.04172033]
 [0.91660329 0.04167638 0.04172033]]


In [None]:
# topics and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

# (5,241) 每一个 topic 里面, 241 个词, 每一个词都有一个数对应, 这个数代表这个词儿的重要性

(3, 94)
[[3.37074610e-01 3.35563699e-01 1.08133040e+03 3.34676640e-01
  3.35848682e-01 3.36085306e-01 3.34663353e-01 7.21129206e+01
  3.36266078e-01 3.36412586e-01 3.35037516e-01 3.39356239e-01
  3.35760503e-01 3.35948638e-01 5.41330978e+02 5.42328022e+02
  3.35308423e-01 1.55732515e+03 3.35400652e-01 3.36145377e-01
  3.36291416e-01 3.35096965e-01 3.35462092e-01 3.35282942e-01
  3.35765267e-01 3.36533259e-01 3.35474874e-01 3.35515273e-01
  3.36692752e-01 3.35533536e-01 3.38582563e-01 1.07165648e+02
  3.35723879e-01 9.65321652e+02 3.37832619e-01 3.37405938e-01
  3.34911465e-01 3.36396107e-01 3.37302385e+02 3.35799663e-01
  6.21299246e+02 3.36377368e-01 3.35601133e-01 3.36748693e-01
  1.96113805e+02 3.38104649e-01 3.37569296e-01 3.37677251e-01
  2.94152520e+00 1.75702293e+03 3.35480432e-01 3.35083782e-01
  8.17321562e+02 3.36954529e-01 3.38383469e-01 3.34780296e-01
  3.35395720e-01 3.35031763e-01 3.35675142e-01 3.34406887e-01
  3.36016840e-01 3.40574367e-01 3.36076437e-01 3.35252446e-01


In [None]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head(10)

# 取每个文档里 topic 值最大的那个 topic 最为这个文档的 topic 
# LDA 给了一个更加均匀的分布 
# 下面是一个 1000 * 5 的矩阵, 每一个 topic 对应一个 probability, 取最大的, 做好聚类 

Unnamed: 0,Topic0,Topic1,Topic2,topic
Doc0,0.06,0.76,0.18,1
Doc1,0.04,0.92,0.04,1
Doc2,0.11,0.78,0.11,1
Doc3,0.08,0.82,0.1,1
Doc4,0.06,0.06,0.89,2
Doc5,0.37,0.37,0.26,0
Doc6,0.26,0.35,0.38,2
Doc7,0.08,0.49,0.43,1
Doc8,0.28,0.48,0.24,1
Doc9,0.11,0.12,0.77,2


In [None]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
0,2937
2,2542
1,1912


In [None]:
# topic word matrix
print(lda.components_)
# topic-word matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()



[[3.37074610e-01 3.35563699e-01 1.08133040e+03 3.34676640e-01
  3.35848682e-01 3.36085306e-01 3.34663353e-01 7.21129206e+01
  3.36266078e-01 3.36412586e-01 3.35037516e-01 3.39356239e-01
  3.35760503e-01 3.35948638e-01 5.41330978e+02 5.42328022e+02
  3.35308423e-01 1.55732515e+03 3.35400652e-01 3.36145377e-01
  3.36291416e-01 3.35096965e-01 3.35462092e-01 3.35282942e-01
  3.35765267e-01 3.36533259e-01 3.35474874e-01 3.35515273e-01
  3.36692752e-01 3.35533536e-01 3.38582563e-01 1.07165648e+02
  3.35723879e-01 9.65321652e+02 3.37832619e-01 3.37405938e-01
  3.34911465e-01 3.36396107e-01 3.37302385e+02 3.35799663e-01
  6.21299246e+02 3.36377368e-01 3.35601133e-01 3.36748693e-01
  1.96113805e+02 3.38104649e-01 3.37569296e-01 3.37677251e-01
  2.94152520e+00 1.75702293e+03 3.35480432e-01 3.35083782e-01
  8.17321562e+02 3.36954529e-01 3.38383469e-01 3.34780296e-01
  3.35395720e-01 3.35031763e-01 3.35675142e-01 3.34406887e-01
  3.36016840e-01 3.40574367e-01 3.36076437e-01 3.35252446e-01
  1.9283

Unnamed: 0,'d,'s,account,alway,amp,away,best,better,big,ca,check,chicken,click,come,contact,countri,day,dm,dog,e-mail,enter,episod,everi,favorit,feel,film,friend,good,got,great,guy,happen,happi,hear,help,hope,http,https,inform,itûª,...,moosejaw.com,moosejawmad,movi,na,need,netflix,new,order,peopl,perfect,pleas,reach,realli,right,say,season,seen,someth,sorri,start,store,stori,sure,tell,thank,thing,think,time,today,tri,use,ve,want,watch,way,whattowatchonnetflix,win,work,world,year
Topic0,0.337075,0.335564,1081.330395,0.334677,0.335849,0.336085,0.334663,72.112921,0.336266,0.336413,0.335038,0.339356,0.335761,0.335949,541.330978,542.328022,0.335308,1557.325152,0.335401,0.336145,0.336291,0.335097,0.335462,0.335283,0.335765,0.336533,0.335475,0.335515,0.336693,0.335534,0.338583,107.165648,0.335724,965.321652,0.337833,0.337406,0.334911,0.336396,337.302385,0.3358,...,0.338383,0.33478,0.335396,0.335032,0.335675,0.334407,0.336017,0.340574,0.336076,0.335252,1928.322479,541.330709,0.337117,817.125802,0.335998,0.335415,0.334213,0.335896,1828.324785,0.336075,134.263785,0.3346,0.338759,0.342856,0.337639,0.335354,0.335689,0.335396,0.336591,559.812491,0.339244,0.336437,0.335525,0.335192,0.335822,0.334666,0.335097,0.335537,0.334536,0.335936
Topic1,55.377025,49.003679,0.334607,80.310159,0.360351,69.93754,0.354305,63.261623,62.659151,84.240468,215.936889,93.280791,7.584927,22.625063,0.334312,0.336029,1.319027,0.338176,0.366164,0.341142,73.242615,4.405434,4.606498,0.350852,110.738341,10.125223,0.389968,240.00106,55.115178,189.996731,35.736554,0.384976,92.526412,0.339588,15.647035,22.568096,228.322622,0.390653,0.359176,1.196363,...,0.345565,0.357316,16.283738,87.196715,156.256339,0.454737,130.732367,10.774847,106.562399,92.077113,0.339077,0.334739,55.120425,2.455814,0.368886,20.118383,89.242321,103.062079,0.337303,94.529811,0.382473,0.365818,35.179064,0.364527,84.452809,181.867326,165.921254,131.208055,43.420126,286.702039,31.176991,85.666383,122.439873,128.186805,8.234179,668.321856,52.561117,128.930506,1.055113,88.816618
Topic2,36.2859,985.660758,0.334998,0.355164,117.3038,4.726375,139.311031,21.625456,14.004583,0.423119,0.728073,0.379853,68.079313,123.038988,0.33471,0.33595,293.345664,0.336671,131.298436,97.322713,27.421094,82.259469,89.05804,79.313865,6.925893,74.538243,79.274557,0.663425,134.548129,0.667735,76.924863,76.449377,46.137864,0.33876,73.015132,56.094498,0.342466,1677.272951,0.338439,80.467837,...,108.316051,114.307904,122.380866,1.468253,0.407986,284.210856,213.931616,73.884579,25.101525,0.587635,0.338444,0.334552,51.542457,2.418384,138.295116,98.546202,0.423465,1.602025,0.337913,20.134114,0.353743,94.299582,49.482177,129.292617,26.209551,0.79732,0.743057,139.456549,153.243282,0.485469,55.483765,84.99718,61.224602,211.478004,89.429999,0.343478,51.103786,0.733957,104.610351,104.847446


In [None]:
# print top n keywords for each topic
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model_lda, lda_model=lda, n_words=15)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,pleas,sorri,look,dm,account,hear,make,right,kfc,tri,countri,contact,reach,inform,like
Topic 1,whattowatchonnetflix,like,tri,good,http,check,great,thing,ll,think,need,time,new,work,watch
Topic 2,https,'s,day,netflix,new,watch,love,know,today,time,best,say,got,dog,tell
