# Understanding Topic Models

I am creating this notebook in order to understand the numbers associated with topics in both  NMF and LDA topic models. In the case of the latter, we are seeing very large numbers associated with keywords within a given topic, and we would like to know why. But that also reveals that I don't understand what the number associated with a given word within a topic means within an NMF topic model -- is it, for example, simply the TFIDF value? Unknown.

For this exploration, I have created a toy corpus of 10 texts drawn from a larger collection of Louisiana treasure legends. I am keeping the code simple, using only the stop words and tokenization built into Sci-Kit Learn. My plan is to generate the TF needed for LDA and the TFIDF needed for NMF, to label them and save them as CSV files. Then to generate the LDA and NMD topic models, using 2 or 3 components, and to break out the H and W matrices. I hope to be able to convert those arrays to dataframes, attach the labels we need to see what's going on and save those to CSV files as well. With any luck, I can fold things into an Excel workbook, and we can be done with this.

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Load the corpus from a small collection of files
# =-=-=-=-=-=-=-=-=-=-= 

import glob

file_list = glob.glob('../texts/tentexts' + '/*.txt')

corpus = []

for file in file_list:
    with open(file) as f_input:
        corpus.append(f_input.read().replace('\n', ' '))
        
print("Corpus is a {} of {} items.".format(type(corpus), len(corpus)))

Corpus is a <class 'list'> of 10 items.


In [2]:
# =-=-=-=-=-=-=-=-=-=-=
# Glean the filenames from the glob list
# (We will get the feature names later from `.get_feature_names()` method.)
# =-=-=-=-=-=-=-=-=-=-= 

filenames = [s.replace('../texts/tentexts/', '') for s in file_list]
docs = [s.replace('.txt', '') for s in filenames]
print(docs)

['anc-088', 'anc-089', 'anc-090', 'anc-091', 'lau-013', 'lau-014', 'loh-157', 'loh-158', 'loh-159', 'loh-160']


In [3]:
# =-=-=-=-=-=-=-=-=-=-=
# Parameters, imports, and functions for both LDA and NMF
# =-=-=-=-=-=-=-=-=-=-= 

# Import
# Not sure if we still need this for this code: used to save arrays to CSVs,
# but now we are using a dataframe to do that. (See commented out code below.)
import numpy as np 
import pandas as pd

# Parameters
n_features = 1000
n_components = 3
n_top_words = 10
#stopwords = re.split('\s+', open('../data/stopwords_all.txt', 'r').read().lower())

topic_labels = ["Topic 0", "Topic 1", "Topic 2"]
# for i in range(0, n_components):


# Ye olde "let's see the topic" function
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [4]:
# =-=-=-=-=-=-=-=-=-=-=
# LDA
# =-=-=-=-=-=-=-=-=-=-= 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(corpus)

tf_array = tf.toarray()
# np.savetxt("../outputs/tentexts_tf.csv", tf_array.astype(np.int), fmt='%d', delimiter=",")
# print("A tf array of {} has been saved to CSV.".format(tf.shape))

lda = LatentDirichletAllocation(n_components=n_components, 
                                max_iter=20,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0: said went money gold man came little know buried told
Topic #1: young far time dollars went came things deep gold told
Topic #2: said man like went don know just shovel tree house



In [6]:
# =-=-=-=-=-=-=-=-=-=-=
# Create dataframes of TF, H, and W
# =-=-=-=-=-=-=-=-=-=-= 

# Create TF dataframe
df_tf = pd.DataFrame(data= tf_array, index = docs, columns = tf_feature_names)

# Uncomment to glimpse dataframe
# df_tf.head(10)

# Save TF dataframe to CSV file
df_tf.to_csv('../outputs/tf_frame.csv', sep=',')

# Get W (DTM) and H (WTM) arrays
lda_W = lda.transform(tf)
lda_H = lda.components_

df_lda_DTM = pd.DataFrame(data= lda_W, index = docs, columns = topic_labels)
df_lda_DTM.to_csv('../outputs/lda_W.csv', sep=',')
print(df_lda_DTM)

          Topic 0   Topic 1   Topic 2
anc-088  0.406158  0.004617  0.589225
anc-089  0.013331  0.010670  0.975999
anc-090  0.013037  0.010633  0.976331
anc-091  0.982436  0.008419  0.009145
lau-013  0.004778  0.004369  0.990853
lau-014  0.002260  0.002088  0.995652
loh-157  0.990568  0.004557  0.004875
loh-158  0.983257  0.007948  0.008795
loh-159  0.983085  0.007575  0.009340
loh-160  0.994152  0.002728  0.003120


In [7]:
df_lda_WTM = pd.DataFrame(data = lda_H, index = topic_labels, columns = tf_feature_names)
df_lda_WTM.to_csv('../outputs/lda_W.csv', sep=',')
print(df_lda_WTM)

         american     asked       big    branch   brother     built      bull  \
Topic 0  1.970122  1.595890  2.329375  1.847106  1.881802  2.609794  0.862688   
Topic 1  0.526371  0.559859  0.512988  0.582659  0.566128  0.566889  0.543268   
Topic 2  0.476247  0.874584  1.553584  1.231478  2.672460  0.518236  4.396376   

           buried      came     chain    ...        weird      went      wife  \
Topic 0  6.030735  6.670249  1.231581    ...     0.511647  9.284730  3.653556   
Topic 1  0.546082  0.609306  0.582465    ...     0.515740  0.615540  0.562227   
Topic 2  0.589048  2.007795  1.152837    ...     1.873471  7.523704  0.931034   

             wind     woods   working      yard      yeah     years     young  
Topic 0  0.542825  0.765247  1.224481  1.231473  2.597735  3.920248  1.869225  
Topic 1  0.548399  0.534635  0.481469  0.525065  0.527799  0.544141  0.637189  
Topic 2  3.959836  5.125925  1.277567  2.586199  1.264871  0.575798  1.935628  

[3 rows x 132 columns]


In [8]:
# =-=-=-=-=-=-=-=-=-=-=
# NMF
# =-=-=-=-=-=-=-=-=-=-= 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(corpus)

tfidf_array = tfidf.toarray()
# np.savetxt("../outputs/tentexts_tfidf.csv", tfidf_array, delimiter=",")
# print("A tf-idf array of {} has been saved to CSV.".format(tfidf.shape))

nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Topic #0: said man went woods little like just controller know money
Topic #1: house buried supposedly wife lot town gold lived came story
Topic #2: yeah saw wasn water place things went years end money



In [9]:
# =-=-=-=-=-=-=-=-=-=-=
# TFIDF and NMF's H and W Matrices
# =-=-=-=-=-=-=-=-=-=-= 

tfidf_df = pd.DataFrame(data= tfidf_array, index = docs, columns = tf_feature_names)
tfidf_df.to_csv('../outputs/tfidf_frame.csv', sep=',')

In [10]:
# Get W (DTM) and H (WTM) arrays
nmf_DTM = nmf.transform(tfidf)
nmf_WTM = nmf.components_

df_nmf_DTM = pd.DataFrame(data= nmf_DTM, index = docs, columns = topic_labels)
df_nmf_DTM.to_csv('../outputs/nmf_DTM.csv', sep=',')
print(df_nmf_DTM)

          Topic 0   Topic 1   Topic 2
anc-088  0.685349  0.000000  0.000000
anc-089  0.679691  0.000000  0.021044
anc-090  0.628755  0.000000  0.000000
anc-091  0.019669  0.000000  1.263780
lau-013  0.543181  0.000000  0.000000
lau-014  0.596777  0.244314  0.000000
loh-157  0.007589  1.245402  0.000000
loh-158  0.348560  0.119015  0.000000
loh-159  0.391789  0.000000  0.247836
loh-160  0.717665  0.064456  0.000000


In [11]:
df_nmf_WTM = pd.DataFrame(data = nmf_WTM, index = topic_labels, columns = tf_feature_names)
df_nmf_WTM.to_csv('../outputs/nmf_WTM.csv', sep=',')
print(df_nmf_WTM)

         american     asked       big    branch   brother     built      bull  \
Topic 0   0.01775  0.024334  0.048008  0.033074  0.075008  0.010696  0.114641   
Topic 1   0.00000  0.000000  0.000000  0.000000  0.000000  0.032530  0.000000   
Topic 2   0.00000  0.000000  0.048711  0.011160  0.000000  0.000000  0.000000   

           buried      came     chain    ...        weird      went      wife  \
Topic 0  0.000000  0.093302  0.000000    ...     0.012572  0.206358  0.017950   
Topic 1  0.214652  0.129069  0.000000    ...     0.000000  0.000000  0.148566   
Topic 2  0.050739  0.000000  0.073587    ...     0.000000  0.151543  0.000000   

             wind     woods  working      yard      yeah     years     young  
Topic 0  0.068009  0.189816  0.01299  0.014969  0.000000  0.022978  0.051542  
Topic 1  0.000000  0.000000  0.00000  0.032976  0.000000  0.000000  0.000000  
Topic 2  0.000000  0.000000  0.00000  0.000000  0.279275  0.148864  0.000000  

[3 rows x 132 columns]


In [12]:
tfidf_df.head(10)

Unnamed: 0,american,asked,big,branch,brother,built,bull,buried,came,chain,...,weird,went,wife,wind,woods,working,yard,yeah,years,young
anc-088,0.0,0.098376,0.07652,0.0,0.0,0.0,0.295129,0.0,0.206159,0.0,...,0.0,0.102743,0.196753,0.0,0.258203,0.0,0.0,0.0,0.0,0.0
anc-089,0.0,0.0,0.146204,0.0,0.0,0.0,0.0,0.0,0.131299,0.0,...,0.0,0.196306,0.0,0.187963,0.164446,0.0,0.0,0.0,0.0,0.0
anc-090,0.0,0.0,0.0,0.0,0.13965,0.0,0.0,0.0,0.0,0.0,...,0.0,0.093754,0.0,0.0,0.471223,0.0,0.0,0.0,0.0,0.0
anc-091,0.0,0.0,0.108179,0.0,0.0,0.0,0.0,0.108179,0.0,0.139077,...,0.0,0.217876,0.0,0.0,0.0,0.0,0.0,0.417232,0.243352,0.0
lau-013,0.0,0.0,0.0,0.0,0.163115,0.0,0.314556,0.0,0.0,0.104852,...,0.104852,0.164259,0.0,0.0,0.0,0.0,0.091733,0.0,0.0,0.183466
lau-014,0.0,0.0,0.0,0.04773,0.0,0.0,0.0,0.0,0.0,0.0,...,0.04773,0.074773,0.0,0.190921,0.0,0.04773,0.083517,0.04773,0.0,0.0
loh-157,0.0,0.0,0.0,0.0,0.0,0.071141,0.0,0.316247,0.227206,0.0,...,0.0,0.0,0.243945,0.0,0.0,0.0,0.071141,0.0,0.0,0.0
loh-158,0.140196,0.0,0.109049,0.0,0.0,0.122655,0.0,0.109049,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.24531,0.0
loh-159,0.0,0.0,0.0,0.296788,0.115426,0.0,0.0,0.0,0.103659,0.0,...,0.0,0.232471,0.0,0.0,0.0,0.148394,0.0,0.0,0.0,0.129827
loh-160,0.071295,0.071295,0.0,0.0,0.055455,0.062374,0.0,0.055455,0.099604,0.0,...,0.0,0.223377,0.0,0.0,0.0,0.0,0.0,0.0,0.062374,0.062374
