# Understanding Topic Models

For this exploration, I have created a toy corpus of 10 texts drawn from a larger collection of Louisiana treasure legends. I am keeping the code simple, using only the stop words and tokenization built into Sci-Kit Learn. My plan is to generate the TF needed for LDA and the TFIDF needed for NMF, to label them and save them as CSV files. Then to generate the LDA and NMD topic models, using 2 or 3 components, and to break out the H and W matrices. I hope to be able to convert those arrays to dataframes, attach the labels we need to see what's going on and save those to CSV files as well. With any luck, I can fold things into an Excel workbook, and we can be done with this.

On the matter of large numbers attached to features associated with LDA components, see this on [SO](https://stackoverflow.com/questions/35140117/how-to-interpret-lda-components-using-sklearn).

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Load the corpus from a small collection of files
# =-=-=-=-=-=-=-=-=-=-= 

import glob

file_list = glob.glob('../texts/tentexts' + '/*.txt')

corpus = []

for file in file_list:
    with open(file) as f_input:
        corpus.append(f_input.read().replace('\n', ' '))
        
print("Corpus is a {} of {} items.".format(type(corpus), len(corpus)))

Corpus is a <class 'list'> of 10 items.


In [28]:
# =-=-=-=-=-=-=-=-=-=-=
# Glean the filenames from the glob list
# (We will get the feature names later from `.get_feature_names()` method.)
# =-=-=-=-=-=-=-=-=-=-= 

filenames = [s.replace('../texts/tentexts/', '') for s in file_list]
docs = [s.replace('.txt', '').upper() for s in filenames]
print(docs)

['ANC-088', 'ANC-089', 'ANC-090', 'ANC-091', 'LAU-013', 'LAU-014', 'LOH-157', 'LOH-158', 'LOH-159', 'LOH-160']


In [3]:
# =-=-=-=-=-=-=-=-=-=-=
# Parameters, imports, and functions for both LDA and NMF
# =-=-=-=-=-=-=-=-=-=-= 

# Import
import numpy as np # not sure if this is still needed
import pandas as pd
from sklearn.feature_extraction import text

# Parameters
n_features = 1000
n_components = 3
n_top_words = 10
#stopwords = re.split('\s+', open('../data/stopwords_all.txt', 'r').read().lower())

stop_added = ["said", "went", "know", "come", "came", "saw", "yeah", "like", "don"]
stop_words = text.ENGLISH_STOP_WORDS.union(stop_added)
# print(stop_words) # To check if words have been added

# Create labels for the topics array later
topic_labels = ["Topic " + str(i) for i in range(n_components)]


# Ye olde "let's see the topic" function made as readable as possible
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "{:d}: ".format(topic_idx)
        message += " ".join([feature_names[i] + ' ' + str(round(topic[i], 2)) + ','
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [4]:
# =-=-=-=-=-=-=-=-=-=-=
# LDA
# =-=-=-=-=-=-=-=-=-=-= 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words=stop_words)

tf = tf_vectorizer.fit_transform(corpus)

tf_array = tf.toarray()
# np.savetxt("../outputs/tentexts_tf.csv", tf_array.astype(np.int), fmt='%d', delimiter=",")
# print("A tf array of {} has been saved to CSV.".format(tf.shape))

lda = LatentDirichletAllocation(n_components=n_components, 
                                max_iter=20,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

0: man 11.61, just 8.01, money 7.41, little 6.03, woods 5.37, told 5.28, old 4.73, later 4.69, bull 4.69, coming 4.69,
1: man 10.06, tree 6.05, house 5.33, got 4.61, shovel 3.99, look 3.96, right 3.96, guy 3.93, money 3.93, water 3.33,
2: gold 6.19, house 4.71, buried 4.02, lot 3.37, supposedly 3.27, money 2.74, man 2.7, little 2.66, wife 2.64, time 2.63,



In [5]:
# =-=-=-=-=-=-=-=-=-=-=
# Create dataframes of TF, H, and W
# =-=-=-=-=-=-=-=-=-=-= 

# Create TF dataframe
df_tf = pd.DataFrame(data= tf_array, index = docs, columns = tf_feature_names)

# Uncomment to glimpse dataframe
# df_tf.head(10)

# Save TF dataframe to CSV file
df_tf.to_csv('../outputs/tf_frame.csv', sep=',')

# Get W (DTM) and H (WTM) arrays
lda_W = lda.transform(tf)
lda_H = lda.components_

df_lda_DTM = pd.DataFrame(data= lda_W, index = docs, columns = topic_labels)
df_lda_DTM.to_csv('../outputs/lda_DTM.csv', sep=',')
print(df_lda_DTM)

          Topic 0   Topic 1   Topic 2
anc-088  0.987878  0.006237  0.005885
anc-089  0.966938  0.017512  0.015550
anc-090  0.969784  0.014724  0.015493
anc-091  0.012094  0.974399  0.013506
lau-013  0.988502  0.006191  0.005307
lau-014  0.003010  0.994051  0.002939
loh-157  0.005482  0.005508  0.989011
loh-158  0.980826  0.009206  0.009968
loh-159  0.009861  0.009769  0.980371
loh-160  0.992503  0.003673  0.003824


In [6]:
df_lda_WTM = pd.DataFrame(data = lda_H, index = topic_labels, columns = tf_feature_names)
df_lda_WTM.to_csv('../outputs/lda_WTM.csv', sep=',')
print(df_lda_WTM)

         american     asked       big    branch   brother     built      bull  \
Topic 0  1.959685  1.922182  2.660974  0.513983  3.277904  1.917683  4.688339   
Topic 1  0.526864  0.615052  1.217397  1.207246  0.560899  0.544669  0.554175   
Topic 2  0.583336  0.556125  0.585222  1.922700  1.228667  1.208030  0.574554   

           buried     chain     chest    ...        water       way     weird  \
Topic 0  1.908744  1.238650  2.609155    ...     0.538923  2.625584  1.247875   
Topic 1  1.166469  1.207228  0.523462    ...     3.325225  1.925063  1.275115   
Topic 2  4.021302  0.526372  0.545059    ...     1.987316  1.987423  0.514261   

             wife      wind     woods   working      yard     years     young  
Topic 0  1.918669  1.208791  5.374326  0.540871  1.262748  2.591511  2.561572  
Topic 1  0.597039  3.310254  0.559850  1.218500  1.908640  1.880450  0.554208  
Topic 2  2.640340  0.553687  0.549913  1.164151  1.244924  0.602379  1.230378  

[3 rows x 123 columns]


In [9]:
# =-=-=-=-=-=-=-=-=-=-=
# NMF
# =-=-=-=-=-=-=-=-=-=-= 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words=stop_words)

tfidf = tfidf_vectorizer.fit_transform(corpus)

tfidf_array = tfidf.toarray()
# np.savetxt("../outputs/tentexts_tfidf.csv", tfidf_array, delimiter=",")
# print("A tf-idf array of {} has been saved to CSV.".format(tfidf.shape))

nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

0: man 0.36, just 0.18, house 0.18, controller 0.17, told 0.15, got 0.15, bull 0.14, old 0.13, coming 0.13, gold 0.12,
1: wasn 0.25, place 0.24, things 0.23, water 0.23, years 0.19, supposedly 0.12, end 0.12, money 0.11, true 0.11, chain 0.1,
2: woods 0.37, little 0.36, money 0.15, used 0.14, grave 0.13, supposed 0.13, seen 0.13, mother 0.12, longer 0.12, did 0.12,



In [10]:
# =-=-=-=-=-=-=-=-=-=-=
# TFIDF and NMF's H and W Matrices
# =-=-=-=-=-=-=-=-=-=-= 

tfidf_df = pd.DataFrame(data= tfidf_array, index = docs, columns = tf_feature_names)
tfidf_df.to_csv('../outputs/tfidf_frame.csv', sep=',')
# tfidf_df.head(10) # To see the first few columns of the tfidf array

# Get W (DTM) and H (WTM) arrays
nmf_DTM = nmf.transform(tfidf)
nmf_WTM = nmf.components_

df_nmf_DTM = pd.DataFrame(data= nmf_DTM, index = docs, columns = topic_labels)
df_nmf_DTM.to_csv('../outputs/nmf_DTM.csv', sep=',')
print(df_nmf_DTM)

          Topic 0   Topic 1   Topic 2
anc-088  0.705351  0.000000  0.000000
anc-089  0.641042  0.000000  0.000000
anc-090  0.000000  0.000000  1.193289
anc-091  0.000000  1.255067  0.000000
lau-013  0.549254  0.051384  0.000000
lau-014  0.712449  0.000000  0.000000
loh-157  0.494898  0.118592  0.000000
loh-158  0.456827  0.000000  0.059042
loh-159  0.215030  0.214393  0.246293
loh-160  0.658819  0.000000  0.110136


In [11]:
df_nmf_WTM = pd.DataFrame(data = nmf_WTM, index = topic_labels, columns = tf_feature_names)
df_nmf_WTM.to_csv('../outputs/nmf_WTM.csv', sep=',')
print(df_nmf_WTM)

         american     asked       big    branch   brother     built      bull  \
Topic 0  0.025807  0.029743  0.061154  0.020184  0.036086  0.033742  0.135488   
Topic 1  0.000000  0.000000  0.066460  0.008693  0.000000  0.000000  0.000000   
Topic 2  0.000000  0.000000  0.000000  0.016023  0.111004  0.000000  0.000000   

           buried     chain     chest    ...        water       way     weird  \
Topic 0  0.071519  0.000000  0.062882    ...     0.017865  0.045446  0.019096   
Topic 1  0.088629  0.103527  0.000000    ...     0.231232  0.076397  0.000000   
Topic 2  0.000000  0.000000  0.000000    ...     0.002343  0.000000  0.000000   

             wife      wind     woods   working      yard    years     young  
Topic 0  0.084598  0.094021  0.082012  0.008862  0.040743  0.03211  0.049947  
Topic 1  0.000000  0.000000  0.000000  0.000000  0.000000  0.19483  0.000000  
Topic 2  0.000000  0.000000  0.373551  0.000000  0.000000  0.00000  0.000000  

[3 rows x 123 columns]
