# Latent Dirichlet Allocation

In [27]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [28]:
def TextNormalizer(text):
    
    for i in range(len(text)):
        clean_text =  text[i].strip()
        clean_text = clean_text.lower()
        whitelist = set("'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ")
        clean_text = ''.join(filter(whitelist.__contains__, clean_text))
        clean_text = clean_text.strip()
        text[i] = clean_text

    return text

data['clean_text'] = TextNormalizer(data.text)

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [30]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = stopwords.words('english')

# Create p_stemmer of class PorterStemmer
p_stemmer = WordNetLemmatizer()

# list for tokenized documents in loop
texts = []

# loop through document list
for i in data.clean_text:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.lemmatize(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, passes=20)

In [79]:
str_match = [s for s in tokens if "ll" in s]
print(str_match)

['william', 'still', 'really', 'allwill', 'dallas']


In [72]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(data.clean_text)
vector = vectorizer.transform(data.clean_text)
lda=LDA(n_components=2)
lda.fit_transform(vector)


array([[0.0061606 , 0.9938394 ],
       [0.00410111, 0.99589889],
       [0.00387489, 0.99612511],
       ...,
       [0.00461547, 0.99538453],
       [0.02715464, 0.97284536],
       [0.98987665, 0.01012335]])

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [70]:
print(ldamodel.print_topics(num_topics=4, num_words=10))

[(0, '0.010*"god" + 0.006*"one" + 0.006*"would" + 0.005*"people" + 0.004*"christian" + 0.004*"jesus" + 0.004*"say" + 0.003*"think" + 0.003*"church" + 0.003*"know"'), (1, '0.008*"team" + 0.008*"game" + 0.005*"hockey" + 0.004*"player" + 0.004*"university" + 0.004*"play" + 0.004*"would" + 0.004*"year" + 0.003*"go" + 0.003*"nhl"')]


In [63]:
def print_topic(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for idx, topic in enumerate(model.components_):
        print('\nTopic #%d:'%  idx)
        print([(words[i], (topic[i]/np.sum(topic)).round(4)) for i in topic.argsort()[:-n_top_words - 1:-1]])

In [50]:
idx, topic =lda.components_


In [53]:
import numpy as np
np.sum(topic)

112280.10773156365

In [71]:
number_words = 10
print_topic(lda, vectorizer, number_words)


Topic #0:
[('god', 0.0039), ('don', 0.0038), ('church', 0.0032), ('like', 0.0031), ('think', 0.003), ('people', 0.0028), ('time', 0.0027), ('university', 0.0026), ('just', 0.0026), ('hockey', 0.0025)]

Topic #1:
[('people', 0.0058), ('god', 0.0055), ('truth', 0.0044), ('know', 0.0043), ('think', 0.0038), ('don', 0.0035), ('article', 0.0034), ('university', 0.0033), ('believe', 0.0032), ('christians', 0.0031)]

Topic #2:
[('god', 0.015), ('jesus', 0.0072), ('people', 0.0054), ('christ', 0.0042), ('does', 0.0041), ('believe', 0.004), ('hell', 0.0035), ('think', 0.0034), ('say', 0.0033), ('church', 0.0033)]

Topic #3:
[('team', 0.0085), ('game', 0.0063), ('hockey', 0.006), ('play', 0.0053), ('games', 0.0042), ('nhl', 0.0041), ('season', 0.0039), ('year', 0.0037), ('period', 0.0034), ('university', 0.0033)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [None]:
text='The National Basketball Association (NBA) is the premier basketball league in the world 
Created on June 6, 1946 under the name of BAA (Basketball Association of America), 
the league was renamed NBA in 1949 after its merger with the NBL (National Basketball League) 1.
It is one of the four major professional leagues in American sports, alongside the NFL (American football), 
MLB (baseball) and NHL (ice hockey). The headquarters of the NBA are located in the Olympic Tower at 645 5th Avenue in New York2.
In 2015, NBA players were the highest paid athletes in the world3'