Lambda School Data Science

*Unit 4, Sprint 1, Module 4*

---

# Topic Modeling (Prepare)

# Latent Dirchilet Allocation (LDA) Models (Prepare)
<a id="#p1"></a>

## Overview
LDA is a "generative probabilistic model". 

Let's play with a model available [here](https://lettier.com/projects/lda-topic-modeling/).

## Follow Along

## Challenge 

# Estimating LDA Models with Gensim (Learn)
<a id="#p1"></a>

## Overview
### A Literary Introduction: *Jane Austen V. Charlotte Bronte*
Despite being born nearly forty years apart, Jane Austen & Charlotte Bronte are often pitted against one another in an imagined battle for literary supremacy. The battle centers around the topics of education for women, courting, and marriage. The authors' similiar backgrounds naturally draw comparisons, but the modern fascination is probably due to novelty of British women publishing novels during the early 19th century. 

Can we help close a litterary battle for supremacy and simply acknowledge that the authors addressed different topics and deserve to be acknowledged as excellent authors each in their own right?

We're going to apply Latent Dirichlet Allocation, a machine learning alogrithm for topic modeling,to each of the author's novels to compare the distribution of topics in their novels.

In [1]:
import numpy as np
import gensim
import os
import re

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

from gensim.models.ldamulticore import LdaMulticore

import pandas as pd

### Novel Data
I grabbed the novel data pre-split into a bunch of smaller chunks

In [2]:
path = './data/austen-brontë-split'

In [3]:
STOPWORDS = set(STOPWORDS).union(set(['said', 'mr', 'mrs']))

def tokenize(text):
    return [token for token in simple_preprocess(text) \
            if token not in STOPWORDS]

In [4]:
import os

def gather_data(path_to_data): 
    data = []
    
    for f in os.listdir(path):
        if f[-3:] == 'txt':
            with open(os.path.join(path,f)) as t:
                text = t.read().strip('\n')
                data.append(tokenize(str(text)))
            
    return data

In [5]:
tokens = gather_data(path)

In [6]:
"this is a sample string with a \n newline character".replace('\n', '')

'this is a sample string with a  newline character'

In [7]:
len(tokens)

813

## Follow Along

### Text Preprocessing
**Challenge**: update the function `tokenize` with any technique you have learned so far this week. 

In [8]:
titles = [t[:-4] for t in os.listdir(path)]

In [9]:
len(titles)

813

In [10]:
titles[:5]

['Austen_Emma0000',
 'Austen_Emma0001',
 'Austen_Emma0002',
 'Austen_Emma0003',
 'Austen_Emma0004']

In [46]:
STOPWORDS = set(STOPWORDS).union(set(['said', 'mr', 'mrs', 'miss']))

def tokenize(text):
    return [token for token in simple_preprocess(text) \
            if token not in STOPWORDS]

In [47]:
tokenize("Hello World! This a test of the tokenization method")

['hello', 'world', 'test', 'tokenization', 'method']

### Author DataFrame


In [48]:
df = pd.DataFrame(index=titles)

In [49]:
df.head()

Austen_Emma0000
Austen_Emma0001
Austen_Emma0002
Austen_Emma0003
Austen_Emma0004


In [50]:
df['tokens'] = tokens

In [51]:
df['author'] = df.reset_index()['index'].apply(lambda x: \
                                               x.split('_')[0]).tolist()
df['book'] = df.reset_index()['index'].apply(lambda x: \
                                             x.split('_')[1][:-4]).tolist()
df['section'] = df.reset_index()['index'].apply(lambda x: \
                                                x[-4:]).tolist()
df['section'] = df['section'].astype('int')

In [52]:
df['author'] = df['author'].map({'Austen':1, 'CBronte':0})

In [53]:
df.author.value_counts()

0    441
1    372
Name: author, dtype: int64

In [54]:
df.head()

Unnamed: 0,tokens,author,book,section
Austen_Emma0000,"[emma, jane, austen, volume, chapter, emma, wo...",1,Emma,0
Austen_Emma0001,"[taylor, wish, pity, weston, thought, agree, p...",1,Emma,1
Austen_Emma0002,"[behaved, charmingly, body, punctual, body, be...",1,Emma,2
Austen_Emma0003,"[native, highbury, born, respectable, family, ...",1,Emma,3
Austen_Emma0004,"[mention, handsome, letter, weston, received, ...",1,Emma,4


### Streaming Documents
Here we use a new pythonic thingy: the `yield` statement in our fucntion. This allows us to iterate over a bunch of documents without actually reading them into memory. You can see how we use this function later on. 

In [55]:
def doc_stream(path):
    for f in os.listdir(path):
        with open(os.path.join(path, f)) as t:
            if f[-3:] == 'txt':
                text = t.read().strip('\n')
                tokens = tokenize(str(text))
                yield tokens

In [56]:
streaming_data = doc_stream(path)

In [57]:
next(streaming_data)

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 'handsome',
 'clever',
 'rich',
 'comfortable',
 'home',
 'happy',
 'disposition',
 'unite',
 'best',
 'blessings',
 'existence',
 'lived',
 'nearly',
 'years',
 'world',
 'little',
 'distress',
 'vex',
 'youngest',
 'daughters',
 'affectionate',
 'indulgent',
 'father',
 'consequence',
 'sister',
 'marriage',
 'mistress',
 'house',
 'early',
 'period',
 'mother',
 'died',
 'long',
 'ago',
 'indistinct',
 'remembrance',
 'caresses',
 'place',
 'supplied',
 'excellent',
 'woman',
 'governess',
 'fallen',
 'little',
 'short',
 'mother',
 'affection',
 'sixteen',
 'years',
 'taylor',
 'woodhouse',
 'family',
 'governess',
 'friend',
 'fond',
 'daughters',
 'particularly',
 'emma',
 'intimacy',
 'sisters',
 'taylor',
 'ceased',
 'hold',
 'nominal',
 'office',
 'governess',
 'mildness',
 'temper',
 'hardly',
 'allowed',
 'impose',
 'restraint',
 'shadow',
 'authority',
 'long',
 'passed',
 'away',
 'living',
 'frien

### Gensim LDA Topic Modeling

In [58]:
# A Dictionary Representation of all the words in our corpus
id2word = corpora.Dictionary(doc_stream(path))

In [59]:
id2word.token2id['england']

3985

In [60]:
id2word.doc2bow(tokenize("This is a sample message Darcy England England England"))

[(2752, 1), (3985, 3), (6600, 1), (6817, 1)]

In [61]:
import sys
sys.getsizeof(id2word)

56

In [62]:
len(id2word.keys())

22094

In [63]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=10, no_above=0.95)

In [64]:
len(id2word.keys())

4923

In [65]:
# a bag of words(bow) representation of our corpus
# Note: we haven't actually read any text into memory here
corpus = [id2word.doc2bow(text) for text in doc_stream(path)]

In [66]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   random_state=723812,
                   num_topics = 15,
                   passes=10,
                   workers=4
                  )

In [67]:
lda.print_topics()

[(0,
  '0.013*"lady" + 0.012*"collins" + 0.010*"catherine" + 0.009*"elizabeth" + 0.009*"sir" + 0.007*"john" + 0.006*"little" + 0.006*"room" + 0.005*"st" + 0.005*"day"'),
 (1,
  '0.029*"harriet" + 0.018*"emma" + 0.010*"elton" + 0.008*"good" + 0.008*"man" + 0.008*"think" + 0.007*"knightley" + 0.006*"thought" + 0.006*"martin" + 0.006*"little"'),
 (2,
  '0.016*"emma" + 0.010*"weston" + 0.009*"thing" + 0.008*"jane" + 0.008*"knightley" + 0.008*"know" + 0.007*"think" + 0.007*"little" + 0.007*"elton" + 0.006*"good"'),
 (3,
  '0.009*"thousand" + 0.007*"sisters" + 0.006*"justice" + 0.005*"burns" + 0.005*"marry" + 0.005*"pounds" + 0.004*"know" + 0.004*"home" + 0.004*"diana" + 0.004*"like"'),
 (4,
  '0.007*"little" + 0.006*"like" + 0.005*"thought" + 0.004*"know" + 0.004*"good" + 0.004*"madame" + 0.004*"day" + 0.004*"time" + 0.003*"long" + 0.003*"hand"'),
 (5,
  '0.009*"know" + 0.008*"rochester" + 0.006*"time" + 0.006*"school" + 0.006*"john" + 0.005*"long" + 0.005*"life" + 0.004*"st" + 0.004*"god" 

In [68]:
words = [re.findall(r'"([^"]*)"', t[1]) for t in lda.print_topics()]

In [69]:
topics = [' '.join(t[0:5]) for t in words]

In [70]:
for id, t in enumerate(topics):
    print(f'Topic {id}: {t}')
    print("\n")

Topic 0: lady collins catherine elizabeth sir


Topic 1: harriet emma elton good man


Topic 2: emma weston thing jane knightley


Topic 3: thousand sisters justice burns marry


Topic 4: little like thought know good


Topic 5: know rochester time school john


Topic 6: wickham uncle darcy came lydia


Topic 7: sir jane like rochester little


Topic 8: st john diana felt hannah


Topic 9: room rochester like little jane


Topic 10: elizabeth bingley bennet darcy collins


Topic 11: elinor marianne jennings sister willoughby


Topic 12: mdlle monsieur henri english mademoiselle


Topic 13: know think elizabeth good sister


Topic 14: edward sir elinor marianne mason




## Challenge 

You will apply an LDA model to a customer review dataset to practice the fitting and estimation of LDA. 

# Interpret LDA Results (Learn)
<a id="#p3"></a>

## Overview

## Follow Along

### Topic Distance Visualization

In [71]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [72]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Overall Model / Documents

In [73]:
lda[corpus[0]]

[(2, 0.7180853), (13, 0.27344456)]

In [74]:
distro = [lda[d] for d in corpus]

In [75]:
distro[0]

[(2, 0.7184448), (13, 0.27373424)]

In [76]:
distro = [lda[d] for d in corpus]

def update(doc):
        d_dist = {k:0 for k in range(0,15)}
        for t in doc:
            d_dist[t[0]] = t[1]
        return d_dist
    
new_distro = [update(d) for d in distro]

In [77]:
df = pd.DataFrame.from_records(new_distro, index=titles)
df.columns = topics
df['author'] = df.reset_index()['index'].apply(lambda x: x.split('_')[0]).tolist()

In [81]:
df.head().T

Unnamed: 0,Austen_Emma0000,Austen_Emma0001,Austen_Emma0002,Austen_Emma0003,Austen_Emma0004
lady collins catherine elizabeth sir,0,0,0,0,0
harriet emma elton good man,0,0,0,0,0
emma weston thing jane knightley,0.718895,0.991422,0.979414,0.564227,0.997689
thousand sisters justice burns marry,0,0,0,0,0
little like thought know good,0,0,0,0,0
know rochester time school john,0,0,0,0,0
wickham uncle darcy came lydia,0,0,0,0,0
sir jane like rochester little,0,0,0,0,0
st john diana felt hannah,0,0,0,0,0
room rochester like little jane,0,0,0,0,0


In [80]:
df.groupby('author').mean().T

author,Austen,CBronte
lady collins catherine elizabeth sir,0.010347,0.002799
harriet emma elton good man,0.083192,0.000859
emma weston thing jane knightley,0.289165,0.004567
thousand sisters justice burns marry,5e-05,0.004178
little like thought know good,0.007407,0.619619
know rochester time school john,0.0,0.022861
wickham uncle darcy came lydia,0.00268,0.0
sir jane like rochester little,0.0,0.045805
st john diana felt hannah,0.012202,0.015029
room rochester like little jane,0.007939,0.218273


## Challenge
### *Can we see if one of the authors focuses more on men than women?*

*  Use Spacy for text prepocessing
*  Extract the Named Entities from the documents using Spacy (command is fairly straight forward)
*  Create unique list of names from the authors (you'll find that there are different types of named entities not all people)
*  Label the names with genders (can you this by hand or you use the US census name lists)
*  Customize your processing to replace the proper name with your gender from the previous step's lookup table
*  Then follow the rest of the LDA flow


# Selecting the Number of Topics (Learn)
<a id="#p4"></a>

## Overview

## Follow Along

In [82]:
from gensim.models.coherencemodel import CoherenceModel

def compute_coherence_values(dictionary, corpus, path, limit, start=2, step=3, passes=5):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    path : path to input texts
    limit : Max num of topics
    passes: the number of times the entire lda model & coherence values are calculated

    Returns:
    -------
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    
    tokens = list(doc_stream(path))
    
    for iter_ in range(passes):
        for num_topics in range(start, limit, step):
            stream = doc_stream(path)
            model = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=4)
            coherencemodel = CoherenceModel(model=model,dictionary=dictionary,corpus=corpus, coherence='u_mass')
            coherence_values.append({'pass': iter_, 
                                     'num_topics': num_topics, 
                                     'coherence_score': coherencemodel.get_coherence()
                                    })

    return coherence_values

In [84]:
# Can take a long time to run.
coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        path=path, 
                                                        start=2, 
                                                        limit=40, 
                                                        step=10,
                                                        passes=1)

In [85]:
topic_coherence = pd.DataFrame.from_records(coherence_values)

In [86]:
topic_coherence.head()

Unnamed: 0,coherence_score,num_topics,pass
0,-0.67531,2,0
1,-0.697647,12,0
2,-0.768982,22,0
3,-0.880607,32,0


In [87]:
import seaborn as sns

ax = sns.lineplot(x="num_topics", y="coherence_score", data=topic_coherence)

In [101]:
# Print the coherence scores
for m, cv in zip(topic_coherence['num_topics'], topic_coherence['coherence_score']):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

Num Topics = 2  has Coherence Value of -0.6753
Num Topics = 12  has Coherence Value of -0.6976
Num Topics = 22  has Coherence Value of -0.769
Num Topics = 32  has Coherence Value of -0.8806


In [89]:
lda[id2word.doc2bow(tokenize("This is a sample document to score with a topic distribution."))]

[(0, 0.02225876),
 (1, 0.02225863),
 (2, 0.022258686),
 (3, 0.02225863),
 (4, 0.6883784),
 (5, 0.02225868),
 (6, 0.02225863),
 (7, 0.02225869),
 (8, 0.022258861),
 (9, 0.02225863),
 (10, 0.02225863),
 (11, 0.02225863),
 (12, 0.02225863),
 (13, 0.022258695),
 (14, 0.022258874)]

## Challenge
### *Can we see if one of the authors focus more on men than women?*

*  Use Spacy for text prepocessing
*  Extract the Named Entities from the documents using Spacy (command is fairly straight forward)
*  Create unique list of names from the authors (you'll find that there are different types of named entities not all people)
*  Label the names with genders (can you this by hand or you use the US census name lists)
*  Customize your processing to replace the proper name with your gender from the previous step's lookup table
*  Then follow the rest of the LDA flow

In [90]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [91]:
test = "Ned asked me a question about England today."

In [92]:
doc = nlp(test)

for token in doc:
    print(token.text, token.lemma_, token.pos_)

Ned Ned PROPN
asked ask VERB
me -PRON- PRON
a a DET
question question NOUN
about about ADP
England England PROPN
today today NOUN
. . PUNCT


In [93]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Ned PERSON
England GPE
today DATE


In [94]:
def doc_stream(path):
    for f in os.listdir(path):
        with open(os.path.join(path,f)) as t:
            text = t.read().strip('\n')
            yield text

def get_people(docstream):
    
    ppl = []
    
    for d in docstream:
        
        doc = nlp(d)
        
        for ent in doc.ents:
            
            if ent.label_ == "PERSON":
                ppl.append(ent.lemma_)
                
    return set(ppl)

In [102]:
people = get_people(doc_stream(path))

In [103]:
doc = nlp(next(doc_stream(path)))

In [104]:
doc.ents[0].lemma_

'JANE AUSTEN'

# Sources

### *References*
* [Andrew Ng et al paper on LDA](https://ai.stanford.edu/~ang/papers/jair03-lda.pdf)
* On [Coherence](https://pdfs.semanticscholar.org/1521/8d9c029cbb903ae7c729b2c644c24994c201.pdf)

### *Resources*

* [Gensim](https://radimrehurek.com/gensim/): Python package for topic modeling, nlp, word vectorization, and few other things. Well maintained and well documented.
* [Topic Modeling with Gensim](http://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#11createthedictionaryandcorpusneededfortopicmodeling): A kind of cookbook for LDA with gensim. Excellent overview, but the you need to be aware of missing import statements and assumed prior knowledge.
* [Chinese Restuarant Process](https://en.wikipedia.org/wiki/Chinese_restaurant_process): That really obscure stats thing I mentioned... 
* [PyLDAvis](https://github.com/bmabey/pyLDAvis): Library for visualizing the topic model and performing some exploratory work. Works well. Has a direct parrell implementation in R as well. 
* [Rare Technologies](https://rare-technologies.com/): The people that made & maintain gensim and a few other libraries.
* [Jane Austen v. Charlotte Bronte](https://www.literaryladiesguide.com/literary-musings/jane-austen-charlotte-bronte-different-alike/)