## textacy, with the Inaugural Addresses . . . 

[textacy Quick Start](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html)

[textacy API documentation](https://chartbeat-labs.github.io/textacy/api_reference.html#)

[textacy github repo](https://github.com/chartbeat-labs/textacy)


## . . . and topic modeling . . . 

[Mallet](http://mallet.cs.umass.edu/), the java topic modeling package we use most often.

[David Mimno explains Topic Modeling](https://vimeo.com/53080123).  Mimno is the maintainer of Mallet.

[Ben Schmidt applies topic modeling to ship logs](http://sappingattention.blogspot.com/2012/11/when-you-have-mallet-everything-looks.html)

[Scott Weingart, "Topic Modeling for Humanists: A Guided Tour"](http://www.scottbot.net/HIAL/index.html@p=19113.html)

[Mining the Dispatch](http://dsl.richmond.edu/dispatch/), an exemplary application of topic modeling to a set of historical newspaper data.

[My toy topic modeller](https://talus.artsci.wustl.edu/malletTalk/toyTopicModeller.py), written in python.

[My github repo for "understanding_mallet"](https://github.com/spenteco/understanding_mallet)

## But first we have to get textacy re-installed

### First, is it already installed?

If the **correct, latest** version is installed, the next cell should output:

    0.6.0
    2.0.11

In [17]:
import textacy, spacy

print textacy.__version__
print spacy.__version__

nlp = spacy.load('en')

0.6.0
2.0.9



### If the previous cell returned "dotted" numbers, but not the right one . . . 

. . . unnstall textacy.  You should be able to run the next cell in both Mac and Windows.

### If the previous cell returned "ImportError: No module named textact" . . . 

. . . or a similar message, then you can skip the next cell.

In [12]:
#!conda uninstall -y textacy

Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

### The problems seem to be related to PATH

### For Mac only

First, the next cell will return something like

    /home/spenteco/anaconda2/bin/python
    
### For Windows

I don't have a Windows machine with a functional Anaconda for testing this (we have Anaconda<br/>
on Windows, but all the installations are broken).  In any case, you don't need to run the<br/>
Windows command (and can't, because it's not a Windows command!).


In [5]:
!which python

/Users/JGuyton/anaconda/bin/python


### Reinstall textacy and the spacy models

### For Mac

We're taking the path (everything through "bin/" fro the first, textacy install command;<br/>|everything for the second command) to specify which "pip" and which "python" we want<br/>to use (there are likely several on your computer).  In the following cell, modify the commands:

!**/home/spenteco/anaconda2/bin/**pip install textacy

!**/home/spenteco/anaconda2/bin/python** -m spacy download en

So that the bolded parts match whatever was output by the "which" command.

### For Windows

I'm working from a memory of the one time I saw this sort of package-installation problem in Windows.  But, I think you can:

1.  Find "Anaconda Prompt" in the start menu, and open it.

2.  In the "Anaconda Prompt", run:

    pip install textacy

    python -m spacy download en

### When you're done, restart the notebook Kernel (Mac and Windows)

See the menu at the top of the Jupyter notebook page.


In [15]:
# !/Users/JGuyton/anaconda/bin/pip install textacy

!/Users/JGuyton/anaconda/bin/python -m spacy download en

# DONT FORGET TO RESTART THE KERNEL WHEN YOU'RE DONE . . . 

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 15.1MB/s 

[93m    Linking successful[0m
    /Users/JGuyton/anaconda/lib/python2.7/site-packages/en_core_web_sm -->
    /Users/JGuyton/anaconda/lib/python2.7/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')



### Did it work?

The next cell should display:

    0.6.0
    2.0.11


In [18]:
import textacy, spacy

print textacy.__version__
print spacy.__version__

nlp = spacy.load('en')

0.6.0
2.0.9


## Back to Text Analysis . . . 

. . . using the Inaugural Address corpus, which has these files:

In [20]:
!ls corpora/inaugural_addresses

[31m10_adams_john_quincy_1825.txt[m[m  [31m37_roosevelt_franklin_1933.txt[m[m
[31m11_jackson_1829.txt[m[m            [31m38_roosevelt_franklin_1937.txt[m[m
[31m12_jackson_1833.txt[m[m            [31m39_roosevelt_franklin_1941.txt[m[m
[31m13_van_buren_1837.txt[m[m          [31m3_adams_john_1797.txt[m[m
[31m14_harrison_1841.txt[m[m           [31m40_roosevelt_franklin_1945.txt[m[m
[31m15_polk_1845.txt[m[m               [31m41_truman_1949.txt[m[m
[31m16_taylor_1849.txt[m[m             [31m42_eisenhower_1953.txt[m[m
[31m17_pierce_1853.txt[m[m             [31m43_eisenhower_1957.txt[m[m
[31m18_buchanan_1857.txt[m[m           [31m44_kennedy_1961.txt[m[m
[31m19_lincoln_1861.txt[m[m            [31m45_johnson_1965.txt[m[m
[31m1_washington_1789.txt[m[m          [31m46_nixon_1969.txt[m[m
[31m20_lincoln_1865.txt[m[m            [31m47_nixon_1973.txt[m[m
[31m21_grant_1869.txt[m[m              [31m48_carter_1977.txt

### Process the corpus through Spacy

We're creating "corpus_input", which is a list of dictionaries.  Each dictionary<br/> 
in the "corpus_input" represents one inaugural address.  For each inaugural address,<br/>
we're keeping the year, the president, the raw text, and the "selected_text".

"Selected text" is the lemma form of the words in the inaugural address.  I'm<br/>
dropping stopwords, punctuation, and space.  I collect the lemmas/tokens for<br/>
"selected_text", then join them into one string, with one space between lemmas.tokens.

In [3]:
import spacy
nlp = spacy.load('en')

In [5]:
import glob, codecs, re

corpus_input = []

for path_to_file in glob.glob('corpora/inaugural_addresses/*.txt'):
    
    f = path_to_file.split('/')[-1].replace('.txt', '')
    
    year = int(f.split('_')[-1])
    president = '_'.join(f.split('_')[1:-1])
    
    raw_text = re.sub('\s+', ' ', codecs.open(path_to_file, 'r', encoding='utf-8').read())
    
    selected_tokens = []
    doc = nlp(unicode(raw_text))
    for t in doc:
        if t.is_stop == False and t.is_punct == False and t.is_space == False:
            if t.lemma_ != '-PRON-':
                selected_tokens.append(t.lemma_)
    
    corpus_input.append({'year': year, 
                   'president': president, 
                   'raw_text': raw_text, 
                   'selected_text': ' '.join(selected_tokens)})
    
print 'Done!'

Done!


### Inspecting corpus_input

Does it contain what I expect it to contain?

In [6]:
corpus_input.sort(key = lambda t: t['year'])

print
print 'len(corpus_input)', len(corpus_input)

print
print corpus_input[0]['year'], corpus_input[0]['president']
print
print corpus_input[0]['raw_text'][:200]
print
print corpus_input[0]['selected_text'][:200]


len(corpus_input) 58

1789 washington

Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was 

fellow citizens senate house representatives among vicissitude incident life event fill great anxiety notification transmit order receive 14th day present month on hand summon country voice hear vener


In [7]:
print
for text in corpus_input:
    print text['year'], text['president'], 'len(raw_text)', len(text['raw_text'])


1789 washington len(raw_text) 8610
1793 washington len(raw_text) 770
1797 adams_john len(raw_text) 13862
1801 jefferson len(raw_text) 10091
1805 jefferson len(raw_text) 12892
1809 madison len(raw_text) 6991
1813 madison len(raw_text) 7147
1817 monroe len(raw_text) 19864
1821 monroe len(raw_text) 26285
1825 adams_john_quincy len(raw_text) 17724
1829 jackson len(raw_text) 6785
1833 jackson len(raw_text) 7031
1837 van_buren len(raw_text) 23370
1841 harrison len(raw_text) 49677
1845 polk len(raw_text) 28660
1849 taylor len(raw_text) 6594
1853 pierce len(raw_text) 20043
1857 buchanan len(raw_text) 16774
1861 lincoln len(raw_text) 20930
1865 lincoln len(raw_text) 3904
1869 grant len(raw_text) 6450
1873 grant len(raw_text) 7695
1877 hayes len(raw_text) 14881
1881 garfield len(raw_text) 17711
1885 cleveland len(raw_text) 10106
1889 harrison len(raw_text) 26127
1893 cleveland len(raw_text) 12304
1897 mckinley len(raw_text) 23618
1901 mckinley len(raw_text) 13398
1905 roosevelt_theodore len(raw

### Pass corpus_input to textacy

It's basically one operation.  In fact, **I think** it's possible to go<br/>
from-the-disk-to-corpus in one step, without processing the texts through<br/>
spacy . . . I think that would look something like:

    corpus = textacy.Corpus(
                u'en', 
                texts = [unicode(open(f).read()) for f in sorted('corpora/inaugural_addresses/*.txt'))],
                metadatas = [{'file_name': i['year'], f.split('/')[-1] for f in sorted('corpora/inaugural_addresses/*.txt'))])
                
Note that I'm simplifying "metadatas".

Why go straight from the disk to textacy, without going through spacy first?<br/>
Because **I seem to recall** that the textacy vectorizer supports a wide range<br/>
of "filters", which provide much of the drop-stopword, etc function which is<br/>
in the spacy code in the cell above.

Note also that we could wrap "unicode(open(f).read())" in a function which<br/>
performed the NLP word selection and lemmatization processes on a single<br/>
text, which would effectively get us the same thing.

Nevertheless, I wanted to demonstrate passing the date through spacy<br/>
because you may at some point want to use some other nlp package to filter,<br/>
lemmatize, etc, so I wanted to demonstrate how to do that.

### The spacy docs for corpus

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#working-with-many-texts

https://chartbeat-labs.github.io/textacy/api_reference.html#module-textacy.corpus


In [8]:
corpus = textacy.Corpus(
            u'en', 
            texts = [unicode(i['selected_text']) for i in corpus_input],
            metadatas = [{'year': i['year'], 'president': i['president']} for i in corpus_input])

print 'len(corpus)', len(corpus)

print
print type(corpus)
#print dir(corpus)

print
print type(corpus[0])
#print dir(corpus[0])

len(corpus) 58

<class 'textacy.corpus.Corpus'>

<class 'textacy.doc.Doc'>


### Inspecting the resulting textacy corpus . . . 

 . . . comparing it to the "corpus_input" data.  Did I goof up<br/>
 anything?  **It's been known to happen.**
 
 Note that corpus.n_sents seems small.

In [9]:
test_n_unique_tokens = []
for c in corpus_input:
    test_n_unique_tokens += re.split('\s+', c['selected_text'])

test_all_tokens =  test_n_unique_tokens
test_n_unique_tokens = list(set(test_n_unique_tokens))

for a in range(0, 5):
    
    print
    print corpus_input[a]['year'], corpus_input[a]['president'], corpus_input[a]['selected_text'][:50]
    
    print corpus[a].metadata, corpus[a].text[:50]
    
print
print 'docs in corpus', corpus.n_docs
print 'sentences in corpus', corpus.n_sents
print 'tokens in corpus', corpus.n_tokens
print
print 'len(test_all_tokens)', len(test_all_tokens)
print 'len(test_n_unique_tokens)', len(test_n_unique_tokens)


1789 washington fellow citizens senate house representatives among
{'president': 'washington', 'year': 1789} fellow citizens senate house representatives among

1793 washington call voice country execute function chief magistra
{'president': 'washington', 'year': 1793} call voice country execute function chief magistra

1797 adams_john when perceive early time middle course america rem
{'president': 'adams_john', 'year': 1797} when perceive early time middle course america rem

1801 jefferson call undertake duty executive office country avail
{'president': 'jefferson', 'year': 1801} call undertake duty executive office country avail

1805 jefferson proceeding fellow citizen qualification constituti
{'president': 'jefferson', 'year': 1805} proceeding fellow citizen qualification constituti

docs in corpus 58
sentences in corpus 437
tokens in corpus 60569

len(test_all_tokens) 60531
len(test_n_unique_tokens) 6675


### Keywords, too easy.

Note the strange import.  Simply importing "textacy" doesn't work for this.

In [10]:
import textwrap
import textacy.keyterms

for doc in corpus[:5]:
    print
    print doc.metadata['year'], doc.metadata['president']
    print
    top_words = []
    for w in textacy.keyterms.textrank(doc, n_keyterms=20):
        top_words.append(w[0])
    print '\n'.join(textwrap.wrap(', '.join(top_words), 80))


1789 washington

government, public, duty, country, great, citizen, hand, present, people,
nation, nature, liberty, united, circumstance, measure, happiness,
qualification, view, care, character

1793 washington

oath, country, solemn, execute, present, function, witness, chief, subject,
magistrate, punishment, occasion, constitutional, proper, injunction, high,
sense, instance, entertain, distinguished

1797 adams_john

people, nation, government, country, good, constitution, state, power, justice,
legislature, public, mind, honor, foreign, great, spirit, peace, party, citizen,
form

1801 jefferson

government, good, man, principle, peace, safety, opinion, power, country,
public, form, nation, confidence, error, happiness, honest, right, citizen, law,
fellow

1805 jefferson

public, state, interest, duty, law, citizen, constitution, limit, good, country,
time, nation, place, false, power, reason, fellow, mind, foreign, press


### Corpus to document-term matrix . . . 

Lots of settings.  The doc is pretty good:

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus

https://chartbeat-labs.github.io/textacy/api_reference.html#vectorizers

(Note that in the API, see **textacy.vsm.vectorizers.Vectorizer**)


In [11]:
# FOR TD-IDF WEIGHTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, apply_dl=True)

# FOR RAW WORD COUNTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False)

# FOR RELATIVE FREQUENCY
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=True)

doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) 
                                               for doc in corpus))

print
print repr(doc_term_matrix)
#print
#print dir(doc_term_matrix)
print
print doc_term_matrix.shape


<58x6494 sparse matrix of type '<type 'numpy.float64'>'
	with 32492 stored elements in Compressed Sparse Row format>

(58, 6494)


### Inspect the document-term matrix

Convert the sparse matrix to a dense matrix (i.e., one with all the zeros).

Inspect in various ways.  Does it look reasonable?

In [12]:
dense_doc_term_matrix = doc_term_matrix.todense()

print
print repr(dense_doc_term_matrix)
print
print dense_doc_term_matrix.shape

list_doc_term_matrix = dense_doc_term_matrix.tolist()

print
print list_doc_term_matrix[0][:100]
print
print len(list_doc_term_matrix), len(list_doc_term_matrix[0])
print
for a in range(len(list_doc_term_matrix[0][:750])):
    if list_doc_term_matrix[0][a] > 0:
        print vectorizer.id_to_term[a], list_doc_term_matrix[0][a], ';',
print


matrix([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.03205853,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ..., 
        [ 0.        ,  0.37300192,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.29688261,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.19857537,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

(58, 6494)

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04170288281141495, 0.04170288281141495, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0

###  Let's do some actual text analysis

We do three things here:

1.  Create a vectorizer, then use it to create a document-term matrix.
2.  Topic model using the document-term matrix.
3.  List the words associated with each resulting topic.

Lots of experimentation with Vectorizer parameters.  Raw word counts seem to work best.

And lots of experimentation with "n_topics" in creating the topic model.  20 seemed<br/>
reasonable for this demonstration.

### The docs:

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus

https://chartbeat-labs.github.io/textacy/api_reference.html#topic-models

In [13]:
# FOR TD-IDF WEIGHTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, apply_dl=True,  min_df=2, max_df=40)

# FOR RAW WORD COUNTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False,  min_df=2, max_df=40)
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False)

# FOR RAW WORD COUNT.   LOW max_df VALUE REMOVES THINGS LIKE 'government' AND 'america' ANd SEEMS TO GIVE THE BEST RESULT
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False,  min_df=2, max_df=26)

# FOR RELATIVE FREQUENCY
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=True)

doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) 
                                               for doc in corpus))

model = textacy.TopicModel('lda', n_topics=20)
model.fit(doc_term_matrix)

doc_topic_matrix = model.transform(doc_term_matrix)

print 
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
    print('topic', topic_idx, ':', ' '.join(top_terms))




('topic', 0, ':', u'today child democracy dream woman job promise friend remember problem')
('topic', 1, ':', u'dream special hero today help weapon moral economic problem worthy')
('topic', 2, ':', u'revenue increase direct economic desire providence reflect extend nature judgment')
('topic', 3, ':', u'federal enforcement extend increase civilization business ability promote opinion community')
('topic', 4, ':', u'opinion regard exist member object circumstance federal grant blessing period')
('topic', 5, ':', u'democracy body sacred person continent early stock 1789 disruption clothe')
('topic', 6, ':', u'pay object debt revenue defense dollar proper case regard trade')
('topic', 7, ':', u'civilization today opinion supreme industrial business promote add evil prove')
('topic', 8, ':', u'offense false occasion draw extend pay whatsoever truth cover cease')
('topic', 9, ':', u'set wish counsel industrial child problem process wrong affair use')
('topic', 10, ':', u'renewal today seas

### What are the document-topic percentages?

I.e., which topics make up what percentage of each document?

In [48]:
from IPython.display import display, Markdown

def make_printable(topic_pcts):
    printable_pcts = []
    for pct in topic_pcts:
        formatted_pct = '%.2f' % pct
        if formatted_pct == '0.00':
            formatted_pct = '    '
        printable_pcts.append(formatted_pct)
    return printable_pcts

topic_headings = []
for a in range(len(doc_topic_matrix[0])):
    topic_headings.append(str(a).rjust(4))
    
results =[['  ', '    ', ' '.ljust(10)] + topic_headings]
    
for a in range(len(doc_topic_matrix)):
    results.append([str(a).rjust(2), 
                    str(corpus[a].metadata['year']), 
                    corpus[a].metadata['president'][:10].ljust(10)] + 
                    make_printable(doc_topic_matrix[a]))

output_text = []
for r in results:
    output_text.append(' '.join(r))
    
display(Markdown('<pre style="font-size:10px;">{}</pre>'.format('\n'.join(output_text))))

<pre style="font-size:10px;">                      0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19
 0 1789 washington                     1.00                                                                           
 1 1793 washington                                                                                      0.96          
 2 1797 adams_john                     1.00                                                                           
 3 1801 jefferson                      1.00                                                                           
 4 1805 jefferson                      0.40                0.35                                         0.25          
 5 1809 madison                        1.00                                                                           
 6 1813 madison                        1.00                                                                           
 7 1817 monroe                         0.02                                                             0.98          
 8 1821 monroe                                                                                          1.00          
 9 1825 adams_john                     1.00                                                                           
10 1829 jackson                        1.00                                                                           
11 1833 jackson                        1.00                                                                           
12 1837 van_buren                      1.00                                                                           
13 1841 harrison                       1.00                                                                           
14 1845 polk                           0.18                                                             0.81          
15 1849 taylor                         0.26                                                             0.74          
16 1853 pierce                         0.34                                                   0.61      0.05          
17 1857 buchanan                                                                                        1.00          
18 1861 lincoln                                                                                         1.00          
19 1865 lincoln                                            0.99                                                       
20 1869 grant                                                                                 0.11      0.88          
21 1873 grant                                                                                           1.00          
22 1877 hayes                          0.98                                                             0.02          
23 1881 garfield                                                          0.02                          0.98          
24 1885 cleveland                      0.71                                                   0.27      0.02          
25 1889 harrison                                                                                        1.00          
26 1893 cleveland                                                                             1.00                    
27 1897 mckinley                                                                                        1.00          
28 1901 mckinley                                                                                        1.00          
29 1905 roosevelt_                                              1.00                                                  
30 1909 taft                                                                                            1.00          
31 1913 wilson                                                  1.00                                                  
32 1917 wilson                                                  1.00                                                  
33 1921 harding                                                                               1.00                    
34 1925 coolidge                                                                              1.00                    
35 1929 hoover                                                                                1.00                    
36 1933 roosevelt_                                                                            1.00                    
37 1937 roosevelt_                                                        0.36                0.60           0.03     
38 1941 roosevelt_                          0.75                          0.25                                        
39 1945 roosevelt_                                                        0.71                               0.28     
40 1949 truman                                                                                1.00                    
41 1953 eisenhower                                                        0.26                0.73                    
42 1957 eisenhower                                                        1.00                                        
43 1961 kennedy                                                           1.00                                        
44 1965 johnson                                                           1.00                                        
45 1969 nixon                                                             1.00                                        
46 1973 nixon                                                             0.14                               0.85     
47 1977 carter          1.00                                                                                          
48 1981 reagan          0.84                                              0.16                                        
49 1985 reagan                                                            1.00                                        
50 1989 bush_georg                                                        1.00                                        
51 1993 clinton                                                      0.27 0.73                                        
52 1997 clinton                                                           1.00                                        
53 2001 bush_georg                                                        0.81                0.18                    
54 2005 bush_georg                                                        1.00                                        
55 2009 obama                                                             1.00                                        
56 2013 obama                                                             1.00                                        
57 2017 trump                                                             1.00                                        </pre>

### List the words associated with each topic

I did this once.  I do it again, because I want to see more words, and I'd<br/>
like something that doesn't result in a wide display.

In [20]:
import textwrap

print

print 
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=50):
    print
    print 'topic', topic_idx, ':', '\n'.join(textwrap.wrap(' '.join(top_terms), 80))
    




topic 0 : today child democracy dream woman job promise friend remember problem cost
million challenge courage conviction forward consider ideal person protection
arm build bless dignity benefit credit program safe ancient value learn second
increase use moral story debate politic journey mr stop period section community
constitutional middle permit perfect lose write

topic 1 : dream special hero today help weapon moral economic problem worthy ceremony
group away create price renew goal federal tax child struggle share build
monument afford commitment productivity reflect capacity confront inauguration
ensure sufficient intention enhance arsenal adversary pay govern burden fall
fate answer thank heal open endure tell value word

topic 2 : revenue increase direct economic desire providence reflect extend nature
judgment prevail business expect foundation tax democracy courage ought occur
moral federal counsel weight representative enable material proceed devise
relationship benefit 