# Exercise Twelve: Texts, Three Ways

For this week, you will sample the three methods we've explored (topic modeling, sentiment analysis, and Markov chain generation) using the same set of root texts. 

- Collect and import ten documents (novels work best, but anything goes!)
- Using the topic modeling code as a starter, build a topic model of the documents (Pick a topic run word cloud)
- Using the sentiment analysis code as a starter, run a sentiment analysis on sample fragments from the documents and compare (See what is interesting)
- Using the Markov chain code as a starter, generate a sentence using one of the documents
- Using the Markov chain code as a starter, generate a longer text fragment using all of the documents

As a bonus, try to extend this analysis to note other features of these documents using any of our previous exercises as a starting point.

Importing 10 works by Arthur Conan Doyle, as used in last week's exercise.

In [13]:
import nltk 
nltk.download('averaged_perceptron_tagger')

import os

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Stoddard\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [14]:
textdir = 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\'
os.chdir(textdir)

# Topic Modeling

In [15]:
import pandas as pd
import os
import numpy as np

documents = []
path = 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\'

filenames=sorted([os.path.join(path, fn) for fn in os.listdir(path)])
print(len(filenames))
print(filenames[:10]) 

10
['C:\\Users\\Stoddard\\DesignDevExercises\\text\\adventure.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\boer.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\fear.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\hound.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\last.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\memoirs.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\return.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\scarlet.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\sign.txt', 'C:\\Users\\Stoddard\\DesignDevExercises\\text\\white.txt']


In [16]:
import sklearn.feature_extraction.text as text

vectorizer=text.CountVectorizer(input='filenames', stop_words="english", min_df=1)
dtm=vectorizer.fit_transform(filenames).toarray() # defines document term matrix

vocab=np.array(vectorizer.get_feature_names())

In [17]:
print(f'Shape of document-term matrix: {dtm.shape}. '# in this case the shape is 7 b/c 7 docs
      f'Number of tokens {dtm.sum()}')

Shape of document-term matrix: (10, 14). Number of tokens 59


In [18]:
import sklearn.decomposition as decomposition
model = decomposition.LatentDirichletAllocation(
    n_components=100, learning_method='online', random_state=1)

In [19]:
document_topic_distributions = model.fit_transform(dtm)

In [20]:
# Grabbing a set of vocab
vocabulary = vectorizer.get_feature_names()
# (# topics, # vocabulary)
assert model.components_.shape == (100, len(vocabulary))
# (# documents, # topics)
assert document_topic_distributions.shape == (dtm.shape[0], 100)

In [21]:
topic_names = [f'Topic {k}' for k in range(100)]
topic_word_distributions = pd.DataFrame(
    model.components_, columns=vocabulary, index=topic_names)
print(topic_word_distributions)

          adventure      boer  designdevexercises      fear     hound  \
Topic 0    0.237354  0.192002            0.193579  0.183461  0.239989   
Topic 1    0.221949  0.190651            0.198547  0.214334  0.190587   
Topic 2    0.209695  0.196927            0.296181  0.215479  0.254467   
Topic 3    0.210005  0.166724            0.208297  0.219074  0.199425   
Topic 4    0.195116  0.228752            0.183062  0.207091  0.218632   
...             ...       ...                 ...       ...       ...   
Topic 95   0.203013  0.204640            0.219346  0.195669  0.177860   
Topic 96   0.198401  0.226942            0.209587  0.200464  0.222647   
Topic 97   0.197749  0.180549            0.218141  0.200016  0.183130   
Topic 98   0.205529  0.209199            0.205392  0.215759  0.170955   
Topic 99   0.164080  0.197547            0.184811  0.224982  0.247049   

           memoirs    return   scarlet      sign  stoddard      text  \
Topic 0   0.189205  0.233837  0.166079  0.226830  0

In [22]:
topic_word_distributions.loc['Topic 7'].sort_values(ascending=False).head(18)

text                  0.508072
designdevexercises    0.495588
txt                   0.490907
users                 0.480355
stoddard              0.466878
boer                  0.308664
adventure             0.257620
white                 0.257170
scarlet               0.250210
memoirs               0.209963
sign                  0.209306
return                0.201594
fear                  0.198173
hound                 0.179849
Name: Topic 7, dtype: float64

In [23]:
document_topic_distributions = pd.DataFrame(
    document_topic_distributions, columns=topic_names)
print(document_topic_distributions)

    Topic 0   Topic 1   Topic 2   Topic 3   Topic 4   Topic 5   Topic 6  \
0  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
1  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
2  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
3  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
4  0.001667  0.001667  0.001667  0.001667  0.001667  0.001667  0.001667   
5  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
6  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
7  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
8  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   
9  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429  0.001429   

    Topic 7   Topic 8   Topic 9  ...  Topic 90  Topic 91  Topic 92  Topic 93  \
0  0.001429  0.001429  0.001429  ...  0.001429  0.001429  0.001429  0.001429   
1  0.001429  0

In [28]:
words = topic_word_distributions.loc['Topic 7'].sort_values(ascending=False).head(18)
words


text                  0.508072
designdevexercises    0.495588
txt                   0.490907
users                 0.480355
stoddard              0.466878
boer                  0.308664
adventure             0.257620
white                 0.257170
scarlet               0.250210
memoirs               0.209963
sign                  0.209306
return                0.201594
fear                  0.198173
hound                 0.179849
Name: Topic 7, dtype: float64

In [30]:
from matplotlib import pyplot as plt

from wordcloud import wordcloud, STOPWORDS
import matplotlib.colors as mcolors

print(topic_word_distributions.loc['Topic 7'].head(20))


wordcloud = WordCloud().generate_from_frequencies(words)


plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


ModuleNotFoundError: No module named 'wordcloud'

# Sentiment Analysis

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')