#  Book Reviews and How to Fake Them

Find an interesting public domain book and download it as plain text.

1. Use Python to massage the data into a suitable format for processing by the Latent Dirichlet Allocation (LDA) model contained in scikit.learn. This will include removing stop words and punctuation. Some ideas for how to do this can be found here.
2. Break the book up into small sections. The most appropriate level might vary between books, but you will most likely be breaking the book up into either paragraphs or chapters (this might also be a pragmatic decision based on whatever's easiest).
3. Train an LDA model on the corpus. The LDA model should find interesting topics that occur at the paragraph (or chapter) level. Be sure to explain your choice of parameters for any parameters that might have a significant effect on the model results.
4. Print out the first ten words of the ten most common topics.

In [162]:
import numpy as np
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
import nltk
from pprint import pprint
import spacy
from spacy.lang.en import English
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [145]:
macbeth_sents = nltk.corpus.gutenberg.sents('shakespeare-macbeth.txt')
macbeth_words = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')

In [146]:
pprint(macbeth_sents[1:3])

[['Actus', 'Primus', '.'], ['Scoena', 'Prima', '.']]


In [147]:
print("#acts:", macbeth_words.count("Actus"))
print("#scenes:", macbeth_words.count("Scena"))

#acts: 5
#scenes: 22


In [148]:
#Divide into 5 scenes
act_nums = [macbeth_words[i:i+2] for i in range(len(macbeth_words)) if macbeth_words[i]=="Actus"]

#Divide macbeth into 22 scenes
scene_nums = [macbeth_words[i:i+2] for i in range(len(macbeth_words)) if macbeth_words[i]=="Scena"]

In [149]:
print(act_nums)

[['Actus', 'Primus'], ['Actus', 'Secundus'], ['Actus', 'Tertius'], ['Actus', 'Quartus'], ['Actus', 'Quintus']]


In [150]:
print((scene_nums))

[['Scena', 'Secunda'], ['Scena', 'Tertia'], ['Scena', 'Quarta'], ['Scena', 'Quinta'], ['Scena', 'Sexta'], ['Scena', 'Septima'], ['Scena', 'Prima'], ['Scena', 'Secunda'], ['Scena', 'Tertia'], ['Scena', 'Quarta'], ['Scena', 'Prima'], ['Scena', 'Secunda'], ['Scena', 'Tertia'], ['Scena', 'Quinta'], ['Scena', 'Prima'], ['Scena', 'Secunda'], ['Scena', 'Prima'], ['Scena', 'Secunda'], ['Scena', 'Quarta'], ['Scena', 'Quinta'], ['Scena', 'Sexta'], ['Scena', 'Septima']]


In [151]:
#latin numbers
latin_numbers = ['Prima', "Secunda", 'Tertia', 'Quarta', 'Quinta', 'Sexta', 'Septima']

In [152]:
#Find start points for each scene
scene_idx = [i for i in range(len(macbeth_words)) if macbeth_words[i-2]=="Scena" and macbeth_words[i-1] in latin_numbers]

In [153]:
#Extract every word in each scene
scenes = [macbeth_words[scene_idx[i]:scene_idx[i+1]] for i in range(len(scene_idx[:-1]))]

In [166]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [170]:
from sklearn.datasets import make_multilabel_classification

scenes, _ = make_multilabel_classification(random_state=0)

In [172]:
# Build LDA Model
lda = LatentDirichletAllocation(n_components=10,
    random_state=0)
lda.fit(scenes) 

# get topics for some given samples:
lda.transform(scenes[-2:])

array([[0.00175485, 0.00175492, 0.00175492, 0.50872288, 0.00175489,
        0.00175493, 0.00175492, 0.16129848, 0.31769483, 0.00175439],
       [0.16910656, 0.001755  , 0.39669371, 0.00175489, 0.00175455,
        0.00175493, 0.42191617, 0.00175488, 0.00175493, 0.00175439]])

In [174]:
sum(np.random.dirichlet([1, 2, 3]))

1.0