## Fake Book Review
### Raymundo Gonzalez Leal

I will use Latent Dirichlet Allocation on the book "A Tale of Two Cities" by Charles Dickens. I will use inverse document frequency and count vectorizer, and I will try using chapters and paragraphs as documents (so I will have 4 different results). I will use the top words from each topic to try to infer the main topics in the book, and see if this is enough to try to make a book review. 

In [13]:
from collections import OrderedDict
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re

In [55]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 10

In [56]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

### Using chapters as documents

Chapters for "A Tale of Two Cities" were previously extracted using the chapterize package (https://github.com/JonathanReeve/chapterize).

In [57]:
Chapters = []
chapters_path  = 'TaleOfTwoCities-chapters/'

for i in range(1,40):
    chap_num = str(i)
    this_path = chapters_path + chap_num + '.txt'
    
    Chapters.append(open(this_path, 'r').read())
    

In [58]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(Chapters)

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0: mr jerry cruncher prisoner young lorry time old like father
Topic #1: mr say lorry stryver doctor little time day brought knew
Topic #2: defarge mr little madame hand time like good day looked
Topic #3: lorry mr make speak stryver lady business way cried house
Topic #4: mr miss monseigneur little way know hand like lorry father
Topic #5: brother hand doctor time boy little young day place father
Topic #6: mr lorry doctor miss darnay pross time know manette father
Topic #7: defarge madame mr lorry miss know father pross time hand
Topic #8: long mr know say eyes little old father defarge like
Topic #9: defarge madame wine mr hand day lorry eyes little monsieur
()


In [59]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(Chapters)

lda.fit(tfidf)

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lda, tfidf_feature_names, n_top_words)

Topic #0: street fast work hours touch manner hear smile express lay
Topic #1: say road solomon courtyard shown saying usual dover removed sign
Topic #2: mr lorry defarge carton miss darnay pross little left know
Topic #3: speak quarter heavily cried loud curious felt respect bare make
Topic #4: bench carry french master monseigneur seek lifted english patient way
Topic #5: young writing tumbrils wore difficulty woman far gate midst easy
Topic #6: honour seen time mr rose door went lorry low barsad
Topic #7: daughter true strike wot particular summer feel wife past suspected
Topic #8: men lamps steadily tumbrils patriot boots sense know order walked
Topic #9: darkness love rough sun mender narrow summer monsieur grim spot
()


Look at Topic #2, we got plenty character names. Seems reasonable to say that the main characters are Mr Lorry, Defarge, Carton, Miss Darnay, and Pross.

### Spliting by paragraph

In [52]:
text_path = 'TaleOfTwoCities.txt'
full_text = open(text_path, 'r').read()
full_text = re.split( '\r\n\r\n', full_text )
len(full_text)

3452

In [60]:
tf = tf_vectorizer.fit_transform(full_text)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0: shall jerry said young sir way hear morrow come voice
Topic #1: time say mr long dear day lucie doctor lorry manette
Topic #2: gutenberg project work tm prison works looked saint electronic foundation
Topic #3: death moment tell believe hour usual case prisoner distance knew
Topic #4: hand eyes old face like look looked man looking passed
Topic #5: little monseigneur went like people came great night did house
Topic #6: charles darnay yes good brother citizen english life doctor french
Topic #7: said mr lorry miss carton pross man head face looked
Topic #8: know think told business wife mother ask evremonde child poor
Topic #9: defarge madame wine monsieur jacques shop door man vengeance little
()


In [61]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(full_text)
lda.fit(tfidf)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lda, tfidf_feature_names, n_top_words)

Topic #0: said mr defarge doctor madame man hand time lorry manette
Topic #1: said mr know say lorry don asked good miss pross
Topic #2: roads mender looked hand village said voice man began little
Topic #3: citizen believe remember tell ago tower la north force law
Topic #4: jerry president mail buried accused shadow everybody drank fell bell
Topic #5: gabelle recall pretty hear roar town guard desire worst does
Topic #6: gutenberg project tm work works citizeness electronic use foundation org
Topic #7: yes sir think moment papers truly paper written passenger handed
Topic #8: hope father husband start woman true silence footsteps dear head
Topic #9: monseigneur charles england brother hurry ghost court world begin arrived
()


#### Review 

By looking at some of the recurrent works in our 4 results we identify what seem to be important characters, themes, and symbols. When we used each chapter as a document, each individual "topic" doesn't really seem sufficiently different from the others. We'll make an educated guess from this, and say that the topics explored in each chapter are farly similar. Time for our review:

"A Tale of Two Cities" by Charles Dickens depicts the story of Mr. Lorry, Defarge, Miss Darnay and Pross. Following these characters, the novel explores the passage of time in France and England. Family is a recurrent theme, and ideas such as brotherhood and fatherhood are explored through Mr. Lorry's relationships with other characters. Ultimately, this is a love story, even if the characters tend to find themselves threatened by darkness, imprisonment, and death. Dickens does a good job at using symbols such as wine, the streets, bureaucracy and wealth in order to better convey the story. A criticism that can be done is there is little thematic variation across the chapters. Hence, while the plot progresses, the ideas that Dickens explores remain fairly constant, which allows for consistency but ultimately prevents individual chapters from being memorable." 


I actually read the book a couple years ago in literature class in high school. My teacher would definetly not be impressed with this "review". Something unsettling is that, just by looking at the topics extracted from our method, we would have no clue about the importance of the French revolution in the novel. We wouldn't even know for sure which are the two cities from the title! (Although we know things happen in France and England, so Paris and London would have been the clear bet). Anyway, this is not that bad for such a naive method.