# Topic Modelling

In text mining, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.


In [1]:
%%capture
%pip install stop-words
%pip install gensim

In [2]:
# We will start by taking a few sentences and treating them as documents
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
    
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health." 

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]



In [3]:
# list for tokenized documents in loop
texts = []

# We will clean and tokenize our “documents using stemming
# loop through document list
for doc in doc_set:
    # clean and tokenize document string
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [token for token in tokens if not token in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(token) for token in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)
print(texts)

[['brocolli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'brocolli', 'mother'], ['mother', 'spend', 'lot', 'time', 'drive', 'brother', 'around', 'basebal', 'practic'], ['health', 'expert', 'suggest', 'drive', 'may', 'caus', 'increas', 'tension', 'blood', 'pressur'], ['often', 'feel', 'pressur', 'perform', 'well', 'school', 'mother', 'never', 'seem', 'drive', 'brother', 'better'], ['health', 'profession', 'say', 'brocolli', 'good', 'health']]


During this cleaning process we introduce a new cleaning step called stemming
(bringing words to their original stem).

We will also be introducing lemmatization later in this laboratory.

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word. Lemma on the other hand is an actual word of target language.

Stemming follows an algorithm with steps to perform on the words which makes it faster. 

Lemmatization, however, uses both WordNet corpus and stop words corpus which makes it slower than stemming. You also had to define parts-of-speech to obtain the correct lemma.


We create a dictionary containing all the words in our documents

In [4]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
print(dictionary[0], dictionary[1], dictionary[2], dictionary[3], dictionary[4], dictionary[5])

brocolli brother eat good like mother


In [5]:
# convert tokenized documents into a document-term matrix (each row consisting of (word_id_in_dictionary, word_frequency_in_current_document))
corpus = [dictionary.doc2bow(text) for text in texts]

print(corpus[0]) # ['brocolli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'brocolli', 'mother']
print(corpus[1]) # ['mother', 'spend', 'lot', 'time', 'drive', 'brother', 'around', 'basebal', 'practice']

[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)]
[(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]


Finally we create our LDA model and specify the number of topics that we want/expect and for how many passes LDA should run.

In [6]:
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)


We can now access the model’s identified topics and the words that they are comprised of

In [7]:
print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, '0.112*"good" + 0.112*"brocolli" + 0.082*"health" + 0.080*"eat"'), (1, '0.075*"drive" + 0.053*"brother" + 0.053*"mother" + 0.053*"pressur"')]


### Wikipedia
Next up we are going to also introduce the wikipedia python module which we can use to download wikipedia pages.

In [8]:
%%capture
%pip install wikipedia

In [9]:
import wikipedia
import nltk

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

religion = wikipedia.page("Religion")
artificial_intelligence = wikipedia.page("Artificial Intelligence")
mona_lisa = wikipedia.page("Mona_Lisaa")
eiffel_tower = wikipedia.page("Eiffel Tower")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
corpus = [religion.content, artificial_intelligence.content, mona_lisa.content, eiffel_tower.content]

In [11]:
print(mona_lisa.content)

The Mona Lisa (; Italian: Gioconda [dʒoˈkonda] or Monna Lisa [ˈmɔnna ˈliːza]; French: Joconde [ʒɔkɔ̃d]) is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". The painting's novel qualities include the subject's enigmatic expression, the monumentality of the composition, the subtle modelling of forms, and the atmospheric illusionism.The painting is probably of the Italian noblewoman Lisa Gherardini, the wife of Francesco del Giocondo. It is painted in oil on a white Lombardy poplar panel. Leonardo never gave the painting to the Giocondo family, and later it is believed he left it in his will to his favored apprentice Salaì. It had been believed to have been painted between 1503 and 1506; however, Leonardo may have continued working on it as late as 1517. It 

In [12]:
# Lemmatization

import re
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

# We are going to create a function for text preprocessing since we are going to be analyzing multiple texts and will want to reuse the code.
stemmer = WordNetLemmatizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in en_stop]
        tokens = [word for word in tokens if len(word)  > 5]

        return tokens


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
processed_data = [];
for doc in corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)

print(corpus[0][:205]) # Original text
print(processed_data[0][:15]) # Processed tokens


Religion is usually defined as a social-cultural system of designated behaviors and practices, morals, beliefs, worldviews, texts, sanctified places, prophecies, ethics, or organizations, that generally re
['religion', 'usually', 'defined', 'social', 'cultural', 'system', 'designated', 'behavior', 'practice', 'belief', 'worldviews', 'sanctified', 'prophecy', 'organization', 'generally']


In this text preprocessing we’ve added the lemmatization step instead of the stemming one (remember the difference discussed earlier)

We preprocess all of the documents and add them to the processed_data list of documents


In [14]:
%%capture
%pip install gensim==4.1

In [15]:
from gensim import corpora

gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]


(Optionally) - we can save our dictionary, corpus and model if we want to load them back later on (and not run it again risking to obtain different values next time we run it)


In [16]:
import pickle

pickle.dump(gensim_corpus, open('gensim_corpus_corpus.pkl', 'wb'))
gensim_dictionary.save('gensim_dictionary.gensim')


In [17]:
import gensim

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=150)
lda_model.save('gensim_model.gensim')


In [18]:
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic[1], '\n')


0.035*"painting" + 0.017*"leonardo" + 0.009*"louvre" + 0.009*"portrait" + 0.006*"century" 

0.058*"religion" + 0.019*"religious" + 0.009*"belief" + 0.008*"culture" + 0.007*"practice" 

0.021*"intelligence" + 0.016*"artificial" + 0.014*"original" + 0.014*"archived" + 0.013*"retrieved" 

0.025*"eiffel" + 0.007*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" 



In [19]:
lda_model_tst = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=8, id2word=gensim_dictionary, passes=15)
topics = lda_model_tst.print_topics(num_words=5)
for topic in topics:
    print(topic[1], '\n')


0.000*"religion" + 0.000*"intelligence" + 0.000*"original" + 0.000*"archived" + 0.000*"religious" 

0.023*"intelligence" + 0.018*"artificial" + 0.016*"original" + 0.015*"archived" + 0.014*"retrieved" 

0.000*"intelligence" + 0.000*"archived" + 0.000*"religion" + 0.000*"original" + 0.000*"artificial" 

0.000*"religion" + 0.000*"religious" + 0.000*"intelligence" + 0.000*"artificial" + 0.000*"original" 

0.000*"intelligence" + 0.000*"painting" + 0.000*"original" + 0.000*"artificial" + 0.000*"archived" 

0.001*"religion" + 0.000*"religious" + 0.000*"belief" + 0.000*"culture" + 0.000*"practice" 

0.023*"painting" + 0.018*"eiffel" + 0.011*"leonardo" + 0.008*"french" + 0.006*"second" 

0.066*"religion" + 0.021*"religious" + 0.010*"belief" + 0.009*"culture" + 0.008*"practice" 



Let’s test our LDA model that was trained on our previous 4 documents (resulting in 3 or 4 topics) on some sentences written by us. We pretend a sentence is a document.

In [20]:
test_doc = 'There are many passages about Christ in The Bible'

In [21]:
# To which topic does the sentence above resemble the most?
test_doc = preprocess_text(test_doc)
bow_test_doc = gensim_dictionary.doc2bow(test_doc)

print(lda_model.get_document_topics(bow_test_doc))


[(0, 0.12511058), (1, 0.6247228), (2, 0.12506433), (3, 0.12510233)]


In [22]:
test_doc = 'Da Vinci created impressive paintings'
test_doc = preprocess_text(test_doc)
bow_test_doc = gensim_dictionary.doc2bow(test_doc)

print(lda_model.get_document_topics(bow_test_doc))


[(0, 0.5551714), (1, 0.062584914), (2, 0.062921524), (3, 0.3193222)]


In [23]:
test_doc = 'Neural networks are taking over the scene of image recognition'
test_doc = preprocess_text(test_doc)
bow_test_doc = gensim_dictionary.doc2bow(test_doc)

print(lda_model.get_document_topics(bow_test_doc))


[(0, 0.050172966), (1, 0.05115974), (2, 0.84851), (3, 0.050157256)]


There are some metrics that we can print which give us an idea of how well the model is doing (in comparison to models for other documents). These metrics are perplexity and coherence.
You can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data.
In order to decide the optimum number of topics to be extracted using LDA, topic coherence score is always used to measure how well the topics are extracted

In [24]:
print('Perplexity:', lda_model.log_perplexity(gensim_corpus))

Perplexity: -7.597080705848661


In [25]:
from gensim.models import CoherenceModel

coherence_score_lda = CoherenceModel(model=lda_model, texts=processed_data, dictionary=gensim_dictionary, coherence='c_v')
coherence_score = coherence_score_lda.get_coherence()

print('Coherence Score:', coherence_score)


Coherence Score: 0.6459205922312374


In [26]:
%%capture
%pip install pyLDAvis

### Visualizing topics

In [27]:
# If we want to load previously saved model
# gensim_dictionary = gensim.corpora.Dictionary.load('gensim_dictionary.gensim')
# gensim_corpus = pickle.load(open('gensim_corpus_corpus.pkl', 'rb'))
# lda_model = gensim.models.ldamodel.LdaModel.load('gensim_model.gensim')


import pyLDAvis.gensim_models

lda_visualization = pyLDAvis.gensim_models.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False)
pyLDAvis.display(lda_visualization)


  by='saliency', ascending=False).head(R).drop('saliency', 1)


pyLDAvis is quite powerful of a tool and provides multiple visualizations within one interactive interface.


We can see the most relevant terms per topic and the most salient terms overall.
We can also play around with the relevance metric to obtain terms that are more or less relevant to a given topic.


The visualization is interactive, we can interact with what is being displayed and we can choose for instance what topic we want to further analyze. 


### LSA - Latent Semantic Analysis

As an extra example we also have an LSA model available in our gensim module which may give more coherent topics at the cost of the separation quality between topics.

In [28]:
from gensim.models import LsiModel

lsi_model = LsiModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary)
topics = lsi_model.print_topics(num_words=10)
for topic in topics:
    print(topic[1])


  sparsetools.csc_matvecs(m, n, samples, corpus.indptr, corpus.indices,


0.806*"religion" + 0.257*"religious" + 0.123*"belief" + 0.117*"culture" + 0.108*"intelligence" + 0.098*"practice" + 0.087*"original" + 0.086*"artificial" + 0.079*"century" + 0.074*"science"
-0.407*"intelligence" + -0.321*"artificial" + -0.285*"original" + -0.280*"archived" + 0.278*"religion" + -0.257*"retrieved" + -0.222*"machine" + -0.170*"learning" + -0.156*"problem" + -0.132*"october"
0.697*"painting" + 0.337*"leonardo" + 0.178*"louvre" + 0.171*"eiffel" + 0.168*"portrait" + 0.152*"french" + 0.124*"century" + 0.119*"museum" + 0.098*"giocondo" + 0.091*"italian"
-0.654*"eiffel" + 0.259*"painting" + -0.182*"second" + -0.144*"exposition" + -0.143*"structure" + 0.130*"leonardo" + -0.127*"tallest" + -0.116*"engineer" + -0.110*"french" + -0.106*"design"


### Exercise 1
Choose 8 Wikipedia articles of your choice.


In [29]:
article_ids = [
    22934, # Probability
    27619007, # Mossad
    63478112, # 1975 Banqiao Dam Failure
    147809, # Mochi
    28661, # Defamation
    344287, # Mercury Poisoning
    4279208, # Peer-to-peer file sharing
    1558354, # Demographics of Europe
]

articles = [wikipedia.page(pageid=id) for id in article_ids]
article_content = [article.content for article in articles]


1. Clean the data using **stemming** and print the resulting topics when applying LDA with num_topics = 2.


In [30]:
stemmer = PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')

wiki_texts = []
for content in article_content:
    raw = content.lower()
    words = tokenizer.tokenize(raw)
    
    stemmed_words = [stemmer.stem(word) for word in words if word not in en_stop]
    
    wiki_texts.append(stemmed_words)

In [31]:
wiki_dict = corpora.Dictionary(wiki_texts)
wiki_corpus = [wiki_dict.doc2bow(text) for text in wiki_texts]

In [32]:
ldamodel = gensim.models.ldamodel.LdaModel(wiki_corpus, num_topics=2, id2word=wiki_dict, passes=20)

print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, '0.015*"defam" + 0.010*"law" + 0.009*"libel" + 0.008*"public"'), (1, '0.011*"mercuri" + 0.008*"mossad" + 0.007*"probabl" + 0.007*"s"')]



2. Clean the data using **lemmatization** and print the resulting topics when applying LDA with num_topics = 2.


In [33]:
stemmer = WordNetLemmatizer()

wiki_texts = []
for content in article_content:
    lemmatized_words = preprocess_text(content)
    wiki_texts.append(lemmatized_words)

In [34]:
wiki_dict = corpora.Dictionary(wiki_texts)
wiki_corpus = [wiki_dict.doc2bow(text) for text in wiki_texts]

In [35]:
lda_model = gensim.models.ldamodel.LdaModel(wiki_corpus, num_topics=8, id2word=wiki_dict, passes=70)

topics = lda_model.print_topics(num_topics=8, num_words=4)
for topic in topics:
    print(topic[1])

0.014*"disaster" + 0.012*"people" + 0.012*"failure" + 0.012*"banqiao"
0.044*"mercury" + 0.016*"sharing" + 0.012*"poisoning" + 0.010*"network"
0.036*"population" + 0.032*"europe" + 0.030*"european" + 0.016*"country"
0.038*"defamation" + 0.017*"statement" + 0.016*"article" + 0.015*"criminal"
0.000*"defamation" + 0.000*"mercury" + 0.000*"statement" + 0.000*"probability"
0.000*"mercury" + 0.000*"defamation" + 0.000*"mossad" + 0.000*"sharing"
0.034*"probability" + 0.014*"displaystyle" + 0.010*"glutinous" + 0.009*"theory"
0.039*"mossad" + 0.017*"israel" + 0.016*"israeli" + 0.014*"operation"



3. Print the coherence and perplexity score of your lemmatization LDA (n=2) model.
Tune your LDA model hyperparameters (num_topics, num_passes, etc.) until you obtain as good of a coherence score as possible.

    Print the **coherence score, perplexity score and the topic separation** of your **new best model**.


In [36]:
print('Perplexity:', lda_model.log_perplexity(wiki_corpus))

Perplexity: -7.516149252318446


In [37]:
from gensim.models import CoherenceModel

coherence_score_lda = CoherenceModel(model=lda_model, texts=wiki_texts, dictionary=wiki_dict, coherence='c_v')
coherence_score = coherence_score_lda.get_coherence()

print('Coherence Score:', coherence_score)

Coherence Score: 0.5302913887945808



4. Apply LSA on your lemmatization-based cleaned articles and print the resulting topics using (num_topics = 2, num_topics = 4 and num_topics = 8)

In [38]:
for num_topics in [2, 4, 8]:
    lsi_model = LsiModel(wiki_corpus, num_topics=num_topics, id2word=wiki_dict)
    topics = lsi_model.print_topics(num_words=6)
    print(f'LSA for number of topics: {num_topics}')
    for topic in topics:
        print(topic[1])

  sparsetools.csc_matvecs(m, n, samples, corpus.indptr, corpus.indices,
  sparsetools.csc_matvecs(m, n, samples, corpus.indptr, corpus.indices,


LSA for number of topics: 2
0.599*"defamation" + 0.270*"statement" + 0.258*"article" + 0.240*"criminal" + 0.217*"public" + 0.216*"person"
0.821*"mercury" + 0.215*"poisoning" + 0.184*"mossad" + 0.174*"exposure" + 0.097*"compound" + 0.097*"methylmercury"
LSA for number of topics: 4
0.599*"defamation" + 0.270*"statement" + 0.258*"article" + 0.240*"criminal" + 0.217*"public" + 0.216*"person"
0.821*"mercury" + 0.215*"poisoning" + 0.184*"mossad" + 0.174*"exposure" + 0.097*"compound" + 0.097*"methylmercury"
-0.626*"mossad" + 0.271*"mercury" + -0.269*"israel" + -0.249*"israeli" + -0.222*"operation" + -0.202*"intelligence"
0.796*"probability" + 0.314*"displaystyle" + 0.210*"theory" + 0.124*"outcome" + 0.104*"example" + 0.093*"number"


  sparsetools.csc_matvecs(m, n, samples, corpus.indptr, corpus.indices,


LSA for number of topics: 8
0.599*"defamation" + 0.270*"statement" + 0.258*"article" + 0.240*"criminal" + 0.217*"public" + 0.216*"person"
-0.821*"mercury" + -0.215*"poisoning" + -0.184*"mossad" + -0.174*"exposure" + -0.097*"compound" + -0.097*"methylmercury"
-0.626*"mossad" + 0.271*"mercury" + -0.269*"israel" + -0.249*"israeli" + -0.222*"operation" + -0.202*"intelligence"
0.796*"probability" + 0.314*"displaystyle" + 0.210*"theory" + 0.124*"outcome" + 0.104*"example" + 0.093*"number"
0.488*"population" + 0.430*"europe" + 0.408*"european" + 0.221*"country" + 0.199*"language" + 0.145*"sharing"
-0.627*"sharing" + -0.376*"network" + -0.207*"community" + -0.149*"percent" + 0.145*"population" + 0.132*"europe"
0.391*"glutinous" + 0.282*"japanese" + 0.216*"traditional" + 0.216*"amylopectin" + 0.202*"starch" + 0.168*"called"
0.326*"disaster" + 0.290*"banqiao" + 0.290*"failure" + 0.259*"people" + 0.218*"chinese" + 0.215*"august"


### Topic Modelling on books

In [39]:
# We will try to apply topic modelling on the science books from last laboratories
import pandas as pd
import gutenbergpy.textget
from tidytext import unnest_tokens
import nltk
nltk.download('punkt')

raw_book = gutenbergpy.textget.get_text_by_id(37729)
galileo_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(14725)
huygens_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(13476)
tesla_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(30155)
einstein_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

corpus = [galileo_text, huygens_text, tesla_text, einstein_text]
processed_data = [];
for doc in corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)

gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [40]:
lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=255)

topics = lda_model.print_topics(num_words=15)
for topic in topics:
    print(topic, '\n')


(0, '0.024*"gravity" + 0.021*"figure" + 0.013*"therefore" + 0.011*"bottom" + 0.011*"matter" + 0.009*"equall" + 0.008*"resistance" + 0.007*"proportion" + 0.007*"_aristotle_" + 0.007*"absolute" + 0.007*"greater" + 0.007*"without" + 0.007*"motion" + 0.006*"weight" + 0.006*"descend"') 

(1, '0.009*"relativity" + 0.008*"theory" + 0.008*"frequency" + 0.007*"reference" + 0.006*"system" + 0.006*"current" + 0.006*"motion" + 0.006*"ordinate" + 0.006*"general" + 0.006*"result" + 0.006*"potential" + 0.006*"experiment" + 0.005*"distance" + 0.005*"energy" + 0.005*"velocity"') 

(2, '0.000*"terminate" + 0.000*"advancing" + 0.000*"mutually" + 0.000*"changed" + 0.000*"clearer" + 0.000*"obscure" + 0.000*"regarding" + 0.000*"applying" + 0.000*"understanding" + 0.000*"believed" + 0.000*"erected" + 0.000*"opposed" + 0.000*"opposition" + 0.000*"edition" + 0.000*"business"') 

(3, '0.027*"refraction" + 0.019*"crystal" + 0.015*"surface" + 0.015*"straight" + 0.011*"perpendicular" + 0.010*"parallel" + 0.010*"mo

In [41]:
# See which book resembles to which topic

print(lda_model[gensim_corpus[0]])
print(lda_model[gensim_corpus[1]])
print(lda_model[gensim_corpus[2]])
print(lda_model[gensim_corpus[3]])


[(0, 0.99991864)]
[(3, 0.99990845)]
[(1, 0.99993205)]
[(1, 0.99991816)]


In [42]:
# Let's see where some partial text from Tesla's text seems to belong to
tesla_text_crumbs = tesla_text[1000:5000]

tesla_text_crumbs = preprocess_text(tesla_text_crumbs)
tesla_text_crumbs = gensim_dictionary.doc2bow(tesla_text_crumbs)

print(lda_model.get_document_topics(tesla_text_crumbs))


[(1, 0.99623775)]


### Jane Austen

In [43]:
import pandas as pd
import gutenbergpy.textget
from tidytext import unnest_tokens
import nltk
nltk.download('punkt')

raw_book = gutenbergpy.textget.get_text_by_id(161)
sense_sensibility_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(1342)
pride_prejudice_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(158)
emma_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(105)
persuasion_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

corpus = [sense_sensibility_text, pride_prejudice_text, emma_text, persuasion_text]

processed_data = [];
for doc in corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)
    
gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=255)

topics = lda_model.print_topics(num_words=15)
for topic in topics:
    print(topic, '\n')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bogda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(0, '0.008*"elizabeth" + 0.008*"little" + 0.006*"nothing" + 0.006*"friend" + 0.006*"though" + 0.006*"harriet" + 0.006*"thought" + 0.006*"without" + 0.005*"always" + 0.005*"father" + 0.005*"weston" + 0.005*"sister" + 0.004*"feeling" + 0.004*"knightley" + 0.004*"indeed"') 

(1, '0.021*"elinor" + 0.017*"marianne" + 0.010*"sister" + 0.008*"mother" + 0.008*"edward" + 0.008*"dashwood" + 0.007*"jennings" + 0.007*"willoughby" + 0.007*"though" + 0.006*"nothing" + 0.005*"colonel" + 0.005*"without" + 0.005*"little" + 0.005*"however" + 0.004*"brandon"') 

(2, '0.000*"judiciously" + 0.000*"bespoke" + 0.000*"stealing" + 0.000*"abstraction" + 0.000*"contraction" + 0.000*"resettled" + 0.000*"resembling" + 0.000*"reseated" + 0.000*"reprobate" + 0.000*"despondence" + 0.000*"bespeak" + 0.000*"destruction" + 0.000*"weathered" + 0.000*"recognition" + 0.000*"altering"') 

(3, '0.000*"judiciously" + 0.000*"bespoke" + 0.000*"stealing" + 0.000*"abstraction" + 0.000*"contraction" + 0.000*"resettled" + 0.000*"re

In [44]:
print(lda_model[gensim_corpus[0]])
print(lda_model[gensim_corpus[1]])
print(lda_model[gensim_corpus[2]])
print(lda_model[gensim_corpus[3]])

[(1, 0.99965495)]
[(0, 0.9997716)]
[(0, 0.9999787)]
[(0, 0.9999622)]


# Dostoyevsky

In [45]:
# Let's add some books from Dostoyevsky to our books corpus
raw_book = gutenbergpy.textget.get_text_by_id(2554)
crime_and_punishment_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(600)
notes_from_underground_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")

raw_book = gutenbergpy.textget.get_text_by_id(8117)
the_possessed_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")


In [46]:
# Does the LDA model split the topics as being the 2 authors?
authors_corpus = [sense_sensibility_text, pride_prejudice_text, emma_text, persuasion_text,
                  crime_and_punishment_text, notes_from_underground_text, the_possessed_text]

processed_data = []
for doc in authors_corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)
    
gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=2, id2word=gensim_dictionary, passes=40)

topics = lda_model.print_topics(num_words=15)
for topic in topics:
    print(topic[1], '\n')



0.007*"little" + 0.006*"sister" + 0.006*"nothing" + 0.006*"though" + 0.006*"elizabeth" + 0.006*"elinor" + 0.005*"without" + 0.005*"friend" + 0.005*"thought" + 0.005*"always" + 0.005*"marianne" + 0.004*"mother" + 0.004*"however" + 0.004*"harriet" + 0.004*"feeling" 

0.010*"though" + 0.006*"suddenly" + 0.006*"raskolnikov" + 0.006*"nothing" + 0.006*"something" + 0.005*"without" + 0.005*"thought" + 0.005*"little" + 0.005*"people" + 0.004*"looked" + 0.004*"perhaps" + 0.004*"almost" + 0.004*"stepan" + 0.004*"trofimovitch" + 0.004*"course" 



In [47]:
# What topic each of the books seems to belong to
for g_corpus in gensim_corpus:
    print(lda_model[g_corpus])


[(0, 0.99997556)]
[(0, 0.9999757)]
[(0, 0.9999789)]
[(0, 0.9999594)]
[(1, 0.9999779)]
[(1, 0.9993321)]
[(1, 0.9999817)]


In [48]:
import pyLDAvis.gensim_models

lda_visualization = pyLDAvis.gensim_models.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False)
pyLDAvis.display(lda_visualization)


  by='saliency', ascending=False).head(R).drop('saliency', 1)


### Exercise 2 - Visualization basics



Use the Jane Austen vs Dostoyevsky topic visualization
1. Inspect the conditional topic distribution given for the words 'elizabeth', 'sister' and 'suddenly'. (hover over the words to see their cond. topic distribution)


* elizabeth - only appears in Jane Austen
* sister - appears in both Jane Austen and Dostoyevsky, but more relevant in Jane Austen
* suddenly - appears significantly more in Dostoyevsky than in Jane Austen


2. Show the most relevant 30 terms of each topic when lambda = 0.



3. Answer the following questions based on visual observations:




>*   The term 'father' is more relevant for which author?
>*   In the books of which author do things happen 'suddenly' more often? 
>*   For which author does 'family' appear more often?
>*   In the books of which author do 'letter(s)' play a bigger role?



* father - more relavant in Jane Austen
* suddenly - significantly more in Dostoyevsky
* family - a lot more in Jane Austen
* letter(s) - more in Jane Austen

### Exercise 3 



Choose books from at least 3 different authors and create your clean corpus of books using lemmatization.
For one book of your choice (out of the >=3) split the book into chapters.


In [49]:
books = [
    {'id': 996, 'name': 'Don Quixote'},
    {'id': 8486, 'name': 'Ghost Stories of an Antiquary'},
    {'id': 2600, 'name': 'War and Peace'},
    {'id': 164, 'name': 'Twenty Thousand Leagues under the Sea'}
]

def process_book(book_id):
    raw_book = gutenbergpy.textget.get_text_by_id(book_id)
    the_possessed_text = gutenbergpy.textget.strip_headers(raw_book).decode("utf-8")
    
    return the_possessed_text

author_content = [process_book(book['id']) for book in books]

In [50]:
author_texts = []
for content in author_content:
    lemmatized_words = preprocess_text(content)
    author_texts.append(lemmatized_words)

In [51]:
author_dict = corpora.Dictionary(author_texts)
author_corpus = [author_dict.doc2bow(text) for text in author_texts]


1. Train (use) an LDA Model that correctly separates everything into N distinct topics, where N is the number of authors you have. E.g. If you have 4 different authors your model should separate your corpus into 4 different topics.


In [52]:
lda_model = gensim.models.ldamodel.LdaModel(author_corpus, num_topics=4, id2word=author_dict, passes=70)

topics = lda_model.print_topics(num_words=8)
for topic in topics:
    print(topic[1])

0.013*"pierre" + 0.013*"prince" + 0.008*"natásha" + 0.008*"andrew" + 0.006*"princess" + 0.006*"thought" + 0.006*"french" + 0.005*"rostóv"
0.000*"prince" + 0.000*"pierre" + 0.000*"natásha" + 0.000*"andrew" + 0.000*"princess" + 0.000*"thought" + 0.000*"chapter" + 0.000*"without"
0.024*"quixote" + 0.023*"sancho" + 0.009*"knight" + 0.007*"without" + 0.007*"master" + 0.006*"though" + 0.006*"worship" + 0.004*"replied"
0.015*"captain" + 0.012*"nautilus" + 0.007*"conseil" + 0.004*"seemed" + 0.004*"without" + 0.004*"surface" + 0.004*"thought" + 0.004*"little"



2. Separate the text into chapters for one book of your choice out of the ones chosen. Print the similitude of each chapter to all of the authors that you've chosen.
Your output should look like:


In [58]:
import re

def split_into_chapters(book_text):
    return re.split(r'CHAPTER [IVXLCD]+\.', book_text)
    
chapters = split_into_chapters(author_content[0])

In [59]:
for idx, chapter in enumerate(chapters):
    processeded_chapter = preprocess_text(chapter)
    bow_processeded_chapter = author_dict.doc2bow(processeded_chapter)
    
    print(f'Chpater {idx + 1}: {lda_model[bow_processeded_chapter]}')

Chpater 1: [(2, 0.9998765)]
Chpater 2: [(2, 0.99835914)]
Chpater 3: [(2, 0.99862164)]
Chpater 4: [(2, 0.9987401)]
Chpater 5: [(2, 0.9985635)]
Chpater 6: [(2, 0.9976819)]
Chpater 7: [(2, 0.9986231)]
Chpater 8: [(2, 0.9981378)]
Chpater 9: [(2, 0.99894243)]
Chpater 10: [(2, 0.99841154)]
Chpater 11: [(2, 0.9982223)]
Chpater 12: [(2, 0.9984106)]
Chpater 13: [(2, 0.99837875)]
Chpater 14: [(2, 0.9990796)]
Chpater 15: [(2, 0.99881226)]
Chpater 16: [(2, 0.99882686)]
Chpater 17: [(2, 0.9988228)]
Chpater 18: [(2, 0.9989061)]
Chpater 19: [(2, 0.9990989)]
Chpater 20: [(2, 0.9987935)]
Chpater 21: [(2, 0.99927396)]
Chpater 22: [(2, 0.9991786)]
Chpater 23: [(2, 0.99909425)]
Chpater 24: [(2, 0.999212)]
Chpater 25: [(2, 0.9990816)]
Chpater 26: [(2, 0.9994252)]
Chpater 27: [(2, 0.99881923)]
Chpater 28: [(2, 0.99946773)]
Chpater 29: [(2, 0.99938834)]
Chpater 30: [(2, 0.9992614)]
Chpater 31: [(2, 0.9991401)]
Chpater 32: [(2, 0.99889714)]
Chpater 33: [(2, 0.998479)]
Chpater 34: [(2, 0.999574)]
Chpater 35: [


    Chapter 1 - (Author 1)

    Chapter 2 - (Author 1 - 0.73, Author 3 - 0.25)

    Chapter 3 - (Author 1)

    etc.

    It's possible that a chapter is similar to the writing style or theme of multiple authors - in that case print out the names and similitude weights for the authors.


