## 9.2. Topic Modelling

### 9.2.1. Wikipedia Article Scrapping

In [1]:
#$ pip install wikipedia

In [2]:
import wikipedia
import nltk
import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

ml = wikipedia.page("Machine Learning")
pizza = wikipedia.page("Pizza")
covid = wikipedia.page("Corona Virus")
etower = wikipedia.page("Eiffel Tower")

corpus = [ml.content, pizza.content, covid.content, etower.content]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\usman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 9.2.2. Data Cleaning

In [3]:
def clean_text(doc):

    doc = re.sub(r'\W', ' ', str(doc))
                      
    doc = re.sub(r'\s+[a-zA-Z]\s+', ' ', str(doc))

    doc = re.sub(r"\s+[a-zA-Z]\s+", ' ', doc)

    doc = re.sub(r'\s+', ' ', doc)
                      
    doc = re.sub(r'^b\s+', '', doc)
                 
    doc = doc.lower()

    words = doc.split()
    words = [stemmer.lemmatize(word) for word in words]
    words = [word for word in words if word not in en_stop]
    words = [word for word in words if len(word)  > 5]

    return words

In [4]:
formated_data = [];
for doc in corpus:
    words = clean_text(doc)
    formated_data.append(words)

### 9.2.3. Topic Modeling with LDA

In [5]:
from gensim import corpora

gensim_dict = corpora.Dictionary(formated_data)
gensim_corpus = [gensim_dict.doc2bow(word, allow_update=True) for word in formated_data]

In [6]:
import gensim

lda_topic_models = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dict, passes=20)

In [7]:
lda_topics = lda_topic_models.print_topics(num_words=7)
for topic_name in lda_topics:
    print(topic_name)

(0, '0.029*"eiffel" + 0.009*"second" + 0.006*"structure" + 0.006*"french" + 0.006*"exposition" + 0.006*"tallest" + 0.005*"engineer"')
(1, '0.000*"learning" + 0.000*"coronavirus" + 0.000*"machine" + 0.000*"algorithm" + 0.000*"coronaviruses" + 0.000*"eiffel" + 0.000*"training"')
(2, '0.013*"cheese" + 0.009*"italian" + 0.009*"tomato" + 0.007*"ingredient" + 0.007*"similar" + 0.007*"topped" + 0.007*"topping"')
(3, '0.038*"learning" + 0.021*"machine" + 0.021*"coronavirus" + 0.013*"algorithm" + 0.010*"training" + 0.009*"coronaviruses" + 0.007*"example"')


### 9.2.4. Testing the Topic Model

In [8]:
doc = 'I like to eat fast food filled with bread and cream'
formatted_doc = clean_text(doc)
bow_doc = gensim_dict.doc2bow(formatted_doc)

print(lda_topic_models.get_document_topics(bow_doc))

[(0, 0.1330796), (1, 0.12547417), (2, 0.6163538), (3, 0.12509239)]


## Exercise 9.1

**Question 1:**

The type of text summary that includes contents from the original text is called:

A. Abstractive Summary

B. Extractive Summary

C. Derived Summary

D. None of the Above

**Answer: B**
    

**Question 2:**

To parse a Wikipedia page, which of the following attribute of the `page` object is used:

A. text

B. data

C. content
 
D. raw_data

**Answer: C**
    
    
**Question 3:**

To create gensim corpora, you need to pass a collection of tokens to which object:

A. gensim.Corpora()

B. gensim.Corpus()

C. gensim.Collection()

D. gensim.Dictionary()
    
**Answer: D**

## Exercise 9.2

Using Wikipedia Library for Python, perform text summarization of the Wikipedia Article on Corona Virus. Add only sentences that contain less than 40 words. Display first 10 sentences from the summary.

**Solution:**

In [9]:
import wikipedia
import nltk
import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

covid = wikipedia.page("Corona Virus")

scrapped_data = covid.content

scrapped_data = re.sub(r'\[[0-9]*\]', ' ',  scrapped_data)
scrapped_data = re.sub(r'\s+', ' ',  scrapped_data)

formatted_text = re.sub('[^a-zA-Z]', ' ', scrapped_data)
formatted_text = re.sub(r'\s+', ' ', formatted_text)

import nltk
all_sentences = nltk.sent_tokenize(scrapped_data)

stopwords = nltk.corpus.stopwords.words('english')

word_freq = {}
for word in nltk.word_tokenize(formatted_text):
    if word not in stopwords:
        if word not in word_freq.keys():
            word_freq[word] = 1
        else:
            word_freq[word] += 1
            
max_freq = max(word_freq.values())

for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)
    
sentence_scores = {}
for sentence in all_sentences:
    for token in nltk.word_tokenize(sentence.lower()):
        if token in word_freq.keys():
            if len(sentence.split(' ')) < 40:
                if sentence not in sentence_scores.keys():
                    sentence_scores[sentence] = word_freq[token]
                else:
                    sentence_scores[sentence] += word_freq[token]

import heapq
selected_sentences= heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

text_summary = ' '.join(selected_sentences)
print(text_summary)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\usman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Phylogentically, mouse hepatitis virus (Murine coronavirus), which infects the mouse's liver and central nervous system, is related to human coronavirus OC43 and bovine coronavirus. === Middle East respiratory syndrome (MERS) === In September 2012, a new type of coronavirus was identified, initially called Novel Coronavirus 2012, and now officially named Middle East respiratory syndrome coronavirus (MERS-CoV). RNA recombination appears to be a major driving force in determining genetic variability within a coronavirus species, the capability of a coronavirus species to jump from one host to another and, infrequently, in determining the emergence of novel coronaviruses. Ferret enteric coronavirus causes a gastrointestinal syndrome known as epizootic catarrhal enteritis (ECE), and a more lethal systemic version of the virus (like FIP in cats) known as ferret systemic coronavirus (FSC). Sialodacryoadenitis virus (SDAV), which is a strain of the species Murine coronavirus, is highly infect