![26-Weeks-logo](../images/26-weeks-of-data-science-banner.jpg)

<h1 align="center"> Natural Language Processing - Part 2 </h1>

## Program so far 
***

* Python
* Statistics
* Supervised Machine Learning
* Unsupervised Machine Learning
* NLP Part - 1

## What are we going to learn today ?
***


- POS Tagging
    - Understanding Lexical Units
    - Open Class
    - Clossed Classes
    - Implementation in Python
- Chunking
- Parsing
    - Deep Tree Parsing
- Topic Modeling
    - What is Topic Modelling
	- Applications O Topic Modelling
	- Latent Dirichlet Allocation
	- How LDA works
	- Topic Modelling with Gensim

We saw last week the bag-of-words approach. As we discussed, bag-of-words fails to capture the structure  of the sentences. Part of Speech helps us overcome this weakness. Let's see how

### POS Tagging

Part of Speech tags are grammatical constituents (Noun, Verbs, Adverb, Adjectives) and this process of POS tagging classify tokens into their part-of-speech tags and label them according to the tagset which is a collection of tags used for the pos tagging. Sounds confusing? Let's make it simpler.

**Syntax**	=	how	words	compose	to	form	larger	meaning	bearing	units


**POS**	=	syntactic	categories	for	words
* You	could	substitute	words	within	a	class	and	have	a	syntactically	valid	sentence
* Gives	information	how	words	combine	into	larger	phrases

**POS from School	grammar:**	noun,	verb,	adjective,	adverb,	preposition,	conjunction,	pronoun,	interjection


**Reality**
![](../images/img1.png)

## Understanding Lexical Units:

* There are approximately 8 traditional basic word classes, sometimes called lexical classes or types
* These are the ones taught in grade school grammar
    - N noun - chair, bandwidth, boy, girl
    - V verb - study, debate, munch
    - ADJ adjective - purple, tall, ridiculous (includes articles)
    - ADV adverb - unfortunately, slowly
    - P preposition - of, by, to
    - CON conjunction - and, but
    - PRO pronoun - I, me, mine
    - INT interjection -  um

## Open Class

* Can add words to these basic word classes:
* Nouns, Verbs, Adjectives, Adverbs.
    - Every known human language has nouns and verbs
* Nouns: people, places, things
    - Classes of nouns
* proper vs. common
* count vs. mass
    - Properties of nouns: can be preceded by a determiner, etc.
* Verbs: actions and processes
* Adjectives: properties, qualities
* Adverbs: hodgepodge!
    - Unfortunately, John walked home extremely slowly yesterday
* Numerals, ordinals: one, two, three, third, …

## Closed classes

* Words are not added to these classes:
    * determiners: a, an, the
    * pronouns: she, he, I
    * prepositions: on, under, over, near, by, …
    * over the river and through the woods
    * particles: up, down, on, off, …
* Used with verbs and have slightly different meaning than when used as a preposition
    - she turned the paper over
* Closed class words are often function words which have structuring uses in grammar:
    - of, it , and , you
* Differ more from language to language than open class words 

In [1]:
import nltk 

text = open('../data/C50train/AaronPressman/2537newsML.txt').read()
sents = nltk.sent_tokenize(text)
print(sents[0])
print()
from nltk import word_tokenize
tokens = word_tokenize(sents[0])
print(tokens)

A break-in at the U.S. Justice Department's World Wide Web site last week highlighted the Internet's continued vulnerability to hackers.

['A', 'break-in', 'at', 'the', 'U.S.', 'Justice', 'Department', "'s", 'World', 'Wide', 'Web', 'site', 'last', 'week', 'highlighted', 'the', 'Internet', "'s", 'continued', 'vulnerability', 'to', 'hackers', '.']


In [2]:
import nltk
nltk.pos_tag(tokens)

[('A', 'DT'),
 ('break-in', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('Justice', 'NNP'),
 ('Department', 'NNP'),
 ("'s", 'POS'),
 ('World', 'NNP'),
 ('Wide', 'NNP'),
 ('Web', 'NNP'),
 ('site', 'NN'),
 ('last', 'JJ'),
 ('week', 'NN'),
 ('highlighted', 'VBD'),
 ('the', 'DT'),
 ('Internet', 'NNP'),
 ("'s", 'POS'),
 ('continued', 'JJ'),
 ('vulnerability', 'NN'),
 ('to', 'TO'),
 ('hackers', 'NNS'),
 ('.', '.')]

These DT, NNP, MD etc are pos tags taken from the standard list of Penn TreeBank Tagsets. It can be found here
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

POS tagging is also supervised learning solution that uses features like previous word, next word, is first letter capitalized? etc.

NLTK has a function to get pos tags and it works after tokenization process. 

In our problem of Author Identification, we can create multiple features using POS Tagging.
1. Number of Nouns, Verbs, Adjectives etc.
2. How many times sentence starts with Adverb. Meaning words like Basically, Typically etc.
    

### Chunking

Chunking is a process of extracting phrases(aka chunks) from unstructured text. Instead of just simple tokens which may not represent the actual meaning of text, its advisable to use phrases such as "New Delhi" as a single word instead of New and Delhi separate words.

- Chunking is done using linguistic rules(language grammar rules), such as when two proper nouns occur together, merge them to make a single word. For Example "South Africa".
-  Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output. 
-  Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase(NP), Verb Phrase (VP) etc.
-  Most data scientist uses N-Grams instead of chunker, but n-grams ends up creating a lot of meaningless words.
-  Chunking is very important when you want to extract information from text such as Locations, Person Names etc.
-  In Author Identification, we can have features like how many Named entities, the author uses in a sentence.
-  What kind of countries/continents, the author mostly refers to in his articles.

There are a lot of libraries which gives phrases out-of-box such as Spacy or TextBlob. NLTK just provides a mechanism using regular expressions to generate chunks.

In [3]:
#Define your grammar using regular expressions
#For example a phrase starting with determiners(The/an/a) followed by noun or adjective will be a noun phrase. such as "a greedy dog"
parser = ('''
    NP: {<DT>? <JJ>* <NN>*} # NP
    P: {<IN>}           # Preposition
    PP: {<P> <NP>}      # PP -> P NP
    VP: {<V.*> <PP|RB|V.*>*}  # VP -> V (NP|PP)*
    ''')
line="Unidentified hackers gained access to the department's web page on August 16 and replaced it with a hate-filled diatribe labelled the Department of Injustice that included a swastika and a picture of Adolf Hitler."
chunkParser = nltk.RegexpParser(parser)
negation_result={}
tagged = nltk.pos_tag(nltk.word_tokenize(line))
tree = chunkParser.parse(tagged)
negated_entity=""
negated_value=""
negation=None
for subtree in tree.subtrees():
    print (subtree)


(S
  (NP Unidentified/JJ)
  hackers/NNS
  (VP gained/VBD)
  (NP access/NN)
  to/TO
  (NP the/DT department/NN)
  's/POS
  (NP web/JJ page/NN)
  (P on/IN)
  August/NNP
  16/CD
  and/CC
  (VP replaced/VBD)
  it/PRP
  (PP (P with/IN) (NP a/DT hate-filled/JJ diatribe/NN))
  (VP labelled/VBD)
  (NP the/DT)
  Department/NNP
  (P of/IN)
  Injustice/NNP
  that/WDT
  (VP included/VBD)
  (NP a/DT swastika/NN)
  and/CC
  (NP a/DT picture/NN)
  (P of/IN)
  Adolf/NNP
  Hitler/NNP
  ./.)
(NP Unidentified/JJ)
(VP gained/VBD)
(NP access/NN)
(NP the/DT department/NN)
(NP web/JJ page/NN)
(P on/IN)
(VP replaced/VBD)
(PP (P with/IN) (NP a/DT hate-filled/JJ diatribe/NN))
(P with/IN)
(NP a/DT hate-filled/JJ diatribe/NN)
(VP labelled/VBD)
(NP the/DT)
(P of/IN)
(VP included/VBD)
(NP a/DT swastika/NN)
(NP a/DT picture/NN)
(P of/IN)


## Deep Tree Parsing

One of the advanced topics in NLP is the Lexical Analysis of text wherein we try to analyze and understand a text. This process is called deep tree parsing in NLP world where we try to analyze relationships amongst the text.
- Text parsing is important when you want to know relationships in text. For example <i>Delhi is capital of India<i>, here Delhi and India are related and having a relationship <b>is capital of<b> 

In [4]:
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
sent = "the dog saw a man in the park"
tokens=nltk.word_tokenize(sent)
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(tokens):
    print(tree)

(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))
(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P in) (NP (Det the) (N park)))))


![alt text](../images/deep_parsing.png "Title")

Here as well we have to define our grammar, which looks quite a tedious job. But there are other NLP packages such as Stanford CoreNLP which provide functions to generate parse tree from unstructured text without defining any grammar.
- Parse tree provides us with meaningful and true relations and also kind of relations they share. Also called facts.
- Tree Parsing is used to build a knowledge base from the unstructured corpus. Check DbPedia.

# Topic Modeling

## What is Topic Modelling ?
***
So far we have visited topics for supervised machine learning for NLP. Now let's see some unsupervised machine learning techniques for NLP.


NLP is all about unstructured data, and one of the problems industry is facing today is about the amount of data that any System has to process. Often it's not practical to read through a huge volume of data and get some insights about that data. Consider google news, there are hundreds of thousands of news that get published on daily basis. So we need a way to group news with some keywords in order to understand what is going on. 

![alt text](../images/topic_modelling.png "Title")

- One in red are classes, which are fixed and with the help of training data, we can build news classifier.
- But one in green are topics, that are identified run time. And process of identification of topics is totally unsupervised. And Topic modelling is one the best way to understand, repersent any unstructured text without actually getting into it.

__Topic Modelling__ as the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

A __Topic__ can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – Farming.

### Applications of Topic Modelling

- Document Clustering.
    1. Group news.
    2. Group emails.
    3. Group similar medical notes etc.
- Keywords Generation. Can be used for SEO.
- Build WordCloud.
- Build Search Engines.
- Build knowledge-graph(aka ontologies).



Topics are generally important words in text. 
- Frequency count can be one of the way to identify topics.
- TF-IDF can also be used for Topic Modelling.
- Or the most famous, LDA (Latent Dirichlet Allocation)

## Latent Dirichlet Allocation

Suppose you have the following set of sentences:

- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.

LDA will try to identify words which have been used in similar context and will calculate probability of occurrence of two words togther.
In the above example, LDA will create topics like:
    
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals).



## How LDA Works?

LDA involves a detailed understanding of Baysian Probabilistic Approach. However, here is an intuitive explanation of how LDA operates:

Take the example given above:

* Step 1: We start with tokenizing and then removing stopwords
* Step 2: We decide the number of topics, in our case we have decided that number to be 2.
* Step 3: Then we take each token in each document and randomly assign it to a topic
    * For Statement 1 it might look like this:
        * Like: 2
        * Eat: 2
        * Broccoli: 1
        * banana: 2
        * Repeat this process for each statement

* Step 4: Doing this will result in 2 matrices:
    1. Document to Topic probability distribution (S -> T)
    2. Topic to token probability distribution (T -> W)
        * Since these are distributed randomly, they are not accurate
        * Hence, we want to modify both matrices to make them as near to the real distribution as possible
* Step 5: We again iterate through each token in each document and again assign the topic considering 2 things
    * How prevalent is that word across topics? P(word W| Topic T)
    * How prevalent are topics in the document? P(Topic T| Document D)
    * Let's consider word "eat".
    * Since it only appears in Topic 2 in statement 1, Hence it is only associated with topic 2.
    * Statement 1 is made up of  3/4 of Topic 2 and  1/4 of Topic 1
    * This can be interpreted as: word "eat" is highly specific to topic 2 and topic 2 makes up of majority of statement 1.
    * Hence, eat is more likely to belong to topic 2
    * So, we assign "eat" to topic 2
* Step 6:
    * Go to step 4
    * We repeat this procedure for each token
    * If we perform this entire procedure again and again, we will attain the (S->T) and (T->W) which are approximately equal to actual matrices.
    

### Topic Modelling with Gensim

gensim(https://radimrehurek.com/gensim/) package in python implements most of topic modelling algorithms.

* We'll walk through a basic application of Topic Modeling with LDA
* We'll also cover the basic NLP operations necessary for the application
    

### Let's fast-forward through pre-processing

* After the processing, we'll have *texts* - a tokenized, stopped and stemmed list of words from a single document
* Let’s fast forward and loop through all our documents and appended each one to *texts*
* So now *texts* is a list of lists, one list for each of our original documents

In [5]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from pprint import pprint

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = set(stopwords.words('english'))

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [6]:
   
# create sample documents
# We will use author data
import nltk
text=open("../data/C50train/AaronPressman/2537newsML.txt").read()
sents = nltk.sent_tokenize(text)

# compile sample documents into a list
doc_set = sents

# list for tokenized documents in loop
texts = []

In [7]:
# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

print("\n##### texts")
print(texts)

print("\n##### The lines in texts")
for line in texts:
    print(line)


##### texts
[['break', 'u', 'justic', 'depart', 'world', 'wide', 'web', 'site', 'last', 'week', 'highlight', 'internet', 'continu', 'vulner', 'hacker'], ['unidentifi', 'hacker', 'gain', 'access', 'depart', 'web', 'page', 'august', '16', 'replac', 'hate', 'fill', 'diatrib', 'label', 'depart', 'injustic', 'includ', 'swastika', 'pictur', 'adolf', 'hitler'], ['justic', 'offici', 'quickli', 'pull', 'plug', 'vandalis', 'page', 'secur', 'flaw', 'allow', 'hacker', 'gain', 'entri', 'like', 'exist', 'thousand', 'corpor', 'govern', 'web', 'site', 'secur', 'expert', 'said'], ['vast', 'major', 'site', 'vulner', 'said', 'richard', 'power', 'senior', 'analyst', 'comput', 'secur', 'institut'], ['justic', 'depart', 'singl'], ['justic', 'depart', 'offici', 'said', 'compromis', 'web', 'site', 'connect', 'comput', 'contain', 'sensit', 'file'], ['web', 'site', 'http', 'www', 'usdoj', 'gov', 'includ', 'copi', 'press', 'releas', 'speech', 'publicli', 'avail', 'inform'], ['secur', 'breach', 'like', 'graffiti

## What's next?

* To generate an LDA model, we need to understand how frequently each term occurs within each document
* To do that, we need to construct a document-term matrix with a package called *gensim*

# Topic Modeling with gensim
***
## Getting started with gensim

In [8]:
from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(175 unique tokens: ['break', 'continu', 'depart', 'hacker', 'highlight']...)


* The Dictionary() function traverses texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics
* To see each token’s unique integer id, try -

In [9]:
print(dictionary.token2id)

{'break': 0, 'continu': 1, 'depart': 2, 'hacker': 3, 'highlight': 4, 'internet': 5, 'justic': 6, 'last': 7, 'site': 8, 'u': 9, 'vulner': 10, 'web': 11, 'week': 12, 'wide': 13, 'world': 14, '16': 15, 'access': 16, 'adolf': 17, 'august': 18, 'diatrib': 19, 'fill': 20, 'gain': 21, 'hate': 22, 'hitler': 23, 'includ': 24, 'injustic': 25, 'label': 26, 'page': 27, 'pictur': 28, 'replac': 29, 'swastika': 30, 'unidentifi': 31, 'allow': 32, 'corpor': 33, 'entri': 34, 'exist': 35, 'expert': 36, 'flaw': 37, 'govern': 38, 'like': 39, 'offici': 40, 'plug': 41, 'pull': 42, 'quickli': 43, 'said': 44, 'secur': 45, 'thousand': 46, 'vandalis': 47, 'analyst': 48, 'comput': 49, 'institut': 50, 'major': 51, 'power': 52, 'richard': 53, 'senior': 54, 'vast': 55, 'singl': 56, 'compromis': 57, 'connect': 58, 'contain': 59, 'file': 60, 'sensit': 61, 'avail': 62, 'copi': 63, 'gov': 64, 'http': 65, 'inform': 66, 'press': 67, 'publicli': 68, 'releas': 69, 'speech': 70, 'usdoj': 71, 'www': 72, 'bert': 73, 'brandenbu

Next, our dictionary must be converted into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) -

In [10]:
corpus = [dictionary.doc2bow(text) for text in texts]
for line in corpus:
    print(line)


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(2, 2), (3, 1), (11, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(3, 1), (6, 1), (8, 1), (11, 1), (21, 1), (27, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1)]
[(8, 1), (10, 1), (44, 1), (45, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)]
[(2, 1), (6, 1), (56, 1)]
[(2, 1), (6, 1), (8, 1), (11, 1), (40, 1), (44, 1), (49, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1)]
[(8, 1), (11, 1), (24, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1)]
[(39, 1), (44, 1), (45, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1)]
[(80, 1), (81, 1), (82, 1)]
[(7, 1), (8, 1

* The doc2bow() function converts dictionary into a bag-of-words
* The result, *corpus*, is a list of vectors equal to the number of documents
* In each document vector is a series of tuples
* The tuples are (term ID, term frequency) pairs
* This includes terms that actually occur - terms that do not occur in a document will not appear in that document’s vector

## Creating the LDA Model

*corpus* is a (sparse) document-term matrix and now we’re ready to generate an LDA model

In [11]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

## Parameters to the LDA model

https://radimrehurek.com/gensim/models/ldamodel.html
* num_topics
    - required
    - An LDA model requires the user to determine how many topics should be generated
    - Our document set is small, so we’re only asking for three topics
* id2word
    - required
    - The LdaModel class requires our previous dictionary to map ids to strings
* passes
    - optional
    - The number of laps the model will take through corpus
    - The greater the number of passes, the more accurate the model will be
    - A lot of passes can be slow on a very large corpus.

In [12]:
print(ldamodel)

LdaModel(num_terms=175, num_topics=3, decay=0.5, chunksize=2000)


In [13]:
print(ldamodel.print_topics())

[(0, '0.031*"secur" + 0.024*"depart" + 0.017*"magazin" + 0.017*"gain" + 0.017*"access" + 0.017*"hole" + 0.017*"spokeswoman" + 0.017*"breach" + 0.017*"measur" + 0.017*"said"'), (1, '0.033*"hacker" + 0.025*"site" + 0.018*"internet" + 0.018*"inform" + 0.018*"break" + 0.018*"u" + 0.010*"justic" + 0.010*"take" + 0.010*"fidel" + 0.010*"elgan"'), (2, '0.049*"site" + 0.043*"web" + 0.037*"said" + 0.031*"secur" + 0.019*"hacker" + 0.014*"justic" + 0.014*"flaw" + 0.014*"offici" + 0.014*"corpor" + 0.014*"major"')]


In [14]:
for topic in ldamodel.print_topics(num_topics=2):
    print(topic)

(1, '0.033*"hacker" + 0.025*"site" + 0.018*"internet" + 0.018*"inform" + 0.018*"break" + 0.018*"u" + 0.010*"justic" + 0.010*"take" + 0.010*"fidel" + 0.010*"elgan"')
(2, '0.049*"site" + 0.043*"web" + 0.037*"said" + 0.031*"secur" + 0.019*"hacker" + 0.014*"justic" + 0.014*"flaw" + 0.014*"offici" + 0.014*"corpor" + 0.014*"major"')


In [15]:
for topic in ldamodel.print_topics(num_topics=3, num_words=3):
    print(topic)

(0, '0.031*"secur" + 0.024*"depart" + 0.017*"magazin"')
(1, '0.033*"hacker" + 0.025*"site" + 0.018*"internet"')
(2, '0.049*"site" + 0.043*"web" + 0.037*"said"')


* Within each topic are the three most probable words to appear in that topic

## Topics in detail
Let's now look at a topic in detail. Let us see how distinct the topics are, and if they seem to capture any context.

In [16]:
print(ldamodel.print_topic(topicno=0))

0.031*"secur" + 0.024*"depart" + 0.017*"magazin" + 0.017*"gain" + 0.017*"access" + 0.017*"hole" + 0.017*"spokeswoman" + 0.017*"breach" + 0.017*"measur" + 0.017*"said"


In [17]:
print(ldamodel.print_topic(topicno=1))

0.033*"hacker" + 0.025*"site" + 0.018*"internet" + 0.018*"inform" + 0.018*"break" + 0.018*"u" + 0.010*"justic" + 0.010*"take" + 0.010*"fidel" + 0.010*"elgan"


In [18]:
print(ldamodel.print_topic(topicno=2))

0.049*"site" + 0.043*"web" + 0.037*"said" + 0.031*"secur" + 0.019*"hacker" + 0.014*"justic" + 0.014*"flaw" + 0.014*"offici" + 0.014*"corpor" + 0.014*"major"


## Do the topics make sense?

In [19]:
for topic in ldamodel.print_topics(num_topics=3, num_words=3):
    print(topic)

(0, '0.031*"secur" + 0.024*"depart" + 0.017*"magazin"')
(1, '0.033*"hacker" + 0.025*"site" + 0.018*"internet"')
(2, '0.049*"site" + 0.043*"web" + 0.037*"said"')


## Refining the model

Two topics seems like a better fit for our documents!

In [20]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

for topic in ldamodel.print_topics(num_topics=2, num_words=4):
    print(topic)

(0, '0.034*"secur" + 0.031*"said" + 0.030*"site" + 0.029*"hacker"')
(1, '0.029*"site" + 0.025*"web" + 0.016*"inform" + 0.016*"comput"')


Let's try it with more passes:

In [21]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=200)

for topic in ldamodel.print_topics(num_topics=2, num_words=4):
    print(topic)

(0, '0.039*"said" + 0.035*"site" + 0.031*"web" + 0.027*"hacker"')
(1, '0.023*"secur" + 0.023*"site" + 0.017*"inform" + 0.013*"depart"')


## Predicting Topic for new documents

In [22]:
doc_set = ["Are Health professionals justified in saying that brocolli is good for your health?",
           "Broccoli contains various bioactive compounds that have been shown to reduce inflammation in your body’s tissues"]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus_2 = [dictionary.doc2bow(text) for text in texts]

In [23]:
import gensim
ldamodel = gensim.models.ldamodel.LdaModel(corpus_2, num_topics=3, id2word = dictionary, passes=20)

In [24]:
#Lets check by default LDA parameters
print(ldamodel)

LdaModel(num_terms=16, num_topics=3, decay=0.5, chunksize=2000)


In [25]:
#Let's print out the topics
for topic in ldamodel.print_topics(num_topics=3, num_words=3):
    print(topic)

(0, '0.189*"health" + 0.108*"say" + 0.108*"good"')
(1, '0.063*"contain" + 0.063*"bodi" + 0.063*"variou"')
(2, '0.087*"shown" + 0.087*"tissu" + 0.087*"bioactiv"')


In [26]:
import nltk

Text = open('../data/C50train/JoeOrtiz/242939newsML.txt').read()
Text

"After a five year struggle, creditors of the collapsed, fraud-ridden BCCI will receive a payment of $2.65 billion on Tuesday, equal to 24.5 percent of their claims, a spokesman for the liquidators said on Monday.\nBank of Credit and Commerce International, founded in 1972, was closed by central banks in 1991 and collapsed with debts of more than $12 billion when evidence of massive fraud and money laundering was unearthed leading to a tangled web of litigation which shows no sign of reaching an early conclusion.\nBCCI had assets of $24 billion and operations in 71 countries at the time of its collapse.\nLiquidator Deloitte and Touche said a further payment, reportedly of 10 percent, of the admitted claims which total some $10.5 billion should be made in the next 12 to 16 months.\nThe gross fund of amounts recovered by the liquidators stands at around $4.0 billion and includes $1.5 billion paid by BCCI's majority shareholder, the government of Abu Dhabi, which will pay a further $250 m

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 1 
***
In the above data, perform a sentence tokenization on the above data using `sent_tokenize()` and store it in a variable called '**Sent**'

In [27]:
import nltk

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 2 
***
- Iterate over every Sentence in the list **Sent**  using a for loop and convert every sentence into 
    - lower case 
    - and then tokenize it using the instantiated object 
- Now remove the stopwords from the tokens 
- Lemmatize them using `WordNetLemmatizer().lemmatize()` 
- Finally append them into the list called **Texts**

In [28]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

tokenizer = RegexpTokenizer(r'\w+')

en_stop = set(stopwords.words('english'))

Lema = WordNetLemmatizer()

Texts = []

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 3 
***
Using the method `.Dictionary()` inside the module `corpora` to create a unique token for every word and also print out the tokens assigned respectively using the `.token2id` attribute

In [29]:
from gensim import corpora, models



<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 4 
***
Now convert the dictionary into a bag of words list using the `.doc2bow()` method in `dictionary` and store it in a variable **corpus** 

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 5 
***
Create an LDA model with number of topics of your choice and your choice of total passes. Now print out the top 5 topics and also the top 3 words in every topic