**Why do we need topic modeling?**

Okay, so now the question arises why do we need topic modeling? If we look around, we can see a huge amount of textual data lying around us in an unstructured format in the form of news articles, research papers, social media posts etc. and we need a way to understand, organize and label this data to make informed decisions. Topic modeling is used in various applications like finding questions on stack overflow that are similar to each other, news flow aggregation and analysis, recommender systems etc. All of these focus on finding the hidden thematic structure in the text, as it is believed that every text that we write be it a tweet, post or a research paper is composed of themes like sports, 
physics, aerospace etc.

**How to do topic modeling?**

Currently, there are many ways to do topic modeling, but we will be discussing a probabilistic modeling approach called Latent Dirichlet Allocation (LDA) developed by Prof. David M. Blei in 2003. This is an extension of Probabilistic Latent Semantic Analysis (PLSA) developed in 1999 by Thomas Hoffman with a very minute difference in terms of how they treat per-document distribution. So let’s jump straight into how LDA works.

### Latent Dirichlet Allocation
Latent: This refers to everything that we don’t know a priori and are hidden in the data. Here, the themes or topics that document consists of are unknown, but they are believed to be present as the text is generated based on those topics.

Dirichlet: It is a ‘distribution of distributions’. Yes, you read it right. But what does this mean? Let’s think about this with the help of an example. Let’s suppose there is a machine that produces dice and we can control whether the machine will always produce a dice with equal weight to all sides, or will there be any bias for some sides. So, the machine producing dice is a distribution as it is producing dice of different types. Also, we know that the dice itself is a distribution as we get multiple values when we roll a dice. This is what it means to be a distribution of distributions and this is what Dirichlet is. Here, in the context of topic modeling, the Dirichlet is the distribution of topics in documents and distribution of words in the topic. It might not be very clear at this point of time, but it’s fine as we will look at it in more detail in a while.

Allocation: This means that once we have Dirichlet, we will allocate topics to the documents and words of the document to topics.

What LDA essentially says is that each word in each document comes from a topic and the topic is selected from a per-document distribution over topics. 

we can say that the probability of a word given document i.e. P(w|d) is equal to

![alt text](https://miro.medium.com/max/374/1*eOvFvK9ouCe-GNMt6IaVLA.jpeg)


where T is the total number of topics. Also, let’s assume that there is W number of words in our vocabulary for all the documents.

If we assume conditional independence, we can say that
P(w|t,d) = P(w|t)

And hence P(w|d) is equal to

![alt text](https://miro.medium.com/max/400/1*jZDFV8seUaX7xStjawIWdA.png)

![alt text](https://miro.medium.com/max/624/1*QiTvyHNwvGI5UCqeKvhNsg.png)

So, looking at this we can think of LDA similar to that of matrix factorization or SVD, where we decompose the probability distribution matrix of word in document in two matrices consisting of distribution of topic in a document and distribution of words in a topic.

![alt text](https://miro.medium.com/max/624/1*mnehwmSdd0w1c6pfAC4LCw.png)

One could apply LDA to DNA and nucleotides, pizzas and toppings, molecules and atoms, employees and skills, or keyboards and crumbs.

The probabilistic topic model estimated by LDA consists of two tables (matrices). The first table describes the probability or chance of selecting a particular part when sampling a particular topic (category).

The second table describes the chance of selecting a particular topic when sampling a particular document or composite.


Lets take an example : *I suddenly have a taste for bacon avocado toast.*

![alt text](https://miro.medium.com/max/1389/1*tmmF-dCMjvASOrGuwxJN9w.png)

![alt text](https://miro.medium.com/max/1394/1*f7ODdUPZtkcWUNcT0CJDzw.png)

The left table has ‘emoji-versus-topics’, and the right table shows ‘documents-versus-topics’. Each column in the left table and each row in the right table sums to one (allowing for some truncation and precision loss).

So if We were to sample (draw an emoji out of a bag) Topic 0, I’d almost certainly get the avocado emoji. If I sampled Document 3, there’s an equal (or ‘uniform’) probability I’d get either Topic 0, 1, or 2.

The LDA algorithm assumes your composites were generated like so:
1. Pick your unique set of parts.
2. Pick how many composites you want.
3. Pick how many parts you want per composite (sample from a Poisson distribution).
4. Pick how many topics (categories) you want.
5. Pick a number between not-zero and positive infinity and call it alpha.
6. Pick a number between not-zero and positive infinity and call it beta.
7. Build the ‘parts-versus-topics’ table. For each column, draw a sample (spin the wheel) from a Dirichlet distribution (which is a distribution of distributions) using beta as the input. Each sample will fill out each column in the table, sum to one, and give the probability of each part per topic (column).
8. Build the ‘composites-versus-topics’ table. For each row, draw a sample from a Dirichlet distribution using alpha as the input. Each sample will fill out each row in the table, sum to one, and give the probability of each topic (column) per composite.
9. Build the actual composites. For each composite, 1) look up its row in the ‘composites-versus-topics’ table, 2) sample a topic based on the probabilities in the row, 3) go to the ‘parts-versus-topics’ table, 4) look up the topic sampled, 5) sample a part based on the probabilities in the column, 6) repeat from step 2 until you’ve reached how many parts this composite was set to have.


Now we know this algorithm (or generative procedure/process) is not how documents (such as articles) are written, but this — for better or worse — is the simplified model LDA assumes.

### Why use LDA?

If you view the number of topics as a number of clusters and the probabilities as the proportion of cluster membership, then using LDA is a way of soft-clustering your composites and parts.
Contrast this with say, k-means, where each entity can only belong to one cluster (hard-clustering). LDA allows for ‘fuzzy’ memberships. This provides a more nuanced way of recommending similar items, finding duplicates, or discovering user profiles/personas.
You could analyze every GitHub repository’s topics/tags and infer themes like native desktop client, back-end web service, single-paged app, or flappy bird clone.

If you choose the number of topics to be less than the documents, using LDA is a way of reducing the dimensionality (the number of rows and columns) of the original composite versus part data set.
With the documents now mapped to a lower dimensional latent/hidden topic/category space, you can now apply other machine learning algorithms which will benefit from the smaller number of dimensions. For example, you could run your documents through LDA, and then hard-cluster them using DBSCAN.

Of course, the main reason you’d use LDA is to uncover the themes lurking in your data. By using LDA on pizza orders, you might infer pizza topping themes like ‘spicy’, ‘salty’, ‘savory’, and ‘sweet’.
You could analyze every GitHub repository’s topics/tags, and infer themes like ‘native desktop client’, ‘back-end web service’, ‘single paged app’, or ‘flappy bird clone’.

### How does LDA work?

There are a few ways of implementing LDA. Still, like most — if not all — machine learning algorithms, it comes down to estimating one or more parameters.
For LDA, those parameters are phi and theta (although sometimes they’re called something else). Phi is the ‘parts-versus-topics’ matrix, and theta is the ‘composites-versus-topics’ matrix.

![alt text](https://miro.medium.com/max/1367/1*szc1yRtfebCezMRsN4UWaA.png)

*Are you a cat person or a dog person?*


The documents and emojis are shown in the image above.
Our hyperparameters are:

alpha = 0.5

beta = 0.01

‘topics’ = 2

‘iterations’ = 1.

To start, we need to randomly assign a topic to each emoji. Using a fair coin (sampling from a uniform distribution), we randomly assign to the first cat emoji ‘Topic 0’, the second cat ‘Topic 1’, the first dog emoji ‘Topic 1’, and the second dog ‘Topic 0’.

|         | Cat 0 | Cat 1 | Dog 0 | Dog 1 |
|---------|-------|-------|-------|-------|
| Topic 0 | *     |       |       | *     |
| Topic 1 |       | *     | *     |       |


This is our current topic assignment per each emoji.

|     | Topic 0 | Topic 1 |
|-----|---------|---------|
| Cat | 1       | 1       |
| Dog | 1       | 1       |

This is our current emoji versus topic counts.

|            | Topic 0 | Topic 1 |
|------------|---------|---------|
| Document 0 | 1       | 1       |
| Document 1 | 1       | 1       |

This our current document versus topic counts.

Now we need to update the topic assignment for the first cat. We subtract one from the emoji versus topic counts for Cat 0, subtract one from the document versus topic counts for Cat 0, calculate the probability of Topic 0 and 1 for Cat 0, flip a biased coin (sample from a categorical distribution), and then update the assignment and counts.

t0 = 
  (cat emoji with Topic 0 + beta)
  /
  (emoji with Topic 0 + unique emoji * beta)


*

(
  (emoji in Document 0 with Topic 0 + alpha)
  /
  (emoji in Document 0 with a topic + number of topics * alpha)
) =
(

  (0 + 0.01)
  /
  (1 + 2 * 0.01)
)

*

(
  (0 + 0.5)
  /
  (1 + 2 * 0.5)
) = 0.0024509803921568627


t1 = ((1 + 0.01) / (2 + 2 * 0.01)) * ((1 + 0.5) / (1 + 2 * 0.5))
   = 0.375

p(Cat 0 = Topic 0 | *) = t0 / (t0 + t1) = 0.006493506493506494

p(Cat 0 = Topic 1 | *) = t1 / (t0 + t1) = 0.9935064935064936


After flipping the biased coin, we surprisingly get the same Topic 0 for Cat 0 so our tables before updating Cat 0 remain the same.
Next we do for Cat 1 what we did for Cat 0. After the flipping the biased coin, we get Topic 0 so now our tables look like so.

|         | Cat 0 | Cat 1 | Dog 0 | Dog 1 |
|---------|-------|-------|-------|-------|
| Topic 0 | *     | *     |       | *     |
| Topic 1 |       |       | *     |       |

This is our current topic assignment per each emoji.

|     | Topic 0 | Topic 1 |
|-----|---------|---------|
| Cat | 2       | 0       |
| Dog | 1       | 1       |

This is our current emoji versus topic counts.

|            | Topic 0 | Topic 1 |
|------------|---------|---------|
| Document 0 | 2       | 0       |
| Document 1 | 1       | 1       |

This our current document versus topic counts.

What we did for the two cat emoji we now do for the dog emoji. After flipping the biased coins, we end up assigning Topic 1 to Dog 0 and Topic 1 to Dog 1.

|         | Cat 0 | Cat 1 | Dog 0 | Dog 1 |
|---------|-------|-------|-------|-------|
| Topic 0 | *     | *     |       |       |
| Topic 1 |       |       | *     | *     |

This is our current topic assignment per each emoji.

|     | Topic 0 | Topic 1 |
|-----|---------|---------|
| Cat | 2       | 0       |
| Dog | 0       | 2       |

This is our current emoji versus topic counts.

|            | Topic 0 | Topic 1 |
|------------|---------|---------|
| Document 0 | 2       | 0       |
| Document 1 | 0       | 2       |

This our current document versus topic counts.

To estimate phi, we use the following equation for each row-column cell in the ‘emoji-versus-topic’ count matrix.

Phi row column =
  (emoji row with topic column + beta)
  /
  (all emoji with topic column + unique emoji * beta)

And for estimating theta, we use the following equation for each row-column cell in the document versus topic count matrix.

Theta row column =
  (emoji in document row with topic column + alpha)
  /
  (emoji in document row + number of topics * alpha)

![alt text](https://miro.medium.com/max/1350/1*7ompnTE6eiH_3CitGvtveQ.png)

The matrix on the left shows us that the cat emoji is much, much more likely to represent Topic 0 than Topic 1, and that the dog emoji is much more likely to represent Topic 1 than Topic 0.

The matrix on the right shows us that Document 0 is more likely to be about Topic 0 than Topic 1, and that Document 1 is more likely to be about Topic 1.

This is how LDA can be used to classify documents (composites) and words/n-grams (parts) into topics.

In [0]:
import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
! pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |▏                               | 10kB 11.5MB/s eta 0:00:01[K     |▍                               | 20kB 1.8MB/s eta 0:00:01[K     |▋                               | 30kB 2.7MB/s eta 0:00:01[K     |▉                               | 40kB 1.7MB/s eta 0:00:01[K     |█                               | 51kB 2.1MB/s eta 0:00:01[K     |█▏                              | 61kB 2.6MB/s eta 0:00:01[K     |█▍                              | 71kB 3.0MB/s eta 0:00:01[K     |█▋                              | 81kB 3.4MB/s eta 0:00:01[K     |█▉                              | 92kB 3.8MB/s eta 0:00:01[K     |██                              | 102kB 2.9MB/s eta 0:00:01[K     |██▎                             | 112kB 2.9MB/s eta 0:00:01[K     |██▍                             | 122kB 2.9MB/s eta 0:00:01[

In [0]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

1. The quality of text processing.
2. The variety of topics the text talks about.
3. The choice of topic modeling algorithm.
4. The number of topics fed to the algorithm.
5. The algorithms tuning parameters.

In [0]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [0]:
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...,8,rec.motorcycles
100,From: tchen@magnus.acs.ohio-state.edu (Tsung-K...,6,misc.forsale
1000,From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\n...,2,comp.os.ms-windows.misc


In [0]:
# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

print(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- ']


In [0]:
print(data[9:10])

['From: (Jon Livesey) Subject: Re: Genocide is Caused by Atheism Organization: sgi Lines: 38 Distribution: world NNTP-Posting-Host: solntze.wpd.sgi.com In article (Frank ODwyer) writes: |> In article (Jon Livesey) writes: |> #In article (Frank ODwyer) writes: |> #|> In article (Jon Livesey) writes: |> #|> |> #|> I forget the origin of the quote, but "I gotta use words when I talk to |> #|> you". An atheist is one who lacks belief in gods, yes? If so, then |> #|> its entirely plausible that an atheist could dig Lenin or Lennon to |> #|> such an extent that it might be considered "worship", and still be |> #|> an atheist. Anything else seems to be Newspeak. |> # |> #Ask yourself the following question. Would you regard an ardent |> #Nazi as a republican, simply because Germany no longer had a Kaiser? |> |> No, because thats based on false dichotomy. There are more options |> than you present me. And that, of course, is the point. You cant simply divide the world into atheists and non-ath

After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. You need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

In [0]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:5])

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst'], ['from', 'guy', 'kuo', 'subje

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

In [0]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])



['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


In [0]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [0]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)

nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['where', 's', 'thing', 'car', 'nntp_poste', 'host', 'umd', 'organization', 'university', 'maryland_college', 'park', 'line', 'wonder', 'anyone', 'could', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. 

In [0]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:11])

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 5), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1)], [(5, 2), (8, 2), (16, 1), (21, 1), (29, 1), (30, 1), (42, 1), (45, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 5), (61, 1), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 2), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 3), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 3), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1)], [(5, 1), (21, 2), (26, 1)

In [0]:
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('addition', 1),
  ('anyone', 2),
  ('body', 1),
  ('bricklin', 1),
  ('bring', 1),
  ('call', 1),
  ('car', 5),
  ('could', 1),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('front_bumper', 1),
  ('funky', 1),
  ('history', 1),
  ('host', 1),
  ('info', 1),
  ('know', 1),
  ('late', 1),
  ('lerxst', 1),
  ('line', 1),
  ('look', 2),
  ('mail', 1),
  ('make', 1),
  ('maryland_college', 1),
  ('model', 1),
  ('name', 1),
  ('neighborhood', 1),
  ('nntp_poste', 1),
  ('organization', 1),
  ('park', 1),
  ('production', 1),
  ('really', 1),
  ('rest', 1),
  ('s', 1),
  ('see', 1),
  ('separate', 1),
  ('small', 1),
  ('specs', 1),
  ('sport', 1),
  ('tellme', 1),
  ('thank', 1),
  ('thing', 1),
  ('umd', 1),
  ('university', 1),
  ('where', 1),
  ('wonder', 1),
  ('year', 1)]]

In [0]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)


The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics()

In [0]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.800*"ax" + 0.056*"max" + 0.003*"proceed" + 0.002*"pool" + '
  '0.002*"microsoft" + 0.002*"lee" + 0.002*"qax" + 0.002*"converter" + '
  '0.002*"qq" + 0.001*"toolkit"'),
 (1,
  '0.017*"god" + 0.014*"christian" + 0.012*"say" + 0.010*"people" + '
  '0.010*"believe" + 0.009*"life" + 0.008*"man" + 0.008*"claim" + 0.007*"word" '
  '+ 0.006*"exist"'),
 (2,
  '0.037*"armenian" + 0.010*"turkish" + 0.010*"turk" + 0.010*"genocide" + '
  '0.009*"kill" + 0.009*"serdar_argic" + 0.008*"child" + 0.007*"people" + '
  '0.007*"greek" + 0.007*"yalanci"'),
 (3,
  '0.019*"israel" + 0.014*"israeli" + 0.011*"center" + 0.010*"jew" + '
  '0.009*"arab" + 0.008*"committee" + 0.008*"march" + 0.008*"member" + '
  '0.007*"mcgill" + 0.007*"soldier"'),
 (4,
  '0.025*"key" + 0.012*"bit" + 0.009*"message" + 0.008*"wiretap" + '
  '0.007*"specifically" + 0.006*"scsi" + 0.006*"clipper" + 0.006*"dream" + '
  '0.006*"algorithm" + 0.006*"punishment"'),
 (5,
  '0.044*"space" + 0.023*"president" + 0.010*"launch" + 0.00

How to interpret this?

Topic 0 is a represented as _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”.

It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 0 is 0.016.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? You may summarise it either are ‘cars’ or ‘automobiles’.

![alt text](https://www.machinelearningplus.com/wp-content/uploads/2018/03/Inferring-Topic-from-Keywords.png)

In [0]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

KeyboardInterrupt: ignored

### Part of Speech(POS) Tagging

![alt text](https://cdn-media-1.freecodecamp.org/images/1*f6e0uf5PX17pTceYU4rbCA.jpeg)

From a very small age, we have been made accustomed to identifying part of speech tags. For example, reading a sentence and being able to identify what words act as nouns, pronouns, verbs, adverbs, and so on. All these are referred to as the part of speech tags.

Let’s look at the Wikipedia definition for them:

*In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.*

Identifying part of speech tags is much more complicated than simply mapping words to their part of speech tags. This is because POS tagging is not something that is generic. It is quite possible for a single word to have a different part of speech tag in different sentences based on different contexts. That is why it is impossible to have a generic mapping for POS tags.

Use cases for POS Tagging:
1. Text to Speech Conversion

*They refuse to permit us to obtain the refuse permit.*

The word refuse is being used twice in this sentence and has two different meanings here. refUSE (/rəˈfyo͞oz/)is a verb meaning “deny,” while REFuse(/ˈrefˌyo͞os/) is a noun meaning “trash” (that is, they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)

2. Word Sense Disambiguation

Words often occur in different senses as different parts of speech. For example:

* She saw a bear.
* Your efforts will bear fruit.


The word bear in the above sentences has completely different senses, but more importantly one is a noun and other is a verb. Rudimentary word sense disambiguation is possible if you can tag words with their POS tags.

Word-sense disambiguation (WSD) is identifying which sense of a word (that is, which meaning) is used in a sentence, when the word has multiple meanings.


There are other applications as well which require POS tagging, like Question Answering, Speech Recognition, Machine Translation, and so on.

NLTK POS tag list:

CC	coordinating conjunction

CD	cardinal digit

DT	determiner

EX	existential there (like: "there is" ... think of it like "there exists")

FW	foreign word

IN	preposition/subordinating conjunction

JJ	adjective	'big'

JJR	adjective, comparative	'bigger'

JJS	adjective, superlative	'biggest'

LS	list marker	1)

MD	modal	could, will

NN	noun, singular 'desk'

NNS	noun plural	'desks'

NNP	proper noun, singular	'Harrison'

NNPS	proper noun, plural	'Americans'

PDT	predeterminer	'all the kids'

POS	possessive ending	parent\'s

PRP	personal pronoun	I, he, she

PRP$	possessive pronoun	my, his, hers

RB	adverb	very, silently,

RBR	adverb, comparative	better

RBS	adverb, superlative	best

RP	particle	give up

TO	to	go 'to' the store.

UH	interjection	errrrrrrrm

VB	verb, base form	take

VBD	verb, past tense	took

VBG	verb, gerund/present participle	taking

VBN	verb, past participle	taken

VBP	verb, sing. present, non-3d	take

VBZ	verb, 3rd person sing. present	takes

WDT	wh-determiner	which

WP	wh-pronoun	who, what

WP$	possessive wh-pronoun	whose

WRB	wh-abverb	where, when

How might we use this? While we're at it, we're going to cover a new sentence tokenizer, called the PunktSentenceTokenizer. This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use.

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

test_string = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

sent = preprocess(test_string)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

### Named Entity Recognition (NER)

![alt text](https://miro.medium.com/max/1306/1*JNHlyK5-jQA6JBKj3nDYcA.png)



Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:
* Which companies were mentioned in the news article?
* Were specified products mentioned in complaints or reviews?
* Does the tweet contain the name of a person? Does the tweet contain this person’s location?

In [0]:
! pip install spacy



SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus

In [0]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [0]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In [0]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)

167

In [0]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 5,
         'DATE': 23,
         'GPE': 16,
         'LAW': 1,
         'NORP': 2,
         'ORDINAL': 1,
         'ORG': 37,
         'PERSON': 82})

The following are three most frequent tokens.

In [0]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Strzok', 28), ('F.B.I.', 13), ('Trump', 12)]

In [0]:
sentences = [x for x in article.sents]
print(sentences[14])

The report was critical of Mr. Strzok’s conduct in sending the texts, and the bureau’s Office of Professional Responsibility said that Mr. Strzok should be suspended for 60 days and demoted.


In [0]:
displacy.render(nlp(str(sentences[11])), jupyter=True, style='ent')

In [0]:
displacy.render(nlp(str(sentences[11])), style='dep', jupyter = True, options = {'distance': 120})

In [0]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[10])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('F.B.I.', 'PROPN', 'F.B.I.'),
 ('immense', 'ADJ', 'immense'),
 ('political', 'ADJ', 'political'),
 ('pressure', 'NOUN', 'pressure'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Trump', 'PROPN', 'Trump'),
 ('dismiss', 'VERB', 'dismiss'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Strzok', 'PROPN', 'Strzok'),
 ('removed', 'VERB', 'remove'),
 ('summer', 'NOUN', 'summer'),
 ('staff', 'NOUN', 'staff'),
 ('special', 'ADJ', 'special'),
 ('counsel', 'NOUN', 'counsel'),
 ('Robert', 'PROPN', 'Robert'),
 ('S.', 'PROPN', 'S.'),
 ('Mueller', 'PROPN', 'Mueller'),
 ('III', 'PROPN', 'III')]

In [0]:
dict([(str(x), x.label_) for x in nlp(str(sentences[10])).ents])
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[10]])

[(The, 'O', ''), (F.B.I., 'B', 'ORG'), (had, 'O', ''), (been, 'O', ''), (under, 'O', ''), (immense, 'O', ''), (political, 'O', ''), (pressure, 'O', ''), (by, 'O', ''), (Mr., 'O', ''), (Trump, 'B', 'PERSON'), (to, 'O', ''), (dismiss, 'O', ''), (Mr., 'O', ''), (Strzok, 'B', 'PERSON'), (,, 'O', ''), (who, 'O', ''), (was, 'O', ''), (removed, 'O', ''), (last, 'B', 'DATE'), (summer, 'I', 'DATE'), (from, 'O', ''), (the, 'O', ''), (staff, 'O', ''), (of, 'O', ''), (the, 'O', ''), (special, 'O', ''), (counsel, 'O', ''), (,, 'O', ''), (Robert, 'B', 'PERSON'), (S., 'I', 'PERSON'), (Mueller, 'I', 'PERSON'), (III, 'I', 'PERSON'), (., 'O', '')]


In [0]:
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')