# Topic Modeling

Topic modeling is a technique of extracting hidden topics from a volume of text. One common algorithm for topic modeing is Latent Derichlet Allocation (LDA). A popular implementation of the LDA is through gensim library.

### Import required libraries

In [1]:
# !pip install pyLDAvis # Uncomment and install this visualization library

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [44]:
#Data manipulation
import pandas as pd
import numpy as np
from pprint import pprint

# Data preprocessing & cleaning
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
import gensim.corpora as corpora

# Modeling
import gensim

# Model Evaluation
from gensim.models import CoherenceModel

# Plotting tools
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis
import pyLDAvis.gensim 

### Load dataset

We will use 20-Newsgroups dataset. The dataset contains around 11k newsgroups posts from 20 different topics. The dataset is found here <a href='https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json'>newsgroups</a>

In [5]:
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')

Check firt 5 rows 

In [7]:
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...,8,rec.motorcycles
100,From: tchen@magnus.acs.ohio-state.edu (Tsung-K...,6,misc.forsale
1000,From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\n...,2,comp.os.ms-windows.misc


Check rows and columns

In [8]:
df.shape

(11314, 3)

### Preprocess Data

Remove emails

In [9]:
data = df.content.values.tolist() # Convert to list first

In [10]:
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] #Remove email addresses

  data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] #Remove email addresses


Remove new line characters

In [12]:
data = [re.sub('\s+', ' ', sent) for sent in data]

  data = [re.sub('\s+', ' ', sent) for sent in data]


Remove distracting single quotes

In [13]:
data = [re.sub("\'", "", sent) for sent in data]

Tokenize the text

In [14]:
def tokenize_to_words(text):
    for t in text:
        yield(gensim.utils.simple_preprocess(str(t), deacc=True))  # deacc=True removes punctuations

In [15]:
tokenized_data = list(tokenize_to_words(data))

Remove stopwords

In [19]:
def remove_stopwords(text):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in text]

In [20]:
# stopwords_less_data=remove_stopwords(tokenized_data)

### Create Data Input to LDA Model

1. Create Dictionary

In [21]:
id2word = corpora.Dictionary(tokenized_data)

2. Create Corpus (Term Document Frequency)

In [22]:
corpus = [id2word.doc2bow(text) for text in tokenized_data]

Show corpus and frequency

In [25]:
print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])

[[('addition', 1), ('all', 1), ('anyone', 2), ('be', 1), ('body', 1), ('bricklin', 1), ('brought', 1), ('bumper', 1), ('by', 1), ('called', 1), ('can', 1), ('car', 5), ('college', 1), ('could', 1), ('day', 1), ('door', 1), ('doors', 1), ('early', 1), ('edu', 1), ('engine', 1), ('enlighten', 1), ('from', 3), ('front', 1), ('funky', 1), ('have', 1), ('history', 1), ('host', 1), ('if', 2), ('il', 1), ('in', 1), ('info', 1), ('is', 3), ('it', 2), ('know', 1), ('late', 1), ('lerxst', 1), ('lines', 1), ('looked', 1), ('looking', 1), ('made', 1), ('mail', 1), ('maryland', 1), ('me', 1), ('model', 1), ('my', 1), ('name', 1), ('neighborhood', 1), ('nntp', 1), ('of', 3), ('on', 2), ('or', 1), ('organization', 1), ('other', 1), ('out', 1), ('park', 1), ('please', 1), ('posting', 1), ('production', 1), ('rac', 1), ('really', 1), ('rest', 1), ('saw', 1), ('separate', 1), ('small', 1), ('specs', 1), ('sports', 1), ('subject', 1), ('tellme', 1), ('thanks', 1), ('the', 6), ('there', 1), ('thing', 1), 

### Modeling LDA Topic model

In the LDA model below we specify chunksize which is the number of document to use for each training iteration/chunk. passes is the total number of training pass.

In [26]:
model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=id2word,num_topics=20, random_state=100, update_every=1,
                                           chunksize=100,passes=10,alpha='auto',per_word_topics=True)

Show topics

Each keyword has a weighted importance value

In [31]:
pprint(model.print_topics())

[(0,
  '0.178*"windows" + 0.093*"dos" + 0.050*"ms" + 0.034*"os" + 0.023*"microsoft" '
  '+ 0.018*"kit" + 0.015*"animation" + 0.008*"derek" + 0.008*"evans" + '
  '0.008*"developers"'),
 (1,
  '0.053*"lines" + 0.052*"subject" + 0.051*"organization" + 0.051*"from" + '
  '0.033*"re" + 0.027*"for" + 0.027*"writes" + 0.027*"university" + '
  '0.027*"article" + 0.026*"posting"'),
 (2,
  '0.084*"card" + 0.043*"mb" + 0.036*"video" + 0.035*"ram" + 0.029*"bus" + '
  '0.025*"driver" + 0.025*"mouse" + 0.024*"scsi" + 0.022*"controller" + '
  '0.021*"mac"'),
 (3,
  '0.083*"chips" + 0.018*"sam" + 0.014*"vw" + 0.009*"um" + 0.008*"ross" + '
  '0.001*"perot" + 0.000*"simm" + 0.000*"tl" + 0.000*"dram" + 0.000*"pu"'),
 (4,
  '0.043*"the" + 0.036*"and" + 0.030*"for" + 0.026*"to" + 0.022*"on" + '
  '0.016*"is" + 0.016*"with" + 0.014*"use" + 0.011*"or" + 0.009*"an"'),
 (5,
  '0.084*"car" + 0.034*"cars" + 0.034*"engine" + 0.026*"gay" + '
  '0.020*"spacecraft" + 0.020*"marriage" + 0.020*"vax" + 0.019*"vms" + '


### Model Evaluation

1. Model perplexity

Model perplexity measures how better the model is. The lower the perplexity value the better the model.

In [33]:
model.log_perplexity(corpus)

-11.79229961945948

2. Topic Coherence 

Topic coherence is a metric that returns the coherene score with is a measure of the degree of semantic similarity between high scoring words in the topic

In [34]:
model_coherence = CoherenceModel(model=model, texts=tokenized_data, dictionary=id2word, coherence='c_v')

In [36]:
model_coherence.get_coherence()

0.4750294116473741

#### Visualize the topics

In [41]:
pyLDAvis.enable_notebook()
vis=pyLDAvis.gensim.prepare(model, corpus, id2word)

In [43]:
vis

Interpreting the Visual

Each bubble on the left graph represents a topic. The larger the bubble, the more prevalent is that topic. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

#### Conclusion

How to improve the model:<hr>
1. Improve on text processing.
2. The variety of topics the text talks about.
3. Topic modeling algorithm to use.
4. The number of topics to be retrieved from the algorithm.
5. The Model hyperparameter tuning.