# Topic Modeling with Gensim




Topic Modeling is an unsupervised technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems.

#### IMPORTING ALL THE NECESSARY LIBRARIES

In [35]:
import nltk
nltk.download('stopwords')
import re
import numpy as np
import pandas as pd
from pprint import pprint


# Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently as possible.
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline


# Enable logging to keep a track of all the events for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)


import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)


from nltk.corpus import stopwords
stop_words=stopwords.words('english')
stop_words.extend(['from','subject','re','edu','use'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\csdin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




#### IMPORTING THE DATASET FROM 20-NEWSGROUP DATASET


In [8]:
df=pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())


['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space'
 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware'
 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc'
 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x'
 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian'
 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']


In [14]:
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [15]:
df.describe(include='all')

Unnamed: 0,content,target,target_names
count,11314,11314.0,11314
unique,11314,,20
top,From: lerxst@wam.umd.edu (where's my thing)\nS...,,rec.sport.hockey
freq,1,,600
mean,,9.293,
std,,5.562719,
min,,0.0,
25%,,5.0,
50%,,9.0,
75%,,14.0,


##### REMOVING ALL THE EMAIL AND NEWLINE CHARACTERS

In [9]:
# Convert to list 
df = df.content.values.tolist()  
# Remove Emails 
df = [re.sub('\S*@\S*\s?', '', sent) for sent in df]  
# Remove new line characters 
df = [re.sub('\s+', ' ', sent) for sent in df]  
# Remove distracting single quotes 
df= [re.sub("\'", "", sent) for sent in df]  
pprint(df[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']


#### TOKENIZATION OF THE WORDS USING simple_preprocess() 

In [10]:
def sent_to_words(sentences):
    for sentence in sentences:

        #using simple_preprocess to convert all the words to lower case and tokenize the document

        yield(gensim.utils.simple_preprocess(str(sentence),deacc=True))     

data_words=list(sent_to_words(df))

print(data_words[:1])


[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]


#### CREATING BIGRAM AND TRIGRAM MODELS

In [11]:
bigram= gensim.models.Phrases(data_words,min_count=5,threshold=100)
trigram=gensim.models.Phrases(bigram[data_words],threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#### DEFINING FUNCTIONS FOR REMOVING STOPWORDS,MAKE BIGRAMS-TRIGRAMS AND LEMMATIZE 

In [12]:
#function for removing stopwords
def remove_stopwords(texts):
    return([[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts])


#function for making bigrams
def make_bigrams(texts):   
    return [bigram_mod[doc] for doc in texts] 


#function for making trigrams
def make_trigrams(texts):   
    return [trigram_mod[doc] for doc in texts] 


#function for lemmatization
#allowed postags are the specific words which will be eligible for TOKENIZATION
def lemmatization(texts,allowed_postags=['NOUN','ADJ','VERB','ADV']):

    texts_out=[]

    for sent in texts:
        doc=nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out    

#### CALLING THE FUNCTIONS IN ORDER

In [14]:
#calling the REMOVE STOP_WORDS FUNCTION.
data_words_nostops=remove_stopwords(data_words)

#function for formation of bigrams 
data_words_bigrams= make_bigrams(data_words_nostops)

In [15]:
#function for formation of trigrams 
data_words_trigrams= make_trigrams(data_words_nostops)

In [36]:
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])


#Lemitize only the words which gets categorized under ALlOWED POSTAGS
data_lemmatized=lemmatization(data_words_bigrams,allowed_postags=['NOUN','ADJ','VERB','ADV'])

print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'umd', 'organization', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


#### CREATE THE DICTIONARY AND CORPUS NEEDED FOR TOPIC MODELING

In [27]:
#Create Dictionary 
id2word = corpora.Dictionary(data_lemmatized)


#Create corpus
texts = data_lemmatized


#Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]


# WHAT WE'RE VIEWING HERE IS THE MAPPING OF WORD_ID,WORD_FREQUENCY.THIS WILL BE USED AS INPUT TO THE LDA MODEL
  
print(corpus[:1])


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 5), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1)]]


In [28]:
# We can view what the given id corresponds to by passing the id as key to Dictionary
id2word[4]

'call'

#### BUILDING THE TOPIC MODEL


In [29]:
#ALONG WITH THE DICTIONARY AND CORPUS,WE NEED TO SPECIFY THE NUMBER OF TOPICS AS WELL
lda_model= gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
                                               

    Alpha parameter is Dirichlet prior concentration parameter that represents document-topic density — with a higher alpha, 
    documents are assumed to be made up of more topics and result in more specific topic distribution per document.

    Beta parameter is the same prior concentration parameter that represents topic-word density — with high beta, 
    topics are assumed to made of up most of the words and result in a more specific word distribution per topic   


#### VIEW THE TOPICS IN LDA MODEL

    The model is built with 20 different topics where each topic is a combination of keywords and each keyword
    adds on to the weightage of the topic

In [30]:
pprint(lda_model.print_topics())
doc_lda=lda_model[corpus]


[(0,
  '0.064*"nhl" + 0.059*"recommend" + 0.051*"gateway" + 0.039*"flight" + '
  '0.031*"fuel" + 0.027*"floor" + 0.024*"bank" + 0.018*"space_station" + '
  '0.017*"phase" + 0.017*"qualified"'),
 (1,
  '0.095*"key" + 0.040*"physical" + 0.037*"public" + 0.028*"encryption" + '
  '0.027*"chip" + 0.025*"security" + 0.022*"private" + 0.021*"master" + '
  '0.020*"government" + 0.018*"clipper"'),
 (2,
  '0.028*"believe" + 0.025*"evidence" + 0.023*"reason" + 0.018*"say" + '
  '0.017*"claim" + 0.015*"christian" + 0.015*"sense" + 0.013*"exist" + '
  '0.012*"fact" + 0.012*"faith"'),
 (3,
  '0.072*"team" + 0.069*"game" + 0.050*"play" + 0.048*"win" + 0.040*"year" + '
  '0.034*"player" + 0.024*"season" + 0.018*"fan" + 0.018*"goal" + 0.017*"run"'),
 (4,
  '0.093*"ide" + 0.078*"mother" + 0.040*"remind" + 0.019*"ultimate" + '
  '0.015*"winter" + 0.012*"beauty" + 0.011*"absurd" + 0.009*"grip" + '
  '0.004*"credibility" + 0.002*"stall"'),
 (5,
  '0.135*"monitor" + 0.043*"rd" + 0.034*"trivial" + 0.021*"suc

    THE ABOVE WEIGHTS REPRESENTS HOW MUCH A WORD IS CONTRIBUTING TO THE TOPICS

#### COMPLEX MODEL PERPLEXITY AND COHERENCE SCORE

In [31]:
#Perplexity metric in NLP is a way to capture the measure of ‘uncertainty’ a model has in predicting 
print(" Perplexity: ", lda_model.log_perplexity(corpus))



# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

 Perplexity:  -13.368808729846432

Coherence Score:  0.5237564492819576


#### VISUALIZE THE TOPICS-KEYWORDS

In [32]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

  default_term_info = default_term_info.sort_values(


    >Each bubble represents a topic. The larger the bubble, the higher percentage of the number of posts in the corpus is about that topic


    >Red bars give the estimated number of times a given term was generated by a given topic. As you can see from the image above, there are total about 35,000 of the word ‘get’, and this term is used about 32,000 times within topic 1. The word with the longest red bar is the word that is used the most by the posts belonging to that topic