# Topic Modeling

**Topic modeling** is a method for ***unsupervised classification*** of such documents, which finds natural groups of items even when we’re not sure what we’re looking for. 


I introduced the concept of topic modeling and walked through the code for developing your topic model using **Latent Dirichlet Allocation (LDA)** method in the ***python*** using gensim implementation.


**Model Implementation Steps:**


1. Loading Data
2. Data Cleaning
3. Phrase Modeling: Bi-grams
4. Data Transformation: Corpus and Dictionary
5. Base Model: Latent Dirichlet Allocation (LDA) Model 
6. Hyper-parameter Tuning
7. Final model
8. Visualize Results

**Install Dependencies**

In [1]:
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 11.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


#**1. Loading Data**





In [4]:
import pandas as pd

df = pd.read_excel('Pubmed5k.xlsx')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleID  4999 non-null   int64 
 1   Title      4999 non-null   object
 2   Abstract   4999 non-null   object
dtypes: int64(1), object(2)
memory usage: 117.3+ KB


In [None]:
len(df)

My data don't contain on any null values and contains on three columns ArticleID, Title and Abstract. 

The Objective of task is extracting name topic from Abstract so the other columns isn't important.


#**2. Data Cleaning**

In [5]:
df = df.drop(columns=['ArticleID', 'Title'], axis=1)
df.head()

Unnamed: 0,Abstract
0,Coordination variability (CV) is commonly anal...
1,Clinical Scenario: Dynamic knee valgus (DKV) i...
2,Various methodologies have been reported to as...
3,As outcomes for acute ischemic stroke (AIS) va...
4,Because hearing loss in children can result in...


In [5]:
df['Abstract'][0]

'Coordination variability (CV) is commonly analyzed to understand dynamical qualities of human locomotion. The purpose of this study was to develop guidelines for the number of trials required to inform the calculation of a stable mean lower limb CV during overground locomotion. Three-dimensional lower limb kinematics were captured for 10 recreational runners performing 20 trials each of preferred and fixed speed walking and running. Stance phase CV was calculated for 9 segment and joint couplings using a modified vector coding technique. The number of trials required to achieve a CV mean within 10% of 20 strides average was determined for each coupling and individual. The statistical outputs of mode (walking vs running) and speed (preferred vs fixed) were compared when informed by differing numbers of trials. A minimum of 11 trials were required for stable mean stance phase CV. With fewer than 11 trials, CV was underestimated and led to an oversight of significant differences between 

In [6]:
import re

# Remove punctuation
df['Abstract_processed'] = df['Abstract'].map(lambda x: re.sub('[,\.!?%]', '', x))
df['Abstract_processed'] = df['Abstract_processed'].map(lambda x: re.sub("\(.*?\)",'',x))


# Convert the abstract to lowercase
df['Abstract_processed'] = df['Abstract_processed'].map(lambda x: x.lower())

df['Abstract_processed'][0]

'coordination variability  is commonly analyzed to understand dynamical qualities of human locomotion the purpose of this study was to develop guidelines for the number of trials required to inform the calculation of a stable mean lower limb cv during overground locomotion three-dimensional lower limb kinematics were captured for 10 recreational runners performing 20 trials each of preferred and fixed speed walking and running stance phase cv was calculated for 9 segment and joint couplings using a modified vector coding technique the number of trials required to achieve a cv mean within 10 of 20 strides average was determined for each coupling and individual the statistical outputs of mode  and speed  were compared when informed by differing numbers of trials a minimum of 11 trials were required for stable mean stance phase cv with fewer than 11 trials cv was underestimated and led to an oversight of significant differences between mode and speed future overground locomotion cv resear

**Tokenize words and further clean-up text**

Tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

In [7]:
from gensim.utils import simple_preprocess

def convert_sentences_into_words(sentences):
    for text in sentences:
        yield(simple_preprocess(str(text), deacc=True))  # I used deacc to remove punctuations

data_sentences = df['Abstract_processed'].values.tolist()
data_words = list(convert_sentences_into_words(data_sentences))

data_words[0]

['coordination',
 'variability',
 'is',
 'commonly',
 'analyzed',
 'to',
 'understand',
 'dynamical',
 'qualities',
 'of',
 'human',
 'locomotion',
 'the',
 'purpose',
 'of',
 'this',
 'study',
 'was',
 'to',
 'develop',
 'guidelines',
 'for',
 'the',
 'number',
 'of',
 'trials',
 'required',
 'to',
 'inform',
 'the',
 'calculation',
 'of',
 'stable',
 'mean',
 'lower',
 'limb',
 'cv',
 'during',
 'overground',
 'locomotion',
 'three',
 'dimensional',
 'lower',
 'limb',
 'kinematics',
 'were',
 'captured',
 'for',
 'recreational',
 'runners',
 'performing',
 'trials',
 'each',
 'of',
 'preferred',
 'and',
 'fixed',
 'speed',
 'walking',
 'and',
 'running',
 'stance',
 'phase',
 'cv',
 'was',
 'calculated',
 'for',
 'segment',
 'and',
 'joint',
 'couplings',
 'using',
 'modified',
 'vector',
 'coding',
 'technique',
 'the',
 'number',
 'of',
 'trials',
 'required',
 'to',
 'achieve',
 'cv',
 'mean',
 'within',
 'of',
 'strides',
 'average',
 'was',
 'determined',
 'for',
 'each',
 'coup

#**3. Phrase Modeling: Bigrams**

***Bigrams*** are two words frequently occurring together in the document. 


In [8]:
import gensim

bigram = gensim.models.Phrases(data_words, min_count=10, threshold=50) 
bigram_mod = gensim.models.phrases.Phraser(bigram)


Remove Stopword, Make Bigrams and Lemmatize

In [14]:
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [15]:
def remove_stopwords(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [17]:
import spacy

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

data_lemmatized[0]

['coordination',
 'variability',
 'commonly',
 'analyze',
 'understand',
 'dynamical',
 'quality',
 'locomotion',
 'purpose',
 'study',
 'develop',
 'guideline',
 'number',
 'trial',
 'require',
 'inform',
 'calculation',
 'stable',
 'overground',
 'locomotion',
 'kinematic',
 'capture',
 'recreational',
 'runner',
 'perform',
 'trial',
 'prefer',
 'fix',
 'speed',
 'walk',
 'run',
 'stance',
 'phase',
 'cv',
 'calculated',
 'segment',
 'joint',
 'coupling',
 'use',
 'modify',
 'vector',
 'coding',
 'technique',
 'number',
 'trial',
 'require',
 'achieve',
 'cv',
 'mean',
 'stride',
 'average',
 'determine',
 'couple',
 'individual',
 'statistical',
 'outputs',
 'mode',
 'speed',
 'compare',
 'inform',
 'differ',
 'number',
 'trial',
 'minimum',
 'trial',
 'require',
 'stable',
 'mean',
 'stance',
 'phase',
 'few',
 'trial',
 'underestimate',
 'lead',
 'oversight',
 'mode',
 'speed',
 'future',
 'overground',
 'locomotion',
 'cv',
 'research',
 'healthy',
 'population',
 'use',
 'vecto

#**4. Data transformation: Corpus and Dictionary**

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model, and it needs two inputs that are the dictionary and the corpus.

In [18]:
id2word = gensim.corpora.Dictionary(data_lemmatized)  

corpus = [id2word.doc2bow(text) for text in data_lemmatized]


# **5. Base Model**



In [19]:
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word,   num_topics=3, random_state=100,chunksize=100,passes=10)

In [20]:
lda_model.print_topics()

[(0,
  '0.014*"use" + 0.008*"base" + 0.008*"model" + 0.007*"method" + 0.005*"system" + 0.004*"result" + 0.004*"provide" + 0.004*"process" + 0.004*"study" + 0.004*"time"'),
 (1,
  '0.012*"cell" + 0.008*"effect" + 0.008*"study" + 0.007*"protein" + 0.007*"increase" + 0.006*"show" + 0.006*"high" + 0.006*"gene" + 0.006*"use" + 0.005*"level"'),
 (2,
  '0.018*"patient" + 0.016*"study" + 0.011*"use" + 0.006*"risk" + 0.006*"health" + 0.006*"include" + 0.006*"high" + 0.005*"group" + 0.005*"year" + 0.004*"treatment"')]

**Compute Coherence Score**


***Topic Coherence*** measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. 


The coherence measures are used in this task that's **C_v**.


***C_v measure*** is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity

In [20]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.3771509216529963


#**6. Hyperparameter tuning**

First, we must know what's the difference between model hyperparameters and model parameters ?



*  ***Model hyperparameters*** can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training, such number of topics, alpha and beta.


*   ***Model parameters*** can be thought of as what the model learns during training, such as the weights for each word in a text.


we have the coherence score for the LDA model, perform a series of sensitivity tests to help determine the following model hyperparameters:

*   Number of Topics    
*   Hyperparameter alpha (Document Density)
*   Hyperparameter beta  (Word Density)




In [21]:
def compute_coherence_values(corpus, dictionary,k,a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [22]:
import numpy as np
import tqdm

#Number of Topics
topics_range = range(3, 11, 1)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.5))

# Beta parameter
beta = list(np.arange(0.01, 1, 0.5))


model_results = {
                 'Num_topics':[],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)))


for k in topics_range:
  for a in alpha:
    for b in beta:

      cv = compute_coherence_values(corpus, id2word,k=k, a=a, b=b)
      model_results['Num_topics'].append(k)
      model_results['Alpha'].append(a)
      model_results['Beta'].append(b)
      model_results['Coherence'].append(cv)
      pbar.update(1)
                    
                 
pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
pbar.close()


  diff = np.log(self.expElogbeta)
100%|██████████| 32/32 [40:39<00:00, 76.22s/it]


#**7. Final model**

In [23]:
final_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=9, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.51,
                                           eta=0.51)

In [24]:
final_model.print_topics()

[(0,
  '0.005*"channel" + 0.005*"formulation" + 0.004*"oxide" + 0.004*"iron" + 0.004*"nanoparticle" + 0.003*"charge" + 0.003*"compression" + 0.003*"release" + 0.002*"ingredient" + 0.002*"polyphenol"'),
 (1,
  '0.020*"cell" + 0.012*"protein" + 0.008*"effect" + 0.006*"expression" + 0.006*"increase" + 0.006*"show" + 0.006*"activity" + 0.006*"induce" + 0.006*"study" + 0.005*"mechanism"'),
 (2,
  '0.005*"smoking" + 0.005*"smoker" + 0.003*"smoke" + 0.003*"tobacco" + 0.002*"cessation" + 0.002*"cigarette" + 0.001*"nicotine" + 0.001*"exudative" + 0.001*"fry" + 0.001*"apoa"'),
 (3,
  '0.029*"patient" + 0.019*"study" + 0.013*"use" + 0.010*"group" + 0.009*"high" + 0.009*"risk" + 0.008*"treatment" + 0.007*"include" + 0.007*"year" + 0.007*"age"'),
 (4,
  '0.015*"use" + 0.008*"model" + 0.007*"base" + 0.007*"study" + 0.007*"method" + 0.006*"result" + 0.005*"high" + 0.005*"provide" + 0.005*"specie" + 0.004*"different"'),
 (5,
  '0.012*"health" + 0.012*"study" + 0.009*"use" + 0.006*"care" + 0.005*"inter

In [25]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=final_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.48941576578317786


#**8. Visualize Results**


In [26]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[?25l[K     |▏                               | 10 kB 15.3 MB/s eta 0:00:01[K     |▍                               | 20 kB 20.6 MB/s eta 0:00:01[K     |▋                               | 30 kB 12.5 MB/s eta 0:00:01[K     |▉                               | 40 kB 9.1 MB/s eta 0:00:01[K     |█                               | 51 kB 4.8 MB/s eta 0:00:01[K     |█▏                              | 61 kB 5.6 MB/s eta 0:00:01[K     |█▍                              | 71 kB 5.7 MB/s eta 0:00:01[K     |█▋                              | 81 kB 5.7 MB/s eta 0:00:01[K     |█▉                              | 92 kB 6.3 MB/s eta 0:00:01[K     |██                              | 102 kB 5.4 MB/s eta 0:00:01[K     |██▏                             | 112 kB 5.4 MB/s eta 0:00:01[K     |██▍                             | 122 kB 5.4 MB/s eta 0:00:01[K     |██▋                             | 133 kB 5.4 MB/s eta 0:00:01[K     |██

In [27]:

import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(final_model, corpus, id2word)
pyLDAvis.save_html(LDAvis_prepared, 'lDAvis.html')

  from collections import Iterable
  by='saliency', ascending=False).head(R).drop('saliency', 1)
