## Topic Modelling

The goal of this notebook is to find the topics on which people are talking within our dataset with tweets about vaccines. There are many models available for topic modelling, but in this Notebook we've focused only on **LDA (Latent Dirichlet Allocation)**.

For data protection purposes, the dataset used in this notebook is not provided here. If you want to replicate the notebook using this dataset, please contact the authors.


#### Input
- A dataset with tweets ready to be used by our LDA algorithm: `vacc_proc_for_topicMdl.csv`

#### Output
- An html where we can visualise the discovered topics: `Vaccs_Notts_topic_7.html`
- A dataset with tweets mapped to their main topic: `topics_mapped_Vaccs_Notts.csv`

In [None]:
# ----------------------------------------
# Libraries need to be installed
# ----------------------------------------

!pip install pyLDAvis
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_sm


# ----------------------------------------    
# For File operations
# ----------------------------------------

import zipfile
import os

# ----------------------------------------
# Data read, write and other operations on Texts
# ----------------------------------------

import pandas as pd
import numpy as np
import string
import re
import unicodedata
from pprint import pprint

# ----------------------------------------
# For Libaries for NLP applications
# ----------------------------------------

import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import gensim
import spacy
spcy = spacy.load('/opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages/en_core_web_sm/en_core_web_sm-2.3.1')
from gensim import corpora
from gensim.models import CoherenceModel

# ----------------------------------------
# For ignoring some warnings
# ----------------------------------------

import warnings
warnings.filterwarnings('ignore')
def wrng():
  warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
  warnings.simplefilter("ignore")
  wrng()

# ----------------------------------------    
# For Visualizations
# ----------------------------------------

import matplotlib
import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim as pygen
pyLDAvis.enable_notebook()

# ----------------------------------------    
# Need to download some extras
# ----------------------------------------

nltk.download('punkt')
nltk.download('stopwords')

### Load Dataset
Here the datset is used from the `TopicModelling_Vaccine_Preprocessing` notebook.

In [None]:
processed_tweets_Vaccs_ = pd.read_csv("/project_data/data_asset/vacc_proc_for_topicMdl.csv")
pd.set_option('display.max_columns', None)  # Showing all columns for that dataframe

### Filtering data related to 'Nottingham' 

In [None]:
notts_tweets_Vaccs_ = processed_tweets_Vaccs_[processed_tweets_Vaccs_["City"] == "Nottingham"]

### Part-of-Speech tagging
 Filtering words based on particular part-of-speech as other parts of speech could generate noise for topics

In [None]:
sentences = []
for line in notts_tweets_Vaccs_["Clean_sentence_Comment"]:
    pos_ = spcy(line)

    sentence2 = " ".join([token.text for token in pos_ if (token.pos_ == "ADJ" or token.pos_ == "NOUN" or token.pos_ == "PROPN" or token.pos_ == "VERB")])
    sentences.append(sentence2)
    
notts_tweets_Vaccs_["Clean_sentence_Comment"] = sentences

### Filtering words
Filtering the least and most frequent words (filters if less than 'no_below', more than 'no_above')

In [None]:
words = [text.split() for text in notts_tweets_Vaccs_["Clean_sentence_Comment"]]
dict_words = corpora.Dictionary(words)
dict_words.filter_extremes(no_below=5, no_above=0.2) 
dict_words.compactify()
myCorpus_notts = [dict_words.doc2bow(word) for word in words]

### Training LDA Model
Here we train the LDA model and compute the coherence metric and log-perplexity for a range of topic numbers and other hyperparameters. Here, we've focused on coherence metric to choose the best model.

In [None]:

MulLda_coherent_scores = []
MulLda_topics_val = []
MulLda_perplexity_val = []
alpha_val = [0.05, 0.1, 0.3, 0.5, 0.8, 1]
MulLda_alphas = []


for topics in range(3, 15, 2):
        for alph in alpha_val:

            lda_model_multi_notts = gensim.models.LdaMulticore(corpus = myCorpus_notts,
                         id2word = dict_words,
                         random_state = 42,
                         num_topics = topics,
                         passes=10,
                         chunksize=512,
                         alpha=alph,
                         offset=64,
                         eta=None,
                         iterations=100,
                         per_word_topics=True,
                         workers=6)
  
            coherence_model_MulLda_notts = CoherenceModel(model = lda_model_multi_notts, 
                                       texts = words, 
                                       dictionary = dict_words, 
                                       coherence = 'c_v')
  
            coherence_MulLda = coherence_model_MulLda_notts.get_coherence()
            perplexity_MulLda = lda_model_multi_notts.log_perplexity(myCorpus_notts)
        
            MulLda_topics_val.append(topics)
            MulLda_alphas.append(alph)        
            MulLda_coherent_scores.append(coherence_MulLda)
            MulLda_perplexity_val.append(perplexity_MulLda)
            

            
df_mulLDA_notts = pd.DataFrame(list(zip(MulLda_topics_val, MulLda_alphas, MulLda_coherent_scores, MulLda_perplexity_val)), 
                         columns = ["MulLda_Topic_Num", "MulLda_alpha_val", "MulLda_Coherent_score", "MulLda_Perplexity_val"])

df_mulLDA_notts.sort_values("MulLda_Coherent_score", axis = 0, ascending = False, 
                     inplace = True) 

df_mulLDA_notts.head()

### Final Model
After choosing best hyperparams from above dataframe based on coherence metric, we can train our final model. Note that we haven't just fully relied on the highest value for this metric, but we have rather chosen the model that makes the most sense based in our experience from the top models.   

The cell below will output the words related to some topics and clusters of topics (visualization).

In [None]:

multi_lda_final_notts = gensim.models.LdaMulticore(corpus = myCorpus_notts,
                         id2word = dict_words,
                         random_state = 42,
                         num_topics = 7,
                         passes=10,
                         chunksize=512,
                         alpha=0.05,
                         offset=64,
                         eta=None,
                         iterations=100,
                         per_word_topics=True,
                         workers=6)

pprint(multi_lda_final_notts.print_topics(num_topics = 7, num_words=20))

print("\n\033[91m" + "\033[1m" +"------- Visualization -----------\n")

lda_Mul_vis_notts = pygen.prepare(multi_lda_final_notts, myCorpus_notts, dict_words)
pyLDAvis.display(lda_Mul_vis_notts)


### Saving Topics as html

In [None]:
pyLDAvis.save_html(lda_Mul_vis_notts, "/project_data/data_asset/Vaccs_Notts_topic_7.html")

### Mapping Tweets with Topics

In [None]:

topicss = []
probss = []

for i, row in enumerate(multi_lda_final_notts[myCorpus_notts]):     # gives topics probablity

    row = sorted(row[0], key=lambda x :(x[1]), reverse=True)    # sorting according to higher probability
    for j, (topic_num, probablity) in enumerate(row):        # j=0  --> containing highest probablity, topic_num --> falls under which topic
        if j == 0:
            topicss.append(topic_num)
            probss.append(probablity)
            
Notts_tweets_Vaccs_["Topic_Num"] = topicss
Notts_tweets_Vaccs_["Topic_prob"] = probss

Notts_tweets_Vaccs_.head()



### Final Dataset
we've given the topics some names and mapped with the tweets

In [None]:
"""

list_ - values of list needs to converted to string

"""
def ListToStr(list_):
  str_val = ""
  for item in list_:
    str_val += item
  return str_val

dts = []

for dttt in Notts_tweets_Vaccs_["Date"]:
    yrs_ = re.findall(r"\d{4}", dttt)
    dts.append(ListToStr(yrs_))
    
Notts_tweets_Vaccs_["year"] = dts

Notts_tweets_Vaccs_["Date"] = pd.to_datetime(Notts_tweets_Vaccs_["Date"]).dt.date

tpc_nms = []

for tpc_ in Notts_tweets_Vaccs_["Topic_Num"].values.tolist():
    if tpc_ == 0:
        tpc_nms.append("Effects of virus and vaccine")
    if tpc_ == 1:
        tpc_nms.append("Politics in US around vaccine")
    if tpc_ == 2:
        tpc_nms.append("Enforcement of vaccines")
    if tpc_ == 3:
        tpc_nms.append("Politics in UK around vaccine")
    if tpc_ == 4:
        tpc_nms.append("Science around Vaccine")
    if tpc_ == 5:
        tpc_nms.append("Public affairs")
    if tpc_ == 6:
        tpc_nms.append("Distribution of vaccine and logistics")
        
Notts_tweets_Vaccs_["Topic_Names"] = tpc_nms


tyms = []

for tym in Notts_tweets_Vaccs_["Date"].values.tolist():
    tym_ = tym.strftime('%d-%b')
    tyms.append(tym_)
    
Notts_tweets_Vaccs_["Date_month"] = tyms


### Saving Final dataset

In [None]:
Notts_tweets_Vaccs_.to_csv('/project_data/data_asset/topics_mapped_Vaccs_Notts.csv', index = False)

  
  
### Author:

-  **Ananda Pal** is a Data Scientist and Performance Test Analyst at IBM, where he specialises in Data Science and Machine Learning Solutions

Copyright © IBM Corp. 2020. Licensed under the Apache License, Version 2.0. Released as licensed Sample Materials.