****Topic Modelling****

***This file contains python script to perform topic modelling on twitter data.***

**Input**: Takes twitter data in a dataframe format with the clean text column, later will take in input topic names ( number will depend on the coherence score )

**Output**: Searches for the best number of topics ( number between lower_limit and upper_limit given by user ) in the twitter data using Coherence Score and returns a vizualizations of all the topics which can be use for insight generation


**Output_file**: final_pipeline is saved with each tweet mapped to relevant topics

In [2]:
dbutils.widgets.removeAll()

In [3]:
#import necessary libraries

import pandas as pd
import numpy as np

lower_limit = 4
upper_limit = 10

In [4]:
#import the cleaned twitter file in dataframe format with the cleaned text column

df = pd.read_csv('/dbfs/FileStore/tables/cleaned_drones_pipeline.csv')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,created_at,text,hashtags
0,0,2020-01-16 23:57:08,with law enforcement cracking down on our use ...,drones
1,1,2020-01-16 23:56:22,we re researching ways to accelerate advanced ...,
2,2,2020-01-16 23:53:15,rpas continua la vanguardia de la tecnolog par...,rpas
3,3,2020-01-16 23:51:01,it bird it plane no it and delivering prescrip...,drones
4,4,2020-01-16 23:46:15,drones and accessories visit our online store ...,drones


Now, let us visualize how the tweets look like in the dataframe format

In [7]:
df1 = df.copy()
df1.head()

Unnamed: 0.1,Unnamed: 0,created_at,text,hashtags
0,0,2020-01-16 23:57:08,with law enforcement cracking down on our use ...,drones
1,1,2020-01-16 23:56:22,we re researching ways to accelerate advanced ...,
2,2,2020-01-16 23:53:15,rpas continua la vanguardia de la tecnolog par...,rpas
3,3,2020-01-16 23:51:01,it bird it plane no it and delivering prescrip...,drones
4,4,2020-01-16 23:46:15,drones and accessories visit our online store ...,drones


Now, let us download the libraries necessary for preprocessing the tweets specifically for Topic Modelling

In [9]:
#import libraries

import nltk; nltk.download('stopwords')

In [10]:
#downloading spacy

!python -m spacy download en_core_web_sm

In [11]:
#downloading wordnet

nltk.download('wordnet')

In [12]:
#downloading libraries

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text):
    
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
  
  
#lemmatizes and removes stopwords
def preprocess(text):
    
    print(text)
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result


We have use text lemmatization to convert all the words in its basic form for topic analysis. Just as a **sanity check** afater the preprocessing, let us remove all the null rows. We have also removed the stop_words as they are not required in topic modelling.

In [14]:
sum(df1['text'].isnull())
df1['text'] = df1['text'].dropna()

Now, let's use the above preprocessing functions to preprocess our tweets

In [16]:
df1['text'] = df1['text'].astype('str')

processed_docs = df1['text'].map(preprocess)
processed_docs[:10]

Now we will prepare our tweets for the LDA algorithm. For that, we need to convert our text and represent them with numbers. We will use gensim's Dictionary function to create a dictionary of the text and then use token2id to create the mapping from text to number.

In [18]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [19]:
print(dictionary)

In [20]:
wordtoid = dictionary.token2id

In [21]:
wordtoid

Now, let us create a bow corpus with all the tweets in the number format generated above

In [23]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [24]:
len(bow_corpus)

In [25]:
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time(s).".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], bow_doc_4310[i][1]))

In [26]:
bow_corpus

To create the features, we will use TfIdf and hence our pre-processing will be complete. Then, we can run our LDA algorithm.

In [28]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]


from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

In [29]:
corpus_tfidf

Now, we will run our LDA algorithm 6 times ( lower_limit and upper_limit might change later) and store all the models in the model_dict

In [31]:
#running lda models
model_dict = {}
for i in range(lower_limit,upper_limit):
  model_dict["lda_model_tfidf" + str(i)] = gensim.models.LdaMulticore(corpus_tfidf,random_state = 42, num_topics=i, id2word=dictionary, passes=5, workers=4)

In [32]:
model_dict

Now, we will calculate the coherence scores for each of the models we ran above and storing it in the coherence dictionary

In [34]:
#calculating coherence scores for each of the models ran above 
coherence_dict = {}
for i in range(lower_limit,upper_limit):
  coherence_model = CoherenceModel(model=model_dict["lda_model_tfidf" + str(i)], texts=processed_docs, dictionary=dictionary, coherence='c_v')
  coherence_dict["coherence_model" + str(i)] = coherence_model.get_coherence() 

In [35]:
coherence_dict

Now, let us calculate the model that gave the maximum coherence score and then use it to visualize the topics

In [37]:
temp = max(coherence_dict.values()) 
#temp is max coherence value
max_coherence_model = [key for key in coherence_dict if coherence_dict[key] == temp] 

In [38]:
max_coherence_model

In [39]:
max_coherence_model[0][-1]

In [40]:
#ma coherence lda model to fetch that particular model
model_dict["lda_model_tfidf" + max_coherence_model[0][-1]]

In [41]:
#importing pyLDAviz

import pyLDAvis
import pyLDAvis.gensim


As a final step, let us visualize the topics.. Hurray!

In [43]:
LDAvis_prepared = pyLDAvis.gensim.prepare(model_dict["lda_model_tfidf" + max_coherence_model[0][-1]], corpus_tfidf, dictionary)
pyLDAvis.display(LDAvis_prepared)

In [44]:
pyLDAvis.save_html(LDAvis_prepared, '/dbfs/FileStore/tables/topic_viz.html')

In [45]:
lda_model = model_dict["lda_model_tfidf" + max_coherence_model[0][-1]]

In [46]:
lda_model

In [47]:
lda_model.save('/dbfs/FileStore/tables/lda_pipeline_model.gensim')

END

In [49]:
lda_model =  models.LdaModel.load('/dbfs/FileStore/tables/lda_pipeline_model.gensim')

Extra code for reporting purpose

In [51]:
for idx, topic in lda_model.print_topics(num_words= 25):
    print('Topic: {} Word: {}'.format(idx, topic))
    print('')


In [52]:
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus_tfidf, dictionary)
pyLDAvis.display(LDAvis_prepared)

In [53]:
pyLDAvis.display(LDAvis_prepared)

In [54]:
#making a list of topics for each tweet ( topic is classified according to the highest probability for each tweet)
d = []
for k in range(0,df1.shape[0]):
  b = lda_model[bow_corpus[k]]
  b = [j for i,j in b]
  m = max(b)
  topic_num_list = [i for i, j in enumerate(b) if j == m]
  d.append(topic_num_list[0])

In [55]:
#sanity check

len(d)

In [56]:
#total_number of topics

total_topics = len(set(d))

In [57]:
df1['topic'] = np.nan
df1['topic'] = pd.Series(d)

In [58]:
print('total number of topics is {}:'.format(total_topics))

In [59]:
#Naming topics according to words and intuitions

#topic0_name = 'Industry applications'
#topic1_name = 'Drone Accessories'
#topic2_name = 'Photography'
#topic3_name = 'Geo-political'
#topic4_name = 'AI and Future'

Now, we will ask users for the different topic names and add them into the dataframe as a column

In [61]:
#will ask topic names to user and then append in th list 

top_nam = []

In [62]:
dbutils.widgets.text("topic_n", "enter_name")
topic = dbutils.widgets.get("topic_n")
print(topic)

In [63]:
top_nam.append(topic)
print(top_nam)

In [64]:
print(top_nam)

Now, let us add all the names taken by the user into the dataframe

In [66]:
name_mapping = []
for i in range(0,total_topics):
  
  name_mapping.append(['topic'+str(i)+'_name',top_nam[i]])

df1['topic_name'] = np.nan
df1['topic_name'] = pd.Series(name_list)

#adding topic names to tnhe dataframe

name_list = []

for i in range(0,total_topics):
  df1.loc[df1.topic == i, ['topic_name']] = name_mapping[i][1]

df1.head()
  


Unnamed: 0.1,Unnamed: 0,created_at,text,hashtags,topic,topic_name
0,0,2020-01-16 23:57:08,with law enforcement cracking down on our use ...,drones,2,Photography
1,1,2020-01-16 23:56:22,we re researching ways to accelerate advanced ...,,0,Industry applications
2,2,2020-01-16 23:53:15,rpas continua la vanguardia de la tecnolog par...,rpas,4,AI and Future
3,3,2020-01-16 23:51:01,it bird it plane no it and delivering prescrip...,drones,4,AI and Future
4,4,2020-01-16 23:46:15,drones and accessories visit our online store ...,drones,1,Drone Accessories


In [67]:
#taking relevant columns

topic_analysis = df1[['text', 'created_at', 'topic','topic_name']]

In [68]:
topic_analysis.head()

Unnamed: 0,text,created_at,topic,topic_name
0,with law enforcement cracking down on our use ...,2020-01-16 23:57:08,2,Photography
1,we re researching ways to accelerate advanced ...,2020-01-16 23:56:22,0,Industry applications
2,rpas continua la vanguardia de la tecnolog par...,2020-01-16 23:53:15,4,AI and Future
3,it bird it plane no it and delivering prescrip...,2020-01-16 23:51:01,4,AI and Future
4,drones and accessories visit our online store ...,2020-01-16 23:46:15,1,Drone Accessories


In [69]:
#saving final topic analysis file for reporting



#topic_analysis.to_csv('/dbfs/FileStore/tables/topic_pipeline.csv')