**Topic Modelling:**

 **Topic modelling** is done using **LDA(Latent Dirichlet Allocation)**. Topic modelling is an unsupervised approach of recognizing or extracting the topics. Topic modelling in which we get to know the different topics in the document. This is done by extracting the patterns of word clusters and frequencies of words in the document.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("D:\data_text.csv")
df.head()

Unnamed: 0,ID,tweets,label
0,413205,Intravenous azithromycin-induced ototoxicity.,1
1,528244,"Immobilization, while Paget's bone disease was...",1
2,361834,Unaccountable severe hypercalcemia in a patien...,1
3,292240,METHODS: We report two cases of pseudoporphyri...,1
4,467101,METHODS: We report two cases of pseudoporphyri...,1


In [3]:
df.label.unique()

array([1, 0], dtype=int64)

In [4]:
df.tweets[1]

"Immobilization, while Paget's bone disease was present, and perhaps enhanced activation of dihydrotachysterol by rifampicin, could have led to increased calcium-release into the circulation."

In [5]:
df.dropna()

Unnamed: 0,ID,tweets,label
0,413205,Intravenous azithromycin-induced ototoxicity.,1
1,528244,"Immobilization, while Paget's bone disease was...",1
2,361834,Unaccountable severe hypercalcemia in a patien...,1
3,292240,METHODS: We report two cases of pseudoporphyri...,1
4,467101,METHODS: We report two cases of pseudoporphyri...,1
...,...,...,...
23511,146275,"At autopsy, the liver was found to be small, s...",0
23512,375409,"Physical exam revealed a patient with aphasia,...",0
23513,246581,At the time when the leukemia appeared seven o...,0
23514,534599,The American Society for Regional Anesthesia a...,0


In [6]:
df.drop(['label'],axis=1,inplace=True)

In [7]:
df=df.head(1000)

In [8]:
df.shape

(1000, 2)

In [9]:
df.head(5)

Unnamed: 0,ID,tweets
0,413205,Intravenous azithromycin-induced ototoxicity.
1,528244,"Immobilization, while Paget's bone disease was..."
2,361834,Unaccountable severe hypercalcemia in a patien...
3,292240,METHODS: We report two cases of pseudoporphyri...
4,467101,METHODS: We report two cases of pseudoporphyri...


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(df['tweets'].values.astype('U'))

In [11]:
doc_term_matrix # Each of 1000 documents is represented as 1608 dimensional vector, which means that our vocabulary has 1608 words

<1000x1608 sparse matrix of type '<class 'numpy.int64'>'
	with 11565 stored elements in Compressed Sparse Row format>

In [12]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# vect =TfidfVectorizer(stop_words='english',max_features=1000)
# vect_text=vect.fit_transform(df['tweets'])

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=2, random_state=42) #n_components=n_topics
LDA.fit(doc_term_matrix)
# The parameter n_components specifies the number of categories, or topics, that we want our text to be divided into.

LatentDirichletAllocation(n_components=2, random_state=42)

In [14]:
print("LDA model Perplexity on train data", LDA.perplexity(doc_term_matrix))

LDA model Perplexity on train data 1017.2050096502307


In [15]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['case', 'hepatitis', 'severe', 'year', 'old', 'patient', 'induced', 'developed', 'patients', 'treatment']


Top 10 words for topic #1:
['presented', 'year', 'cases', 'therapy', 'syndrome', 'induced', 'associated', 'patient', 'report', 'case']




In [16]:
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

(1000, 2)

In [17]:
# composition of doc 0 for eg
print("Document 0: ")
for i,topic in enumerate(topic_values[0]):
  print("Topic ",i,": ",topic*100,"%")

Document 0: 
Topic  0 :  27.19388049284014 %
Topic  1 :  72.80611950715985 %


In [18]:
# composition of doc 1 for eg
print("Document 1: ")
for i,topic in enumerate(topic_values[1]):
  print("Topic ",i,": ",topic*100,"%")

Document 1: 
Topic  0 :  91.34509412537913 %
Topic  1 :  8.654905874620873 %


In [19]:
df['Topic'] = topic_values.argmax(axis=1)
df.head()

Unnamed: 0,ID,tweets,Topic
0,413205,Intravenous azithromycin-induced ototoxicity.,1
1,528244,"Immobilization, while Paget's bone disease was...",0
2,361834,Unaccountable severe hypercalcemia in a patien...,0
3,292240,METHODS: We report two cases of pseudoporphyri...,1
4,467101,METHODS: We report two cases of pseudoporphyri...,1


In [20]:
# visualization
import warnings
warnings.simplefilter("ignore", FutureWarning)

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
vis = pyLDAvis.sklearn.prepare(LDA, doc_term_matrix, count_vect, mds='tsne')
saved = pyLDAvis.save_html(vis, fileobj = "vis.html")