# NLP - Topic Modeling for Health Survey Data
EAI 6000 - M6 Assignment
By Omkar Sadekar

In [39]:
#Importing Required Libraries
import bamboolib as bam 
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

y=pd.read_csv(r'medquad.csv')

In [42]:
y.head(30)

                                             question  \
0                            What is (are) Glaucoma ?   
1                              What causes Glaucoma ?   
2                 What are the symptoms of Glaucoma ?   
3              What are the treatments for Glaucoma ?   
4                            What is (are) Glaucoma ?   
5                            What is (are) Glaucoma ?   
6                            What is (are) Glaucoma ?   
7                      Who is at risk for Glaucoma? ?   
8                           How to prevent Glaucoma ?   
9                 What are the symptoms of Glaucoma ?   
10             What are the treatments for Glaucoma ?   
11  what research (or clinical trials) is being do...   
12                     Who is at risk for Glaucoma? ?   
13                           What is (are) Glaucoma ?   
14                What is (are) High Blood Pressure ?   
15                  What causes High Blood Pressure ?   
16          Who is at risk for 

In [17]:
df = y.dropna()

In [18]:
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 
import string
from nltk import word_tokenize, pos_tag

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/omkarsadekar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/omkarsadekar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [23]:
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [20]:
data_nouns = pd.DataFrame(df.answer.apply(nouns))

In [28]:
data_nouns

                                                  answer
0      Glaucoma group diseases eye nerve result visio...
1      people cause blindness United States anyone gl...
2      Symptoms Glaucoma Glaucoma eyes type glaucoma ...
3      glaucoma treatments vision sight glaucoma trea...
4      Glaucoma group diseases eye nerve result visio...
...                                                  ...
16407  Focal nerves head torso leg Focal neuropathy i...
16408  way blood glucose levels range blood glucose l...
16409  Doctors basis symptoms exam exam doctor blood ...
16410  treatment step blood glucose levels range nerv...
16411  Diabetic neuropathies disorders abnormalities ...

[16393 rows x 1 columns]

In [25]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [26]:

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['medical', 'patient', 'disease','treatment','symptoms','condition','healthcare','hospital','doctor','nurse','medication','diagnosis','surgery','laboratory','research''study','journal','clinical','trial','therapy','intervention','outcome','prognosis','chronic','acute']
stop_words = ENGLISH_STOP_WORDS.union(add_stop_words)


In [27]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_nouns.answer)
data_dtmn = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtmn.index = data_nouns.index


In [30]:
from gensim import matutils, models
import scipy.sparse


In [31]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cv.vocabulary_.items())

In [37]:
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.036*"gene" + 0.020*"syndrome" + 0.018*"cells" + 0.018*"mutations" + 0.014*"disorder" + 0.014*"cell" + 0.013*"cancer" + 0.012*"individuals" + 0.011*"protein" + 0.010*"people"'),
 (1,
  '0.031*"blood" + 0.017*"health" + 0.016*"people" + 0.014*"care" + 0.013*"diabetes" + 0.013*"kidney" + 0.010*"body" + 0.009*"heart" + 0.009*"provider" + 0.009*"test"'),
 (2,
  '0.038*"people" + 0.038*"symptom" + 0.027*"number" + 0.026*"information" + 0.026*"patients" + 0.026*"signs" + 0.025*"frequency" + 0.020*"sign" + 0.019*"study" + 0.013*"human"')]


Topic 0: Genetics and Mutations
Keywords: gene, syndrome, cells, mutations, disorder, cell, cancer, individuals, protein, people

This topic seems to be focused on genetic factors and mutations. It mentions genes, syndromes, cells, and mutations, indicating a discussion related to genetic disorders or diseases. The presence of terms like cancer and protein suggests a possible exploration of genetic links to cancer and related research. The topic also mentions individuals, indicating a potential focus on personalized medicine or genetic variations among people.

Topic 1: Health Conditions and Medical Tests
Keywords: blood, health, people, care, diabetes, kidney, body, heart, provider, test

This topic revolves around health conditions and medical tests. The keyword blood suggests a focus on blood-related health issues or blood tests. Terms like diabetes, kidney, and heart indicate discussions related to specific health conditions. The presence of keywords like care and provider suggests a consideration of healthcare services or healthcare providers. Overall, this topic seems to cover a broad range of health-related topics.

Topic 2: Symptoms and Patient Studies
Keywords: people, symptom, number, information, patients, signs, frequency, sign, study, human

This topic centers around symptoms and patient studies. The keyword people suggests a focus on how symptoms or conditions affect individuals. The presence of terms like symptom, signs, and frequency indicates a discussion about the manifestation and frequency of symptoms. The mention of patients and studies suggests a possible exploration of patient-centered research or clinical studies. The topic also includes keywords like information and human, indicating a broader consideration of various aspects related to symptoms and patient experiences.

