<a href="https://colab.research.google.com/github/GuptaNavdeep1983/CS688/blob/main/TopicModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
import io
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd 
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
no_topics = 5 #@param {type:"integer"}

no_top_words = 4 #@param {type:"integer"}

no_top_documents = 3 #@param {type:"integer"}

In [None]:
df = pd.read_csv("pubmed_results.csv")
df.dropna(inplace=True)
df

In [4]:
all_sentences = df['title'].to_numpy()
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [5]:
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

In [16]:
#@title Run NMF

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(all_sentences)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

print("NMF Topics")
display_topics(nmf_H, nmf_W, tfidf_feature_names, all_sentences, no_top_words, no_top_documents)
print("--------------")

NMF Topics
Topic 0:
study risk factors population
The contingent valuation study of Heiðmörk, Iceland - Willingness to pay for its preservation.
[The magazineSaúde em Debateas a source and object of study].
A prospective, randomized, single - blind study comparing intraplaque injection of thiocolchicine and verapamil in Peyronie's Disease: a pilot study.
Topic 1:
care home palliative medical
[The change in home palliative care].
[System for dispensing medicines in home medical care-pharmacy's function as medical care facility-].
[The role of home palliative care by health insurance pharmacy].
Topic 2:
cancer cells cell breast
Adipose stem cell crosstalk with chemo-residual breast cancer cells: implications for tumor recurrence.
Co-expression of TIM-3 and CEACAM1 promotes T cell exhaustion in colorectal cancer patients.
Induced cancer stem cells generated by radiochemotherapy and their therapeutic implications.
Topic 3:
review literature prevent interventions
[Evidence-based and promisi

In [19]:
#@title Run LDA

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(all_sentences)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, all_sentences, no_top_words, no_top_documents)

LDA Topics
Topic 0:
patients disease care health
Minimally Invasive Inlay Prosthesis Unicompartmental Knee Arthroplasty for the Treatment of Unicompartmental Osteoarthritis: A Prospective Observational Cohort Study with Minimum 2-Year Outcomes and up to 14-Year Survival.
Effectiveness of the Chiari Health Index for Pediatrics instrument in measuring postoperative health-related quality of life in pediatric patients with Chiari malformation type I.
Primary care physicians' perceived barriers, facilitators and strategies to enhance conservative care for older adults with chronic kidney disease: a qualitative descriptive study.
Topic 1:
cell based potential evidence
Prediction of Areal Bone Mineral Density and Bone Mineral Content in Children and Adolescents Living With HIV Based on Anthropometric Variables.
[A current status of the support for patient leaving hospital that was strengthened by the regional alliances: the evaluation of analysis done by the patient, family and regional staf