<a href="https://colab.research.google.com/github/Okeezodumu/Natural-language-processing/blob/main/My_Topic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Topic modelling assignment
Assignment: Topic Modeling COVID-19 Research Titles
Problem Statement:
Identifying recurring themes in COVID-19 research can be overwhelming due to the large volume of studies. In this assignment, you will use BERTopic to analyze research paper titles and uncover key topics in the field.

Steps:
Dataset Overview:

Use the provided dataset containing research titles, abstracts, and URLs.
Focus only on the titles column for topic modeling.
Objective:

Apply BERTopic to discover and analyze topics in the research titles.
Tasks:


Analysis:
Extract and list the topics with their top keywords.
Visualize topic distributions.
Identify the most common topic and summarize its significance.
Assign a topic and topic name to every title in dataset.

Deliverables:
A Python script implementing BERTopic.
A brief summary (150–200 words) discussing key findings and visualizations.

In [None]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB

In [None]:
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

In [None]:
df=pd.read_csv('/content/covid_abstracts.csv')

In [None]:
df.head()

Unnamed: 0,title,abstract,url
0,Real-World Experience with COVID-19 Including...,This article summarizes the experiences of COV...,https://pubmed.ncbi.nlm.nih.gov/35008137
1,Successful outcome of pre-engraftment COVID-19...,Coronavirus disease 2019 COVID-19 caused by...,https://pubmed.ncbi.nlm.nih.gov/35008104
2,The impact of COVID-19 on oncology professiona...,BACKGROUND COVID-19 has had a significant imp...,https://pubmed.ncbi.nlm.nih.gov/35007996
3,ICU admission and mortality classifiers for CO...,The coronavirus disease 2019 COVID-19 which ...,https://pubmed.ncbi.nlm.nih.gov/35007991
4,Clinical evaluation of nasopharyngeal midturb...,In the setting of supply chain shortages of na...,https://pubmed.ncbi.nlm.nih.gov/35007959


In [None]:
df=df.dropna()
print(df.shape)

(10000, 3)


In [None]:
df['title']=df['title'].astype(str)

In [None]:
#create an array of descriptions
doc = df.title.values

In [None]:
doc[0:5]

array(['Real-World Experience with COVID-19  Including Direct COVID-19 Antigen Testing and Monoclonal-Antibody Bamlanivimab in a Rural Critical Access Hospital in South Dakota',
       'Successful outcome of pre-engraftment COVID-19 in an HCT patient  impact of targeted therapies and cellular immunity',
       'The impact of COVID-19 on oncology professionals-one year on  lessons learned from the ESMO Resilience Task Force survey series',
       'ICU admission and mortality classifiers for COVID-19 patients based on subgroups of dynamically associated profiles across multiple timepoints',
       'Clinical evaluation of nasopharyngeal  midturbinate nasal and oropharyngeal swabs for the detection of SARS-CoV-2'],
      dtype=object)

In [None]:
# Import the SentenceTransformer model to generate dense vector embeddings for documents
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # A pre-trained transformer model optimized for semantic similarity
#https://huggingface.co/sentence-transformers


# Instantiate the BERTopic model
# embedding_model: Specifies the embedding model to use for document vectorization.
# language: Defines the language of the documents (in this case, English).
# calculate_probabilities: Indicates whether to calculate the probabilities of a document belonging to topics (True).
# verbose: Enables detailed logging of the model's processing steps for better visibility.
topic_model = BERTopic(
    embedding_model=embedding_model,
    language="english",
    calculate_probabilities=True,
    verbose=True
)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Generate topics and their corresponding probabilities for the documents
# topics: A list of integers, where each integer corresponds to the topic assigned to a specific document.
# probs: A 2D array (list of lists), where each inner list contains the probabilities of the corresponding document
#        belonging to each topic.
topics, probs = topic_model.fit_transform(doc)  # doc is a list of input documents to be analyzed


2025-01-22 13:28:51,940 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2025-01-22 13:31:41,239 - BERTopic - Embedding - Completed ✓
2025-01-22 13:31:41,243 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-22 13:32:31,159 - BERTopic - Dimensionality - Completed ✓
2025-01-22 13:32:31,164 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-22 13:32:47,571 - BERTopic - Cluster - Completed ✓
2025-01-22 13:32:47,587 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-22 13:32:47,976 - BERTopic - Representation - Completed ✓


In [None]:
len(probs)

10000

In [None]:
topic_model.get_topic(0)

[('telemedicine', 0.06820603187670492),
 ('telehealth', 0.054182497844639514),
 ('satisfaction', 0.016862541257429448),
 ('during', 0.013048094244910918),
 ('program', 0.011606459591798254),
 ('use', 0.011043707870705173),
 ('pandemic', 0.010405387979726791),
 ('pediatric', 0.01036672462295814),
 ('epilepsy', 0.010121382476508609),
 ('patient', 0.010010090531074727)]

In [None]:
topic_model.get_topic(13)

[('masks', 0.06879780480001729),
 ('mask', 0.062123570190341),
 ('face', 0.05706986492443266),
 ('wearing', 0.04542509585563421),
 ('n95', 0.029196988038491665),
 ('respirators', 0.027824085810045886),
 ('aerosol', 0.024226717789671576),
 ('surgical', 0.023827546724885722),
 ('fit', 0.020429028599878236),
 ('respirator', 0.016738589571778435)]

In [None]:
topic_model.get_topic_info() #show topic information

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3568,-1_the_of_covid_19,"[the, of, covid, 19, in, and, to, cov, pandemi...",[Psychological Impact of the COVID-19 Pandemic...
1,0,164,0_telemedicine_telehealth_satisfaction_during,"[telemedicine, telehealth, satisfaction, durin...",[Development of a Telemedicine Screening Progr...
2,1,155,1_mental_health_psychiatric_workers,"[mental, health, psychiatric, workers, work, p...",[The mental health impact of contact with COVI...
3,2,144,2_respiratory_oxygen_ventilation_pulmonary,"[respiratory, oxygen, ventilation, pulmonary, ...",[High-flow nasal oxygen therapy decrease the r...
4,3,131,3_drug_treatment_potential_antiviral,"[drug, treatment, potential, antiviral, therap...",[Antiviral drugs Potent agents promising th...
...,...,...,...,...,...
176,175,10,175_distancing_social_kalman_regimes,"[distancing, social, kalman, regimes, mali, ev...",[Effectiveness of social distancing interventi...
177,176,10,176_employment_benefits_unemployment_loans,"[employment, benefits, unemployment, loans, em...",[This time is really different The multiplier...
178,177,10,177_coagulopathy_fibrinogen_coagulation_intrav...,"[coagulopathy, fibrinogen, coagulation, intrav...",[Investigation of the Molecular Mechanism of C...
179,178,10,178_metabolic_adiposity_lipid_severity,"[metabolic, adiposity, lipid, severity, secret...",[The risk factor for instability metabolic hea...


In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=20)

In [None]:
topic_model.visualize_heatmap( width=1000, height=1000)

In [None]:
topic_model.generate_topic_labels()

['-1_the_of_covid',
 '0_telemedicine_telehealth_satisfaction',
 '1_mental_health_psychiatric',
 '2_respiratory_oxygen_ventilation',
 '3_drug_treatment_potential',
 '4_child_children_parents',
 '5_sars_cov_transmission',
 '6_protease_inhibitors_main',
 '7_variants_sars_cov',
 '8_online_learning_education',
 '9_antibodies_igg_neutralizing',
 '10_spike_ace2_protein',
 '11_media_twitter_social',
 '12_prognosis_reactive_meta',
 '13_masks_mask_face',
 '14_nurses_nursing_nurse',
 '15_vaccine_vaccination_2021',
 '16_pregnant_pregnancy_women',
 '17_graphene_detection_biosensor',
 '18_modeling_mathematical_epidemic',
 '19_vaccine_willingness_acceptance',
 '20_images_ray_chest',
 '21_dental_oral_dentistry',
 '22_neurological_brain_palsy',
 '23_cytokine_neutrophil_storm',
 '24_machine_artificial_intelligence',
 '25_sleep_quality_insomnia',
 '26_mobility_transport_transportation',
 '27_respiratory_syndrome_acute',
 '28_pollution_air_pm2',
 '29_students_university_college',
 '30_lessons_emergency_re

In [None]:
topic_model.find_topics("vaccine") # topic 0,15 19, 58 ,96,97,100,171.172 relate with COVID-19

([96, 58, 15, 19, 105],
 [0.6674495, 0.65018547, 0.64887655, 0.6286167, 0.6200327])

In [None]:
df.head()

Unnamed: 0,title,abstract,url
0,Real-World Experience with COVID-19 Including...,This article summarizes the experiences of COV...,https://pubmed.ncbi.nlm.nih.gov/35008137
1,Successful outcome of pre-engraftment COVID-19...,Coronavirus disease 2019 COVID-19 caused by...,https://pubmed.ncbi.nlm.nih.gov/35008104
2,The impact of COVID-19 on oncology professiona...,BACKGROUND COVID-19 has had a significant imp...,https://pubmed.ncbi.nlm.nih.gov/35007996
3,ICU admission and mortality classifiers for CO...,The coronavirus disease 2019 COVID-19 which ...,https://pubmed.ncbi.nlm.nih.gov/35007991
4,Clinical evaluation of nasopharyngeal midturb...,In the setting of supply chain shortages of na...,https://pubmed.ncbi.nlm.nih.gov/35007959


In [None]:
# Step 1: Assign topic numbers to the DataFrame
df["topic"] = topics  # The `topics` output from BERTopic

df.head(10)


Unnamed: 0,title,abstract,url,topic
0,Real-World Experience with COVID-19 Including...,This article summarizes the experiences of COV...,https://pubmed.ncbi.nlm.nih.gov/35008137,91
1,Successful outcome of pre-engraftment COVID-19...,Coronavirus disease 2019 COVID-19 caused by...,https://pubmed.ncbi.nlm.nih.gov/35008104,-1
2,The impact of COVID-19 on oncology professiona...,BACKGROUND COVID-19 has had a significant imp...,https://pubmed.ncbi.nlm.nih.gov/35007996,106
3,ICU admission and mortality classifiers for CO...,The coronavirus disease 2019 COVID-19 which ...,https://pubmed.ncbi.nlm.nih.gov/35007991,88
4,Clinical evaluation of nasopharyngeal midturb...,In the setting of supply chain shortages of na...,https://pubmed.ncbi.nlm.nih.gov/35007959,-1
5,Safer in care A pandemic-tested model of inte...,BACKGROUND Rural poor persons with HIV PWH...,https://pubmed.ncbi.nlm.nih.gov/35007957,119
6,The effect of Covid-19 in digital media use of...,BACKGROUND Covid-19 pandemic has boosted digi...,https://pubmed.ncbi.nlm.nih.gov/35007925,-1
7,Scanning the RBD-ACE2 molecular interactions i...,The emergence of new SARS-CoV-2 variants poses...,https://pubmed.ncbi.nlm.nih.gov/35007846,-1
8,Implications of SARS-CoV-2 infection on the cl...,BACKGROUND The current coronavirus pandemic ...,https://pubmed.ncbi.nlm.nih.gov/35007842,-1
9,Merbromin is a mixed-type inhibitor of 3-chyom...,3-chyomotrypsin like protease 3CLpro has bee...,https://pubmed.ncbi.nlm.nih.gov/35007835,6


In [None]:
topic_labels=(topic_model.generate_topic_labels())
topic_labels

['-1_the_of_covid',
 '0_telemedicine_telehealth_satisfaction',
 '1_mental_health_psychiatric',
 '2_respiratory_oxygen_ventilation',
 '3_drug_treatment_potential',
 '4_child_children_parents',
 '5_sars_cov_transmission',
 '6_protease_inhibitors_main',
 '7_variants_sars_cov',
 '8_online_learning_education',
 '9_antibodies_igg_neutralizing',
 '10_spike_ace2_protein',
 '11_media_twitter_social',
 '12_prognosis_reactive_meta',
 '13_masks_mask_face',
 '14_nurses_nursing_nurse',
 '15_vaccine_vaccination_2021',
 '16_pregnant_pregnancy_women',
 '17_graphene_detection_biosensor',
 '18_modeling_mathematical_epidemic',
 '19_vaccine_willingness_acceptance',
 '20_images_ray_chest',
 '21_dental_oral_dentistry',
 '22_neurological_brain_palsy',
 '23_cytokine_neutrophil_storm',
 '24_machine_artificial_intelligence',
 '25_sleep_quality_insomnia',
 '26_mobility_transport_transportation',
 '27_respiratory_syndrome_acute',
 '28_pollution_air_pm2',
 '29_students_university_college',
 '30_lessons_emergency_re

In [None]:
topic_mapping = {i: label for i, label in enumerate(topic_labels, start=-1)}
topic_mapping

{-1: '-1_the_of_covid',
 0: '0_telemedicine_telehealth_satisfaction',
 1: '1_mental_health_psychiatric',
 2: '2_respiratory_oxygen_ventilation',
 3: '3_drug_treatment_potential',
 4: '4_child_children_parents',
 5: '5_sars_cov_transmission',
 6: '6_protease_inhibitors_main',
 7: '7_variants_sars_cov',
 8: '8_online_learning_education',
 9: '9_antibodies_igg_neutralizing',
 10: '10_spike_ace2_protein',
 11: '11_media_twitter_social',
 12: '12_prognosis_reactive_meta',
 13: '13_masks_mask_face',
 14: '14_nurses_nursing_nurse',
 15: '15_vaccine_vaccination_2021',
 16: '16_pregnant_pregnancy_women',
 17: '17_graphene_detection_biosensor',
 18: '18_modeling_mathematical_epidemic',
 19: '19_vaccine_willingness_acceptance',
 20: '20_images_ray_chest',
 21: '21_dental_oral_dentistry',
 22: '22_neurological_brain_palsy',
 23: '23_cytokine_neutrophil_storm',
 24: '24_machine_artificial_intelligence',
 25: '25_sleep_quality_insomnia',
 26: '26_mobility_transport_transportation',
 27: '27_respirat

In [None]:
import numpy as np
# Replace -1 (Outliers) temporarily with NaN to handle missing labels
df["topic_name"] = df["topic"].replace(-1, np.nan)


# Map topic_labels directly to the topic column
df["topic_name"] = df["topic_name"].map(pd.Series(topic_mapping))

df.tail(10)


Unnamed: 0,title,abstract,url,topic,topic_name
9990,Factors associated with early receipt of COVID...,BACKGROUND We aimed to determine factors inde...,https://pubmed.ncbi.nlm.nih.gov/34851970,15,15_vaccine_vaccination_2021
9991,Aerosol tracer testing in Boeing 767 and 777 a...,The COVID-19 pandemic has reintroduced questio...,https://pubmed.ncbi.nlm.nih.gov/34851965,-1,
9992,Cumulative burden of non-communicable diseases...,There continue to be conflicting data regardin...,https://pubmed.ncbi.nlm.nih.gov/34851963,119,119_hiv_services_adherence
9993,Preoperative and Postoperative Opioid Prescrip...,The United States is facing an opioid epidemic...,https://pubmed.ncbi.nlm.nih.gov/34851880,37,37_trauma_hip_fracture
9994,Hypersensitivity Reactions to Vaccines Curren...,The first reports of hypersensitivity reaction...,https://pubmed.ncbi.nlm.nih.gov/34851819,7,7_variants_sars_cov
9995,Rooming-in Breastfeeding and Neonatal Follow-...,INTRODUCTION Due to growing evidence suggesti...,https://pubmed.ncbi.nlm.nih.gov/34851815,-1,
9996,Acute Retinal Necrosis from Reactivation of Va...,PURPOSE To report a case of acute retinal nec...,https://pubmed.ncbi.nlm.nih.gov/34851795,-1,
9997,Acute Abducens Nerve Palsy Following the Secon...,The authors report the case of an otherwise he...,https://pubmed.ncbi.nlm.nih.gov/34851785,22,22_neurological_brain_palsy
9998,Planning and Implementing the Protocol for Psy...,The present study aims to plan the protocol fo...,https://pubmed.ncbi.nlm.nih.gov/34851781,35,35_workers_healthcare_frontline
9999,Prolonged corrected QT interval in hospitalize...,OBJECTIVE To evaluate the association of a pr...,https://pubmed.ncbi.nlm.nih.gov/34851769,40,40_2019_coronavirus_disease


In [None]:
# Fill NaN values with "Outlier" for -1 topics
df["topic_name"].fillna("Outlier", inplace=True)

In [None]:
df.tail(10)

Unnamed: 0,title,abstract,url,topic,topic_name
9990,Factors associated with early receipt of COVID...,BACKGROUND We aimed to determine factors inde...,https://pubmed.ncbi.nlm.nih.gov/34851970,15,15_vaccine_vaccination_2021
9991,Aerosol tracer testing in Boeing 767 and 777 a...,The COVID-19 pandemic has reintroduced questio...,https://pubmed.ncbi.nlm.nih.gov/34851965,-1,Outlier
9992,Cumulative burden of non-communicable diseases...,There continue to be conflicting data regardin...,https://pubmed.ncbi.nlm.nih.gov/34851963,119,119_hiv_services_adherence
9993,Preoperative and Postoperative Opioid Prescrip...,The United States is facing an opioid epidemic...,https://pubmed.ncbi.nlm.nih.gov/34851880,37,37_trauma_hip_fracture
9994,Hypersensitivity Reactions to Vaccines Curren...,The first reports of hypersensitivity reaction...,https://pubmed.ncbi.nlm.nih.gov/34851819,7,7_variants_sars_cov
9995,Rooming-in Breastfeeding and Neonatal Follow-...,INTRODUCTION Due to growing evidence suggesti...,https://pubmed.ncbi.nlm.nih.gov/34851815,-1,Outlier
9996,Acute Retinal Necrosis from Reactivation of Va...,PURPOSE To report a case of acute retinal nec...,https://pubmed.ncbi.nlm.nih.gov/34851795,-1,Outlier
9997,Acute Abducens Nerve Palsy Following the Secon...,The authors report the case of an otherwise he...,https://pubmed.ncbi.nlm.nih.gov/34851785,22,22_neurological_brain_palsy
9998,Planning and Implementing the Protocol for Psy...,The present study aims to plan the protocol fo...,https://pubmed.ncbi.nlm.nih.gov/34851781,35,35_workers_healthcare_frontline
9999,Prolonged corrected QT interval in hospitalize...,OBJECTIVE To evaluate the association of a pr...,https://pubmed.ncbi.nlm.nih.gov/34851769,40,40_2019_coronavirus_disease


In [None]:
df_changes = df[df['topic'] == 19]

In [None]:
df_changes

Unnamed: 0,title,abstract,url,topic,topic_name
417,Personal willingness to receive a Covid-19 vac...,High levels of vaccine hesitancy are an obstac...,https://pubmed.ncbi.nlm.nih.gov/35001533,19,19_vaccine_willingness_acceptance
673,The myth of vaccination and autism spectrum,BACKGROUND Among all of the studied potential...,https://pubmed.ncbi.nlm.nih.gov/34996019,19,19_vaccine_willingness_acceptance
691,Factors and Reasons Associated with Low COVID-...,INTRODUCTION The inability to achieve high CO...,https://pubmed.ncbi.nlm.nih.gov/34995722,19,19_vaccine_willingness_acceptance
976,COVID-19 vaccine acceptance and hesitancy amon...,BACKGROUND We assessed willingness to accept ...,https://pubmed.ncbi.nlm.nih.gov/34990311,19,19_vaccine_willingness_acceptance
1538,Attitudes towards vaccines and intention to va...,OBJECTIVE To examine SARS-CoV-2 vaccine confi...,https://pubmed.ncbi.nlm.nih.gov/34980631,19,19_vaccine_willingness_acceptance
...,...,...,...,...,...
9216,Relationship between mask wearing testing an...,BACKGROUND Mask wearing mitigates the spread ...,https://pubmed.ncbi.nlm.nih.gov/34865166,19,19_vaccine_willingness_acceptance
9289,Individual factors influencing COVID-19 vaccin...,BACKGROUND A year after the start of the COVI...,https://pubmed.ncbi.nlm.nih.gov/34863621,19,19_vaccine_willingness_acceptance
9678,Measuring vaccination willingness in response ...,The pandemic COVID-19 is continued to the mass...,https://pubmed.ncbi.nlm.nih.gov/34856879,19,19_vaccine_willingness_acceptance
9838,COVID-19 Vaccine Concerns and Acceptability by...,INTRODUCTION We need to understand the contin...,https://pubmed.ncbi.nlm.nih.gov/34854328,19,19_vaccine_willingness_acceptance
