DT5125 - Data Science Application

Group Assignment 1 - Text Clustering

Group 9

Main Topic: COVID-19

Subtopics:

Label;  Category;   Description

A;	COVID-19 & Vaccine Efficacy	Studies; evaluating how well different vaccines work, possibly across variants.

B;	COVID-19 & Long-Term Neurological Effects;	Research on long COVID and cognitive/neurological symptoms (brain fog, memory loss).

C;	COVID-19 & Machine Learning for Diagnosis;	Use of ML to diagnose COVID-19 from imaging (CT, X-rays) or symptoms.

D;	COVID-19 & Public Health Policy / Social Behavior;	Research on lockdown effects, mask compliance, misinformation, etc.

E;	COVID-19 & Genomic/Variant Analysis;	Studies on SARS-CoV-2 mutations, variant tracking, and genomic signatures.



Step 1: Find and import 200 abstracts per category, save to csv with label and metadata 

In [1]:
import time
import re
import pandas as pd
from Bio import Entrez

# Set user email here, used to access websites
Entrez.email = "akaur104@uottawa.ca"

# Define search categories
categories = {
    "A": 'COVID-19 AND ("vaccine efficacy" OR "vaccine effectiveness")',
    "B": 'COVID-19 AND ("long COVID" OR "neurological symptoms" OR "brain fog")',
    "C": 'COVID-19 AND ("machine learning" OR "deep learning") AND ("diagnosis" OR "prediction")',
    "D": 'COVID-19 AND ("public health" OR "policy" OR "lockdown" OR "social distancing")',
    "E": 'COVID-19 AND ("variant analysis" OR "genomic surveillance" OR "SARS-CoV-2 mutations")'
}

def clean_text(text):
    """
    Remove noise, symbols, HTML tags. Prepare a version with only abstracts for comparison

    """
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text.strip()

def fetch_200_valid_abstracts(query, email="akaur104@uottawa.ca"):
    """
    Fetches batches of PubMed articles until 200 valid ones with abstracts are collected.
    Returns each with title, abstract, label, and additional metadata.
    """

    Entrez.email = email
    valid_abstracts = []
    retstart = 0
    batch_size = 100
    max_attempts = 1000

    search_handle = Entrez.esearch(db="pubmed", term=query, usehistory="y", retmax=0)
    search_results = Entrez.read(search_handle)
    total_records = int(search_results["Count"])
    webenv = search_results["WebEnv"]
    query_key = search_results["QueryKey"]

    while len(valid_abstracts) < 200 and retstart < total_records and max_attempts > 0:
        try:
            fetch_handle = Entrez.efetch(
                db="pubmed",
                rettype="abstract",
                retmode="xml",
                retstart=retstart,
                retmax=batch_size,
                webenv=webenv,
                query_key=query_key
            )
            records = Entrez.read(fetch_handle)
        except Exception as e:
            print(f"Entrez fetch failed at retstart={retstart}: {e}")
            break

        for article in records.get("PubmedArticle", []):
            try:
                article_meta = article["MedlineCitation"]
                article_fields = article_meta["Article"]
                title = article_fields.get("ArticleTitle", "")
                abstract_parts = article_fields.get("Abstract", {}).get("AbstractText", [])
                if not abstract_parts:
                    continue

                abstract = " ".join(str(p) for p in abstract_parts)

                # Metadata
                pmid = article_meta.get("PMID", "?")
                journal = article_fields.get("Journal", {}).get("Title", "?")
                pub_year = article_fields.get("Journal", {}).get("JournalIssue", {}).get("PubDate", {}).get("Year", "?")

                authors_list = article_fields.get("AuthorList", [])
                authors = ", ".join([
                    f"{a.get('LastName', '')} {a.get('Initials', '')}"
                    for a in authors_list if "LastName" in a
                ][:3])  # up to 3 authors

                article_ids = article.get("PubmedData", {}).get("ArticleIdList", [])
                doi = next((id for id in article_ids if id.attributes.get("IdType") == "doi"), "?")

                valid_abstracts.append({
                    "pmid": pmid,
                    "title": clean_text(title),
                    "abstract": clean_text(abstract),
                    "journal": journal,
                    "pub_year": pub_year,
                    "authors": authors,
                    "doi": doi
                })

                if len(valid_abstracts) >= 200:
                    break
            except Exception:
                continue

        retstart += batch_size
        max_attempts -= 1
        time.sleep(0.3)
        print(f"✅ Collected {len(valid_abstracts)} valid abstracts...")

    return valid_abstracts[:200]


# Download and label data
all_data = []

for label, query in categories.items():
    print(f"\n🔍 Fetching Category {label}")
    articles = fetch_200_valid_abstracts(query)
    for article in articles:
        article["label"] = label  # Add label to each entry
    all_data.extend(articles)

# Save with metadata
df = pd.DataFrame(all_data)
df.to_csv("covid19_labeled_abstracts_with_metadata.csv", index=False)
print("✅ Saved to 'covid19_labeled_abstracts_with_metadata.csv'")




🔍 Fetching Category A
✅ Collected 97 valid abstracts...
✅ Collected 195 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category B
✅ Collected 93 valid abstracts...
✅ Collected 189 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category C
✅ Collected 100 valid abstracts...
✅ Collected 199 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category D
✅ Collected 93 valid abstracts...
✅ Collected 190 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category E
✅ Collected 97 valid abstracts...
✅ Collected 197 valid abstracts...
✅ Collected 200 valid abstracts...
✅ Saved to 'covid19_labeled_abstracts_with_metadata.csv'


Step 2: Preprocess Text: 	Lowercase everything, remove stopwords/punctuation, truncate to ~100 words

In [68]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

df = pd.read_csv("covid19_labeled_abstracts_with_metadata.csv")
lemmatizer = WordNetLemmatizer()
# Local stopword list
stopwords = set("""
a about above after again against all am an and any are aren't as at be because been before being below between
both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from
further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his
how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor
not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should
shouldn't so some such than that that's the their theirs them themselves then there's these they they'd
they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were
weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you
you'd you'll you're you've your yours yourself yourselves covid covid19 sarscov sarscov2
""".split())

# preprocess_text function
def preprocess_text(text):
    text = text.lower()
    re.sub(r'\d+', '', text)               # Remove digits, added to improve clustering results
    text = re.sub(r'[^a-z\s]', '', text)          # Remove non-alpha
    words = text.split()
    words = [w for w in words if w not in stopwords]
    #words = [lemmatizer.lemmatize(w) for w in words if w not in stopwords] #Added to check if clustering results improve or not
    return " ".join(words)

# Rebuild from clean components
df["clean_title"] = df["title"].astype(str).apply(preprocess_text)
df["clean_abstract"] = df["abstract"].astype(str).apply(preprocess_text)
df["full_text"] = df["clean_title"] + " " + df["clean_abstract"]

df.to_csv("covid19_combined_text_dataset.csv", index=False)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 3: Vectorizing and Clustering using BOW, TF-IDF, LDA, Word Embedding with K-Means Clustering

In [69]:
# Load Preprocessed Data
df = pd.read_csv("covid19_combined_text_dataset.csv")
df = df[df["label"].isin(["A", "B", "C", "D", "E"])]  # Ensure only labeled classes
df["label"] = df["label"].astype(str)  # make sure label is string
print("✅ Loaded:", df.shape[0], "documents")


✅ Loaded: 1000 documents


In [70]:
import spacy
import numpy as np
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import adjusted_rand_score, silhouette_score

In [71]:

# --------- BOW Vectorization ---------
bow_vectorizer = CountVectorizer(max_features=1000, ngram_range=(1, 2))
X_bow = bow_vectorizer.fit_transform(df["full_text"])

# KMeans on BOW
kmeans_bow = KMeans(n_clusters=5, random_state=42)
bow_preds = kmeans_bow.fit_predict(X_bow)
ari_bow = adjusted_rand_score(df["label"], bow_preds)
sil_bow = silhouette_score(X_bow, bow_preds)

# EM on BOW
em_bow = GaussianMixture(n_components=5, random_state=42)
bow_em_preds = em_bow.fit_predict(X_bow.toarray())
ari_bow_em = adjusted_rand_score(df["label"], bow_em_preds)
sil_bow_em = silhouette_score(X_bow, bow_em_preds)

# Hierarchical on BOW
hier_bow = AgglomerativeClustering(n_clusters=5)
bow_hier_preds = hier_bow.fit_predict(X_bow.toarray())
ari_bow_hier = adjusted_rand_score(df["label"], bow_hier_preds)
sil_bow_hier = silhouette_score(X_bow, bow_hier_preds)


In [72]:
# --------- TF-IDF Vectorization ---------
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = tfidf_vectorizer.fit_transform(df["full_text"])

# KMeans on TF-IDF
kmeans_tfidf = KMeans(n_clusters=5, random_state=42)
tfidf_preds = kmeans_tfidf.fit_predict(X_tfidf)
ari_tfidf = adjusted_rand_score(df["label"], tfidf_preds)
sil_tfidf = silhouette_score(X_tfidf, tfidf_preds)

# EM on TF-IDF
em_model = GaussianMixture(n_components=5, random_state=42)
em_preds = em_model.fit_predict(X_tfidf.toarray())
ari_em = adjusted_rand_score(df["label"], em_preds)
sil_em = silhouette_score(X_tfidf, em_preds)

# Hierarchical on TF-IDF
hier_model = AgglomerativeClustering(n_clusters=5)
hier_preds = hier_model.fit_predict(X_tfidf.toarray())
ari_hier = adjusted_rand_score(df["label"], hier_preds)
sil_hier = silhouette_score(X_tfidf, hier_preds)


In [73]:
# --------- LDA Clustering ---------
tokenized_docs = [doc.split() for doc in df["full_text"]]
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10, random_state=42)
lda_features = []
for doc in corpus:
    topic_probs = [0.0] * 5
    for topic_num, prob in lda_model.get_document_topics(doc):
        topic_probs[topic_num] = prob
    lda_features.append(topic_probs)
X_lda = pd.DataFrame(lda_features)

# KMeans on LDA
kmeans_lda = KMeans(n_clusters=5, random_state=42)
lda_preds = kmeans_lda.fit_predict(X_lda)
ari_lda = adjusted_rand_score(df["label"], lda_preds)

# EM on LDA
em_lda = GaussianMixture(n_components=5, random_state=42)
lda_em_preds = em_lda.fit_predict(X_lda)
ari_lda_em = adjusted_rand_score(df["label"], lda_em_preds)

# Hierarchical on LDA
hier_lda = AgglomerativeClustering(n_clusters=5)
lda_hier_preds = hier_lda.fit_predict(X_lda)
ari_lda_hier = adjusted_rand_score(df["label"], lda_hier_preds)

In [74]:
# --------- Word Embedding Clustering ---------
nlp = spacy.load("en_core_web_md")
def get_embedding(text):
    return nlp(text).vector
embeddings = np.vstack(df["full_text"].apply(get_embedding))

# KMeans on Embeddings
kmeans_embed = KMeans(n_clusters=5, random_state=42)
embed_preds = kmeans_embed.fit_predict(embeddings)
ari_embed = adjusted_rand_score(df["label"], embed_preds)

# EM on Embeddings
em_embed = GaussianMixture(n_components=5, random_state=42)
embed_em_preds = em_embed.fit_predict(embeddings)
ari_embed_em = adjusted_rand_score(df["label"], embed_em_preds)

# Hierarchical on Embeddings
hier_embed = AgglomerativeClustering(n_clusters=5)
embed_hier_preds = hier_embed.fit_predict(embeddings)
ari_embed_hier = adjusted_rand_score(df["label"], embed_hier_preds)

In [75]:

# --------- Results Summary ---------
print("\n✅ Clustering Evaluation Summary")
print("BOW:")
print("  - KMeans ARI:", round(ari_bow, 3))
print("  - EM ARI:", round(ari_bow_em, 3))
print("  - Hierarchical ARI:", round(ari_bow_hier, 3))
print("TF-IDF:")
print("  - KMeans ARI:", round(ari_tfidf, 3))
print("  - EM ARI:", round(ari_em, 3))
print("  - Hierarchical ARI:", round(ari_hier, 3))
print("LDA:")
print("  - KMeans ARI:", round(ari_lda, 3))
print("  - EM ARI:", round(ari_lda_em, 3))
print("  - Hierarchical ARI:", round(ari_lda_hier, 3))
print("Word Embeddings:")
print("  - KMeans ARI:", round(ari_embed, 3))
print("  - EM ARI:", round(ari_embed_em, 3))
print("  - Hierarchical ARI:", round(ari_embed_hier, 3))

# --------- Save Results ---------
df["cluster_bow"] = bow_preds
df["cluster_bow_em"] = bow_em_preds
df["cluster_bow_hier"] = bow_hier_preds
df["cluster_tfidf"] = tfidf_preds
df["cluster_em"] = em_preds
df["cluster_hier"] = hier_preds
df["cluster_lda"] = lda_preds
df["cluster_lda_em"] = lda_em_preds
df["cluster_lda_hier"] = lda_hier_preds
df["cluster_embed"] = embed_preds
df["cluster_embed_em"] = embed_em_preds
df["cluster_embed_hier"] = embed_hier_preds
df.to_csv("assignment2_clustering_all_methods.csv", index=False)
print("\n📁 Results saved to 'assignment2_clustering_all_methods.csv'")



✅ Clustering Evaluation Summary
BOW:
  - KMeans ARI: 0.319
  - EM ARI: 0.319
  - Hierarchical ARI: 0.356
TF-IDF:
  - KMeans ARI: 0.493
  - EM ARI: 0.493
  - Hierarchical ARI: 0.319
LDA:
  - KMeans ARI: 0.153
  - EM ARI: 0.111
  - Hierarchical ARI: 0.151
Word Embeddings:
  - KMeans ARI: 0.241
  - EM ARI: 0.241
  - Hierarchical ARI: 0.269

📁 Results saved to 'assignment2_clustering_all_methods.csv'


## Results For Full Text Processing
Results when all Features for BOW and TF-IDF are max=1000, no ngram, no lemmetizing
BOW:
  - KMeans ARI: 0.177
  - EM ARI: 0.177
  - Hierarchical ARI: 0.389
TF-IDF:
  - KMeans ARI: 0.565
  - EM ARI: 0.565
  - Hierarchical ARI: 0.327
LDA:
  - KMeans ARI: 0.153
  - EM ARI: 0.111
  - Hierarchical ARI: 0.151
Word Embeddings:
  - KMeans ARI: 0.241
  - EM ARI: 0.241
  - Hierarchical ARI: 0.269

Results when all features for BOW and TF-IDF are max=3000
BOW:
  - KMeans ARI: 0.213
  - EM ARI: 0.213
  - Hierarchical ARI: 0.35
TF-IDF:
  - KMeans ARI: 0.539
  - EM ARI: 0.539
  - Hierarchical ARI: 0.377
LDA:
  - KMeans ARI: 0.153
  - EM ARI: 0.111
  - Hierarchical ARI: 0.151
Word Embeddings:
  - KMeans ARI: 0.241
  - EM ARI: 0.241
  - Hierarchical ARI: 0.269

New Changes:
for BOW and tf-idf, ngram 1,2 added with max features =1000, added text lemmatizer to preprocess data

BOW:
  - KMeans ARI: 0.299
  - EM ARI: 0.299
  - Hierarchical ARI: 0.324
TF-IDF:
  - KMeans ARI: 0.486
  - EM ARI: 0.486
  - Hierarchical ARI: 0.297
LDA:
  - KMeans ARI: 0.222
  - EM ARI: 0.135
  - Hierarchical ARI: 0.204
Word Embeddings:
  - KMeans ARI: 0.235
  - EM ARI: 0.235
  - Hierarchical ARI: 0.245

After Removing standalone numbers, improving text pre-processing to see if clustering improves:
BOW:
  - KMeans ARI: 0.299
  - EM ARI: 0.299
  - Hierarchical ARI: 0.324
TF-IDF:
  - KMeans ARI: 0.481
  - EM ARI: 0.52
  - Hierarchical ARI: 0.298
LDA:
  - KMeans ARI: 0.222
  - EM ARI: 0.135
  - Hierarchical ARI: 0.204
Word Embeddings:
  - KMeans ARI: 0.235
  - EM ARI: 0.235
  - Hierarchical ARI: 0.245

Remove Lemmetization:
BOW:
  - KMeans ARI: 0.319
  - EM ARI: 0.319
  - Hierarchical ARI: 0.356
TF-IDF:
  - KMeans ARI: 0.493
  - EM ARI: 0.493
  - Hierarchical ARI: 0.319
LDA:
  - KMeans ARI: 0.153
  - EM ARI: 0.111
  - Hierarchical ARI: 0.151
Word Embeddings:
  - KMeans ARI: 0.241
  - EM ARI: 0.241
  - Hierarchical ARI: 0.269
