DT5125 - Data Science Application

Group Assignment 1 - Text Classification 

Group 9

Main Topic: COVID-19

Subtopics:

Label;  Category;   Description

A;	COVID-19 & Vaccine Efficacy	Studies; evaluating how well different vaccines work, possibly across variants.

B;	COVID-19 & Long-Term Neurological Effects;	Research on long COVID and cognitive/neurological symptoms (brain fog, memory loss).

C;	COVID-19 & Machine Learning for Diagnosis;	Use of ML to diagnose COVID-19 from imaging (CT, X-rays) or symptoms.

D;	COVID-19 & Public Health Policy / Social Behavior;	Research on lockdown effects, mask compliance, misinformation, etc.

E;	COVID-19 & Genomic/Variant Analysis;	Studies on SARS-CoV-2 mutations, variant tracking, and genomic signatures.



Step 1: Find and import 200 abstracts per category, save to csv with label and metadata 

In [None]:
import time
import re
import pandas as pd
from Bio import Entrez

# Set user email here, used to access websites
Entrez.email = "akaur104@uottawa.ca"

# Define search categories
categories = {
    "A": 'COVID-19 AND ("vaccine efficacy" OR "vaccine effectiveness")',
    "B": 'COVID-19 AND ("long COVID" OR "neurological symptoms" OR "brain fog")',
    "C": 'COVID-19 AND ("machine learning" OR "deep learning") AND ("diagnosis" OR "prediction")',
    "D": 'COVID-19 AND ("public health" OR "policy" OR "lockdown" OR "social distancing")',
    "E": 'COVID-19 AND ("variant analysis" OR "genomic surveillance" OR "SARS-CoV-2 mutations")'
}

def clean_text(text):
    """
    Remove noise, symbols, HTML tags. Prepare a version with only abstracts for comparison

    """
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text.strip()

def fetch_200_valid_abstracts(query, email="your_email@example.com"):
    """
    Fetches batches of PubMed articles until 200 valid ones with abstracts are collected.
    Returns each with title, abstract, label, and additional metadata.
    """

    Entrez.email = email
    valid_abstracts = []
    retstart = 0
    batch_size = 100
    max_attempts = 1000

    search_handle = Entrez.esearch(db="pubmed", term=query, usehistory="y", retmax=0)
    search_results = Entrez.read(search_handle)
    total_records = int(search_results["Count"])
    webenv = search_results["WebEnv"]
    query_key = search_results["QueryKey"]

    while len(valid_abstracts) < 200 and retstart < total_records and max_attempts > 0:
        try:
            fetch_handle = Entrez.efetch(
                db="pubmed",
                rettype="abstract",
                retmode="xml",
                retstart=retstart,
                retmax=batch_size,
                webenv=webenv,
                query_key=query_key
            )
            records = Entrez.read(fetch_handle)
        except Exception as e:
            print(f"Entrez fetch failed at retstart={retstart}: {e}")
            break

        for article in records.get("PubmedArticle", []):
            try:
                article_meta = article["MedlineCitation"]
                article_fields = article_meta["Article"]
                title = article_fields.get("ArticleTitle", "")
                abstract_parts = article_fields.get("Abstract", {}).get("AbstractText", [])
                if not abstract_parts:
                    continue

                abstract = " ".join(str(p) for p in abstract_parts)

                # Metadata
                pmid = article_meta.get("PMID", "?")
                journal = article_fields.get("Journal", {}).get("Title", "?")
                pub_year = article_fields.get("Journal", {}).get("JournalIssue", {}).get("PubDate", {}).get("Year", "?")

                authors_list = article_fields.get("AuthorList", [])
                authors = ", ".join([
                    f"{a.get('LastName', '')} {a.get('Initials', '')}"
                    for a in authors_list if "LastName" in a
                ][:3])  # up to 3 authors

                article_ids = article.get("PubmedData", {}).get("ArticleIdList", [])
                doi = next((id for id in article_ids if id.attributes.get("IdType") == "doi"), "?")

                valid_abstracts.append({
                    "pmid": pmid,
                    "title": clean_text(title),
                    "abstract": clean_text(abstract),
                    "journal": journal,
                    "pub_year": pub_year,
                    "authors": authors,
                    "doi": doi
                })

                if len(valid_abstracts) >= 200:
                    break
            except Exception:
                continue

        retstart += batch_size
        max_attempts -= 1
        time.sleep(0.3)
        print(f"✅ Collected {len(valid_abstracts)} valid abstracts...")

    return valid_abstracts[:200]


# Download and label data
all_data = []

for label, query in categories.items():
    print(f"\n🔍 Fetching Category {label}")
    articles = fetch_200_valid_abstracts(query)
    for article in articles:
        article["label"] = label  # Add label to each entry
    all_data.extend(articles)

# Save with metadata
df = pd.DataFrame(all_data)
df.to_csv("covid19_labeled_abstracts_with_metadata.csv", index=False)
print("✅ Saved to 'covid19_labeled_abstracts_with_metadata.csv'")




🔍 Fetching Category A
✅ Collected 96 valid abstracts...
✅ Collected 194 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category B
✅ Collected 96 valid abstracts...
✅ Collected 195 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category C
✅ Collected 100 valid abstracts...
✅ Collected 198 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category D
✅ Collected 94 valid abstracts...
✅ Collected 192 valid abstracts...
✅ Collected 200 valid abstracts...

🔍 Fetching Category E
✅ Collected 97 valid abstracts...
✅ Collected 197 valid abstracts...
✅ Collected 200 valid abstracts...
✅ Saved to 'covid19_labeled_abstracts_with_metadata.csv'
✅ Data saved to 'covid19_labeled_abstracts.csv'


Step 2: Preprocess Text: 	Lowercase everything, remove stopwords/punctuation, truncate to ~100 words

In [5]:
# Load the CSV file
df = pd.read_csv("covid19_labeled_abstracts_with_metadata.csv")

# Basic English stopword list (custom/local version)
local_stopwords = set("""
a about above after again against all am an and any are aren't as at be because been before being below between
both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from
further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his
how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor
not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should
shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd
they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were
weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you
you'd you'll you're you've your yours yourself yourselves
""".split())

# Function to clean and preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    tokens = text.split()  # Tokenize into words
    tokens = [word for word in tokens if word not in local_stopwords]  # Remove stopwords
    return " ".join(tokens[:100])  # Keep first ~100 words

# Apply to abstract and title columns
df["clean_abstract"] = df["abstract"].astype(str).apply(preprocess_text)
df["clean_title"] = df["title"].astype(str).apply(preprocess_text)

# Optional: Save to new file
df.to_csv("covid19_preprocessed_dataset.csv", index=False)


Step 3: Partition Dataset, Randomly split into Train (60%), Validation (20%), Test (20%); keep classes balanced

In [7]:
from sklearn.model_selection import train_test_split

# Split: 60% Train, 20% Val, 20% Test with stratified sampling by label
train_df, temp_df = train_test_split(df, test_size=0.4, stratify=df["label"], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df["label"], random_state=42)

# Save to files
train_df.to_csv("covid19_train.csv", index=False)
val_df.to_csv("covid19_val.csv", index=False)
test_df.to_csv("covid19_test.csv", index=False)

print("✅ Partitioned and saved to:")
print(" - covid19_train.csv")
print(" - covid19_val.csv")
print(" - covid19_test.csv")


✅ Partitioned and saved to:
 - covid19_train.csv
 - covid19_val.csv
 - covid19_test.csv


Step 4: Apply Feature Transformation, Use BoW, TF-IDF, N-gram (optionally LDA, Doc2Vec) to convert text into features

In [8]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Load the preprocessed and split datasets
train_df = pd.read_csv("covid19_train.csv")
val_df = pd.read_csv("covid19_val.csv")
test_df = pd.read_csv("covid19_test.csv")

# Use the cleaned abstract text for vectorization
text_field = "clean_abstract"

# --------------------------
# 1. Bag of Words (Unigrams)
# --------------------------
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(train_df[text_field])
X_val_bow = bow_vectorizer.transform(val_df[text_field])
X_test_bow = bow_vectorizer.transform(test_df[text_field])

# --------------------------
# 2. TF-IDF (Unigrams)
# --------------------------
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df[text_field])
X_val_tfidf = tfidf_vectorizer.transform(val_df[text_field])
X_test_tfidf = tfidf_vectorizer.transform(test_df[text_field])

# ----------------------------------
# 3. TF-IDF with N-grams (1–2 grams)
# ----------------------------------
tfidf_ngram_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_train_tfidf_ngram = tfidf_ngram_vectorizer.fit_transform(train_df[text_field])
X_val_tfidf_ngram = tfidf_ngram_vectorizer.transform(val_df[text_field])
X_test_tfidf_ngram = tfidf_ngram_vectorizer.transform(test_df[text_field])

# ------------------------------------
# Feature Summary (used for debugging)
# ------------------------------------
print("Feature dimensions:")
print(f"BoW: {X_train_bow.shape[1]} features")
print(f"TF-IDF: {X_train_tfidf.shape[1]} features")
print(f"TF-IDF (1-2 N-grams): {X_train_tfidf_ngram.shape[1]} features")


Feature dimensions:
BoW: 9754 features
TF-IDF: 9754 features
TF-IDF (1-2 N-grams): 54944 features


Step 5: Build Feature-Model Combos, Combine each feature method with different classifiers (e.g., TF-IDF + LR, N-gram + NB)

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

#NOTE: Previous code cell needs to have been run before running this one

# get Labels
y_train = train_df["label"]
y_val = val_df["label"]
y_test = test_df["label"]

# Use previously created feature matrices, should bbe available in memory if previous code block was run
# - X_train_bow, X_test_bow
# - X_train_tfidf, X_test_tfidf
# - X_train_tfidf_ngram, X_test_tfidf_ngram

# Dictionary to store results
model_results = {}

# 1. TF-IDF + Logistic Regression
lr_tfidf = LogisticRegression(max_iter=1000)
lr_tfidf.fit(X_train_tfidf, y_train)
preds_lr_tfidf = lr_tfidf.predict(X_test_tfidf)
model_results["TF-IDF + Logistic Regression"] = classification_report(
    y_test, preds_lr_tfidf, output_dict=True)

# 2. TF-IDF N-gram + Naive Bayes
nb_tfidf_ngram = MultinomialNB()
nb_tfidf_ngram.fit(X_train_tfidf_ngram, y_train)
preds_nb_tfidf_ngram = nb_tfidf_ngram.predict(X_test_tfidf_ngram)
model_results["TF-IDF N-gram + Naive Bayes"] = classification_report(
    y_test, preds_nb_tfidf_ngram, output_dict=True)

# 3. BoW + Naive Bayes
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)
preds_nb_bow = nb_bow.predict(X_test_bow)
model_results["BoW + Naive Bayes"] = classification_report(
    y_test, preds_nb_bow, output_dict=True)

# 4. TF-IDF N-gram + Logistic Regression
lr_tfidf_ngram = LogisticRegression(max_iter=1000)
lr_tfidf_ngram.fit(X_train_tfidf_ngram, y_train)
preds_lr_tfidf_ngram = lr_tfidf_ngram.predict(X_test_tfidf_ngram)
model_results["TF-IDF N-gram + Logistic Regression"] = classification_report(
    y_test, preds_lr_tfidf_ngram, output_dict=True)

# ----------------------------------
# Summary Table (Accuracy, F1 Scores)
# ----------------------------------
summary = []
for name, report in model_results.items():
    row = {
        "Model": name,
        "Accuracy": report["accuracy"],
        "Macro F1": report["macro avg"]["f1-score"],
        "Weighted F1": report["weighted avg"]["f1-score"]
    }
    summary.append(row)

summary_df = pd.DataFrame(summary)
print(summary_df)


                                 Model  Accuracy  Macro F1  Weighted F1
0         TF-IDF + Logistic Regression     0.775  0.773009     0.773009
1          TF-IDF N-gram + Naive Bayes     0.765  0.760917     0.760917
2                    BoW + Naive Bayes     0.745  0.742114     0.742114
3  TF-IDF N-gram + Logistic Regression     0.785  0.781901     0.781901


Extra information for Freeman below

In [None]:
Amrit input: how can we make these models perform better?

Great question — improving model performance in a text classification pipeline like yours can be tackled at multiple levels:

---

## 🚀 How to Improve Model Performance

### 🔧 1. **Better Text Features**

Instead of relying solely on raw BoW or TF-IDF, use semantically richer features:

* **Use pre-trained embeddings** (Word2Vec, GloVe, FastText)
  → Converts words into vectors that capture context and meaning
* **Use transformer-based embeddings** like BERT, BioBERT, or SciBERT
  → Especially useful for biomedical text like PubMed abstracts
* **LDA (Latent Dirichlet Allocation)** for topic modeling
  → Helps capture latent themes across documents

---

### 🧠 2. **Smarter Models**

* **SVM** (Support Vector Machines): Often better than NB/LR on sparse data
* **Random Forest / XGBoost**: Work well with engineered features
* **Neural Networks / CNNs / LSTMs**: For large datasets or contextual sequence modeling
* **Transformer models (e.g., BERT)** fine-tuned on your data:

  * HuggingFace's `transformers` library makes this easy
  * Example: Fine-tune `bert-base-uncased` or `allenai/biobert-base-cased-v1.2`

---

### 🔍 3. **Text Data Enhancements**

* **Augment abstracts** by including other metadata: title, journal, keywords
* **Use full articles or summaries** if possible (not just abstracts)
* **Back-translation** or **synonym injection** for low-resource classes

---

### ⚖️ 4. **Data Balancing**

* Ensure each class (A–E) has equal or sufficient examples
* Try **oversampling** (SMOTE) or **class weighting** in your model

---

### 🧪 5. **Hyperparameter Tuning**

Use grid/randomized search for:

* `alpha` in Naive Bayes
* `C` and `penalty` in Logistic Regression or SVM
* `max_features`, `ngram_range`, `min_df` in vectorizers

```python
from sklearn.model_selection import GridSearchCV

params = {
    'C': [0.1, 1, 10],
    'penalty': ['l2']
}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid=params, cv=5)
grid.fit(X_train_tfidf, y_train)
print(grid.best_params_)
```

---

### 🧼 6. **Advanced Preprocessing**

* **Named Entity Recognition (NER)** to extract clinical/medical concepts
* **Stemming or Lemmatization** using SpaCy or NLTK
* **POS-tag filtering**: Keep only relevant parts of speech

---

### 📊 7. **Evaluation Strategy**

* Use **confusion matrices** to find specific classes that need improvement
* Inspect **misclassified examples** for patterns or labeling inconsistencies

---

Would you like help with any of these:

* Trying **SVM or BERT embeddings**?
* Adding **title + abstract** together as input?
* Running **grid search** for your current models?

Let’s focus based on your goals (better accuracy, research exploration, deep learning, etc.).


In [None]:
Amrit input: what exactly is the output of this code? the end goal is to train a machine that can tell us which covid category when asked

Excellent — let’s clarify **what the output of this pipeline is**, how it fits your end goal, and what to do next to achieve a deployable classifier.

---

## ✅ **What You've Built So Far**

You’ve created a **supervised text classification pipeline** to automatically classify research paper abstracts into **one of five COVID-19 research categories** (A–E). This includes:

### 1. **Data Curation**

* 200 abstracts per category (total \~1000)
* Labeled, cleaned, and stored with metadata

### 2. **Text Preprocessing**

* Lowercased, punctuation stripped, stopwords removed, truncated to 100 words

### 3. **Feature Extraction**

* Converted text into numerical vectors using:

  * BoW
  * TF-IDF
  * N-gram TF-IDF

### 4. **Model Training**

* Trained models: Logistic Regression, Naive Bayes
* Evaluated them on unseen test data

---

## 📦 **What Is the Output of This Code?**

The current code outputs:

* A trained **Logistic Regression or Naive Bayes model**
* **Classification accuracy and F1 scores**
* A model that, when given a cleaned abstract, can predict its **category (A–E)**

However, these models are currently only **in memory** and not saved or deployed. That’s the next step.

---

## 🎯 **Your End Goal: Ask → Get a Category**

You want to:

> Input a new research abstract or article, and get back its category (A–E)

To achieve this fully:

### ✅ You're 90% there.

### 🔜 Final Steps:

1. **Save the best model to disk**

   ```python
   import joblib
   joblib.dump(lr_tfidf_ngram, "covid_category_model.pkl")
   joblib.dump(tfidf_ngram_vectorizer, "tfidf_vectorizer.pkl")
   ```

2. **Write a prediction script**

   ```python
   import joblib

   # Load model and vectorizer
   model = joblib.load("covid_category_model.pkl")
   vectorizer = joblib.load("tfidf_vectorizer.pkl")

   # New abstract (example)
   new_text = """COVID-19 vaccines have shown varying levels of effectiveness against new variants..."""

   # Preprocess
   def preprocess(text):
       text = text.lower()
       text = re.sub(r"[^\w\s]", "", text)
       tokens = [w for w in text.split() if w not in local_stopwords]
       return " ".join(tokens[:100])

   clean_input = preprocess(new_text)

   # Vectorize and predict
   X_input = vectorizer.transform([clean_input])
   prediction = model.predict(X_input)

   print("Predicted Category:", prediction[0])
   ```

3. **(Optional) Build a Web API or Interface**

   * Use Flask or FastAPI to serve this model
   * Add a UI where users paste in text and get the category

---

## 📌 In Short

**You now have a working machine that can predict COVID-19 abstract categories.** You just need to:

* Persist the model + vectorizer
* Build an interface or script to use it interactively

Would you like me to generate the prediction script or an API endpoint for you next?
