# CMSC 35440 Machine Learning in Biology and Medicine
## Homework 1: Embedding Immunology Research Articles
**Released**: Jan 14, 2026

**Due**: Jan 24, 2026 at 11:59 PM Chicago Time on Gradescope

**In this first homework, you'll generate embeddings for 28 immunology research articles and visualize them using various dimensionality reduction techniques.**

At a high-level, embeddings are vectors computed by some algorithm or model that "code" information from data. For this homework, you will code text documents as vectors using the bag of words algorithm and normalize these vectors using the term-frequency inverse document frequency (TF-IDF) method. TF-IDF downweights ubiquitous terms and highlights vocabulary that is distinctive to each paper (e.g., checkpoint, cGAS-STING, germinal center). This helps the embeddings reflect biological themes rather than common filler, so downstream plots can separate immune subfields and detect cross-cutting topics.

The 28 papers span three major areas of immunology:
- **T-cell biology**: CD8+ T cell exhaustion, checkpoint inhibition, cancer immunotherapy
- **B-cell biology**: Germinal centers, antibody responses, T follicular helper cells  
- **Innate immunity**: TLRs, cGAS-STING pathway, macrophages, autophagy

For this homework, you will code text documents as vectors using the bag of words algorithm and normalize these vectors using the term-frequency inverse documentation frequency (TF-IDF) method. This method dates back over 50 years to 1972. Through this homework, hopefully we'll convince you that it's still very much relevant.

## Instructions

1. Download and open this starter notebook in your favorite Jupyter Notebook host. We recommend using [Google Colab](https://colab.research.google.com/).
   * **NB:** We'll design all homeworks such that they can be run on the *free* tier of Colab.
   * For this homework, we don't require the use of any GPUs.

2. Download and unzip the research articles. We've provided them as a tarball that can be downloaded from [https://github.com/SummerAnn/SummerAnn-CMSC-35440-Source/releases/download/hw1/hw1.tar.gz](https://github.com/SummerAnn/SummerAnn-CMSC-35440-Source/releases/download/hw1/hw1.tar.gz).
   * You'll notice that there's a CSV of article metadata and a folder of article *PDFs*. While these articles are available elsewhere on the internet as extracted-text, real-world data is messy. One such way that data can be messy is that it only exists as PDFs - so **you must use the article PDFs for this assignment**.

3. Extract the text from the articles. You should probably use some variables from the metadata at this step.

4. Compute the term-document matrix and normalize using TF-IDF. **You must implement TF-IDF yourself. You may not use any existing implementations** (e.g. you can NOT use sklearn's TfidfVectorizer).
   * Defining what is a "term" is up to you but don't overcomplicate it. Splitting on whitespace characters works fine.
   * The Wikipedia article should be all you need: [https://en.wikipedia.org/wiki/Tf-idf](https://en.wikipedia.org/wiki/Tf-idf).

5. Normalize your per-document embeddings using L2 normalization.

6. Visualize your embeddings using dimensionality reduction (3 plots total):
   * Apply linear dimensionality reduction
   * Apply a **non-linear** method 
   * Add a **clustermap** for a global similarity view
   * **Important for all plots**: Color points by the **subtopic** column in the metadata CSV to show how well your embeddings capture biological themes
   * **Important**: Non-linear methods are sensitive to hyperparameters. See: [https://pair-code.github.io/understanding-umap/](https://pair-code.github.io/understanding-umap/)
   * Label your plots clearly with titles, axis labels, and a legend showing which color corresponds to which subtopic

7. Analyze your results and submit:
   * Your submission should include 2 things:
     1. Your writeup containing figures with your embedding visualizations
     2. Your notebook with your code for computing TF-IDF and generating figures
   * Your writeup should be **0.5 to 1 page** (before figures). Text should be 12pt, single spaced, with 1 inch margins, on letter size paper. PDF or Word.
   * Some guiding questions: Have your embeddings captured underlying information about the articles? Why do some articles cluster together? How do the three figures you generated compare? Do papers with the same subtopic cluster together?

**Tips and Tricks:**
1. You're welcome to use any tools except where noted above.
2. Reading CSVs: use `pandas`
3. Extracting text from PDFs: use [`pypdf`](https://github.com/py-pdf/pypdf)
4. Normalization: use `numpy`
5. Visualization: use `matplotlib` or `seaborn`
6. For help: Email course staff or come to office hours!

In [None]:
!pip install pypdf numpy pandas matplotlib seaborn scikit-learn umap-learn

## Setup

In [None]:
!pip install pypdf numpy pandas matplotlib seaborn scikit-learn umap-learn
!wget https://github.com/SummerAnn/SummerAnn-CMSC-35440-Source/releases/download/hw1/hw1.tar.gz
!tar -xzf hw1.tar.gz

In [None]:
import numpy as np
import pandas as pd
from pypdf import PdfReader
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

np.random.seed(42)

## Load Metadata

In [None]:
import os
import re

# Prefer local metadata (with subtopic if available)
candidates = [
    "article-metadata-with-subtopic.csv",
    "hw1/article-metadata-with-subtopic.csv",
    "article-metadata.csv",
    "hw1/article-metadata.csv",
]
meta_path = next((p for p in candidates if os.path.exists(p)), candidates[-1])
df = pd.read_csv(meta_path)

label_col = "subtopic" if "subtopic" in df.columns else ("topic" if "topic" in df.columns else ("category" if "category" in df.columns else df.columns[-1]))
label_title = label_col.capitalize()
print(f"Loaded {len(df)} papers from {meta_path}")
print(f"\n{label_title} distribution:")
print(df[label_col].value_counts())
df.head()
