## Introduction

The [ZymoBIOMICS mock microbial community](https://github.com/LomanLab/mockcommunity) is a standardized mixture of known bacterial and fungal taxa at defined proportions. Because the true composition is known, it is widely used as a benchmarking and proof-of-concept dataset for evaluating bioinformatics pipelines, feature representations, and ecological analysis methods.

In this homework, you will use the Zymo mock community as a controlled testbed to assess the quality of **GenSLMs sequence embeddings**.



### Background

GenSLMs embeddings are pretrained representations derived from the **GenSLM foundation model** (GitHub: https://github.com/ramanathanlab/genslm).  
They are intended to capture biologically meaningful patterns such as evolutionary or phylogenetic relatedness and shared functional motifs.

There are 10 reference genomes in this dataset. Metagenomic reads have been assembled into contigs. The reference genome IDs correspond to:

- **0**: *Bacillus subtilis*
- **1**: *Cryptococcus neoformans*
- **2**: *Enterococcus faecalis*
- **3**: *Escherichia coli*
- **4**: *Lactobacillus fermentum*
- **5**: *Listeria monocytogenes*
- **6**: *Pseudomonas aeruginosa*
- **7**: *Saccharomyces cerevisiae*
- **8**: *Salmonella enterica*
- **9**: *Staphylococcus aureus*

Among these genomes, **E. coli** and **Salmonella** share ~87.04% similarity, and **Bacillus** and **Staphylococcus** share ~72.25% similarity.



### Hypothesis

We hypothesize that GenSLMs embeddings encode evolutionary and functional signals strongly enough that:

1. A classifier trained on the embeddings can accurately predict contig origin (genome ID).
2. Low-dimensional projections (PCA / UMAP / t-SNE) reveal coherent clusters aligned with genome IDs, including known similarity pairs (e.g., *E. coli* vs. *Salmonella*).
3. Clustering algorithms applied to the embeddings approximate true contig origin groupings with high homogeneity and completeness.



### Strategy

1. Load contig-level embeddings and genome IDs.
2. Perform a stratified train/test split to evaluate generalization.
3. Train multiple multi-class classifiers (Logistic Regression, LightGBM, XGBoost, CatBoost) and select the best model based on macro F1.
4. To speed up classification, reduce the dimensionality using PCA and UMAP, and evaluate how different numbers of components affect classification performance.
5. Visualize the embeddings using PCA, t-SNE, and UMAP with tuned hyperparameters; annotate notable overlaps (e.g., *E. coli* vs. *Salmonella*).
6. Apply clustering (DBSCAN / HDBSCAN / KMeans) on the 2D reduced space and compute homogeneity and completeness scores.




### Evaluation

- **Classification**: macro precision, macro recall, macro F1, accuracy.
- **Visualization**: qualitative cluster separation and overlapping
- **Clustering**: homogeneity_score, completeness_score, number of clusters vs expected genomes, outlier count (if applicable).


### Open-Ended Nature

This assignment is intentionally open-ended — there is no single “correct” setting.  
You are encouraged to explore reasonable hyperparameters.  
For some steps, I also provide my own settings so that everyone starts from the same baseline.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
genslm_npz = np.load("zymo_genslm_embeddings.npz")
genslm_npz['e'].shape
genslm_npz['id'].shape

In [None]:
genslm_embed = pd.DataFrame(genslm_npz['e'])
genslm_embed['rid'] = genslm_npz['id']
genslm_embed=genslm_embed.sample(frac=1)
genslm_embed.describe()

In [None]:
# show the value count of the rid
genslm_embed['rid'].value_counts()

In [None]:
# split the data into feature and target
X = genslm_embed.iloc[:, :-1].values
y = genslm_embed.iloc[:, -1].values

# split the data into training and test set with stratify
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
# scale the data
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Task 1: Multi-Class Classification (20 points)

Apply multiple classifiers to the dataset and identify the best-performing model based on the **macro F1 score**.  
The classifiers to evaluate include: **Multinomial Logistic Regression, LightGBM, XGBoost, and CatBoost**.

Please display the performance metrics for each classifier (e.g., **accuracy, macro precision, macro recall, macro F1**) to justify your selection of the top model for the downstream tasks.

## Task 2: Dimensionality Reduction for Efficient Classification (20 points)

Using the full embedding dimensionality for classification can be computationally intensive.  
In this step, your objective is to reduce the dimensionality of the data using **PCA** and **UMAP** to speed up classification.

Explore multiple values of `n_components` to evaluate how dimensionality affects classification performance.  
Use the **best-performing classifier identified in Task 1**.

The dimensionalities to explore are:

```python
n_components = [10, 20, 43, 100, 150, 200, 300, 400]
```

For each setting, report the **accuracy, macro precision, macro recall, and macro F1 score**.

After completing your experiments:

- **Identify which dimensionality reduction method (PCA or UMAP) performs better overall on this dataset.**
- **Determine the best-performing `n_components` value based on the evaluation metrics.**

**Note:** t-SNE is mainly used for reducing data to 2 or 3 dimensions for visualization.  
Because it does not scale effectively to higher-dimensional embeddings, we **exclude it** from this task.


## Task 3: Visualizing High-Dimensional Embeddings Using PCA, t-SNE, and UMAP (20 points)

Visualize the high-dimensional data using **PCA**, **t-SNE**, and **UMAP**.  
Keep in mind that tuning the hyperparameters for t-SNE and UMAP is essential for producing informative plots.

Experiment with different settings to generate the most insightful visualizations, and **explicitly report the best hyperparameters you identified through your experiments** in your submission.


## Task 4: Evaluating Clustering Algorithms on 2D Embeddings (30 points)

Clustering algorithms often struggle with the curse of dimensionality, which can lead to poor performance on high-dimensional data.  
To address this, use the **2-dimensional representation** obtained from **t-SNE** in Task 2 (`n_components = 2`, `perplexity = 25`).

Apply **three different clustering algorithms** to the 2D data and evaluate their effectiveness.  
For each algorithm, report the following:

- **Visualization** of the clustering result (dimension = 2)
- **Number of clusters**
- **Number of outliers** (`cluster_labels == -1`, if applicable)
- **Completeness score** and **Homogeneity score**  
  (refer to scikit-learn documentation for how to compute these metrics given ground-truth labels)

**Note:**  
Since clustering is unsupervised, use the entire dataset when tuning hyperparameters (no train-test split needed).


## Task 5: Summary of Best Results and Reflection (10 points)

Summarize the best results you obtained from each of the previous tasks and explain what you learned from this assignment.
