# Assignment 1, Task A: Classification problem.

## The data:
In this QSAR exercise, the mutagenicity of various molecules is to be investigated. The dataset in use is the Ames Mutagenicity Dataset for Multi-Task learning accessed via the PyTDC library, essentially as also provided here: https://huggingface.co/datasets/scikit-fingerprints/TDC_ames. Columns have been renamed for enhanced clarity.

The dataset gives the overal mutagenicity (1 = mutagen) of various drugs (simply represented as their SMILES string). From the SMILES strings, molecular fingerprints can be generated as molecular descriptors.

## The tasks:
1) Inspect the data and clean if needed. Adhere to good practices!
2) Calculate the fingerprints (partial snippet provided) and create a feature matrix X and a target vector y
3) Then four different models should be trained on the fingerprints and evaluated according to accuracy and their roc-auc score to compare their performance. For each model, additionally, the overfitting needs to be addressed.

These four models have to be compared:
- `KNeighborsClassifier`: choose a suitable number of neighbors
- `DecisionTreeClassifier`: use a random_state
- `RandomForestClassifier`: use a random_state and a slightly bigger forest (e.g. 200 trees)
- `GradientBoostingClassifier`: use a random_state

Other than the stated parameters, the models can be mostly used as provided by `scikit`. No hyperparameter tuning needs to be performed, no CV necessary.

4) Conclusion and discussion: Provide answers to the questions.

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, ConfusionMatrixDisplay

In [2]:
df = pd.read_csv("ames_data.csv")
df.head()

Unnamed: 0,drug_id,smiles,mutagenicity
0,Drug 0,O=[N+]([O-])c1ccc2ccc3ccc([N+](=O)[O-])c4c5ccc...,1
1,Drug 1,O=[N+]([O-])c1c2c(c3ccc4cccc5ccc1c3c45)CCCC2,1
2,Drug 2,O=c1c2ccccc2c(=O)c2c1ccc1c2[nH]c2c3c(=O)c4cccc...,0
3,Drug 3,[N-]=[N+]=CC(=O)NCC(=O)NN,1
4,Drug 4,[N-]=[N+]=C1C=NC(=O)NC1=O,1


## 1. Inspect and clean the data
- Gain some overview of the data and assess NaNs and duplicates and clean if needed.
- Inspect the class balance!

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7278 entries, 0 to 7277
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   drug_id       7278 non-null   object
 1   smiles        7278 non-null   object
 2   mutagenicity  7278 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 170.7+ KB


In [6]:
df.shape

(7278, 3)

In [3]:
df.isna().sum()

drug_id         0
smiles          0
mutagenicity    0
dtype: int64

In [4]:
df.duplicated().sum()

np.int64(0)

## 2. Create fingerprints from the Smiles
The partial snippet for MorganFingerprints can be used. Note that instead of a dataframe, the function will produce a np.array, which will be written into a list. From this you can create the feature matrix and the target vector. Inspect the shape of the arrays!

In [18]:
def smiles_to_fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
    fp = mfpgen.GetFingerprint(mol)
    return np.array(fp)

# Convert to fingerprints
fps = []
valid_labels = []

for smiles, label in zip(df["smiles"], df["mutagenicity"]):
    fp = smiles_to_fp(smiles)
    if fp is not None:
        fps.append(fp)
        valid_labels.append(label)


Feature matrix shape: (7278, 2048)


In [24]:

X = np.array(fps)
y = np.array(valid_labels)
print(f"Feature matrix shape: {X.shape}")
print(f"Label vector shape: {len(y)}")
print("the same number of entries in the feature matrix and label vector:", X.shape[0] == len(y)) #checking that the number of samples in X and y match

Feature matrix shape: (7278, 2048)
Label vector shape: 7278
the same number of entries in the feature matrix and label vector: True


## 3. Train the models
Use a classic train-test split of 0.2 including a random seed and `stratify`. For training and predicting labels, take note of the time the process takes for each model (does not necessarily have to be coded, can also be estimated). Make sure to predict labels for both training and test splits in order to identify overfitting. Use the accuracy and roc-auc as metrics for evaluation.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

KNN
  Train Accuracy: 0.861
  Test  Accuracy: 0.790
  Train ROC-AUC:  0.941
  Test  ROC-AUC:  0.860
----------------------------------------
Decision Tree
  Train Accuracy: 0.999
  Test  Accuracy: 0.777
  Train ROC-AUC:  1.000
  Test  ROC-AUC:  0.772
----------------------------------------
Random Forest
  Train Accuracy: 0.999
  Test  Accuracy: 0.825
  Train ROC-AUC:  1.000
  Test  ROC-AUC:  0.901
----------------------------------------
Gradient Boosting
  Train Accuracy: 0.810
  Test  Accuracy: 0.773
  Train ROC-AUC:  0.895
  Test  ROC-AUC:  0.851
----------------------------------------


In [36]:
models = {
    "KNN":                KNeighborsClassifier(n_neighbors=5),
    "Decision Tree":      DecisionTreeClassifier(random_state=42),
    "Random Forest":      RandomForestClassifier(n_estimators=200, random_state=42),
    "Gradient Boosting":  GradientBoostingClassifier(random_state=42)
}

results = {}

for name, model in models.items():
    start = time.time()
    
    model.fit(X_train, y_train)
    
    results[name] = {
        "Train Accuracy ": accuracy_score(y_train, model.predict(X_train)),
        "Test  Accuracy:":  accuracy_score(y_test,  model.predict(X_test)),
        "Train ROC-AUC": roc_auc_score(y_train,  model.predict_proba(X_train)[:,1]),
        "Test  ROC-AUC":  roc_auc_score(y_test,   model.predict_proba(X_test)[:,1]),
        "time":      round(time.time() - start, 2)
    }

# Print all results
for name, metrics in results.items():
    print(f"\n{name}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.3f}")
        else:
            print(f"  {metric}: {value}")


KNN:
  Train Accuracy : 0.861
  Test  Accuracy:: 0.794
  Train ROC-AUC: 0.941
  Test  ROC-AUC: 0.862
  time: 1.300

Decision Tree:
  Train Accuracy : 0.999
  Test  Accuracy:: 0.772
  Train ROC-AUC: 1.000
  Test  ROC-AUC: 0.768
  time: 0.720

Random Forest:
  Train Accuracy : 0.999
  Test  Accuracy:: 0.826
  Train ROC-AUC: 1.000
  Test  ROC-AUC: 0.901
  time: 3.210

Gradient Boosting:
  Train Accuracy : 0.810
  Test  Accuracy:: 0.773
  Train ROC-AUC: 0.895
  Test  ROC-AUC: 0.851
  time: 12.530


## 4. Conclusion and discussion
- Which model performed the best?
- The best Performed was Random Forest
- Which was the most time efficient?
- Most time efficient was Decision Tree
- Which model showed the wors overfitting? 
- Decision Tree
- Why does ensemble learning outperform a single tree?
- A single Decision Tree makes one rigid set of rules and is very sensitive to the training data, small changes produce completely different trees. Ensemble methods fix this, the result is a much more robust and stable model that generalizes better to unseen molecules.
- Why does KNN perform well in high-dimensional fingerprint space?
- KNN works because the mathematical "closeness" of the fingerprints in high dimensions directly reflects the actual chemical similarity of the molecules.
- What does ROC-AUC tell us that accuracy does not?
- Accuracy tells you how often the model is "right," while ROC-AUC tells you how good the model is at separating the "bad" molecules from the "good" ones, regardless of how many there are of each.

In [41]:
results_df = pd.DataFrame(results).T
results_df["overfit_acc"] = (results_df["Train Accuracy "] - results_df["Test  Accuracy:"]).round(3)

results_df.style\
    .format("{:.3f}")\
    .highlight_max(subset=["Test  Accuracy:", "Test  ROC-AUC"], color="lightgreen")\
    .highlight_min(subset=["overfit_acc"], color="lightblue")\
    .highlight_max(subset=["overfit_acc"], color="salmon")

Unnamed: 0,Train Accuracy,Test Accuracy:,Train ROC-AUC,Test ROC-AUC,time,overfit_acc
KNN,0.861,0.794,0.941,0.862,1.3,0.067
Decision Tree,0.999,0.772,1.0,0.768,0.72,0.228
Random Forest,0.999,0.826,1.0,0.901,3.21,0.174
Gradient Boosting,0.81,0.773,0.895,0.851,12.53,0.037
