# Leash Tutorial - ECFPs and Random Forest


## Data Preparation

The training and testing data paths are defined for the .parquet files. We use duckdb to scan search through the large training sets. Just to get started lets sample out an equal number of positive and negatives.

This query selects an equal number of samples where binds equals 0 (non-binding) and 1 (binding), limited to 30,000 each, to avoid model bias towards a particular class.


In [None]:
DEBUG = False

In [None]:
if DEBUG:
    NB_ROWS = 30_000
else:
    NB_ROWS = 150_000

In [None]:
import duckdb
import pandas as pd
from xgboost import XGBClassifier

train_path = "../../data/raw/train.parquet"
test_path = "../../data/raw/test.parquet"

con = duckdb.connect()

df = con.query(
    f"""(SELECT *
                        FROM parquet_scan('{train_path}')
                        WHERE binds = 0
                        ORDER BY random()
                        LIMIT {NB_ROWS})
                        UNION ALL
                        (SELECT *
                        FROM parquet_scan('{train_path}')
                        WHERE binds = 1
                        ORDER BY random()
                        LIMIT {NB_ROWS})"""
).df()

con.close()

In [None]:
print(df.shape)
df.head()

## Feature Preprocessing

Lets grab the smiles for the fully assembled molecule `molecule_smiles` and generate ecfps for it. We could choose different radiuses or bits, but 2 and 1024 is pretty standard.


In [None]:
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.preprocessing import OneHotEncoder
import mapply

# Convert SMILES to RDKit molecules
df["molecule"] = df["molecule_smiles"].apply(Chem.MolFromSmiles)


# Generate ECFPs
def generate_ecfp(molecule, radius=2, bits=1024):
    if molecule is None:
        return None
    return list(AllChem.GetMorganFingerprintAsBitVect(molecule, radius, nBits=bits))


# Initialize mapply to use all of our CPU cores and provide us a progress bar.
mapply.init(
    n_workers=-1,
    progressbar=True,
)

df["ecfp"] = df["molecule"].mapply(lambda x: generate_ecfp(x))

## Train Model


In [None]:
MODEL_TYPE = "xgb"

In [None]:
import mlflow


def get_model(model_type):
    if model_type == "RF":
        return RandomForestClassifier(n_estimators=100, random_state=42)
    if model_type == "xgb":
        return XGBClassifier(use_label_encoder=False, n_estimators=100, random_state=42)


mlflow.set_tracking_uri("file:///Users/lstefanu/Documents/perso/kaggle/belka/mlruns")
mlflow.set_experiment("ecfp train")

with mlflow.start_run(run_name="train"):
    # One-hot encode the protein_name
    onehot_encoder = OneHotEncoder(sparse_output=False)
    protein_onehot = onehot_encoder.fit_transform(
        df["protein_name"].values.reshape(-1, 1)
    )

    # Combine ECFPs and one-hot encoded protein_name
    X = [
        ecfp + protein
        for ecfp, protein in zip(df["ecfp"].tolist(), protein_onehot.tolist())
    ]
    y = df["binds"].tolist()

    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Create and train the random forest model
    model = get_model(MODEL_TYPE)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred_proba = model.predict_proba(X_test)[
        :, 1
    ]  # Probability of the positive class

    # Calculate the mean average precision
    map_score = average_precision_score(y_test, y_pred_proba)
    print(f"Mean Average Precision (mAP): {map_score:.2f}")

    # Log parameters and metrics
    mlflow.log_params({"model": MODEL_TYPE, "n_estimators": 100, "random_state": 42})
    mlflow.log_metric("mAP", map_score)

Look at that Average Precision score. We did amazing!

Actually no, we just overfit. This is likely recurring theme for this competition. It is easy to predict molecules that come from the same corner of chemical space, but generalizing to new molecules is extremely difficult.


## Test Prediction

The trained Random Forest model is then used to predict the binding probabilities. These predictions are saved to a CSV file, which serves as the submission file for the Kaggle competition.


RUN TIME =

| Model         | Execution time | Device |
| ------------- | -------------- | ------ |
| Random Forest | 29min          | Mac M2 |
| XGBoost       | 49min          | Mac M2 |


In [None]:
import os
import pandas as pd
from tqdm import tqdm
import time

# Process the test.parquet file chunk by chunk
test_file = "../../data/raw/test.csv"
output_file = "submission.csv"  # Specify the path and filename for the output file

# Read the test.parquet file into a pandas DataFrame
chunksize = 100000  # Adjust chunk size according to your needs
total_chunks = sum(1 for _ in pd.read_csv(test_file, chunksize=chunksize))
progress_bar = tqdm(total=total_chunks, desc="Processing chunks", unit="chunk")

for df_test in pd.read_csv(test_file, chunksize=chunksize):
    # Generate ECFPs for the molecule_smiles
    df_test["molecule"] = df_test["molecule_smiles"].apply(Chem.MolFromSmiles)
    df_test["ecfp"] = df_test["molecule"].apply(generate_ecfp)

    # One-hot encode the protein_name
    protein_onehot = onehot_encoder.transform(
        df_test["protein_name"].values.reshape(-1, 1)
    )

    # Combine ECFPs and one-hot encoded protein_name
    X_test = [
        ecfp + protein
        for ecfp, protein in zip(df_test["ecfp"].tolist(), protein_onehot.tolist())
    ]

    # Predict the probabilities
    probabilities = model.predict_proba(X_test)[:, 1]

    # Create a DataFrame with 'id' and 'probability' columns
    output_df = pd.DataFrame({"id": df_test["id"], "binds": probabilities})

    # Save the output DataFrame to a CSV file
    output_df.to_csv(
        output_file, index=False, mode="a", header=not os.path.exists(output_file)
    )

    # Update the progress bar
    progress_bar.update(1)
    time.sleep(0.1)  # Simulating processing time

# Close the progress bar
progress_bar.close()