<img>
<html>
  <head>
  </head>

  <body>
    <img src="image/molecule image.jpg">
  </body>
</html>

# Overview

A common challenge in experimental chemistry is inferring causality from small datasets. Chemists often synthesize around 100 molecules, measure a key property, and then attempt to identify which molecular features influence that property. This process typically relies on expert intuition to select meaningful features, followed by multivariate linear regression for feature importance analysis.

Our recent work at the **NSF's Molecule Maker Lab Institute**, published in *Nature*, demonstrated that a **purely data-driven machine learning approach** can:
1. Identify important molecular features previously overlooked by experts.
2. Generate a regression model for photostability that performs well on newly synthesized molecules.

This approach has shown significant promise in discovering and enhancing **molecular photostability** (e.g., for organic solar cells), and we aim to build upon this success to enable broader data-driven discoveries in chemistry.

---

# Challenge

The goal of this competition is to:
> **Identify the best algorithm to select the most informative molecular features and accurately regress the experimental property (T80) for new molecules.**

You are provided with:
- A small dataset (~100 molecules).
- A large number of calculated (but mostly irrelevant) features.
- An experimental property: **photostability lifetime (T80)**.

---

# Dataset & Resources

- **Training and test datasets** include ~150 molecular features.
- **SMILES strings** (textual molecular representations) are included.
- Features can be extended using **RDKit** or other cheminformatics tools.
- Example RDKit script is available in the `SmilesStrings Dataset`.

You may also utilize:
- **Pre-trained models** (e.g., FARM or other SMILES-based models).
- Traditional models without any pre-trained feature extraction.

---

# Model Development Insights

In our previous research, we evaluated **Support Vector Regression (SVR)** models trained on approximately **2.5 million combinations** of molecular features.

Key findings:
- **Top 3 predictive features** identified:
  - `TDOS4.0`
  - `NumHeteroatoms` (number of non-carbon, non-hydrogen atoms)
  - `Mass`
- **TDOS4.0** (and its correlated counterpart `TDOS3.9`) was physically validated.
- **NumHeteroatoms** and **Mass** were not physically analyzed yet — open for discovery.

---

# Evaluation

To encourage innovation:
- We have synthesized **9 new molecules** and measured their **T80 values**.
- These will serve as the test set to evaluate submitted models.

Prizes will be awarded for:
- Best model **using pre-trained SMILES models**.
- Best model **not using pre-trained models** (classical feature-based approach).

---

Good luck, and happy modeling!


In [1]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   


In [4]:
# load the data 
df_train=pd.read_csv("../molecular-machine-learning/data/train.csv")
df_test=pd.read_csv("../molecular-machine-learning/data/test.csv")

In [6]:
# let us see the data
df_train.head()

Unnamed: 0,Batch_ID,T80,Smiles,Mass,HAcceptors,HDonors,LogP,Asphericity,Rg,TPSA,...,SDOS4.5,SDOS4.6,SDOS4.7,SDOS4.8,SDOS4.9,SDOS5.0,SDOS5.1,SDOS5.2,SDOS5.3,SDOS5.4
0,Train-01,103.86,CCCCCCCCCCCCc1ccsc1-c1ccc(-c2cccs2)cc1,410.692,2,0,9.607,0.301361,5.187321,0.0,...,1.717761,1.970186,1.760071,1.224983,0.664733,0.282353,0.096763,0.034589,0.030793,0.05734
1,Train-02,101.13,CCCCCCCCCCCCc1ccsc1-c1cccs1,334.594,2,0,7.94,0.367472,4.141425,0.0,...,0.012396,0.046031,0.133124,0.29984,0.525958,0.718549,0.764711,0.634854,0.414866,0.225909
2,Train-03,78.3,CN1CCN(S(=O)(=O)c2ccc(-c3ccc(-c4cccs4)cc3)cc2)CC1,398.553,4,0,4.0182,0.799589,5.368024,40.62,...,2.421162,2.703267,2.352276,1.595867,0.845839,0.35462,0.127878,0.0606,0.064782,0.098908
3,Train-04,71.88,O=C1c2ccccc2C(=O)c2cc(-c3ccc(-c4cccs4)s3)ccc21,372.47,4,0,5.919,0.793825,4.948903,34.14,...,0.88632,0.579059,0.345148,0.246564,0.276259,0.381997,0.495304,0.566935,0.594203,0.614075
4,Train-05,68.37,CC(C)(C)OC(=O)n1ccc2ccc(-c3ccc(-c4ccc(-c5cccs5...,457.62,5,0,8.5485,0.671148,5.994751,31.23,...,0.487723,0.245764,0.249019,0.363222,0.474953,0.505358,0.440671,0.330129,0.234649,0.183111


In [25]:
df.describe()

Unnamed: 0,T80,Mass,HAcceptors,HDonors,LogP,Asphericity,Rg,TPSA,RingCount,NumRotatableBonds,...,SDOS4.5,SDOS4.6,SDOS4.7,SDOS4.8,SDOS4.9,SDOS5.0,SDOS5.1,SDOS5.2,SDOS5.3,SDOS5.4
count,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,...,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0
mean,22.710476,601.389929,5.5,0.309524,10.854289,0.53268,6.355368,36.990952,5.714286,11.690476,...,0.874759,0.763264,0.668653,0.595143,0.560035,0.571105,0.624577,0.706185,0.786314,0.833714
std,26.896407,176.251665,2.244234,0.604378,3.504098,0.217963,1.264587,30.888275,1.74308,6.884015,...,0.993377,0.872205,0.720167,0.580504,0.541022,0.568734,0.59727,0.665208,0.769713,0.814881
min,1.5,238.315,2.0,0.0,3.8721,0.169945,3.62924,0.0,2.0,2.0,...,0.012396,0.014004,0.02437,0.037166,0.039435,0.0517,0.049996,0.034589,0.030793,0.05734
25%,5.085,483.69125,4.0,0.0,8.369075,0.345224,5.593117,17.98,5.0,4.0,...,0.1581,0.107728,0.150733,0.19582,0.239064,0.283035,0.220079,0.235281,0.250824,0.236877
50%,10.485,570.826,5.0,0.0,10.8015,0.541883,6.289388,36.225,5.0,14.0,...,0.449055,0.363273,0.333565,0.330316,0.410859,0.371651,0.427824,0.475996,0.488769,0.473099
75%,30.1825,751.587,6.75,0.0,13.05458,0.714117,6.95129,50.075,7.0,17.0,...,1.232151,1.191322,1.086311,0.913003,0.712412,0.718112,0.964893,1.124234,1.386218,1.418428
max,103.86,1005.426,12.0,2.0,17.767,0.913014,10.519416,132.99,9.0,23.0,...,3.765836,2.986408,2.43103,2.394081,2.738119,3.060757,3.103451,2.90821,2.818927,3.188643


In [10]:
# Description of the dataset
print(f"The dataset contains {df_train.shape[0]} rows and {df_train.shape[1]} columns.")
print("Here is a summary of the columns:")
print(f"- Total columns: {df_train.shape[1]}")
print(f"- Numerical columns: {df_train.select_dtypes(include=['int64', 'float64']).shape[1]}")
print(f"- Categorical columns: {df_train.select_dtypes(include=['object', 'category']).shape[1]}")
print(f"- Boolean columns: {df_train.select_dtypes(include=['bool']).shape[1]}")
print(f"- Columns with missing values: {df_train.isnull().sum().loc[df_train.isnull().sum() > 0].shape[0]}")


The dataset contains 42 rows and 146 columns.
Here is a summary of the columns:
- Total columns: 146
- Numerical columns: 144
- Categorical columns: 2
- Boolean columns: 0
- Columns with missing values: 0


In [13]:
# check the number of missing values in the data
print("Missing values in the dataset:")
print(df_train.isnull().sum().sum())


Missing values in the dataset:
0


In [18]:
# check the duplicates in the data 
duplicates = df_train.duplicated().sum()
duplicates_percentage = (duplicates / df_train.shape[0]) * 100
print(f"Number of duplicate rows: {duplicates} ({duplicates_percentage:.2f}%)")


Number of duplicate rows: 0 (0.00%)


In [19]:
# check the outliers in the data 
def check_outliers(data):
    outliers = {}
    for column in data.select_dtypes(include=[np.number]).columns:
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers[column] = data[(data[column] < lower_bound) | (data[column] > upper_bound)].shape[0]
    return outliers

In [20]:
# now pass the parameters to check the outliers
outliers = check_outliers(df_train)
print("Outliers in the dataset:")

Outliers in the dataset:


In [22]:
# # lets checke the outliers throuh visulization
# def plot_outliers(data, outliers):
#     for column, count in outliers.items():
#         plt.figure(figsize=(10, 6))
#         sns.boxplot(x=data[column])
#         plt.title(f"Boxplot of {column} (Outliers: {count})")
#         plt.show()
# plot_outliers(df_train, outliers)

In [None]:
# lets explore the most important features in the data one by one
df_train=df_train[""]

ValueError: could not convert string to float: 'Train-01'