<a href="https://colab.research.google.com/github/S-AJ-H/AIMS26/blob/main/4a_Project_A_Polymer_Representations_with_Chemprop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4a. Project A: Further representations for polymers with Chemprop

In the workshop notebooks, we have used pairs of monomers in the form `A.B` to represent polymers with many bonded, repeating units i.e. `A-B-A-B-A-B-....` or -`[A-B-]n`. The benefit of this approach is its simplicity: it is easy to create polymer representations that contain the correct atoms and therefore contain all of the important groups of atoms that define chemical properties.
The downside of this approach is that the two monomers do not interact with each other properly, as the monomers do not share any bonds. They therefore cannot pass messages to each other, limiting the function of Chemprop for generating hidden representations. This is a polymer-specific problem.

---

## Project Tasks:

In this project, you will be predicting electron affinities of the same polymer photocatalysts as in the workshop. Your goal is to train Chemprop models using different lengths of given SMILES strings ("oligomer length") to represent the polymers:

*   pairs of monomers (format A.B)
*   dimers (format A-B)
*   quadmers (A-B-A-B)
*   octamers (A-B-A-B-A-B-A-B)

To do this, you will need a new dataset ("dataset_oligomers.csv"), which is provided in the GitHub repository. Evaluate the performances of these models against each other and draw conclusions on how SMILES input type affects architecture, hyperparameters, training and performance of your model. You might want to utilise some of the resources given below.

---

## In your presentations, you are expected to:

1.   Define the project problem and discuss its real-world applications.
2.   Explain the model architecture and the reasons for using it for the specific problem, with a focus on how it is different from the models from the workshop earlier in the week.
3.   Describe the training process and show training loss curves. How does the oligomer length affect training?
4.   Discuss the impact of your hyperparameter optimisations. Explain why your reasoning for hyperparameter selection and tuning. Present your best hyperparameters.
5.   Present key performance metrics from your best model. How do these change as a function of oligomer length?
6.   What is limiting the model and how it could be further improved?

---

## Extension Tasks:

* Explore how changing the oligomer length affects the performance of the fixed representations model. Compare this to your Chemprop models.
* Evaluate the performance of these models as a function of dataset size.
* Create multi-task models, in which EA and IP are predicted simultaneously. Does model performance improve? Why/why not?

---

## Resources:

>RDKit:   
>https://rdkit.org/docs/index.html

>Chemprop:  
>https://pubs.acs.org/doi/10.1021/acs.jcim.9b00237  
>https://pubs.acs.org/doi/10.1021/acs.jcim.3c01250  
>https://chemprop.readthedocs.io/en/latest/

>Data from:  
>https://pubs.acs.org/doi/full/10.1021/jacs.9b03591

>Polymer representations:  
>https://pubs.rsc.org/en/content/articlelanding/2022/SC/D2SC02839E

##0. Install Chemprop from GitHub

In [None]:
# Chemprop (~1min)
!pip install chemprop -qq
import chemprop
print("Imported Chemprop version", chemprop.__version__)

from rdkit import Chem                                                  # rdkit is used to convert SMILES to molecular graphs ("mols")
from rdkit.Chem import Draw                                             # Lets us draw molecules
from chemprop import data, featurizers, models, nn                      # chemprop is our GNN package

# ML
import lightning.pytorch as pl                                          # lightning has built-in functions for lots of the basics (metric tracking etc); Chemprop is built on this.
from lightning.pytorch.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import EarlyStopping
from lightning.pytorch.loggers import CSVLogger                         # Configure CSV logger for tracking losses
import logging
logging.getLogger("lightning.pytorch").setLevel(logging.ERROR)
from sklearn.model_selection import train_test_split, KFold, PredefinedSplit
from sklearn.metrics import r2_score, mean_absolute_error

# Misc
import os
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
version = 0                                                             # used for save files

##1. Load data



In [None]:
#Get the polymer SMILES from GitHub.
csv_url = "https://raw.githubusercontent.com/S-AJ-H/AIMS26/2a6ddde83a53782df38fe6194cb59c5d3b2c2a1c/dataset_oligomers.csv"
df_data = pd.read_csv(csv_url)
display(df_data)

##2. Prepare data for machine learning

##3. Define, train and validate model

##4. Analyse results