<a href="https://colab.research.google.com/github/FlorianSong/MResAMS_DataAnalytics/blob/main/Workshop4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project

- For your final project, you will be asked to analyse data from the following paper: [Electronic spectra from TDDFT and machine learning in chemical space!](https://aip.scitation.org/doi/10.1063/1.4928757). 

- In this data, you will find a variety of measurements done for a total of 22000 molecules. The original paper performed Machine Learning with Coulomb-matrix (CM) and bag-of-bonds (BOB) features. However, we have more heavily featurised the data from [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) representations of the molecules, into quantitative data that can then be used by the models to be learnt and fit on. Since this step can sometimes be a little involved (much research in cheminformatics is dedicated to this field!), I have prepared this for you. You will find the data featurised in three distinct ways:

    1. Using Morgan fingerprints as implemented in [RDKit](https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints). This is a popular method to turn molecules of any shape into fixed sized vectors of binary data. The authors of the paper also use this method. 
    2. Using all molecular descriptors available in RDKit. I'm not sure if a comprehensive list of the descriptors exists, but they can all be looked up in the rdkit [documentation](https://www.rdkit.org/docs/source/rdkit.Chem.html). This is partially also used in the paper.
    3. Using the [mordred package](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y), containing some 1826 descriptors, 1427 of which returned non-erroneous features for the molecules in the photoswitch dataset. For a complete list of all descriptors, click [here](https://mordred-descriptor.github.io/documentation/master/descriptors.html). This is not in the paper and may offer additional descriptors that could be important.


- I will put the `final_project_data` on the GitHub for this course, which contains the following: 
    * `Featurisation.ipynb` (My finished notebook with all the commented (!!!) code necessary to obtain the data below, in case you want to change parameters, in particular for fingerprints. Since this may be a little advanced, this is absolutely not mandatory!)
    * `qm8.csv` (this is the original data, SMILES + outcome measurements, i.e. transition wavelengths)
    * `morgan_fingerprints.csv` (SMILES + Morgan fingerprints)
    * `rdkit_descriptors.csv` (SMILES + rdkit descriptors)
    * `mordred_descriptors.csv` (SMILES + mordred descriptors)


- Some ideas to get you started: 
    * Compare & contrast methods of featurisation (fingerprints have no real-life equivalents as opposed to descriptors etc):
    * What features are useful?
    * How can we cut them down?
    * Supervised learning:
    * Attempt to predict excitation energies and oscillator strengths, either purely with ML or by predicting error between CC2 and DFT
    * Random forests? Neural nets? (TAKE CARE: big data set, dimensionality reduction)
    * Unsupervised learning:
    * Clustering the molecules (or subsets), what groups of molecules are you seeing? Useful properties of some groups? Outliers?
    * Are there links we don’t expect? How can we explain these?
![image.png](attachment:image.png)



### Best of luck!


QM8 is the dataset used in a study on modeling quantum
   mechanical calculations of electronic spectra and excited
   state energy of small molecules. Multiple methods, including
   time-dependent density functional theories (TDDFT) and
   second-order approximate coupled-cluster (CC2), are applied to
   a collection of molecules that include up to eight heavy atoms
   (also a subset of the GDB-17 database). In our collection,
  there are four excited state properties calculated by four
  different methods on 22 thousand samples:
  S0 -> S1 transition energy E1 and the corresponding oscillator strength f1
  S0 -> S2 transition energy E2 and the corresponding oscillator strength f2
  E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation
  Random splitting is recommended for this dataset.
  The source data contain:

   - Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
   - Columns 2-5: RI-CC2/def2TZVP
   - Columns 6-9: LR-TDPBE0/def2SVP
   - Columns 10-13: LR-TDPBE0/def2TZVP
   - Columns 14-17: LR-TDCAM-B3LYP/def2TZVP