# Week 3

## Objectives so far
* Read more about Pyomo and other tools that might work for multi-objective optimization in Python
* Read and summarize multiple multi-objective optimization papers for Paloma
* ~~Prepare 2 architectures (one low-data and one large-data and present them in June 27)~~
* ~~Prepare a presentation for literature review in June 30~~

## June 27

>**Title:** Data-Driven Methods for Accelerating Polymer Design
>
>**Author(s):** Tarak K. Patra
>
>**Link:** [PDF Article](https://nbviewer.org/github/LouisTheLuis/MCSC-Summer-2022/blob/master/Summaries/Data-Driven_Methods_for_Accelerating_Polymer_Design.pdf)
>
>**Important Points:**
>* *Context:* data banks in synthetic polymers are sparse, heterogeneous, and sometimes unavailable. Additionally, due to a lack of techniques for measuring polymer properties, building these databases is often challenging.
>* Some popular large-scale polymer simulation packages: LAMMPS, GROMACS, NAMD, DL_POLY, DL_MONTE, and Cassandra. These are used for estimation of a wide range of equilibrium and steady-state properties (structure factor, radius of gyration, viscosity, gas permeation, etc.)
>* There are also packages for quantum level simulations: VASP, QUANTUM ESPRESSO, and Gaussian.
>* Primary bottleneck in optimal design of polymers is the large combinatorial sequence space. Large sequence spaces need to be explored to identify the best candidate. Thus, the article establishes a roadmap for polymer design study.
>* *Polymer Featurization:*
> ![image.png](attachment:image.png)
>* The featurization of polymers can be done by 6 popular representations: One-Hot Encoding of Polymer Sequence, Property Coloring, Chemical Tree, One-Hot Encoding of SMILES Strings, Autoencoding, and Motif-Based Fingerprinting.
>* Each one of these featurization models can be used for a particular kind of architecture; for example, property coloring might be useful for a CNN-focused architecture.
>* The article mentions multiple supervised machine learning methods that can be used for building predictive models for polymers: Kernel ridge regression (KRR), support vector machine (SVM), Gaussian process regression (GPR), ANN, and random forest.
>* The article also describes multiple optimization algorithms: Bayesian Optimization (BO), genetic algorithm (GA), and Monte Carlo tree search (MCTS).

### Meeting (6/27/2022 - 2:00pm)
During this meeting we reviewed the architectures proposed by all three of us. The idea was that each one of us would propose two architectures:
* One *low-data*/*short-term* architecture which would be used with a small toy data-set, whose purpose would be to test our machine learning approaches to generate new polymers with certain characteristics pertaining to a previously known set of molecules —such as cellulose acetate, for example.
* One *big-data*/*long-term* architecture, which would generalize the approach to a large data set with many polymers to choose from.

Before going into the details of what each one of us did, it is important to point out the proposed *data set* to use for the upcoming project development. Neil brought up that, as **PolyInfo** will not be available for a while, we will be forced to use **PI1M**, which stands for PolyInfo one million. The information regarding this particular data set is very flimsy; it does in one hand contain more than 1 million polymers. In the other hand:
* The data contained in this database might not be trusted due to their methodology not being particularly the best.
* It is likely that the database does not contain the information that we want. It might however, serve as a useful toy data set.

Now, we will look into each other's architectures:

>**Author:** Andrew Emmel
>
>**Architecture Diagram:** [Diagram](https://nbviewer.org/github/LouisTheLuis/MCSC-Summer-2022/blob/master/Proposed%20Architectures/Andy%27s%20Architecture.pdf)
---
>**Author:** Neil Malur
>
>**Architecture Diagram:** [Diagram](https://nbviewer.org/github/LouisTheLuis/MCSC-Summer-2022/blob/master/Proposed%20Architectures/Neil%27s%20Architecture.pdf)
---
>**Author:** Louis Martinez
>
>**Architecture Diagram:** [Diagram](https://nbviewer.org/github/LouisTheLuis/MCSC-Summer-2022/blob/master/Proposed%20Architectures/Louis%27%20Proposal.pdf)

## June 28
Now, we have been assigned to read more in the literature in order to prepare a presentation this Thursday, June 30 regarding the findings that we may have obtained from looking into the literature.

>**Title:** PI1M: A Benchmark Database for Polymer Informatics
>
>**Author(s):** Ruimin Ma and Tengfei Luo
>
>**Link:** [PDF Article](https://nbviewer.org/github/LouisTheLuis/MCSC-Summer-2022/blob/master/Summaries/PI1M_A_Benchmark_Database_for_Polymer_Informatics.pdf)
>
>**Important Points:**
>* *Context:* PolyInfo 1 Million is a benchmark database that is introduced by these researchers as a solution to a previous problem in polymer informatics: the lack of accessibility to polymer databases like PolyInfo, Polymer Genome or CHEMnetBASE-Polymers; which hinders the development of new approaches to molecular design.
>* The researchers manually collected 12000 polymer structures from PolyInfo and then used them to train a generative model to generate 1 million polymer structures outside of the training data set.
>* Then, a machine learning representation for polymers —polymer embedding (PE)— is introduced, which can be used to get information such as the density, glass transition temperature, etc.
>![image.png](attachment:image.png)
>* The database was built by taking around 12000 monomers from PolyInfo —represented as p-SMILES strings— and then are used as training set in a RNN, which is the generative model.
>* In the RNN model, the p-SMILES strings are tokenized into a sequence of characters $(X_1, X_2, ...,X_t, X_t, ..., X_N)$ which then are used to learn the conditional probabilistic distribution $P(X_{t+1}|X_1, X_2, ..., X_t)$. Once the RNN has been trained, it generates new polymers from the previous distribution.
>* Unfortunately, PI1M was published without polymer properties.

## June 29
More literature review. Probably will spend some time making the presentation for tomorrow June 30.

Also, Paloma told us to check this website: [The Materials Project](https://materialsproject.org/materials/mp-18957). 

>**Title:** Dielectric Polymer Property Prediction Using Recurrent Neural Networks with Optimizations
>
>**Author(s):** Antonina L. Nazarova, Liqiu Yang, Kuang Liu, Ankit Mishra, Rajiv K. Kalia, Ken-ichi Nomura, Aiichiro Nakano, Priya Vashishta, and Pankaj Rajak
>
>**Link:** [PDF Article](https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c01366)
>
>**Important Points:**
>* *Context:* even though there has been success in the use of ML for predicting structure-property relations, the field is in its infancy when it comes to dielectric properties of polymers.
>* This paper in particular, uses a single-layer Elman RNN to identify correlations between the structure of polymers of the norbornene class and their permitivitty.
>* It uses SMILES notation in binary and decimal representations.
>* The researchers compared two algorithms to implement RNN, the original backpropagation (BP) and its modification (ATransformedBP), developed with affine transformed input as well as resilitent propagation (RPROP) with an optimized parameter of initial weight update. These were compared in their effectiveness when predicting dielectric parameters.
>* They generated their own data set for glass transition temperature $T_g$ and dielectric constant $ε$ using valence-aware polarizable reactive force field method (ReaxPQ-v). Thus, the training set consisted of SMILES representation of generated chemically accurate data.
>* The use of input affine transformation (AT) is proposed to overcome the issues that backpropagation brings (e.g. trapping in local minima, slow convergence, etc.)
>* Also, the use of resilient propagation (in the form of iPROP⁻) was proposed to address convergence issues and local minima stacking.
![image.png](attachment:image.png)
>* The polymeric data sets were (...)
>* In order to compare the two RNN models the RMSE and RSD were calculated.
>* The prediction accuracy of the dielectric constant $ε$ using the ATransformedBP algorithm was found to be similar or in some instances slightly superior to resilient propagation learning algorithms like iPROP⁻.
>* The binary SMILES format led to better results than the decimal SMILES format.
>* The average RSD parameter of the algorithms never exceeded 5%; the maximum never being higher than 30%.

## June 30

### Meeting (6/30/2022 - 2:20pm)

* In the last meeting we focused on the proposed architectures our group proposed. The one we tended towards was using the bond angles as feature representation.
* Paloma opened the *Materials Project* website, with a particular polymer ($Nb_3Os$).
* Neil proposed some architectures with more up-to-date literature. This brought some ideas to him that made him reconsider the structure of the current architecture.
* Paloma has proposed some crash course in polymer engineering (OCW).
* She asks the chemical group to investigate how does the cellulose acetate monomer bond to itself.
* Paloma mentioned the architectures with IBM. 
    * IBM mentioned that they already did a data-based approach
    * Mentioned Random Forest; according to Neil, that seems to be outdated, as there are methods that outperform Random Forest
    * We might not be working with angles, but with a 3D representation
    * IBM has a data-set already


Let's making it simple for me:

I read:

* **1.** The paper about *Data-Driven Methods for Accelerating Polymer Design*, where the writer just went over multiple feature representation forms for polymers and how these affect the results given by Machine Learning methods.
* **2.** The paper about *Dielectric Polymer Property Prediction Using Recurrent Neural Networks with Optimizations*, in which the authors write on advances they made on the prediction of dielectric properties of polymers using a Recurrent Neural Network (RNN). They used multiple strategies (SMILES Fingerprint representations, backpropagation through ATransformedBP and iRPROP). The SMILES representation turned out to be the best in training and prediction performance.
* **3.** Review Article of Machine Learning for Polymer Informatics. Contains a list of polymer databases that could be useful.

Let's compile the **Literature Review Questions**:
* **1.** Representation of the chemical structure
* **2.** The architecture of the machine learning method
* **3.** How do they get the properties of the polymers?
    * **a.** Molecular Weight
    * **b.** Comparison with other polymers
    * **c.** Chemical groups
    * **d.** Another approach
* **4.** Do they mention how to incorporate physics?
* **5.** How much data do we need to do something similar?
* **6.** Could this work on 3D?
* **7.** What is the key message of the paper?

Neil mention the papers he read (the 2 most important ones):

>**Author:** Neil Malur
>
>**Important points:**
>
>* **1.** 2022-based approaches. We take a polymer and put it through a Message Passign Neural Network (MPNN). The results are very good, and it lines up with our intuiton because we are looking at how polymers bond to themselves.
>* **2.** May 2022 -> They took the monomer as a repeating unit and made it bond to itself in a repeating manner. They made the graph periodic in nature and loop to itself, and then pass it through a MPNN. They found significant gains in predictions: atomization error went down, decreases for around 40%-19% in some properties and increases of ~3% (insignificant, one of them was crystallization energy). An overall ~20% decreaser error. It does not, however, encapture 3D data or entanglement.
>* **3.** Biodegradable polymers -> P2Actor.
---
>**Author:** Andrew Emmel
>
>**Important points:**
>
>* PI1M is not a possible data-set, unfortunately. The data set does not contain any polymer propeties; just the list of properties.
>* The BigSMILES for describing structure-identifying macromolecules. It might be interesting depending on which monomer representation you use. It does not do 3D representation for these macromolecules. You could represent polymerization points.
>* Paloma mentions that we need to understand why these researchers were looking for something different from the original SMILES.
>* Backtracking to an IBM paper. They mentioned using feature vector for the inverse molecular design. If we used the Deep RL method. Could not find much; written in a very confusing way.

Elena unfortunately could not find much information regarding why fingerprinting assumes a Gaussian Distribution. She will be looking deeper into that.

We need more information on Molecular Dynamics. Look into Primer/Article Review. Meet Prof. Gregory Rutledge in MIT to make more specific questions. We need to look into how to incorporate physics into the architecture.

Download PI1M and see what we can do with it. 