<a href="https://colab.research.google.com/github/D3TaLES/In-The-Mix/blob/main/data_science/InTheMix2_DataScineceDay2_MASTER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recap
1. Reduction and oxidation potentials are important for the development of redox flow batteries

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/redox_flow_battery.png)

2. There are several types of Machine Learning (ML).

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/ml_taxanomy.png)


Supervised learning is a form of machine learning involves those problems where the task at hand requires a fully labeled dataset. The models in this case use the labels as ground truth to learn and update themselves.Today we shall take a deeper dive in supervised learning and look at a practical application of it.


# Principles in Action: Training supervised ML models
>Goal: Train ML model to predict vertical electron affinity.

To accomplish this, we will download around 30,000 molecules and calculate three descriptors which we will use to predict the vertical electron affinity. We will start with simple models and then proceed to complex models.

The goal is to see how well different models trained on our data performs the task of predicting vertical electron affinity.

Before we proceed towards analysis let us spend some time discussing some terminologies that we will be using frequently in this tutorial and also learn about the important steps of approaching our task the ML way!

# Terminology Explanation


* **Data visualization and preprocessing**

> Data visualization helps us see patterns or trends. Data preprocessing typically involves cleaning up, organizing, and formatting the data so that it can be corretly handled by the model. This helps the model learn better from the data and make better predictions.



![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/train_test_split.png)
> It is important to evaluate how our model performs on unseen data. To do so, people usually split the data into two parts – one for teaching (training set) and one for checking (testing set). This helps us ensure that the machine learning model generalizes well on new, unseen information and not just memorizing the training data. The most commom way to do the train test split is randomly pick 80% of the data for training and the remainimg 20% for testing.


![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/choose_the_right_model.png)

> Just like different tools are used for different tasks, there are different machine learning models that are meant for different problems. Based on the type of problem and data we have, and the results we would want to achieve, the following are some tips for choosing a good model:
1. Understand the problem: First, figure out what kind of problem you're trying to solve. Is it about classifying things (like cats and dogs), predicting numbers (like house prices), or finding relationships between data points (like how exercise affects health)?
2. Explore the data: Take a look at your data to see what's in it. Are there images, text, or numbers? Are there missing values or unusual patterns? This can help you choose a model that works well with that kind of data.
3. Start simple: Try starting with a simple model, like a decision tree. If it doesn't perform well, you can move on to more complex models.
4. Experiment and compare: Test different models on your data and compare their performance. Keep track of their accuracy, speed, and any other factors that are important for your task.


![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/loss.png)

>A loss function is like a scorecard that tells us how good or bad the computer's predictions are compared to the real answers. Our goal is to help the computer get better at predicting by minimizing the difference between its predictions and the actual values. The loss function guides the computer to make adjustments and improve its performance. The following are two common loss functions:
1. Mean Squared Error (MSE): This loss function calculates the average of the squared differences between the model's predictions and the answers. It's often used in problems where we want to predict numbers, like house prices or temperatures.
2. Cross-Entropy: This loss function is used when we want to classify things, like deciding if a picture is of a cat or a dog. It measures how well our model can predict the correct category, rewarding the model if it's confident about the right answer and penalizing it if it's confident about the wrong one.





In [None]:
#@title Visual of loss function
from IPython.display import Image
Image(url='https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/Loss_Function.gif', width = 600)

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/hyperparameter.png)

> When we build a machine learning model, there are a set of parameters whose optimum values are determined through the training of the model but still some parameters remain which are related to the architecture of the model and are not determined as a part of the training process. These parameters are known as hyperparameters. Hyperparameter tuning involves determination of the optimum value of these hyperparameters that best fits our model.

In [None]:
# REST API class
#@title Install and import required packages
!pip install rdkit -qqq
from sklearn.metrics import mean_absolute_percentage_error
import json, warnings, requests, rdkit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem.rdMolDescriptors import *

from keras.models import Sequential
from keras.layers import Dense

USERNAME = 'd3tales.edu@gmail.com'
PASSWORD = 'D3education'

class RESTAPI(object):
    def __init__(self, method=None, url="https://d3tales.as.uky.edu", endpoint=None,
                 login_endpoint='login', username=USERNAME, password=PASSWORD,
                 upload_file=None, params=None, expected_endpoint=None, return_json=False):
        """
        Upload a file to through d3tales.as.uky.edu file upload feature.

        :param method: str, html method (such as post or get)
        :param url: str, base url
        :param endpoint: str, post or get endpoint url (not containing base url)
        :param login_endpoint: str, login url (not containing base url)
        :param username: str, user username
        :param password: str, user password
        :param upload_file: str, path to file to be uploaded
        :param params: dict, form parameters for post
        :param return_json: bool, get or post method returns json if true
        """
        self.method = method
        self.endpoint = "{}/{}/".format(url, endpoint).replace("//", "/").replace(':/', '://')
        self.login_endpoint = "{}/{}/".format(url, login_endpoint).replace("//", "/").replace(':/', '://')
        if expected_endpoint:
            self.expected_endpoint = "{}/{}/".format(url, expected_endpoint).replace("//", "/").replace(':/', '://')
        self.user_data = dict(username=username, password=password) if username and password else None

        self.client = self.get_client()
        params.update(dict(csrfmiddlewaretoken=self.csrftoken, next='/')) if params else {}
        self.params = params or {}
        self.upload_file = upload_file
        self.return_json = return_json

        if self.method in ["get", "GET", "Get"]:
            self.response = self.get_process()

        elif self.method in ["POST", "post", "Post"]:
            self.response = self.post_process()

        if expected_endpoint:
            if self.response.request.url != self.expected_endpoint:
                warnings.warn("The {} response url for {} to {} did not match the expected response url".format(
                    self.upload_file, self.endpoint, self.method))

    @property
    def cookies(self):
        return self.client.get(self.endpoint).cookies  # sets cookie

    @property
    def csrftoken(self):
        # Retrieve the CSRF token for data post
        return self.cookies['csrftoken'] if 'csrftoken' in self.cookies else self.cookies['csrf']

    def get_client(self):
        with requests.Session() as client:
            if self.login_endpoint and self.user_data:
                # Login
                client.get(self.login_endpoint)  # sets cookie
                csrftoken = client.cookies['csrftoken'] if 'csrftoken' in client.cookies else client.cookies['csrf']
                self.user_data.update(dict(csrfmiddlewaretoken=csrftoken, next='/'))
                # Submit login form
                req = client.post(self.login_endpoint, data=self.user_data, headers=dict(Referer=self.login_endpoint))
            return client

    def post_process(self):
        # Submit data form
        file_data = dict(file=open(self.upload_file, 'rb')) if self.upload_file else None
        req = self.client.post(self.endpoint, data=self.params, files=file_data,
                               headers=dict(Referer=self.endpoint), cookies=self.cookies)
        return_data = req.json() if self.return_json else req
        return return_data

    def get_process(self):
        if self.params:
            req = self.client.get(self.endpoint, data=self.params, headers=dict(Referer=self.endpoint), cookies=self.cookies)
        else:
            req = self.client.get(self.endpoint, headers=dict(Referer=self.endpoint))

        return_data = req.json() if self.return_json else req
        return return_data


def get_prop(prop="reduction_potential", limit=500):
  query = "mol_characterization." + prop + "==true/mol_info.smiles=1&mol_characterization." + prop + "=1/limit=" + str(limit)
  print("Collecting data through REST API...")
  response = RESTAPI(method='get', endpoint="restapi/molecules/"+query,
                      url="https://d3tales.as.uky.edu", login_endpoint='login',
                      return_json=True).response
  comp_data = pd.DataFrame(response)
  get_value = lambda c: c.get(prop).get("value")
  comp_data['smiles'] = comp_data.mol_info.apply(lambda c: c.get("smiles"))
  comp_data[prop] = comp_data.mol_characterization.apply(get_value)
  comp_data

  comp_data.set_index(comp_data._id, inplace=True)
  comp_data.drop(['_id', 'mol_info', 'mol_characterization' ], axis=1, inplace=True)

  return comp_data
# get_prop(prop="vertical_electron_affinity", limit=5)

## Data preparation

Data preparation is an important part of machine learning. Data preparation involves getting the data ready to the point where it can reveal its secrets when we cast the spell of machine learning in it. Often the raw or original data that is available contains many irregularities which needs to be taken care of. In addition to that it is very important to gain a thorough understanding of every aspect of the data. This is necessary to understand and interpret results across the several steps of the analysis process. Even though the steps for prerpocessing can vary depending on the data we deal with the following steps broadly summarize how data can be prepared correctly:

1. Gaining information about the source of the data.
2. Doing a thorough study on the background of the data and what each aspect of the data signifies.
3. Remove irregularities present in the data.
4. Prepare proper visualization of the data.

Let us now obtain our dataset and do a visual study of it to find out what we are dealing with.

Hit run on the next cell. This will download the necessary data required for our supervised ml exercise!

Our data for this consists four of properties of several molecules namely vertical electron affinity, kappa1, LabuteASA and CalcChi1v.

In [None]:
#@title Loading and preprocessing data

raw_df = get_prop(prop = 'vertical_electron_affinity', limit=100000).reset_index(drop=True)
mol = []
Kappa1 = []
Kappa2 = []
LabuteASA = []
calc1v = []
calc2v = []
MolMR = []
NumHD = []
for i in range(len(raw_df)):
        mol.append(Chem.MolFromSmiles(raw_df['smiles'][i]))
        Kappa1.append(CalcKappa1(mol[i]))
        LabuteASA.append(CalcLabuteASA(mol[i]))
        calc1v.append(CalcChi1v(mol[i]))


d = pd.DataFrame({'Kappa1':Kappa1, 'LabuteASA':LabuteASA, 'CalcChi1v':calc1v})
d = d.join(raw_df['vertical_electron_affinity'])


print(d)

In [None]:
#@title
Response = 'vertical_electron_affinity'
# Response = 'oxidation_potential'

Feature1 = 'Kappa1'
Feature2 = 'CalcChi1v'
Feature3 = 'LabuteASA'
Sample_Size =  30814
data_itm = pd.DataFrame({'Response':d[Response], 'Feature1':d[Feature1], 'Feature2':d[Feature2], 'Feature3':d[Feature3]}).sample(Sample_Size)
#data_itm = pd.DataFrame({'Response':d[Response], 'Feature1':d[Feature1], 'Feature2':d[Feature2], 'Feature3':d[Feature3]})
data_itm.head()

Let us now look at our data. We discussed earlier that this is an important part of data preparation. We human beings learn best using our basic senses. So its important to be able to see the data through our eyes.

Hit run on the cell below!

In [None]:
#@title Visualizing data
import seaborn as sns

sns.pairplot(data_itm, diag_kind='hist')

One can immediately notice the weirdness present in the data just by looking at it. This happens because of the presence of influential observations called outliers. The reason outliers are called influential observations are simply because they can influence any machine learning models very drastically. Thus is is important to remove them before performing any kind of modelling.

Click on the next cell to see how the data looks after removal of outliers!

In [None]:
#@title Removing outliers
data_itm = data_itm.loc[(data_itm['Response'] > -20) & (data_itm['Response'] < 20)]
sns.pairplot(data_itm, diag_kind='hist')

Now that we are visually satisfied, lets proceed to analyze it. But before we do so we need to split the data. For this exercise we will use 80% data for training our models and 20% for testing them.

The next cell will perform this 80-20 split.

In [None]:
#@title Splitting data for training and testing
y = data_itm['Response']
X = data_itm[['Feature1', 'Feature2', 'Feature3']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#Linear Regression

This is one of the very first statistical concepts that enabled performing supervised learning. Even though its remarkably simple it has proved useful in innumerable number of situations and also paved the way for development of more advanced models for supervised learning.

Consider the data we saw yesterday in the primer on linear algebra section

Student No. |Age | Height |
----------- |----|--------|
1           | 20 |  65.78|  
2           | 22 |  71.52 |  
3           | 21 |  69.40 |  
4           | 21 |  68.22 |  
5           | 23 |  67.79 |  

Suppose we want to come up with a model which takes input the height of a student and predicts its age. The most simple relationship we can think about is

$Age = α + β \cdot Height$.

Note that this relationship is linear hence the name linear regression. Now we have established a relationship but we have not quantified it yet in the sense that we don't know the values of $α$ and $\beta$. How to find the right value of $α$ and $β$?


It is evident that modeling can never give predictions with 100% accuracy. No matter how good of a model we fit there will always be present some error in our prediction. This being said we can frame our model as

> $Age = α + β \cdot Height + e$.

Here we are trying to explain the age from height using the $α$ + $β$*Height part but still after that there remains a part that affects age which we doesn't know. We are denoting this part using $ϵ$. Let us now write the equations for our 5 data points

> $20 = α + β \cdot 65.78 + e_1$

> $22 = α + β \cdot 71.52 + e_2$

> $21 = α + β \cdot 69.4 + e_3$

> $21 = α + β \cdot 68.22 + e_4$

> $23 = α + β \cdot 67.79 + e_5$

Then this implies that

> $e_1 = 20 - (α + β \cdot 65.78)$.

> $e_2 = 22 - (α + β \cdot 71.52)$.

> $e_3 = 21 - (α + β \cdot 69.4)$.

> $e_4 = 21 - (α + β \cdot 68.22)$.

> $e_5 = 23 - (α + β \cdot 67.79)$.

are our errors in prediction. We would want to choose a value of $α$ and $β$ that minimizes these errors as much as possible. This idea is achieved using the method of least squares. Let us define a function

> $L(α, β) = \frac{1}{5} \sum\limits_{i=1}^5 e_i^2 = \frac{1}{5} \sum\limits_{i=1}^5 (Age_{\ i} - (α + β \cdot Height_{\ i}))^2$




This is known as the the 'mean squared error'. The idea is to choose $α$ and $\beta$ which can gives us the minimum value of $L(α, β)$. Such values of $α$ and $\beta$ will give us our linear regression model.

The mean squarred error is often minimized using an algorithm called gradient descent.




In [None]:
#@title
Image(url='https://gbhat.com/assets/gifs/linear_regression.gif', width = 600)

Let us now perform linear regression on our data on molecules. Hitting run on the next module will fit the linear regression to our molecules.

In [None]:
#@title
reg = LinearRegression()
reg.fit(X_train, y_train)

How has the linear regression performed?

To test this we use our fitted regression model to make predictions on the test data which we then use to calculate a measure of accuracy known as coefficient of variation denoted by $R^2$.

$R^2$ gives us a measure of how much the predicted and the actual values are alike with respect to their variability. Higher values of $R^2$ indicate better accuracy.

In [None]:
#@title
y_pred_reg = reg.predict(X_test)
r2_score(y_test, y_pred_reg)

# Neural Networks (NN) for Supervised Learning

Neural Networks is a computational learning system that translate a data input (given in a certain form) into a desired output (usually in a different form). For example we can built a neural network to classify images into different classes. In this case the input data is images (one form) and the output are real numbers (a form different from images).

Neural networks were inspired by the human brain which is a complex network of neurons connected in a complicated way. Information passes from one neuron to another through the connections which enables us to perform our daily functions.

Like human brain neural networks also contains neurons which are simple computational units arranged in a specific way (usually in several layers) and connected to each other.

A picture of a simple neural network is shown below.

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/neural_net.png)

The picture above represents a simple neural network called the feed forward neural network with 4 layers.

The first layer is called the **input layer**. This is where the data enters the network. The number of neurons in this layer is equal to the number of input features (or input variables).

The layers in the middle are called the **hidden layers**. It is here that most of the magic happens. As the data passes through these layers certain computations take place which finally lead us to the **output layer** which gives us the desired output. Note that there can be several neurons in output layers depending on the problem at hand.

The neurons between any two layers are connected through weights. Weights are parameters in the network that enables transformation of the input data within the hidden layers.

Suppose we have a neural network with 1 input layer, 3 hidden layers and 1 output layer. Let $W_1, W_2, W_3$ denote three matrices which contain the weights of the 3 layers. Let us denote the input to neural network using $x$

Upon passing through $h_1$ a matrix multiplication between $x$ and $W_1$ is performed followed by passing through a function $f$ known as the activation function. This can be denoted as

> $h_1 = f (W_1x)$

Now $h_1$ passes thorugh the second layer which performs a matrix multiplication between $h_1$ and $W_2$ followed by passage through activation function. This can be again denoted by

> $h_2 = f (W_2 h_1)$

Follwing this pattern until the final hidden layer we have.

> $h_3 = f (W_3 h_2)$.

$h_3$ is finally passed through the output layer which gives us the predicted value $\hat{y}$ for input $x$.

> $\hat{y} = f (W_o h_3)$.

Then if $y$ is the actual value for $\hat{y}$ (the ground truth) then $e = y - \hat{y}$ is the error in predicting $y$. Thus if we have multiple observations $x_1, x_2,\ldots,x_n$ passing through the neural network then for each $x_i$ we will have a prediction $\hat{y_i}$ for. This means for every $x_i$ we will have errors

> $e_i = y_i - \hat{y_i}$.

Using these $n$ $e_i$'s we can define a 'mean squared error' in a exactly similar way we defined for linear regression which is

> $L= \frac{1}{n} \sum\limits_{i=1}^n e_i^2$.

This function gives us a sense of the loss we make during preeictions hence will be used as our loss function.

Training of the neural network involves selecting a value of weights which minimizes this loss function. This is achieved using an algorithm called bakcpropagation. The gif below shows the working principle behind backpropagation.

You can read more about it in [here](https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd).





In [None]:
#@title
Image(url='https://miro.medium.com/v2/1*mTTmfdMcFlPtyu8__vRHOQ.gif', width = 600)

We will now build a neural network to predict electron affinity using the other two properties and see if we can achieve any increase in accuracy or not.

In [None]:
#@title Creating neural net
Number_of_input_units =  3
Number_of_output_units = 1
Number_of_hidden_layers = 5#@param {type:"integer"}
Number_of_hidden_units = []
for i in range(0, Number_of_hidden_layers):
            print("The number of neurons for hidden layer " + str(i+1) + ":")
            ele = int(input())
            Number_of_hidden_units.append(ele) # adding the element
#Number_of_hidden_units - inp_list.split(",")
Number_of_ephocs =  50#@param {type:"integer"}
Batch_size =  128


model_parameters = {}
model_parameters['hidden_units'] = Number_of_hidden_units
model_parameters['input_size'] = Number_of_input_units
model_parameters['output_size'] = Number_of_output_units

training_parameters = {}
training_parameters['epoch'] = Number_of_ephocs
training_parameters['batch_size'] = Batch_size
training_parameters['loss'] = 'mean_squared_error'
training_parameters['optimizer'] = 'adam'

Target = ['Response']
Feature = ['Feature1', 'Feature2','Feature3']

X = data_itm[Feature].values
y = data_itm[Target].values

FeatureScaler=StandardScaler()
TargetVarScaler=StandardScaler()

# Storing the fit object for later reference
FeatureScalerFit=FeatureScaler.fit(X)
TargetVarScalerFit=TargetVarScaler.fit(y)

# Generating the standardized values of X and y
X=FeatureScalerFit.transform(X)
y=TargetVarScalerFit.transform(y)

  # Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#@title Function to train the network
def train_and_test_model(dataset, Target, Feature, model_params, train_params, test, verbose=True):

  losses = []

  X = dataset[Feature].values
  y = dataset[Target].values

  FeatureScaler=StandardScaler()
  TargetVarScaler=StandardScaler()

  # Storing the fit object for later reference
  FeatureScalerFit=FeatureScaler.fit(X)
  TargetVarScalerFit=TargetVarScaler.fit(y)

  # Generating the standardized values of X and y
  X=FeatureScalerFit.transform(X)
  y=TargetVarScalerFit.transform(y)

  # Split the data into training and testing set
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test, random_state=42)

  # start creating the model

  #from keras.models import Sequential
  #from keras.layers import Dense

  model = Sequential()

  # Defining the Input layer and FIRST hidden layer, both are same!
  model.add(Dense(units=model_params['hidden_units'][0], input_dim=model_params['input_size'], kernel_initializer='normal', activation='relu'))

  # add layers as given by the user
  for layer_num, units_per_layer in enumerate(model_params['hidden_units']):
    if layer_num > 0:
      # we already added the first hidden layer along with the input layer above
      model.add(Dense(units=units_per_layer, kernel_initializer='normal'))

  # add output layer
  model.add(Dense(model_params['output_size'], kernel_initializer='normal'))

  # Compiling the model
  model.compile(loss=train_params['loss'], optimizer=train_params['optimizer'])

  # Fitting the ANN to the Training set
  history = model.fit(X_train, y_train ,batch_size = train_params['batch_size'], epochs = train_params['epoch'], verbose=verbose)
  losses = history.history['loss']
  # calculate prediction and mean absolute relative error:
  #MAPE = np.nanmean(100 * (np.abs(y_test-model.predict(X_test))/y_test))
  MAPE = mean_absolute_percentage_error(y_test, model.predict(X_test))

  return model, MAPE, losses


In [None]:
#@title Training neural net
model, err, losses = train_and_test_model(data_itm, Target, Feature, model_parameters, training_parameters, 0.2)

epochs = range(1, len(losses) + 1)

# Plot the losses
plt.plot(epochs, losses, 'b', label='Training Loss')
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Let's see if the neural network has increased the accuracy or not!

In [None]:
r2_score(y_test, model.predict(X_test))


We do not see a significant increase in accuracy! This is largely because the relationship between the variables and vertical electron affinity are far too complex to be captured by even this deep network.

It might be possible if we increase the number of layers or neurons drastically we might see some decent results but that might lead to a non parsimonious model with unfeasible number of parameters.

At this stage thus it is needed to look at more advanced networks which can capture this complex relationship present in this data.

# How to improve these models?

All of the above models we developed until now have not given us an accuracy we would like to have. Thus the question arises how can we get a model which will increase the accuracy of prediction significantly. Some common ways include


  - Increse data size.
  - Larger model.
  - Better features.

Let us look at a tool which can perform predictions with a very good accuracy. Click on the link below

[OCELOT ML](https://oscar.as.uky.edu/ocelotml_2d/)

# Synopsis

When approaching any problem using machine learning it is necessary to

>Perform a thorough visualization of the data to detect any influential observations.

> Split the data into two parts for training and testing.

> Define a proper loss function to ensure the chosen algorithm trains well.

Often one simple model won't provide a good accuracy hence it is important to try out several models ranging from simple to complex models to find out the best model suited for the job at hand.

Copyright 2021-2023, University of Kentucky and Iowa State University
Designed by Souradeep Chattopadhyay, Chih-Hsuan Yang and Hsin-Jung Yang