<a href="https://colab.research.google.com/github/D3TaLES/In-The-Mix/blob/main/data_science/InTheMix2_DataScineceDay1_MASTER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Setup**

> Please run the next three cells immediately. These cells installs the necessary packages we will need for this tutorial

In [None]:
#@title Import Packages
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import *
from numpy.linalg import inv
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.cluster import KMeans
import ipywidgets as widgets
import IPython
import json
from IPython.core.display import display

global speaker, talking_speed, text, button_clear, output_clear, button, output


#**Why Machine Learning?**

*   Predictive Modeling: Machine learning algorithms can build predictive models based on large datasets of chemical structures, properties, and reactions. These models can make accurate predictions about various chemical phenomena, such as the stability of molecules, reaction outcomes, toxicity, and material properties.

*   Materials Design and Discovery: Machine learning enables the exploration and design of new materials with desired properties. By learning from large databases of materials and their properties, machine learning models can identify patterns and relationships between composition, structure, and properties. This knowledge can be utilized to predict and discover novel materials for applications such as energy storage, catalysis, electronics, and more.



#**A primer on linear algebra**
> Suppose we have the following data for 5 students

Student No. |Age | Height | Weight
----------- |----|--------|-------
1           | 20 |  65.78|  112.99
2           | 22 |  71.52 |  136.49
3           | 21 |  69.40 |  153.03
4           | 21 |  68.22 |  142.34
5           | 23 |  67.79 |  144.30

##Matrix
>A matrix is an array of numbers, symbols, expressions etc arranged in rows and columns. When dealing with data, matrices provide us with the tools to store, represent, modify, transform our data and also perform analysis. Thus it would be safe to assume matrices form the soul of any machine learning algorithm. We will briefly use the numpy package in python to see some examples of matrices.

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/mat.png)




In [None]:
# We can represent the above data using matrix in python using numpy as

d = array([[20, 65.78, 112.99], [22, 71.52, 136.49], [21, 69.40, 153.03], [21, 68.22, 142.34], [23, 67.79, 144.30]])
print(d)

The above matrix (denoted by M) has 5 rows and 3 columns so the dimension of the matrix will be denoted as $M^{5 × 3}$
###Matrix operations







In [None]:
# Defining two matrices using numpy
m1 = np.array([[2, -7, 5], [-6, 2, 0], [3, 1, 2]])
m2 = np.array([[5, 8, -5], [3, 6, 9], [0, -5, 8]])

print("1st matrix : \n", m1)
print("2nd matrix : \n", m2)

**Matrix Addition**
>Suppose we have two matrices $M_1^{a × b}$ and $M_2^{a × b}$. Then the sum of these two matrices will be the sum of each element between the two matrices.

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/matsum.png)

In [None]:
# Matrix addition
m_add = np.add(m1, m2)
print("Sum : \n", m_add)

**Multiplication by scalar**
> Suppose we have a matrix $M^{a × b}$. If we multiply this matrix by a scalar number c, then c will be multiplied with every element of the matrix indivitually.

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/matscalmul.png)

In [None]:
# Multiplication by scalar
c = 5
m_sc = c*m1
print("Matrix m1 multiplied by scalar: \n", m_sc)

**Multiplication of two matrices**
> Suppose we have two matrices $M_1^{a × b}$ and $M_2^{b × c}$. Then when taking the product of these two matrices each row of $M_1$ will be multiplied with the corresponding with the corresponding column of $M_2$. Hence first row of $M_1$ will be multiplied by first column of $M_2$, second row of $M_1$ will be multiplied by second column of $M_2$ and so on. The resultant matrix will be of dimension $a \times c$.

![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/matmul.png)

In [None]:
#@title Run cell to see matrices multiplied...
from IPython.display import Image
Image(url='https://www.mscroggs.co.uk/img/full/multiply_matrices.gif')

In [None]:
# Multiplication of two matrices
m_mult = np.matmul(m1, m2)
print("Product of m1 and m2 is: \n" ,m_mult)

**Inverse of a matrix**
> Suppose we have a matrix $M^{a × a}$. Then the inverse of $M$ is a matrix $N$ such that if M and N are multiplied the result is an identity matrix (A matrix whose diagonal elements are 1 and other elements are 0). We will explore this in detail in the next cells.

In [None]:
# Inverse of a matrix
m_inv = inv(m1)
print("The inverse of m1 is: \n", m_inv)

#**Towards machine learning!**


*“What we want is a machine that can learn from experience..” — Alan Turing*

> * One of the biggest resources available to human's today is data. Recently we have learned to tap into this huge resource which has resulted in some amazing feats never seen before.
> * The concept of machine goes back to mid 1950's when researchers were looking to create intelligent machines but relied on explicit programming knowledge and rules which made it difficult to scale. This is when they began exploring the idea of teaching machines to learn from data without being explicitly programmed.
> * In the years that followed, many new statistical and mathematical methods like regression, classification, decision theory, random forest etc were developed which helped harness useful information from data that is necessary for machines to learn.
> * In the last two to three decades, as computers became more powerful and access to bigger and better computing capabilities became available, the power of the statistical algorithms became multi-fold and enabled creation of better concepts and algorithms which helped machines become more smart and powerful in learning from data.
> * Some recent marvels of artificial intelligence include [ChatGPT](https://openai.com/blog/chatgpt) which is a chatbot that can interact with its user essentially like a human does. It understands nuances of human conversation and performs conversation accordingly.
> * Broadly speaking machine learning provides us resources to understand and analyze data and derive meaningfull representations from it.






> Machine learning can be broadly classified into three types depending on the task at hand and the data being used. They are:
1.   Supervised Learning
2.   Unsupervised Learning
3.   Reinforcement Learning

We'll cover the first two today. Let's learn about them through an example.
For reference we will use the iris dataset present in the scikit package in python.


![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/ml_taxanomy.png)





Hit run on the cell below to load the data and get a peak at it.

In [None]:
#@title
iris = datasets.load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
data['target'] = data['target'].replace(0, 'Iris-setosa')
data['target'] = data['target'].replace(1, 'Iris-versicolor')
data['target'] = data['target'].replace(2, 'Iris-virginica')
data.rename(columns = {'target': 'Species'}, inplace = True)
X = data.drop(['Species'], axis=1)
y = data['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=5)
data.head()

The above dataset contains values of four features (length and width of sepals and petals) of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

Let us take a visual look at this dataset. Hit run on the next module.

In [None]:
# Plot data
sns.pairplot(data, hue = 'Species', diag_kind='hist', markers=["o", "s", "D"], height=3)

Now that we have our sample data let us use this to understand ***supervised*** and ***unsupervised*** learning.


###**Supervised Learning**
![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/supervised1.png)

Photo source:
[Iris douglasiana, ](https://gardenerspath.com/wp-content/uploads/2023/03/Purple-Iris-Flower-Growing-in-the-Garden.jpg)
[Iris cristata, ](https://en.wikipedia.org/wiki/Iris_cristata#/media/File:Iris_cristata_(2).jpg)
[Iris tectorum](https://en.wikipedia.org/wiki/Iris_(plant)#/media/File:Iris_tectorum_-_flower_view_01.jpg)


This form of machine learning involves those problems where the task at hand requires a fully labeled dataset. The models in this case use the labels as ground truth to learn and update themselves.

Suppose we have a task at hand where we have to use the Iris dataset to build a model which takes in input the petal length, petal width, sepal length and sepal width of a flower and predicts which species that particular flower is.

The data that will be used to model in this (the Iris data) has the ground truth, aka the column 'Species'. The model will use this ground truth to train itself by analyzing how much error it is producing as compared with the ground truth. The ground truth provides the model with the necessary 'supervision.

One such supervised learning method is known as [logistic regression](https://machinelearningmastery.com/logistic-regression-tutorial-for-machine-learning/). Let us use this method to create our model.

Hit run on the next module to create our model using the Iris data.



In [None]:
#@title
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Now suppose we have 15 new flowers whose sepal length, sepal width, petal length and petal width are given.

Hit run on the next module to take a look at the data

In [None]:
#@title
X_test

Now let us use our model to predict the species of these 15 flowers!

Hit run in the next module. This will perform the prediction for these 15 flowers and display the results.

In [None]:
#@title
y_pred = logreg.predict(X_test)
pd.DataFrame({'Actual Species':y_test, 'Predicted Species': y_pred}, columns=['Actual Species', 'Predicted Species'])

**Let us now see a cool real life example of supervised learning!**

In farming one of the biggest challenges are dealing with pests. Farmers often have to rely heavily on pesticides to reduce damage from pests. Different form of pests require different form of treatement so its very important to know what kind of pest infestation is under study.

Thus if there can be a mechanism which can give this information readily to farmers that will benefit the farming community profoundly.

[This application]( https://insectapp.las.iastate.edu) can identify insects based on a simple photo. We will try the app with this [photo](https://raw.githubusercontent.com/hjy77/inthemix/main/bug.png).

The model behind this application is a supervised one and is build on a large database of insects.

This tool was developed by Dr Aarti Singh's group and Dr Baskar Ganapathysubramanian's group at Iowa State University.

##**Unsupervised Learning**
![](https://raw.githubusercontent.com/D3TaLES/In-The-Mix/main/data_science/media/unsupervised.png)
[Photo source](https://blog.floydhub.com/introduction-to-k-means-clustering-in-python-with-scikit-learn/)

In unsupervised learning the data being dealt with does not have predefined labels. Thus the data here often lacks information (a ground truth element) about the specific desired outcome. In this case unsupervised ML methods attempts to find structure in the data by extracting important features and then analyzing them.

Putting this in terms of the iris data the scenario is
  >We have the sepal length, sepal width, petal length and petal width of 50 flowers. Suppose our task is to determine how many species of flowers are present in our data. Then this task boils down to finding the numer of groups (or clusters) in the data.

Let us consider the iris dataset again but this time let's assume that we do not have the information on species.

>Note that in the current scenario we do not have the labels on which species each flower belongs to. This means there is no supervisory feature that can help us determine the different groups present in the datset. We have to figure out the number of groups (species) without the supervisory variable. This problem is hence called a unsupervised learning problem.

In [None]:
#@title
#X1 = X
#X1.head()
sns.pairplot(data, diag_kind='hist')

The visual analysis here is important as it gives us a plausible idea regarding the number of groups.
> But how can we be sure that our visual analysis is giving us the right number of groups? Also how can we find out these groups?

This is essentially a problem of 'clustering' an unsupervised method which finds the optimum groups in a given data.

A popular algorithm to perform clustering is called k-means algorithm. The idea simple idea algorithm is
> The 'k' in k-means refers to the number of groups you want to partition your data into. Thus if we plug in k=2, we have a 2-means algorithm which will find out 2 groups in your data. Here k can be chosen by the user.

Let us apply this k-means algorithm for k=2,3,4 into our data and see what kind of groups we get.



In [None]:
#@title Fitting the k-means model
kmeans2 = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans3 = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans4 = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans2 = kmeans2.fit_predict(X)
y_kmeans3 = kmeans3.fit_predict(X)
y_kmeans4 = kmeans4.fit_predict(X)

X['Determine Species 2'] = y_kmeans2
X['Determine Species 3'] = y_kmeans3
X['Determine Species 4'] = y_kmeans4

#X1['Determine Species'] = X1['Determine Species'].replace(0, 'Iris-setosa')
#X1['Determine Species'] = X['Determine Species'].replace(1, 'Iris-versicolor')
X

In [None]:
#@title Clustering results for k=2
sns.pairplot(X.iloc[:,[0,1,2,3,4]], hue = 'Determine Species 2', markers=["o", "s"], diag_kind='hist')

In [None]:
#@title Clustering results for k=3
sns.pairplot(X.iloc[:,[0,1,2,3,5]], hue = 'Determine Species 3', markers=["o", "s", "D"], diag_kind='hist')

In [None]:
#@title Clustering results for k=4
sns.pairplot(X.iloc[:,[0,1,2,3,6]], hue = 'Determine Species 4', markers=["o", "s", "D", "P"], diag_kind='hist')

Visual analysis can indicate that all three clustering solutions are correct.
> But how do we know which solution is actually correct?

This can be accomplished using various techniques such as jump statistic, silhouette width etc. We won't go into details about how these methods work but for this analysis jump statistic has indicated that the optimum number of groups is 3.

# Real life application of k-means

Suppose we have been given 100,000 documents and our task is to find out which documents are similar and can form a group. In this scenario reading each document becomes impractical.

In this scenario unsupervised machine learning can determine which documents form a group and can perform the segregation effortlessly.

This [link](https://www.kaggle.com/code/aybukehamideak/clustering-text-documents-using-k-means) contains details of how document clustering can be performed using k-means.

# Synopsis

1. Machine Learning provides us with efficient ways to find hidden information within data to understand real life scenarios.

2. Redox flow batteries find very important applications in energy storage and form an area of active research. Data collected from these batteries through sensors and other ways can be analyzed efficiently using machine learning methods that can help in design of efficient energy storage soultions.

3. Two of the fundemental types of ML are (1) supervised learning and (2) unsupervised learning.
> Supervised learning makes use of ground truth to create models that can help us perform predictions for future scenario's.
>
> Unsupervised learning tries to infer meaningfull representations from the data without the use of ground truth.


Copyright 2021-2023, University of Kentucky and Iowa State University

Designed by Souradeep Chattopadhyay, Chih-Hsuan Yang, Hsin-Jung Yang, and Rebekah Duke