# Supervised linear regression machine learning model to predict the primary dendritic arm spacing

# Project Overview

we develop a supervised-regression machine learning model to obtain a correlation between the PDAS and velocity and temperature gradient


# Data Wrangling

### 1. Sourcing and loading

#### 1a. Import relevant libraries 

In [1]:
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
import math
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
import statsmodels.api as sm # Second library for linear regression model based on OLS
from statsmodels.graphics.api import abline_plot # What does this do? Find out and type here.
from sklearn.metrics import mean_squared_error, r2_score # acess performance.
from sklearn.model_selection import train_test_split,cross_validate,KFold,cross_val_score #split data in training and testing dataset
from sklearn import linear_model, preprocessing # linear regression models
import warnings # For handling error messages.
# Don't worry about the following two instructions: they just suppress warnings that could occur later. 
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

#### 1b. Load the data

In [2]:
# Load the data. 
df=pd.read_csv("PDAS_all.csv")

#### 1c. Exploring the data

We have generated the data, so we know they are almost clean. However, I perfowm the typical data wrangling steps use dto idensify if there are any problems with the data files

In [3]:
# Check out its appearance. 
df.head(n=5)

Unnamed: 0,V,G,Mat_HB,Mat_KF,PDAS
0,0.01,10000000.0,2.02e-14,2.57e-13,2.46857e-06
1,0.02,10000000.0,2.02e-14,2.57e-13,2.16e-06
2,0.03,10000000.0,2.02e-14,2.57e-13,1.08e-06
3,0.04,10000000.0,2.02e-14,2.57e-13,9.25714e-07
4,0.05,10000000.0,2.02e-14,2.57e-13,8.1e-07


In [4]:
# overview of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V       250 non-null    float64
 1   G       250 non-null    float64
 2   Mat_HB  250 non-null    float64
 3   Mat_KF  250 non-null    float64
 4   PDAS    250 non-null    float64
dtypes: float64(5)
memory usage: 9.9 KB


Because I have already generated the data I know the data are all in correct format and all the rows have correct values.
However this shows all columns have correct data type. They also do not have a missing value

In [5]:
df.isna().any()

V         False
G         False
Mat_HB    False
Mat_KF    False
PDAS      False
dtype: bool

As expected, we do not have any NULL values

In [6]:
df.duplicated().any()

False

In [7]:
df['Mat_HB'].nunique()

7

The data set have overall data for 7 alloys. for each alloy the PDAS is calculated based on different values of V and G.

In [8]:
print(df["Mat_HB"].unique())

[2.02e-14 3.37e-14 3.51e-14 5.85e-15 2.30e-15 4.33e-15 5.11e-15]


In [9]:
# We need a new column showing the alloy name
# we know Mat_HB of 2.02e-14 3.37e-14 3.51e-14 5.85e-15 2.30e-15 4.33e-15 5.11e-15 belonds to Ti-3.4%Ni, Ti-7.1%Ni, Ti-10.6%Ni, 
# Mg-9 at% Al, Al-6 at%Cu, Al-8 at%Cu, and Al-10 at%Cu

conditions=[np.logical_and(df["Mat_HB"].gt(2.01e-14),df["Mat_HB"].lt(2.03e-14)),
            np.logical_and(df["Mat_HB"].gt(3.36e-14),df["Mat_HB"].lt(3.38e-14)),
            np.logical_and(df["Mat_HB"].gt(3.50e-14),df["Mat_HB"].lt(3.52e-14)),
            np.logical_and(df["Mat_HB"].gt(5.84e-15),df["Mat_HB"].lt(5.86e-15)),
            np.logical_and(df["Mat_HB"].gt(2.29e-15),df["Mat_HB"].lt(2.31e-15)),
            np.logical_and(df["Mat_HB"].gt(4.32e-15),df["Mat_HB"].lt(4.34e-15)),
            np.logical_and(df["Mat_HB"].gt(5.10e-15),df["Mat_HB"].lt(5.12e-15))]
outputs=["Ti-3.4 at% Ni","Ti-7.1 at% Ni","Ti-10.7 at% Ni","Mg- 9 at% Al","Al-6 at% Cu","Al-8 at% Cu","Al-10 at% Cu"]
df["Alloy"]=pd.Series(np.select(conditions,outputs,"alloy"))


In [10]:
# OUR equation will have the following format PDAS=A (V)^alpha (G)^beta (matprop)^gamma. To make this easier to change to
# a regression model, 
# we will first deside the dependent variable as PDAS and then We will take ln of all the columns and add it to a new column
#and move forward with that
df.head()

Unnamed: 0,V,G,Mat_HB,Mat_KF,PDAS,Alloy
0,0.01,10000000.0,2.02e-14,2.57e-13,2.46857e-06,Ti-3.4 at% Ni
1,0.02,10000000.0,2.02e-14,2.57e-13,2.16e-06,Ti-3.4 at% Ni
2,0.03,10000000.0,2.02e-14,2.57e-13,1.08e-06,Ti-3.4 at% Ni
3,0.04,10000000.0,2.02e-14,2.57e-13,9.25714e-07,Ti-3.4 at% Ni
4,0.05,10000000.0,2.02e-14,2.57e-13,8.1e-07,Ti-3.4 at% Ni


In [11]:
# Get a basic statistical summary of the dependent variable 
#PDAS is our fixed dependent variable
df["PDAS"].describe()

count    2.500000e+02
mean     2.573232e-06
std      6.532639e-06
min      1.322670e-07
25%      4.926000e-07
50%      8.134800e-07
75%      1.470000e-06
max      6.294750e-05
Name: PDAS, dtype: float64