<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [1]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import BayesianRidge

## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
# Display a summary of the table information (number of datapoints, etc.)

In [3]:

slime_data = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")
slime_data.head(15)


Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [4]:
slime_data.dtypes

Temperature °C      int64
Mols KCL            int64
Size nm^3         float64
dtype: object

In [5]:
slime_data.info

<bound method DataFrame.info of      Temperature °C  Mols KCL     Size nm^3
0               469       647  6.244743e+05
1               403       694  5.779610e+05
2               302       975  6.196847e+05
3               779       916  1.460449e+06
4               901        18  4.325726e+04
..              ...       ...           ...
995             894       847  1.545661e+06
996             327       982  6.737041e+05
997             791       213  3.477543e+05
998             769       553  8.684794e+05
999             919       452  8.476413e+05

[1000 rows x 3 columns]>

In [6]:
slime_data.describe()

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


## Part 2. Splitting the dataset

In [7]:
# Take the pandas dataset and split it into our features (X) and label (y)
slime = slime_data
x = slime.iloc[:,:2]
y = slime.iloc[:,-1]

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.10)


## Part 3. Perform a Linear Regression

In [8]:
# Use sklearn to train a model on the training set
Model = LinearRegression()
Model.fit(x_train,y_train)
# Create a sample datapoint and predict the output of that sample with the trained model
prediction = Model.predict(x_test)
print(pd.DataFrame({"Sample" : y_test,"Predicted": prediction}))





           Sample     Predicted
855  2.691125e+05  2.932031e+05
218  1.892368e+05  5.167462e+05
373  2.870685e+05  4.262652e+05
445  1.863741e+06  1.412682e+06
142  2.606631e+04 -2.080550e+05
..            ...           ...
342  1.270029e+03 -3.657091e+05
990  9.089383e+04  3.668614e+05
959  4.322046e+04 -8.375568e+03
544  8.594349e+05  9.319641e+05
666  2.211291e+05  5.473410e+05

[100 rows x 2 columns]


In [9]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
Score = Model.score(x,y)
print("Score = ", Score)

Score =  0.860677107032014


The Score here is $R^2$ value which measures the goodness of fit for the linear regression models. This value shows us the relation between our features and the label of the trained model. The value ranges from 0 to 1 and higher value means better relationship between features and labels

Our value of 0.86 means that our model has very good accuracy in showing the relationsib between our features and labels

In [10]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
Coeff = Model.coef_
Intercept = Model.intercept_
print("Coeffecient = ", Coeff)
print("Intercept = ", Intercept)



Coeffecient =  [ 875.95625376 1039.82294292]
Intercept =  -419729.2682966482


Here the Equation is:
$\\$ $h\left( x \right)$ = 875.9562temp + 1039.8229mols - 419729.2682


    

## Part 4. Use Cross Validation

In [11]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
# Report on their finding and their significance
cross_val = cross_val_score(Model,x_train,y_train,cv = 10)
cross_val


array([0.8626782 , 0.8424985 , 0.86331229, 0.85732202, 0.88357524,
       0.87752653, 0.84683639, 0.86505225, 0.86317199, 0.84022695])

In [12]:
print("Mean : ", cross_val.mean())
print("Standard Deviation : ", cross_val.std())

Mean :  0.8602200351150653
Standard Deviation :  0.013393825665117712


cross_val_score returns the array of the value of $R^2$ for 10 rounds we performed to train the model we have that consisted of 90% train data and 10% test data. The function repeated the training and testing data 10 times i.e cv = 10 on different portion of data. We can see that the mean is 0.86, and SD is 0.022 which gives us a confirmation that regardless of the data sets we select for the training it gave us relatively similar result.

## Part 5. Using Polynomial Regression

In [13]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2

Model_poly = PolynomialFeatures(2)
x_test = Model_poly.fit_transform(x_test)
x_train = Model_poly.fit_transform(x_train)


In [14]:
Model_Bay = BayesianRidge()
Model_Bay.fit(x_train,y_train)
Bay_prediction = Model_Bay.predict(x_test)
print(pd.DataFrame({"Sample" : y_test,"Predicted": Bay_prediction}))



           Sample     Predicted
855  2.691125e+05  2.691125e+05
218  1.892368e+05  1.892368e+05
373  2.870685e+05  2.870685e+05
445  1.863741e+06  1.863741e+06
142  2.606631e+04  2.606631e+04
..            ...           ...
342  1.270029e+03  1.270029e+03
990  9.089383e+04  9.089383e+04
959  4.322046e+04  4.322046e+04
544  8.594349e+05  8.594349e+05
666  2.211291e+05  2.211291e+05

[100 rows x 2 columns]


In [15]:
# Report on the metrics and output the resultant equation as you did in Part 3.
Bay_Score = Model_Bay.score(x_test, y_test)
print("Bayesian Score = ", Bay_Score)

Bayesian Score =  1.0


In [16]:
Bay_Coeff = Model_Bay.coef_
Bay_Intercept = Model_Bay.intercept_
print("Coeffecient = ", Bay_Coeff)
print("Intercept = ", Bay_Intercept)


Coeffecient =  [ 0.00000000e+00  1.20000000e+01 -7.64515189e-08 -2.57542876e-11
  2.00000000e+00  2.85714287e-02]
Intercept =  6.6371867433190346e-06


Here the Equation is:
$\\$ $h\left( x \right)$ = 0.0000 + 12.000a + 0.0000b + 0.0000$a^2$ + 2.0000ab + 0.2857$b^2$

    

Bayesian Ridge Regression was used in this, as we can see that the score is 1.0, which is 100% value which means that it accurately  prdicts the size of slime based on the value of tempreture and mol.