<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [1]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ShuffleSplit

## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
slime = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")
# Output the first 15 rows of the data
slime.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [3]:
# Display a summary of the table information (number of datapoints, etc.)
slime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [4]:
# Take the pandas dataset and split it into our features (X) and label (y)
features=slime[["Temperature °C", "Mols KCL"]]
features

Unnamed: 0,Temperature °C,Mols KCL
0,469,647
1,403,694
2,302,975
3,779,916
4,901,18
...,...,...
995,894,847
996,327,982
997,791,213
998,769,553


In [5]:
labels = slime["Size nm^3"]
labels

0      6.244743e+05
1      5.779610e+05
2      6.196847e+05
3      1.460449e+06
4      4.325726e+04
           ...     
995    1.545661e+06
996    6.737041e+05
997    3.477543e+05
998    8.684794e+05
999    8.476413e+05
Name: Size nm^3, Length: 1000, dtype: float64

In [6]:
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
x_train, x_test, y_train, y_test = train_test_split(features, labels, random_state=3)
x_train, y_train

(     Temperature °C  Mols KCL
 942             165       150
 962             299       287
 237             540       172
 914             419       873
 301             306       169
 ..              ...       ...
 952             666       312
 643             192        29
 249             924       140
 664             845       312
 874             526        25
 
 [750 rows x 2 columns],
 942     52122.85714
 962    177567.40000
 237    193085.25710
 914    758377.11430
 301    107916.02860
            ...     
 952    426357.25710
 643     13464.02857
 249    270368.00000
 664    540201.25710
 874     32629.85714
 Name: Size nm^3, Length: 750, dtype: float64)

In [7]:
x_test,y_test

(     Temperature °C  Mols KCL
 642             682       733
 762             116       482
 909             187       575
 199             111       774
 586             275       307
 ..              ...       ...
 146             610       678
 897             923       871
 705             935        28
 458             336       367
 349             688       522
 
 [250 rows x 2 columns],
 642    1.023347e+06
 762    1.198538e+05
 909    2.267404e+05
 199    1.902765e+05
 586    1.748428e+05
            ...     
 146    8.476138e+05
 897    1.640617e+06
 705    6.360240e+04
 458    2.545043e+05
 349    7.343133e+05
 Name: Size nm^3, Length: 250, dtype: float64)

## Part 3. Perform a Linear Regression

In [8]:
# Use sklearn to train a model on the training set
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)

# Create a sample datapoint and predict the output of that sample with the trained model
reg.predict(np.array([[335,360]]))

array([248868.17697246])

In [9]:
reg.score(x_train, y_train)

0.8615796489132921

In [10]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
reg.score(x_test,y_test)

0.8578404094944483

This is the $R^2$ score runing on the test set where $R^2=1-\frac{\text{SUM}((\text{y_true} - \text{y_pred})^2)}{\text{SUM}((\text{y_true} - \text{y_true.mean()})^2)}$. It's just used to evaluate the prediction perfomence of the model. The best possible score is 1.0.

In [11]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
reg.coef_

array([ 878.15966663, 1018.11982405])

In [12]:
reg.intercept_

-411838.44800625014

 $h(x)=-411838.44800+878.15966x_1+1018.11982x_2$

## Part 4. Use Cross Validation

In [13]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
reg = linear_model.LinearRegression()
cv = ShuffleSplit(n_splits=10, random_state=0)
scores = cross_val_score(reg, x_train, y_train, cv=cv)
scores

array([0.87066084, 0.8312289 , 0.87822906, 0.85789495, 0.86848602,
       0.85744914, 0.88281516, 0.85173972, 0.88016112, 0.86760776])

In [14]:
scores.mean()

0.8646272670002974

In [15]:
scores = cross_val_score(reg, x_train, y_train, cv=5) #K-fold CV
scores

array([0.85280684, 0.8551826 , 0.86973163, 0.88362657, 0.82609625])

In [16]:
scores.mean()

0.8574887787514971

In [17]:
# Report on their finding and their significance

Instead of holding out one validation set to prevent overfitting we use cross validation to avoid drastically reduce of the number of samples which can be used for training. It also prevents overfitting on the test set and give us a good performence measure while tuning the hyperparameters.

Each group in the cross vailadtion gives us different a score. For cross vailadtion, they are better than holding out a spereate validation set because we have more data to train the model, besides we dont need to touch the test set and still get a fair score to evaluate the generalization performance.

## Part 5. Using Polynomial Regression

In [18]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
model = Pipeline([('poly', PolynomialFeatures(degree=2)),
                 ('linear', linear_model.LinearRegression(fit_intercept=False))])
model = model.fit(x_train,y_train)

In [19]:
# Report on the metrics and output the resultant equation as you did in Part 3.
model.named_steps['linear'].coef_

array([ 2.06632172e-05,  1.20000000e+01, -1.42161677e-07, -9.82133618e-12,
        2.00000000e+00,  2.85714287e-02])

In [20]:
model.predict(np.array([[335,360]]))

array([248922.85713143])

In [21]:
model.score(x_test,y_test)

1.0

 $h(x)=0.00002+12x_1+2x_1x_2+0.02857{x_2}^2$