<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [386]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

## Part 1. Loading the dataset

In [387]:
# Using pandas load the dataset (load remotely, not locally)
df = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

In [388]:
# Output the first 15 rows of the data
df.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [389]:
# Display a summary of the table information (number of datapoints, etc.)
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


In [390]:
df.size

3000

In [391]:
df.dtypes

Temperature °C      int64
Mols KCL            int64
Size nm^3         float64
dtype: object

## Part 2. Splitting the dataset

In [392]:
# Take the pandas dataset and split it into our features (X) and label (y)
features = df[["Temperature °C", "Mols KCL"]] # X
label = df["Size nm^3"] # Y

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, Y_train, Y_test = train_test_split(features, label, train_size=.90, random_state=0)
X_train, X_test, Y_train, Y_test

(     Temperature °C  Mols KCL
 785              18        61
 873             826       386
 65              518       772
 902             752       701
 317             348       288
 ..              ...       ...
 835               6        13
 192             430       560
 629             153       801
 559             726       744
 684             493       979
 
 [900 rows x 2 columns],
      Temperature °C  Mols KCL
 993             311       265
 859             951       145
 298             985       786
 553              66       317
 672              59       277
 ..              ...       ...
 485             994       881
 568             249       927
 108             293       466
 367             217       682
 644             399       823
 
 [100 rows x 2 columns],
 785    2.518314e+03
 873    6.518410e+05
 65     8.230361e+05
 902    1.077368e+06
 317    2.069938e+05
            ...     
 835    2.328286e+02
 192    4.957200e+05
 629    2.652735e+05
 559    1.104

## Part 3. Perform a Linear Regression

In [393]:
# Use sklearn to train a model on the training set
l_regression = linear_model.LinearRegression()
l_regression.fit(X_train, Y_train)

LinearRegression()

In [394]:
# Create a sample datapoint and predict the output of that sample with the trained model
l_regression.predict(X_train)

array([-323387.68117916,  701377.24909512,  823759.45783579,
        954402.38564216,  189985.00199204,  887054.15114539,
        551068.4434087 ,  119846.69114978,  192934.17218907,
        741676.34535412,  -71757.58280486,  524258.70144316,
        597288.0607819 ,  696323.09632882,  128232.01539656,
        367527.06642784,   87476.24863334, 1071658.44229592,
        184244.42049944,  338951.37536618, -240868.6266035 ,
       -167044.89823746,  822922.17967811,  369997.38540049,
       -125402.03471609,  656965.09631842,  516995.80669658,
        894031.95322936,  720738.6408524 ,  850133.25770734,
        498511.81496737,  309786.10551914,  452397.40931508,
        263173.51386224,  491432.92337257,  699929.66374183,
        881826.96822756,  140377.54374773, -265594.17308992,
        910044.23235592,  594399.79954541,  340155.38355724,
        182639.01756478,  538500.85058361,  404911.5398402 ,
        -64740.99380059,  345855.84317945,   80121.961106  ,
        854862.19594064,

In [395]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
l_regression.score(X_train,Y_train)
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX

0.8571870071495865

In [396]:
# The coefficients
l_regression.coef_

array([ 863.58108791, 1006.12741921])

In [397]:
# The intercept
l_regression.intercept_ 

-400305.91333353234

Because it's a 2D matrix and our model is linear, it's not 100 percent accurate as we can see from the score.

$ h(x) = 884x_1 + 1029x_2 - 418431 $

## Part 4. Use Cross Validation

In [398]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
cross_val_score(l_regression, X_train, Y_train, cv=60)

# Report on their finding and their significance

array([0.84145609, 0.8616694 , 0.73420563, 0.8233996 , 0.73294288,
       0.80925699, 0.82580605, 0.83128711, 0.86117109, 0.82073346,
       0.83676121, 0.86741143, 0.71294841, 0.71420336, 0.84376908,
       0.85540728, 0.7214358 , 0.84324374, 0.91737668, 0.88700399,
       0.86878485, 0.85362867, 0.880749  , 0.87954522, 0.89764654,
       0.70526365, 0.86022084, 0.92672162, 0.9106964 , 0.86133994,
       0.89624446, 0.92191316, 0.78634922, 0.89468162, 0.91899643,
       0.84198556, 0.85634974, 0.9199058 , 0.88189683, 0.8630636 ,
       0.81052792, 0.68708043, 0.81291639, 0.86818653, 0.83554822,
       0.87644156, 0.83992873, 0.8690072 , 0.89935371, 0.81943985,
       0.71740208, 0.66935338, 0.84579248, 0.88499735, 0.82940534,
       0.86911212, 0.94047234, 0.81493134, 0.6473024 , 0.83387914])

From the 60 cv runs, they're trying to get close to the true value of linear regression.

## Part 5. Using Polynomial Regression

In [399]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
p = PolynomialFeatures(degree=2)
p_train = p.fit_transform(X_train)
p_test = p.fit_transform(X_test)
p_reg = linear_model.LinearRegression()
p_reg.fit(p_train, Y_train)

# Report on the metrics and output the resultant equation as you did in Part 3.

LinearRegression()

In [400]:
p_reg.score(p_train, Y_train)

1.0

In [401]:
p_reg.coef_

array([ 0.00000000e+00,  1.20000000e+01, -1.10385043e-07, -1.20934374e-11,
        2.00000000e+00,  2.85714287e-02])

In [402]:
p_reg.intercept_

1.3901910278946161e-05

We can see that our scroe increases to 100 percent when we use Polynomial Regression, indicating that our distribution is polynomial. 

$ h(x) = 0.0000139 + 12x_1 + 2x_1x_2 + 0.0285x_2^2 $