<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

## Part 1. Loading the dataset

In [5]:
Data = pd.read_csv("science_data_large.csv") # reads CSV file and stores in Data

In [6]:
Data.head(15) # returns first 15 rows of Dataframe

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [7]:
Data.shape # returns number of rows and columns

(1000, 3)

In [8]:
Data.info() # returns a summary of the table

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [9]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = Data[["Temperature °C","Mols KCL"]]
Y = Data["Size nm^3"]

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.1) # split using Sklearn

## Part 3. Perform a Linear Regression

In [16]:
# Use sklearn to train a model on the training set
regTrain = LinearRegression().fit(X_train.values,Y_train)
# Create a sample datapoint and predict the output of that sample with the trained model
regTrain.predict([[690.1,79]])

array([269711.34299652])

In [13]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
regTrain.score(X_train,Y_train)

0.8640012036050259

## What is the score ?
Score refers to the accuracy of the regression that was created. It gives us an understanding of how good it is at predicting data points. The highest score possible is a 1 which means that our regression will produce a 100% precise result given any input. The lowest possible score would be a -1 which means our regression is useless. For this example our regression scored a 0.864. Thats very close to 1 so its a good estimate but not the best.

In [14]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
regTrain.coef_

array([ 882.09523109, 1037.27012558])

In [15]:
regTrain.intercept_

-420966.91589744086

Sample equation: $h(T,M) = (882.09523)T + (1037.27016)M -420966.9159$
- S = Size 
- T = Temperature °C
- M = Mol KCL

## Part 4. Use Cross Validation

In [23]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
scores = cross_val_score(regTrain, X, Y, cv = 15)
print(scores)
print("\nMean of Scores: " + str(np.mean(scores)))

[0.79028083 0.86917059 0.85559836 0.85907049 0.87786968 0.86099121
 0.87124279 0.85495367 0.85432399 0.88231404 0.85851787 0.8673683
 0.74745492 0.87694851 0.88229995]

Mean of Scores: 0.8538936791610037


## Cross Validation

Cross Validation allows us to further test our prediction function. From the results above we see that using different subsets of our set of data yields different scores. Whats interesting is that they all revolve around the value 85. This gives us further assurance that we have a decent score. This also helps us understand how the prediction function will behave with new sets of data. This process gives us a small peek into how our prediciton function will do with new data.


## Part 5. Using Polynomial Regression

In [50]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
model = Pipeline([('poly', PolynomialFeatures(degree=2)), ('linear', LinearRegression(fit_intercept=False))])
model = model.fit(X_train.values, Y_train)
model.named_steps['linear'].coef_

array([ 1.75975034e-05,  1.20000000e+01, -1.33941756e-07, -1.03437259e-11,
        2.00000000e+00,  2.85714287e-02])

### Report on the metrics and output the resultant equation as you did in Part 3.
The result shown above are combinations of coefficients for a polynomial that fits the given Data X_train. 
The coefficients are arranged to fit the polynomial 
- A + Bx + Cz + Dx^2 + Exz, + Fz^2 - [A,B,C,D,E,F]

$h(T,M) = 1.7596^-5 + (12)T - (1.3394^-7)M - (1.0344^-11)T^2 + (2)TM + (0.02857)M^2$

In [51]:
model.predict([[690.1,79]])

array([117495.31429163])

In [53]:
model.score(X_train.values,Y_train)

1.0

This method yields a more accuarte way of fitting data in our example. With ordinary linear regression we had a score of 0.86 while with this method the score is a perfect 1. 