<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [170]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

## Part 1. Loading the dataset

In [171]:
# Using pandas load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
slime_data = pd.read_csv('https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv')
slime_data.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [172]:
# Display a summary of the table information (number of datapoints, etc.)
slime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [173]:
# Take the pandas dataset and split it into our features (X) and label (y)
features = slime_data[["Temperature °C","Mols KCL"]]
label = slime_data["Size nm^3"]
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.1, train_size=0.9, random_state=0)
#Displays data
x_train, x_test, y_train, y_test


(     Temperature °C  Mols KCL
 785              18        61
 873             826       386
 65              518       772
 902             752       701
 317             348       288
 ..              ...       ...
 835               6        13
 192             430       560
 629             153       801
 559             726       744
 684             493       979
 
 [900 rows x 2 columns],
      Temperature °C  Mols KCL
 993             311       265
 859             951       145
 298             985       786
 553              66       317
 672              59       277
 ..              ...       ...
 485             994       881
 568             249       927
 108             293       466
 367             217       682
 644             399       823
 
 [100 rows x 2 columns],
 785    2.518314e+03
 873    6.518410e+05
 65     8.230361e+05
 902    1.077368e+06
 317    2.069938e+05
            ...     
 835    2.328286e+02
 192    4.957200e+05
 629    2.652735e+05
 559    1.104

## Part 3. Perform a Linear Regression

In [174]:
# Use sklearn to train a model on the training set
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression()

In [175]:
#Create a sample datapoint and predict the output of that sample with the trained model

#Note that here the predict variable is making predictions on what our "Size nm^3" will be
predictions = model.predict(x_test)
predictions

array([ 134891.57109752,  566848.1770532 , 1241137.60975794,
        -24367.1696409 ,  -70657.3340248 ,  384956.44708746,
        467392.47066236,  960873.69862144,    8407.39619923,
         51611.35989487,  838231.36519831, 1232720.46674085,
        901563.3931183 , -118835.20674597,  188176.20244545,
       1158467.7057509 ,  665263.26735435,  560560.92590561,
        163940.53886357,  469055.93661764, 1419180.64745835,
        -84401.29716064,  555956.53417346, 1401489.68980634,
        450621.8221688 , 1161403.05700777,   45784.96014151,
        408948.47482721,  421868.97874553,  645969.69402768,
        467893.44392697,  208517.41869172, 1438187.73449241,
        708330.20055877, 1400047.62029314,  711132.70221441,
        942380.06884206, 1116186.8384737 , 1257396.19462681,
        189369.12027644,   71374.08771584,  870102.22808014,
         94087.37936383,  407174.28131093,  271473.01984826,
        388791.32041263,  237873.60116061,  790013.37060675,
        112125.67358907,

In [176]:
#predict_output = model.predict(np.array([[0,1]]))
model.score(x_test, y_test)

0.8761646752736478

-Report on the score for that model, in your own words (markdown, not code) explain what the score means
This score basically means that our model is 87.6% accurate.

In [177]:
coeffChart = pd.DataFrame(data=model.coef_, index=features.columns, columns = ['Coeff'])
coeffChart

Unnamed: 0,Coeff
Temperature °C,863.581088
Mols KCL,1006.127419


-Extract the coefficients and intercept from the model and write an equation for your h(x) using laTeX
<br>
#Note: We are able to get the intercept by using "print(model.intercept_)" and "print(model.coef_)". Next, we want our equation in the form of y=mx+b where m is the slope of the line and b is the y-intercept
<br>
$y={864x_1 + 1006x_2 - 400306}$

## Part 4. Use Cross Validation

In [178]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
scores = cross_val_score(model, x_test, y_test, cv=5)
scores

array([0.91532594, 0.88628606, 0.87313623, 0.75349855, 0.88772151])

-Report on their finding and their significance
Here our data is basically taking half it and training it and testing the other half and visversa. The result will give us these value scores which is a representation of an unbias (with some variance) feedback of how well our data is going to do ourside the given slime data

## Part 5. Using Polynomial Regression

In [179]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
model2 = Pipeline([('poly', PolynomialFeatures(degree=2)), ('linear', LinearRegression(fit_intercept=False))])
# Report on the metrics and output the resultant equation as you did in Part 3.
new_x_train = poly.fit_transform(x_train)
model2=LinearRegression()
model2.fit(new_x_train, y_train)
print(model2.score(new_x_train,y_train))
print(model2.coef_)
print(model2.intercept_)


1.0
[ 0.00000000e+00  1.20000000e+01 -1.10385822e-07 -1.20952137e-11
  2.00000000e+00  2.85714287e-02]
1.3903190847486258e-05


The score that we got after transforming our x training values was 1.0. What this means is that the accuracy has improved and it is better than before. 

$y={1.2x_2 - 1.1e^{-7}x_3 - 1.2e^{-11}x_4 + 2x_5 + 2.9e^{-2}x_6 - 1.4e^{-5}}$