In [1]:
# import section
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn import svm

data_url = "https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv"

COL_TEMP = "Temperature 째C"
COL_MOLS = "Mols KCL"
COL_SIZE = "Size nm^3"

regression = LinearRegression()

## Part 1. Loading the dataset

In [2]:
# load the dataset remotely
chem_df = pd.read_csv(data_url)
# output the first 15 rows of the data
chem_df.head(15)

Unnamed: 0,Temperature 째C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [3]:
# summary of the table information
chem_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
Temperature 째C    1000 non-null int64
Mols KCL          1000 non-null int64
Size nm^3         1000 non-null float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [4]:
# split the dataframe into features (temprature and mols of KCl) and label (size of slime)
features = chem_df[[COL_TEMP, COL_MOLS]]
labels = chem_df[COL_SIZE]

In [5]:
# use sklearn to split features and labels into training (90%) and testing (10%) set
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size = 0.1, random_state=53
)
print(f'features training size: {len(features_train)}')
print(f'features testing size: {len(features_test)}')
print(f'labels training size: {len(labels_train)}')
print(f'labels testing size: {len(labels_test)}')

features training size: 900
features testing size: 100
labels training size: 900
labels testing size: 100


## Part 3. Perform a Linear Regression

In [6]:
# train a model on the training set
model = regression.fit(features_train, labels_train)
score = model.score(features_test, labels_test)
print(f'Score: {score}')

Score: 0.875344345275818


The <em>score</em> function returns the coefficient of determination $R^2$ of the trained model. It tells us how well our trained model express the relation between our features and the labels.  
The expression of is 1-$\frac{\mu}{\nu}$, where $\mu$ is the residual sum of squares and the $\nu$ is the total sum of squares.  
Ideally, we will have the floating point $R^2$ value in the interval [0, 1]. The higher $R^2$ value, the better model we have for expressing the relationship between the features and the labels.  
Now we have a score of 0.875, which is a good value which suggests that our trained model has a good representation of the equuation of a slime.

In [7]:
print(model.coef_)
print(model.intercept_)

[ 877.29673334 1017.67056544]
-410540.6400718404


Let's say temprature is $x_0$ and mols of KCl is $x_1$, the equation is:  
$h(x):y = 877.3 x_0+ 1017.67 x_1 - 410540.64$

## Part 4. Use Cross Validation

In [8]:
# use cross_val_score function to repeat the experiment across many shuffles of the data
# becase features and labels are continues values, estimator should uses regression
clt = svm.SVR(kernel="linear")
# in order to have 90% of the data as the training data, cv is set to 10
scores = cross_val_score(estimator=clt, X=features, y=labels, cv=10)
scores

array([0.82328159, 0.86040459, 0.87426268, 0.8617585 , 0.87031059,
       0.84056155, 0.87636702, 0.86448193, 0.78440962, 0.8842528 ])

The function <em>cross_val_score</em> returns an array of coefficient of determinations. For the setup above, 90% of the features and labels data is used to train the linear model and 10% rest data is used to test. The function repeat the training and testing 10 times using different portion of data. Finally the funciton camp up with the $R^2$ of each round.  
These $R^2$s tells us how accurate the linear regression can be working on the given dataset.  
Most of the $R^2$s are greater than 0.85 which suggests linear regression is a good model for the given data set.

## Part 5. Using Polynomial Regression

In [9]:
# use PolynomialFeatures perform another regression on an augmented dataset of degree 2
# construct the polynomial regression 
poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression(fit_intercept=False)),
])

# evaluate
# our feature training data is [Temperature, mols], let's call it [T, M]
# after the polynomialfeatures processes, the array is [1, T, M, T^2, TM, M^2]
poly_model = poly_model.fit(features_train, labels_train)
poly_score = poly_model.score(features_test, labels_test)
poly_score

1.0

The coefficient of determinations of the above setup is 1, which means the trained model using polynomial regression perfectly fits the data we have.

In [10]:
poly_model.named_steps['linear'].coef_

array([ 1.25367616e-05,  1.20000000e+01, -1.08214466e-07, -2.53166377e-11,
        2.00000000e+00,  2.85714287e-02])

Denoting the temperature as $x_0$ and mols of KCl is $x_1$. The equation is:  
$h(x): y = 1.25 * 10^{-5} + 0.12 x_0 - 1.08 * 10^{-7} x_1 - 2.53 * 10^{-11} x_0^2 + 2 x_0x_1 + 2.86 * 10^{-2} x_1^2$