# Quantifying Linear Regression

Create a model to quantify

In [1]:
# Import dependencies
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import pandas as pd
import os
from pickle import dump
import numpy as np

In [2]:
# Generate some data
# X, y = make_regression(n_samples=20, n_features=1, random_state=42, noise=4, bias=100.0)

# make dataframe with wine csv
df_wine = pd.read_csv(os.path.join("..", "Resources", "winequality-joined.csv"))
df_wine.head()

Unnamed: 0,color,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [3]:
# we are going to make the range 0 to 10 so that new outlier data can be added by user
# hard code a drop down menu on the user interface to select 0-10 so people can't type decimals
target = df_wine["quality"]
target_names = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"] 
target_names

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

In [4]:
wine_qual_drop = df_wine.drop("quality", axis=1)
# feature_names = wine_qual_drop.columns
wine_qual_drop.head()

Unnamed: 0,color,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
1,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
3,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9


In [5]:
feature_names = wine_qual_drop.columns

In [6]:
# Create a linear model
model = LinearRegression()

# Fit (Train) our model to the data
model.fit(wine_qual_drop, target)

LinearRegression()

## Quantifying our Model

* Mean Squared Error (MSE)

* R2 Score

There are a variety of ways to quantify the model, but MSE and R2 are very common

In [7]:
from sklearn.metrics import mean_squared_error, r2_score

# Use our model to predict a value
predicted = model.predict(wine_qual_drop)

# Score the prediction with mse and r2
mse = mean_squared_error(target, predicted)
r2 = r2_score(target, predicted)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")

Mean Squared Error (MSE): 0.5363623573881133
R-squared (R2 ): 0.29653465192960937


A "good" MSE score will be close to zero while a "good" [R2 Score](https://en.wikipedia.org/wiki/Coefficient_of_determination) will be close to 1.

R2 Score is the default scoring for many of the Sklearn models

In [8]:
# Overall Score for the model
model.score(wine_qual_drop, target)

0.29653465192960937

## Validation

We also want to understand how well our model performs on new data. 

One approach for this is to split your data into a training and testing dataset.

You fit (train) the model using training data, and score and validate your model using the testing data.

This train/test splitting is so common that Sklearn provides a mechanism for doing this. 

## Testing and Training Data

In order to quantify our model against new input values, we often split the data into training and testing data. The model is then fit to the training data and scored by the test data. Sklean pre-processing provides a library for automatically splitting up the data into training and testing

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(wine_qual_drop, target, random_state=42)

In [10]:
# data = train_test_split(X, y, random_state=42)
# X_train = data[0]
# X_test = data[1]
# y_train = data[2]
# y_test = data[3]

Train the model using the training data

In [11]:
model.fit(X_train, y_train)

LinearRegression()

And score the model using the unseen testing data

In [12]:
model.score(X_test, y_test)

0.3148927596727418

## Your Turn!