# Linear Regression 2

**Import Libraries**

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

import plotly.express as px
import plotly.graph_objects as go

**Train-test Split**

The idea of a train-test split is to train on some percentage of the data set (fit the model on the training set) and then test with the data not yet seen by the model (test set). We can also refer to this method as using a holdout set or a subset which we "hold back" from training on.

where to split depends on a few factors about the data
- size of your data set
- type of model
- use of cross-validation metrics

common splits are
- 80/20
- 75/25



**Create Training and Test Sets**

In [2]:
# Load the data into a DataFrame
df = sns.load_dataset("penguins")

# Drop NaNs
df.dropna(inplace=True)

# Create the 2-D features matrix
X = df['flipper_length_mm']
X = X[:, np.newaxis]

# Create the target array
y = df['body_mass_g']

# Create the training and test sets. 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print('The training and testing feature: ', X_train.shape, y_train.shape)
print('The training and testing target: ', X_test.shape, y_test.shape)

The training and testing feature:  (266, 1) (266,)
The training and testing target:  (67, 1) (67,)


**Fitting model on train data and predict on test data**

In [4]:
# Instantiate the class (with default parameters)
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Slope (also called the model coefficient)
print(model.coef_)

# Intercept
print(model.intercept_)

# Print in equation form
print(f'\nbody_mass_g = {model.coef_[0]} x flipper_length_mm + ({model.intercept_})')

[50.41798199]
-5919.258741821233

body_mass_g = 50.41798199462178 x flipper_length_mm + (-5919.258741821233)


**Making predictions**

In [6]:
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)

0.7938115564401114

### Multiple Linear Regression

The general form for a multiple linear regression is

$y = \beta_0 + \beta_1X_1 + \beta_2X_2$

where

$\beta_0$

is the intercept,

$\beta_1$

is the regression coefficient for the dependent variable

$X_1$

, and

$\beta_2$

is the regression coefficient for the dependent variable

$X_2$

**compare features**

In [10]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [12]:
fig = px.scatter_matrix(df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']], height=800)
fig.show()

### Fitting the multiple regression model

As we expected, the model fit returns two regression coefficients. If we go back to the equations for a multiple linear regression and substitute in our coefficients, we now have the equation of a plane instead of a line:

$y = -5836 + 49X_{1} + 5X_{2}$

Both of the regression coefficients are positive, which means that as the flipper length and bill length increase, the mass of the penguin also increases.

In [13]:
df.dropna(inplace=True)

features = ['flipper_length_mm', 'bill_length_mm']
X = df[features]

y = df['body_mass_g']

model = LinearRegression()

model.fit(X, y)

# Slope (2 parameters here)
print(model.coef_)

# Intercept
print(model.intercept_)

[48.88969177  4.95860126]
-5836.298732120461


### Creating a 3D plot

In [42]:
# Create the data to plot the best-fit plane
(x_plane, y_plane) = np.meshgrid(np.arange(165, 235, 1), np.arange(30, 60, 1))
z_plane = -5836 + 49*x_plane + 5*y_plane

In [41]:
fig = px.scatter_3d(df, x='flipper_length_mm', y='bill_length_mm', z='body_mass_g', color='species', width=800, height=800)
grey = [[0, 'rgb(176,224,230)'], [1, 'rgb(176,224,230)']]
fig.add_trace(go.Surface(x=x_plane, y=y_plane, z=z_plane, colorscale=grey, opacityscale=0.05,  showscale=False))
fig.show()

## Looking into Ordinary Least Squares and How it Works

### Sum of squared errors

The ordinary least squares method is based on minimizing the sum of the squared error. This error is the difference between the observed dependent variable and the value predicted by the linear model. The OLS method estimates the slope and intercept parameters for a line that minimizes the sum of the distance between the line and the observed data.

we sum the square of the distance of each point from the line. Some of the data is above the line (positive distance) and some is below (negative distance). We square this distance so that the negative values don't cancel out the positive values.

The form of the equation for a line takes many forms and uses different variables. We’ve been usingy = \beta_0 + \beta_1X_1and we’ll continue with this format. It is important to be aware that there are other variables used for the parameters in the equation of a line. We’ll start with an equation for a line with the form:

$y = \beta_0 + \beta_1X$

We want to estimate the parameters for\beta_0and\beta_1X. For each data pointiin our data set, we want to find the value predicted by using the coefficients\beta_0and\beta_1. The predicted value of y would be given by:

$y_{predict} = \beta_0 + \beta_1x_i$

The error between the actual valuey_iand the predicted valuey_{predict}is:

$\text{diff} = y_{i} - y_{predict} = y_{i} - (\beta_0 + \beta_1x_{i})$

We want to sum over all of the values in the data set. So we need to square this difference and then add them up:

$\text{Sum of squares} = \sum (y_{i} - (\beta_0 + \beta_1 x_{i}))^2$

Now comes the fun part: we need to apply some math to the above sum of squares equation to find the values of\alphaand\betathat minimize the sum. There are a few different ways to derive this answer, including using calculus and linear algebra.

**Parameter estimates: least squares**

The values for the parameters are given by:

$\beta_1 = \frac{Cov[x,y]}{Var[x]}$

$\beta_0 = \bar{y} - \beta \bar{x}$

where

$\bar{x}$ and $\bar{y}$ are the mean values of $x$ and $y$

In [44]:
# Generate the sample data
x = np.arange(25)
delta = np.random.uniform(0,20, size=(25,))
y = 0.4 * x + 3 + delta

# Define a function to calculate alpha and beta

def least_squares_params(x, y):
    '''
    x and y: data to be fit
    returns: the least-square values of alpha and beta
    '''
    # Calculate the mean of X and y
    xmean = np.mean(x); ymean = np.mean(y)

    # Calculate the covariance for x and y, variance for x
    xycov = (x-xmean)*(y-ymean)
    xvar = (x-xmean)**2

    # Calculate the coefficients
    beta_1 = sum(xycov) / sum(xvar)
    beta_0 = ymean - (beta_1*xmean)

    print('beta_0: ', beta_0)
    print('beta_1: ', beta_1)

# Find the estimated parameters for alpha, beta
# given our (x, y) data set
least_squares_params(x,y)

beta_0:  13.66843993351543
beta_1:  0.37400954084824806


In [45]:
# Instantiate the class (with default parameters)
model = LinearRegression()

X = x[:, np.newaxis]

# Fit the model
model.fit(X, y)

# Intercept
print('beta_0: ', model.intercept_)

# Slope (also called the model coefficient)
print('beta_1: ', model.coef_)

beta_0:  13.66843993351543
beta_1:  [0.37400954]


### Modeling concepts

- overfitting vs. underfitting
- bias and varience

**Bias and Varience**

- High bias: Doesn't pay a lot of attention to the data and over simplifies the model
- Low bias: Pays too much attention to the data and is usually a complicated model
- High variance: Fits the training data set very well but doesn't generalize to new data
- Low variance: Returns similar models for different sets of training data