<br>

# Week 10: Introduction to Supervised Learning: Numeric Targets
# Multiple Linear Regression

## a Brief Recap:

* Hello, how are you?
* Any questions for homework 3?
* Today: Supervised Learning Methods for Numeric Targets
    - Multiple Linear Regression
    - Ridge & Lasso
* Next Week: Supervised Learning Methods for Categorical Targets

## Supervised Regression

Basic Supervised Regression Algorithms:  

* Generalized Linear Models
* Ridge Regression
* Lasso Regression
* Decision Trees
* Random Forests
* k-Nearest Neighbors (kNN)
* Support Vector Machines (SVM)

Today we will focus on extending our understanding of Linear Regression and then move on to Ridge and Lasso Regression.  

Next week we will work on Decision Trees & Random Forest as well as kNN & SVM. However, it will be for the case where the target variable is categorical.

## Building on Simple Linear Regression Models

Last week we considered simple linear regression where: 
$$Y \approx \beta_0 + \beta_1X$$

But the world is complicated!  
It is rare for a single variable to have a strong and consistent linear relationship to a predictor.  

Can you name any simple linear systems? ...

Today we will consider systems where the model describes the relationship of multiple predictor values to a single numeric target variable.  

$$Y \approx \beta_0 + \beta_1X + \beta_2X + \dots + \beta_NX$$

## A Classic Example Dataset

NYC Italian Restaurant DataSet.  

Let's get our environment set up....

In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import matplotlib 
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
path = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/nyc.csv'
df = pd.read_csv( path, encoding= 'unicode_escape' )
print( df.head(), '\n\n' )
print( df.info() )

In [None]:
# note the similar spread of the predictor varaibles
df.describe()

## Our Goal:

* We would like to build an optimal model to predict `Price` from the given feature variables
* We would like to use our model to predict the price of other Italian restaurants

Let's get our data into the environment...

## assessing performance for ML models

Last week we explored a few statistical approaches to modeling data.  
For Machine Learning the goal is slightly different than Statistical Learning:  

**Statistical Learning** - use our model to draw statistical inferences about the system  
**Machine Learning** - emphasizes using the model to predict new cases  

We would like to be able to assess our models ability to predict new observations.  
To do this, we need divide our dataset into 2 sets:  

* **Train** - data that will be used to train our model 
* **Test** - hold-out data to assess model performance. We will keep this data out of use until our model is developed.

## split Train/Test data sets with `sklearn`

* X,y = predictors and Target
* test_size - arbitrary, but typically an 80/20 train/test split
* set a random state so our results are reproducible

In [None]:
#from sklearn.model_selection import train_test_split
X = df.drop(['Price', 'Case', 'Restaurant'], axis=1)
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42)

We will only use the test data splits once we are ready to test our model

## Building a Multiple Linear Regression model

* an initial EDA
* we'll start simple, then build to kitchen sink model
* model diagnostics
* model evaluation

### Visualize the features

In [None]:
X_train.hist( bins = 20, figsize=(15,6), layout = (2,2) )

## Looking for relationships

Visualize relationships of the predictors

In [None]:
sns.pairplot( X_train )
plt.show()

In [None]:
corrMatrix = X_train.corr()
plt.subplots( figsize = (10,8) )
sns.heatmap(corrMatrix, annot=True)
plt.show()

There are definitely relations between the numeric features (let's ignore 'East' for now).  
A Correlation of 0.5 translates to 25% shared variance.  

We will keep this in mind when we evaluate our model

### using `statsmodels` to build up to a kitchen sink

`statsmodel` is better for a pythonic approach to statistical learning. We will start with this 

But first we'll start with a simple linear regression of the dataset: `Price` modelled by `Food`

In [None]:
train = pd.concat( [X_train, y_train], axis=1 )
simp_mod = smf.ols( formula = 'Price ~ Food', data = train)
fitted_simp = simp_mod.fit()
fitted_simp.summary()

In [None]:
# visualize the fit
fitted_simp.fittedvalues
fig, ax = plt.subplots( figsize = (10,8) )
ax.scatter( x = train['Food'], y = train['Price'], color = 'blue')
ax.plot( train['Food'], fitted_simp.fittedvalues, 'k' )
plt.show()

### Adding the `East` categorical variable

In [None]:
simpE_mod = smf.ols( formula = 'Price ~ Food  + C(East)', data = train)
fitted_simpE = simpE_mod.fit()
fitted_simpE.summary()

### Adding `Decor`

In [None]:
simpD_mod = smf.ols( formula = 'Price ~ Food  + Decor', data = train)
fitted_simpD = simpD_mod.fit()
fitted_simpD.summary()

In [None]:
# visualize as a plane in 3D
x = np.linspace(16,25,10)
y = np.linspace(14,24,10)
X,Y = np.meshgrid(x,y)
Z = (-25.9537 + 1.6773*X + 1.9376*Y) 

from mpl_toolkits.mplot3d import axes3d

fig = plt.figure(figsize = (10,8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter( train['Food'], train['Decor'], train['Price'])
surf = ax.plot_surface(X, Y, Z, alpha=0.5)
ax.view_init(30, 60)

### The Kitchen Sink

That's interesting, let's just see what happens when we include all variables in the model  
What can we conclude from the model coefficients?

In [None]:
# a full 'kitchen sink model'
mlr_mod = smf.ols( formula = 'Price ~ Food + Decor + Service + C(East)', data = train)
fitted_mlr = mlr_mod.fit()
fitted_mlr.summary()

## Diagnostic Plots! 

Yep, diagnostic plots are still relevant for multiple linear regression

In [None]:
from statsmodels.nonparametric.smoothers_lowess import lowess
import scipy.stats as stats

def linear_regression(df, x_cols, y_cols):
    mod = sm.OLS(endog=df[y_cols], exog=df[x_cols]).fit()
    influence = mod.get_influence()

    res = df.copy()
    res['resid'] = mod.resid
    res['fittedvalues'] = mod.fittedvalues
    res['resid_std'] = mod.resid_pearson
    res['sqrt_resid_std'] = res['resid_std'].abs().transform('sqrt')
    res['leverage'] = influence.hat_matrix_diag
    res['norm_resid'] = mod.get_influence().resid_studentized_internal
    res['cooks'] = influence.cooks_distance[0]
    res['cooks_pval'] = influence.cooks_distance[1]
    return mod, res


def plot_diagnosis(df):
    fig, axs = plt.subplots(nrows=2, ncols=2, figsize = (10,8))
    plt.style.use('seaborn')

    # Residual against fitted values.
    smooth = lowess( endog = df.resid, exog =  df.fittedvalues)
    index, data = np.transpose(smooth)
    axs[0,0].scatter( x = df.fittedvalues, y = df.resid, ls = 'None' )
    axs[0,0].plot( index, data, 'r' )
    axs[0,0].axhline( y=0, color='k')
    axs[0,0].set_ylabel( 'Residual' )
    axs[0,0].set_xlabel( 'fitted' )

    # qqplot
    sm.qqplot(
        df['norm_resid'], dist=stats.t, fit=True, line='45',
        ax=axs[0, 1]#, c='#4C72B0'
    )
    axs[0,1].set_title('Normal Q-Q')

    # The scale-location plot.
    smooth = lowess( endog = df.sqrt_resid_std, exog =  df.fittedvalues)
    index, data = np.transpose(smooth)
    axs[1,0].scatter(
        x=df.fittedvalues, y=df.sqrt_resid_std
    )
    axs[1,0].plot( index, data, 'r' )
    axs[1,0].set_xlabel('Fitted values')
    axs[1,0].set_ylabel('Sqrt(|standardized residuals|)')
    axs[1,0].set_title('Scale-Location')

    # Standardized residuals vs. leverage
    smooth = lowess( endog = df.resid_std, exog =  df.leverage)
    index, data = np.transpose(smooth)
    axs[1,1].scatter(
        x=df.leverage, y=df.resid_std#, ax=axes[1, 1]
    )
    axs[1,1].axhline(y=0, color='grey', linestyle='dashed')
    axs[1,1].plot( index, data, 'r' )
    axs[1,1].set_xlabel('Leverage')
    axs[1,1].set_ylabel('standardized residuals')
    axs[1,1].set_title('Residuals vs Leverage')
    leverage_top_3 = np.flip(np.argsort(df.cooks), 0)[:3]
    for i in leverage_top_3:
        axs[1,1].annotate(i, xy=(df.leverage[i],
                                 df.norm_resid[i]));

    plt.tight_layout()
    plt.show()

In [None]:
mod_kitchensink, res = linear_regression( df=train, x_cols=['Food', 'Decor', 'Service', 'East'], y_cols=["Price"] )
plot_diagnosis( res )

### Multicolinearity

**Multicolinearity** - we have to consider that the predictors may not only be related to the target variable, but may be related among themselves.  

**Statistical Learning** - multicolinearity does bad things to coefficient estimates (large std errors) and this makes interpretting the model difficult  
**Machine Learning** - it can be shown that multicolinearity will not affect the predictions of the model

Evaluate with the VIF measure: 

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns

vif_data['VIF'] = [variance_inflation_factor( X_train.values,i) for i in range( X_train.shape[1] ) ]
vif_data

### Use the Model to Predict Unseen data

Predict on the X_test predictors evaluate on the y_test known outcomes

In [None]:
fitted_mlr.predict( X_test )

## DS4VS Happy Hour  üç∑ .....kinda

Here is a toy dataset with several feature variables describing types of wine.  
Try to implement a kitchen sink model with the data to predict the 'quality'

In [None]:
path = '/home/bonzilla/Documents/ScienceLife/DS4VS/datasets/winequality_red.csv'
wine = pd.read_csv( path, encoding= 'unicode_escape' )
print( wine.head(), '\n\n' )
print( wine.info() )

<img src="https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Week10/flowchart.png" width="60%" style="margin-left:auto; margin-right:auto">


## Wonderful!, we just situated simple linear regression into the broader Linear Regression family. Next we will look at a completely different regression approach...
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">