## Regression Functions for Homeworks 4

This notebook contains code to create the regression models needed for homework 4.

Please read __ALL__ the comments in the code and the headings. This notebook is NOT intended to be run as a script from top to bottom, although there are some code cells that need to be run first.
- The general utility libraries need to be loaded first
- Then you need to execute the load data and engineer features code cells
- Finally, execute the create X and y from the features code cells

You can choose to use utilize the grid search implementations for each algorithm listed below tuning the parameters as defined by the Scikit-Learn documentation.

The description box above each algorithm contains a link to sklearn's documentation on that algorithm. Please use that for parameter tuning.

In [None]:
# Load general utilities
# ----------------------
import pandas as pd
from scipy import stats
import datetime
import math
import numpy as np
import pickle

### PREP AND PREPROCESSING SECTION

###  Load the data and engineer features

In [None]:
# This is the code you can use to open your pickle file
# Read the data and features from the pickle
data, discrete_features, continuous_features, ret_cols = pickle.load( open( "Data/clean_data.pickle", "rb" ) )

In [None]:
# Create a feature for the length of a person's credit history at the
# time the loan is issued
data['cr_hist'] = (data.issue_d - data.earliest_cr_line) / np.timedelta64(1, 'M')
continuous_features.append('cr_hist')

#### If you want to use a smaller sample of the data due to time constraints, use the following code

In [None]:
# this code randomly samples 55% of the rows
# change the frac paramter if you want a different % to sample
# replace = False insures we won't select the same row twice
data=data.sample(frac=.55, replace=False, ).copy()

### Function to Calculate PValues

In [None]:
def getPValues (model, X_test, y_test):
    params = np.append(model.intercept_,model.coef_)
    predictions = model.predict(X_test)

    newX = pd.DataFrame({"Constant":np.ones(len(X_test))}).join(pd.DataFrame(X_test.reset_index(drop=True)))

    mse = mean_squared_error(y_test, predictions)

    var_b = mse*(np.linalg.pinv(np.dot(newX.T,newX)).diagonal())
    sd_b = np.sqrt(var_b)
    ts_b = params/ sd_b
    p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]
    sd_b = np.round(sd_b,3)
    ts_b = np.round(ts_b,3)
    p_values = np.round(p_values,8)
    params = np.round(params,4)

    df = pd.DataFrame()
    df["Coeff"],df["SE"],df["t val"],df["Probs"] = [params,sd_b,ts_b,p_values]
    names = ['Intercept']
    names.extend(list(X_test))
    df.index = names
    return df

### Create X and y from the features

In [None]:
from sklearn.preprocessing import MinMaxScaler

def minMaxScaleContinuous(continuousList):
    return pd.DataFrame(MinMaxScaler().fit_transform(data[continuousList])
                             ,columns=list(data[continuousList].columns)
                             ,index = data[continuousList].index)

def createDiscreteDummies(discreteList):
    return pd.get_dummies(data[discreteList], dummy_na = True, prefix_sep = "::", drop_first = False)

#### VERY IMPORTANT STEP
You need to define which features to use in the modeling. The homework will direct you to either use all the features from the data ingestion and cleaning process or to remove some features because they are defined by LendingClub

In [None]:
# define the discrete features you want to use in modeling.
# if you want to use all the discrete features, just set discrete_features_touse = discrete_features
discrete_features_touse =['purpose', 'term', 'verification_status', 'emp_length', 'home_ownership']

# define the continuous features to use in modeling
# if you want to use all the continuous features, just set the continuous_features_touse = continuous_features
continuous_features_touse = ['loan_amnt', 'funded_amnt','installment','annual_inc','dti','revol_bal','delinq_2yrs','open_acc',
 'pub_rec','fico_range_high','fico_range_low','revol_util','cr_hist']

In [None]:
discrete_features_touse=discrete_features
continuous_features_touse = continuous_features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Create dummies for categorical features and concatenate with continuous features for X or predictor dataframe

# Use this line of code if you do not want to scale the continuous features
#X_continuous = data[continuous_features_touse]

# use this line if you want to scale the continuous features using the MinMaxScaler in the function defined above
X_continuous = minMaxScaleContinuous(continuous_features_touse)

# create numeric dummy features for the discrete features to be used in modeling
X_discrete = createDiscreteDummies(discrete_features_touse)

#concatenate the continuous and discrete features into one dataframe
X = pd.concat([X_continuous, X_discrete], axis = 1)

# this is the target variable 
# 'ret_PESS', 'ret_OPT', 'ret_INTa', 'ret_INTb'

# Use this line of code if you do not want to scale the ret_cols
y=data['ret_PESS']

# use this line if you want to scale the ret_cols using the MinMaxScaler in the function defined above
#ret_data = minMaxScaleContinuous(ret_cols)
#y=ret_data['ret_PESS']

# create a test and train split of the transformed data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.3)


### Important functions to save a model to pickle and load it later
Training of these models takes time. It is advisable to save the model as a pickle after you've trained it to your satisfaction so you can use it later for comparison without having to re-train it.

The code after the function defintions provides an __example__ of how to use them.

In [None]:
import joblib
# save the model to disk
def saveModel(filename, model):
    joblib.dump(model, filename)
 
 
# load the model from disk
def loadModel(filename):
    return joblib.load(filename)

In [None]:
# save the model to disk
saveModel('mlr_model', mlr_model)

In [None]:
# load the model from disk
mlr_model = loadModel('mlr_model')

### Linear Regression
This is an Multiple Linear Regression.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.linear_model import LinearRegression

mlr_model = LinearRegression(n_jobs=-1).fit(X_train, y_train)

print("Training set score: {:.5f}".format(mlr_model.score(X_train, y_train)))
print("Test set score: {:.5f}".format(mlr_model.score(X_test, y_test)))

print("mlr.coef_:", mlr_model.coef_)
print("mlr.intercept_:", mlr_model.intercept_)

In [None]:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

predictions = mlr_model.predict(X_test)
score = explained_variance_score(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
rmse = math.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

print("score = {:.5f} | MAE = {:.5f} | RMSE = {:.5f} | R2 = {:.5f}".format(score, mae, rmse, r2))

In [None]:
getPValues(mlr_model, X_train, y_train)

### Ridge Regression GridsearchCV
This is an example grid search with cross validation using ridge regression. Please note the following:
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of R2. You can change that to another scoring method.
'''

parameters = {'alpha' : [0.001, 0.001, 0.01, 0.1, 1]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(Ridge(max_iter=100000), parameters, cv=5, return_train_score=True, scoring='r2', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
ridge_model = grid.fit(X_train, y_train)

print('best params ',ridge_model.best_params_,'\n')
print('best estimator ',ridge_model.best_estimator_,'\n')
print('best validation score ', ridge_model.best_score_,'\n')
print('scoring method ', ridge_model.scorer_)

print("Test set accuracy score: {:.7f}".format(ridge_model.score(X_test, y_test)))

saveModel('ridge_model', ridge_model)

### LASSO Regression GridsearchCV
This is an example grid search with cross validation using lasso regression. Please note the following:
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of R2. You can change that to another scoring method.
'''

parameters = {'alpha' : [0.000000001, 0.00000001, 0.0000001]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(Lasso(max_iter=10000), parameters, cv=5, return_train_score=True, scoring='r2', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
ls_model = grid.fit(X_train, y_train)

print('best params ',ls_model.best_params_,'\n')
print('best estimator ',ls_model.best_estimator_,'\n')
print('best validation score ', ls_model.best_score_,'\n')
print('scoring method ', ls_model.scorer_)

print("Test set accuracy score: {:.7f}".format(ls_model.score(X_test, y_test)))

saveModel('ls_model', ls_model)

### ElasticNet Regression GridsearchCV
This is an example grid search with cross validation using ElasticNet regression. It is not required that you use this algorithm, but I included it in case you were interested in trying it. It blends L1 and L2 regularization using the l1_ratio parameter.

Please note the following:
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of R2. You can change that to another scoring method.
'''

parameters = {'alpha' : [0.000000001, 0.00000001, 0.01],
              'l1_ratio': [0.25, 0.5, 0.75]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(ElasticNet(max_iter=10000), parameters, cv=5, return_train_score=True, scoring='r2', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
en_model = grid.fit(X_train, y_train)

print('best params ',en_model.best_params_,'\n')
print('best estimator ',en_model.best_estimator_,'\n')
print('best validation score ', en_model.best_score_,'\n')
print('scoring method ', en_model.scorer_)

print("Test set accuracy score: {:.7f}".format(en_model.score(X_test, y_test)))

saveModel('en_model', en_model)