# Metrics and Cross Validation Exercise

#### This exercise corresponds to Thinkful Unit 4, Lesson 1: Metrics and Cross Validation section. Using the loans data used in earlier exercises, we will fit a linear regression model and perform 10-fold cross validation. Then we compute mean squared error, mean absolute error, and r2 for comparison. 

In [38]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import KFold
% matplotlib inline

###### Load data:

In [40]:
loansData = pd.read_csv('https://github.com/Thinkful-Ed/curric-data-001-data-sets/raw/master/loans/loansData.csv')

#### We are trying to model interest rate using loan amount and FICO score. Let's examine these variables.

###### We see that interest rate includes percent symbols that need removing.

In [42]:
loansData['Interest.Rate'][0:5]

81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object

###### The loan amount is fine as is.

In [43]:
loansData['Amount.Requested'][0:5]

81174    20000
99592    19200
80059    35000
15825    10000
33182    12000
Name: Amount.Requested, dtype: int64

###### As we did in a previous lesson, we are going to set the FICO score to the lowest value in the range. There are probably better ways to approach this, just keep in mind. 

In [45]:
loansData['FICO.Range'][0:5]

81174    735-739
99592    715-719
80059    690-694
15825    695-699
33182    695-699
Name: FICO.Range, dtype: object

In [47]:
cleanInterestRate = loansData['Interest.Rate'].map(lambda x: round(float(x.rstrip('%')), 4))
loansData['FICO.Score'] = [int(val.split('-')[0]) for val in loansData['FICO.Range']]
loansData['Interest.Rate'] = cleanInterestRate

###### Check that we made the correct changes to those variables:

In [49]:
loansData['Interest.Rate'][0:5]

81174     8.90
99592    12.12
80059    21.98
15825     9.99
33182    11.71
Name: Interest.Rate, dtype: float64

In [19]:
loansData['FICO.Score'][0:5]

81174    735
99592    715
80059    690
15825    695
33182    695
Name: FICO.Score, dtype: int64

#### Now we will define the outcome variable and feature space.

In [51]:
y = loansData['Interest.Rate'].as_matrix()

loans_features = loansData[['Amount.Requested', 'FICO.Score']]
X = loans_features.as_matrix().astype(np.int)

###### Import the stuff needed for metric computation, modeling, and cross validation.

In [52]:
from sklearn.metrics import mean_absolute_error as mas, mean_squared_error as mse, r2_score as r2s
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

###### Define the linear model and set the cv object:

In [30]:
lm = linear_model.LinearRegression(fit_intercept=True)
cv = KFold(n_splits=10)

In [31]:
score_metrics = ['mean_absolute_error', 'mean_squared_error','r2']

###### Compute the metrics for each of the folds, where the average is the performance of the model.

#### My interpretation of the output below is that the r-squared shows us the percent of the outcome that is explained by the predictors. In other words, 65% of the variation in interest rate is explained by FICO score and loan amount on average across the folds. The mean absolute error indicates that the predicted value is on average almost two points lower than the acutal value, while the mean squared error shows that the predicted value is about 6 points lower than the actual value. 

In [36]:
for s in score_metrics:
    scores = cross_val_score(lm, X, y, scoring=s, cv=cv)
    print(str(s) + ": %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

mean_absolute_error: -1.94 (+/- 0.10)
mean_squared_error: -6.01 (+/- 0.82)
r2: 0.65 (+/- 0.05)


  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
