### This notebook will use modelling software to generate the model coefficients a0, a1 and a2 to investigate FICO Score and Loan Amount as predictors of Interest Rate

In [44]:
%pylab inline
import pylab as pl
import numpy as np
import pandas as pd
import statsmodels.api as sm

# import the cleaned up dataset
df = pd.read_csv('/home/sophie/projects/LendingClub/data/clean_LD.csv')

intrate = df['Interest.Rate']
loanamt = df['Amount.Requested']
fico = df['FICO.Score']

# reshape the data from a pandas Series to columns
# the dependent variable
# This creates a 2D array, with T turning it from (1,1867) to (1867,1)
y = np.matrix(intrate).T # I think T does the same as transpose()

# the independent variables shaped as columns
x1 = np.matrix(fico).transpose()
x2 = np.matrix(loanamt).transpose()

# put the two columns together to create an input matrix
# if we had n independent variables we would have n columns here
x = np.column_stack([x1,x2])  # column_stack takes a sequence fo 1-D arrays and stacks them as columns.

print x[0:2,0] # to access x1
print x[0:2,1] # to access x2

# create a linear model and fit it to the data
X = sm.add_constant(x) # adds a column of 1s (the first column) to the x (2D stacked data)
model = sm.OLS(y,X)    # creates an ordinary least squares model. Y = response variable, X, should include an intercept.

# f is a A RegressionResults class instance. The list of attributes are found 
# here http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.html
f = model.fit() # fit is one of the methods which can be applied to an OLS object

print 'Coefficients: ', f.params[0:2] # linear coefficients that minimize the least squares criterion. a1 and a2
print 'Intercepts: ', f.params[2] # a0
print 'P-Values: ', f.pvalues
print 'R-Squared: ', f.rsquared



Populating the interactive namespace from numpy and matplotlib
[[735]
 [715]]
[[20000]
 [19200]]
[[  1.00000000e+00   7.35000000e+02   2.00000000e+04]
 [  1.00000000e+00   7.15000000e+02   1.92000000e+04]
 [  1.00000000e+00   6.95000000e+02   1.00000000e+04]
 ..., 
 [  1.00000000e+00   6.80000000e+02   1.00000000e+04]
 [  1.00000000e+00   6.75000000e+02   6.00000000e+03]
 [  1.00000000e+00   6.70000000e+02   9.00000000e+03]] [[  735 20000]
 [  715 19200]
 [  695 10000]
 ..., 
 [  680 10000]
 [  675  6000]
 [  670  9000]]
Coefficients:  [ 0.7232804  -0.00087589]
Intercepts:  1.97716000896e-06
P-Values:  [  0.00000000e+00   0.00000000e+00   3.00521465e-98]
R-Squared:  0.644760522744


`%matplotlib` prevents importing * from pylab and numpy


Coefficients: contains $a_1$ and $a_2$
Intercept: is at $a_0$

Next, we need to work out how reliable the numbers are. 
P-values are probabilities we can use to do this and to be confident we want it to be close to 0.
Convention is p < 0.05. If it is more, we have less confidence using that dimension in modelling and predicting.

$R^2$ : How much variance in the data is captured by the model.      
$R$ : coefficient of correlation between independent variables and dependent variable. How much Y depends on the seperate X's. Lies between -1 and 1, so $R^2$ lies between 0 and 1.
We want a high $R^2$.

We have created a linear multivariate regression model for Interest Rate, which is well described by the parameters above.
