# LINEAR REGRESSION

## Import Modules

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import numpy as np
import statsmodels.api as sm

## Read in Data

In [17]:
loansData = pd.read_csv('https://github.com/Thinkful-Ed/curric-data-001-data-sets/raw/master/loans/loansData.csv')

## Examine Data for Cleanliness

In [18]:
loansData['Interest.Rate'][0:5]

81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object

### The percent symbol should be removed prior to analysis. 

In [19]:
cleanInterestRate = loansData['Interest.Rate'].map(lambda x: round(float(x.rstrip('%')) / 100, 4))
loansData['Interest.Rate'] = cleanInterestRate

### The code above removes the percent symbol and puts the clean data column back into the dataset. 

In [20]:
loansData['Loan.Length'][0:5]

81174    36 months
99592    36 months
80059    60 months
15825    36 months
33182    36 months
Name: Loan.Length, dtype: object

### The months label should be removed for analysis. 

In [21]:
cleanLoanLength = loansData['Loan.Length'].map(lambda y: int(y.strip(' months')))
loansData['Loan.Length'] = cleanLoanLength

### The code above removes the months label from the loan length variable and inserts the clean data column into the dataset. 

In [22]:
loansData['FICO.Range'][0:5]

81174    735-739
99592    715-719
80059    690-694
15825    695-699
33182    695-699
Name: FICO.Range, dtype: object

### This range will not lend itself to analysis easily, so it must be changed as well. 

In [23]:
loansData['FICO.Score'] = [int(val.split('-')[0]) for val in loansData['FICO.Range']]

### This line of code splits the FICO score range on the hyphen and chooses the first item in the list for the FICO score.

In [24]:
intrate = loansData['Interest.Rate']
loanamt = loansData['Amount.Requested']
fico = loansData['FICO.Score']

In [25]:
# the dependent variable 
y = np.matrix(intrate).transpose()

#the independent variables shaped as columns
x1 = np.matrix(fico).transpose()
x2 = np.matrix(loanamt).transpose()

In [26]:
x = np.column_stack([x1,x2])

In [32]:
# create the linear model:
x = sm.add_constant(x)
model = sm.OLS(y,x)
f = model.fit()
f.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.657
Model:,OLS,Adj. R-squared:,0.656
Method:,Least Squares,F-statistic:,2388.0
Date:,"Wed, 18 Jan 2017",Prob (F-statistic):,0.0
Time:,21:06:49,Log-Likelihood:,5727.6
No. Observations:,2500,AIC:,-11450.0
Df Residuals:,2497,BIC:,-11430.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.7288,0.010,73.734,0.000,0.709 0.748
x1,-0.0009,1.4e-05,-63.022,0.000,-0.001 -0.001
x2,2.107e-06,6.3e-08,33.443,0.000,1.98e-06 2.23e-06

0,1,2,3
Omnibus:,69.496,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,77.811
Skew:,0.379,Prob(JB):,1.27e-17
Kurtosis:,3.414,Cond. No.,296000.0


### The linear regression output shows the both predictors, the loan amount and FICO score, are significantly related to the outcome, interest rate. For every one unit increase in the FICO score (and holding the loan amount constant), the interest rate decreases by .0009 units. Holding the FICO score constant, the loan amount is accompanied by a corresponding increase in the interest rate. 

### The R-squared of the model is 0.657, which means the model explains a fair amount of the variance in the outcome variable, interest rate. 