# Gia Gillis

## Loan Interest Rate Analysis Part 3 of 3

Import necessary libraries.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

Function to replace null values.

In [10]:
def replace_null_values(x, strategy):
    imputer = SimpleImputer(missing_values=np.nan,strategy=strategy)
    imputer = imputer.fit(x)
    x = imputer.transform(x)
    x = pd.DataFrame.from_records(x)
    return x

Load data, drop Loan Id and Borrower Id columns, and change dates strings to date objects.

In [11]:
clean_loans=pd.read_csv(r'C:\Users\Gia\Downloads\Analyst_Test\Analyst_Test\clean_loan_interest_rates.csv', parse_dates=True)
clean_loans['Loan Date']=pd.to_datetime(clean_loans['Loan Date'], format='%Y-%m-%d')
clean_loans['Credit Line Date']=pd.to_datetime(clean_loans['Credit Line Date'], format='%Y-%m-%d')

In [12]:
clean_loans.columns

Index(['Interest Rate', 'Requested', 'Funded', 'Investor Funded',
       'Number of Payments', 'Loan Grade', 'Loan Subgrade', 'Job',
       'Years Employed', 'Home', 'Annual Income', 'Income Verified',
       'Loan Date', 'Loan Cat', 'State', 'Ratio', 'Late Payments',
       'Credit Line Date', 'Months Del', 'Months PR', 'Derog Recs',
       'Credit Lines', 'Status'],
      dtype='object')

Dropping all rows with null values resulted in less than half of the data, so this is unused and null values need to be replaced instead.

In [13]:
#clean_loans.dropna(inplace=True)

Separate data into features x and label y.

In [14]:
train_x = clean_loans.iloc[:,1:]
train_y = clean_loans.iloc[:,0]

In [15]:
train_x.head()

Unnamed: 0,Requested,Funded,Investor Funded,Number of Payments,Loan Grade,Loan Subgrade,Job,Years Employed,Home,Annual Income,...,Loan Cat,State,Ratio,Late Payments,Credit Line Date,Months Del,Months PR,Derog Recs,Credit Lines,Status
0,25000.0,25000.0,19080.0,36 months,B,B4,,< 1 year,RENT,85000.0,...,debt_consolidation,CA,19.48,0.0,1994-02-01,0.0,0.0,0.0,42.0,f
1,7000.0,7000.0,673.0,36 months,B,B5,Cnn,< 1 year,RENT,65000.0,...,credit_card,NY,14.29,0.0,2000-10-01,0.0,0.0,0.0,7.0,f
2,25000.0,25000.0,24725.0,36 months,D,D3,Web Programmer,1 year,RENT,70000.0,...,debt_consolidation,NY,10.5,0.0,2000-06-01,41.0,0.0,0.0,17.0,f
3,1200.0,1200.0,1200.0,36 months,C,C2,City Of Beaumont Texas,10+ years,OWN,54000.0,...,debt_consolidation,TX,5.47,0.0,1985-01-01,64.0,0.0,0.0,31.0,f
4,10800.0,10800.0,10692.0,36 months,C,C3,State Farm Insurance,6 years,RENT,32000.0,...,debt_consolidation,CT,11.63,0.0,1996-12-01,58.0,0.0,0.0,40.0,f


In [16]:
train_y.head()

0    0.1189
1    0.1071
2    0.1699
3    0.1311
4    0.1357
Name: Interest Rate, dtype: float64

Divide features x into numeric data and categorical data.

In [17]:
numerical_data=train_x.select_dtypes(include=['float'])

In [18]:
categorical_data=train_x.select_dtypes(exclude=['float', 'datetime'])

In [19]:
categorical_data.columns

Index(['Number of Payments', 'Loan Grade', 'Loan Subgrade', 'Job',
       'Years Employed', 'Home', 'Income Verified', 'Loan Cat', 'State',
       'Status'],
      dtype='object')

Drop Job column because it caused problems when encoding categorical data.  Job column had too many values.

In [20]:
categorical_data.drop('Job', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [21]:
numerical_data.head()

Unnamed: 0,Requested,Funded,Investor Funded,Annual Income,Ratio,Late Payments,Months Del,Months PR,Derog Recs,Credit Lines
0,25000.0,25000.0,19080.0,85000.0,19.48,0.0,0.0,0.0,0.0,42.0
1,7000.0,7000.0,673.0,65000.0,14.29,0.0,0.0,0.0,0.0,7.0
2,25000.0,25000.0,24725.0,70000.0,10.5,0.0,41.0,0.0,0.0,17.0
3,1200.0,1200.0,1200.0,54000.0,5.47,0.0,64.0,0.0,0.0,31.0
4,10800.0,10800.0,10692.0,32000.0,11.63,0.0,58.0,0.0,0.0,40.0


In [22]:
categorical_data.head()

Unnamed: 0,Number of Payments,Loan Grade,Loan Subgrade,Years Employed,Home,Income Verified,Loan Cat,State,Status
0,36 months,B,B4,< 1 year,RENT,verified - income,debt_consolidation,CA,f
1,36 months,B,B5,< 1 year,RENT,not verified,credit_card,NY,f
2,36 months,D,D3,1 year,RENT,verified - income,debt_consolidation,NY,f
3,36 months,C,C2,10+ years,OWN,not verified,debt_consolidation,TX,f
4,36 months,C,C3,6 years,RENT,not verified,debt_consolidation,CT,f


Attempted to replace null values in categorical data using SimpleImputer, but the kernel would hang.

In [23]:
#Replace null values in both categorical data and numerical data
#categorical_data = replace_null_values(categorical_data, 'most_frequent')

In [24]:
categorical_data.isnull().sum()

Number of Payments        1
Loan Grade            51851
Loan Subgrade         51851
Years Employed        14793
Home                  51960
Income Verified           1
Loan Cat                  1
State                     1
Status                    1
dtype: int64

In [25]:
for column in categorical_data.columns:
    print(max(categorical_data[column].value_counts().index), max(categorical_data[column].value_counts()))

 60 months 247661
G 86080
G5 20342
< 1 year 108455
RENT 145958
verified - income source 126990
wedding 198177
WY 52812
w 232474


Replaced null in categorical data with the most frequent value from each column.

In [26]:
for column in categorical_data.columns:
    categorical_data[column].fillna(max(categorical_data[column].value_counts().index), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Replaced null values in numerical data with mean.

In [27]:
numerical_data = replace_null_values(numerical_data, 'mean')

In [28]:
numerical_data.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [29]:
categorical_data=pd.get_dummies(categorical_data)

Unite categorical and numerical data after nulls are replaced, and categorical data is replace with dummies.

In [30]:
train_x = pd.concat([categorical_data, numerical_data], axis=1)

Separate data into test and train and scale data.

In [31]:
x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.10)

#Scaling data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train, y_train)
x_test = scaler.transform(x_test)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  


Import libraries from scikit-learn to build, apply, and score models.

In [32]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

In [33]:
lm=LinearRegression()
lm.fit(x_train, y_train)
y_pred=lm.predict(x_test)

In [34]:
print('Linear Regression Mean Squared Error ', mean_squared_error(y_test, y_pred))
print('Linear Regression R2 ', r2_score(y_test, y_pred))

Linear Regression Mean Squared Error  0.0003281919084094814
Linear Regression R2  0.8269919422030179


In [35]:
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
r=Ridge()
grid = GridSearchCV(r, parameters,cv=4)
grid.fit(x_train, y_train)
best=grid.best_estimator_
best

Ridge(alpha=10, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [36]:
y_pred=best.predict(x_test)

In [37]:
print('Ridge Mean Squared Error ', mean_squared_error(y_test, y_pred))
print('Ridge R2 ', r2_score(y_test, y_pred))

Ridge Mean Squared Error  0.00032819058081694555
Ridge R2  0.8269926420502727


The model with the higher R2 is a better fit for the data, and model with a lower MSE is a better fit for the data.  The two models are very similar.