# Home Loan Prediction
This dataset `full_home_loans.csv` is about home loan applications in Washington state, USA, where each row of the dataset is an individual loan application. Your goal in this assignment is to build a machine learning model that can accurately predict whether a given loan application was accepted or rejected. 


## Part 1: Data Exploration
The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/

In [1]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

To understand what kind of data was collected, `pandas` has some handy commands:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

### Question 1.A:  How many rows are in this dataset? How many columns?
Answer 1.A: By applying df.shape command, it is found that the dataset has **369281** rows and **27** columns. 

In [2]:
df.shape

(369281, 27)

In [3]:
df.head()

Unnamed: 0,town_name,county_name,loan_amount_000s,applicant_income_000s,property_type_name,occupied_by_owner,loan_type_name,is_hoepa_loan,loan_purpose_name,loan_approved,...,co_applicant_race_name_2,co_applicant_race_name_1,co_applicant_ethnicity_name,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name
0,"Portland, Vancouver, Hillsboro - OR, WA",Clark County,227,116.0,One-to-four family dwelling (other than manufa...,1,Conventional,0,Refinancing,1,...,,"Information not provided by applicant in mail,...",Not Hispanic or Latino,Female,,,,,"Information not provided by applicant in mail,...",Not Hispanic or Latino
1,Walla Walla - WA,Walla Walla County,240,42.0,One-to-four family dwelling (other than manufa...,1,FHA-insured,0,Home purchase,1,...,,No co-applicant,No co-applicant,Male,,,,,White,Hispanic or Latino
2,"Portland, Vancouver, Hillsboro - OR, WA",Clark County,241,117.0,One-to-four family dwelling (other than manufa...,1,Conventional,0,Refinancing,1,...,,White,Not Hispanic or Latino,Male,,,,,White,Not Hispanic or Latino
3,"Portland, Vancouver, Hillsboro - OR, WA",Clark County,351,315.0,One-to-four family dwelling (other than manufa...,1,Conventional,0,Refinancing,1,...,,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,...",Male,,,,,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,..."
4,"Bremerton, Silverdale - WA",Kitsap County,417,114.0,One-to-four family dwelling (other than manufa...,1,Conventional,0,Home improvement,1,...,,White,Not Hispanic or Latino,Female,,,,,White,Not Hispanic or Latino


### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
Answer 1.B: The column **'loan approved'** is the outcome value for each application. df.columns command has been applied to view all the columns in the dataset.

In [4]:
df.columns

Index(['town_name', 'county_name', 'loan_amount_000s', 'applicant_income_000s',
       'property_type_name', 'occupied_by_owner', 'loan_type_name',
       'is_hoepa_loan', 'loan_purpose_name', 'loan_approved',
       'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
       'co_applicant_sex_name', 'co_applicant_race_name_5',
       'co_applicant_race_name_4', 'co_applicant_race_name_3',
       'co_applicant_race_name_2', 'co_applicant_race_name_1',
       'co_applicant_ethnicity_name', 'applicant_sex_name',
       'applicant_race_name_5', 'applicant_race_name_4',
       'applicant_race_name_3', 'applicant_race_name_2',
       'applicant_race_name_1', 'applicant_ethnicity_name'],
      dtype='object')

### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: There are 3 columns in the dataset that list why a loan was denied. Try looking up the pandas command to list the unique values in a column.

Answer 1.C: From the unique listed values in the 3 columns, the reasons for denial include: 'Other', 'Credit application incomplete', 'Collateral', 'Debt-to-income ratio', 'Credit history', 'Unverifiable information', 'Employment history', 'Insufficient cash (downpayment, closing costs)' and 'Mortgage insurance denied'. The command df.'column_name'.unique() has been used. 

In [5]:
df.denial_reason_name_1.unique()

array([nan, 'Other', 'Credit application incomplete', 'Collateral',
       'Debt-to-income ratio', 'Credit history',
       'Unverifiable information', 'Employment history',
       'Insufficient cash (downpayment, closing costs)',
       'Mortgage insurance denied'], dtype=object)

In [6]:
df.denial_reason_name_2.unique()

array([nan, 'Collateral', 'Other',
       'Insufficient cash (downpayment, closing costs)',
       'Debt-to-income ratio', 'Employment history',
       'Credit application incomplete', 'Credit history',
       'Unverifiable information', 'Mortgage insurance denied'],
      dtype=object)

In [7]:
df.denial_reason_name_3.unique()

array([nan, 'Other', 'Credit history',
       'Insufficient cash (downpayment, closing costs)',
       'Employment history', 'Debt-to-income ratio',
       'Unverifiable information', 'Collateral',
       'Credit application incomplete', 'Mortgage insurance denied'],
      dtype=object)

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click to write your answer question here. Show your work in code below if applicable._
#1. Any past foreclosures
#2. Applicant's assets other than income like retirement fund, savings etc. 
#3. Employment history
#4. Income Tax Return filing history

In [8]:
df.describe()

Unnamed: 0,loan_amount_000s,applicant_income_000s,occupied_by_owner,is_hoepa_loan,loan_approved
count,369281.0,320143.0,369281.0,369281.0,369281.0
mean,289.583453,114.013435,0.915503,3.2e-05,0.836493
std,478.234372,120.781098,0.278132,0.0057,0.369828
min,1.0,1.0,0.0,0.0,0.0
25%,178.0,61.0,1.0,0.0,1.0
50%,252.0,90.0,1.0,0.0,1.0
75%,354.0,135.0,1.0,0.0,1.0
max,99999.0,9999.0,1.0,1.0,1.0


## Part 2: Preparing Data to Input to a Model
Here we'll start using `scikit-learn` which provides simple library calls for most things we'd like to do in a simple machine learning pipeline. If you haven't used `scikit-learn` before this tutorial may be useful to give you a sense of what the library can do: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Machine learning models can only understand data that is represented numerically, but lots of the columns in our dataset like "town_name" are text _categorical_ data. Meanwhile, many models do better when continous numerical data is within small, consistent ranges, such as all data being between -1, 0 and 1, which is definitely not the case with our thousands of dollars loan units.

So first, we will seperate out our samples (called _X_) into features we'd like to include in our model that are categorical or continous so that we can preprocess each appropriately seperately.

In [9]:
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat] 
X_num = df[features_num]

### Part 2.A One Hot Encode Categorical Variables
Run the following code to one hot encode the categorical features:

In [10]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

In [11]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names())
X_cat_proc.head()

Unnamed: 0,x0_Home improvement,x0_Home purchase,x0_Refinancing,x1_Female,"x1_Information not provided by applicant in mail, Internet, or telephone application",x1_Male,x1_Not applicable
0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.A: In your own words, how is one hot coding tranforming the categorical data? What does the term "one-hot" refer to?
Answer 2.A: One hot coding indicates the state of the state machine. This process has split the non numerical answers into columns, the columns being multiple choice questions (with option to choose only 1 option) which would have been answered by respondents during data collection. <br><br> 
"one hot" means that among all values (of the non numerical column that has been split) one value is 1 i.e. **hot** and the rest are 0. Similarly, in cases when all values are 1 and one value is 0, 0 is called "one cold".

### Part 2.B Scaling down continuous numerical data
Run the following code to normalize any continous numberical features, such as loan dollar amount, between -1 and 0. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers.

In [12]:
scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X_num_proc.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,loan_amount_000s,applicant_income_000s
0,-0.130864,0.016448
1,-0.10368,-0.596232
2,-0.101589,0.024727
3,0.128424,1.664059
4,0.266432,-0.000111


### Part 2.C Merge our feature sets into one sample dataset _X_ and fix NaN values
Run the code below to combine the numerical and categorical feature sets.

In [13]:
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,x0_Home improvement,x0_Home purchase,x0_Refinancing,x1_Female,"x1_Information not provided by applicant in mail, Internet, or telephone application",x1_Male,x1_Not applicable
0,-0.130864,0.016448,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,-0.10368,-0.596232,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.101589,0.024727,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.128424,1.664059,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.266432,-0.000111,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.C The code line below removes any NaN values in our sample with 0. NaNs are missing values that a model won't be able to understand. What is the _semantic_ meaning of replaceing a NaN with 0 for the categorical variables? And for the continous numerical variables? 
Answer 2.C: For the categorical variables, we have value of either 1 or 0. Thus when we have NaN filled in in one of those, then it means that if the user didn't fill anything, then that information is not valid for them. For continuous numerical variables, a code was written previously to normalise the data, with the average being O. Thus if that data is missing, then replacing NaN with 0 will make it as the average value, thus helping us to retain and use the data in other columns without dropping the whole row. 

In [14]:
X = X.fillna(0) # remove NaN values

### Part 2.D Create our target array _y_ that our model will try to predict

In [15]:
y = df['loan_approved'] # target

### Part 2.E Split our data into training, test, and validation sets
Run the code below to split the data. Both validation and test sets will be used for testing our model, but use the validation set while you are developing and improving your model, and leave the test for late stage evaluation.

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct

(258496, 9) (55392, 9) (55393, 9)


### Question 2.E:  In a  single sentence, what is the difference between train, test, and validation sets?
Answer 2.E: Training set is the data from which the model sees and learns.<br>
Validation set:<br>
Test set: This is only used when when a model is completely trained using the train and validation sets. 

## Part 3. Developing Models
Scikit-learn has a substantial library of different models we can use for classification. Below are implemented two of the most simple classification models, Logistic Regression and Dummy Classifier.

In [17]:
# personal notes
# a confusion matrix is a table that describes the performance of a classification model/ classifier on a set of
# test data for which the true values are known.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))

In [18]:
# personal notes
# LR is used when the target/ dependent variable is categorical
# in LR, the value of data points ranges strictly from 0 to 1 
# refer to: https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

Confusion matrix:
 [[    0  9185]
 [    0 46207]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9185
           1       0.83      1.00      0.91     46207

   micro avg       0.83      0.83      0.83     55392
   macro avg       0.42      0.50      0.45     55392
weighted avg       0.70      0.83      0.76     55392



  'precision', 'predicted', average, warn_for)


The Dummy Classifier is a 'dummy' because it is going to use zero machine learning, and simply predict "approve this loan" (value 1) for every loan it sees.

In [19]:
from sklearn.dummy import DummyClassifier

approve_everyone = DummyClassifier(strategy='constant', constant = 1).fit(X_train, y_train) # first fit (train) the model
y_pred_dummy = approve_everyone.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred_dummy) # finally evaluate performance

Confusion matrix:
 [[    0  9185]
 [    0 46207]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9185
           1       0.83      1.00      0.91     46207

   micro avg       0.83      0.83      0.83     55392
   macro avg       0.42      0.50      0.45     55392
weighted avg       0.70      0.83      0.76     55392



## Question 3.A: Considering only the data itself, why do Logistic Regression and the Dummy Classifier perform the same? What is the semantic meaning for why Dummy Classifier has such high accuracy?
_Double click to write your answer question here._

## Part 4: Your turn!

### Task 4.A: Create a new balanced dataset where exactly half of the samples are rejected loan applications and half are accepted loan application.
_show your work below_

In [20]:
loan_apps = df['loan_approved'] #create dataframe 
loan_apps.value_counts() #count the number of 0 and 1 in the loan_approved dataframe

1    308901
0     60380
Name: loan_approved, dtype: int64

In [None]:
loan_apps.head()

In [22]:
loan_apps.shape

(369281,)

In [None]:
#make separate dataframes of the 0 and 1's in the dataframe
zero = df[loan_apps == 0]
one = df[loan_apps == 1]
chosen_one = one.sample(n = 60380, axis = 1, replace = True, weights=None, random_state=None)

In [None]:
chosen_one.head()

In [None]:
zero = df[loan_apps == 0]
zero.shape

In [None]:
# one = df[df['loan_approved']==1]
# chosen_one = one.sample(n = 60380)
# chosen_one.shape

### Task 4.B: Below, retry training and evaluating a Logistic regression model on the updated data.
_show your work below_

### Task 4.C: Use your own imagination and experimentation to improve predictive performance for this task, modifying the model choices, feature choices, and data processing however you wish.
_Important! Your ability to improve the model above the baseline after Task 4.B will count for 10% of this assignment grade, with 5% of that given for modest improvements to performance. Thus while we encourage you to experiment, do not sink excessive time into this task. We will test the performance on our own holdout dataset._

_show your work below_