# Students Do: Predicting Fraudulent Loans Applications

According to the American Bankers Association, [_"every dollar of fraud now costs banks and credit unions roughly $2.92"_](https://www.aba.com/member-tools/industry-solutions/insights/state-card-fraud-2018), that's a reason why predicting fraud using machine learning techniques becomes a [broad area of research](https://scholar.google.com.mx/scholar?q=fraud+detection+machine+learning&btnG=&oq=fraud+detection+) and a great [business opportunity for FinTech startups](https://www.eu-startups.com/2019/06/paris-based-fintech-bleckwen-raises-e8-8-million-for-its-fraud-detection-software-to-prevent-financial-crime/).

In this activity, you are going to explore how tree based algorithms can be used to identify fraudulent loan applications. You will start using a decision tree model, that will be trained with the `sba_loans_encoded.csv` file that you created before.

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Needed for decision tree visualization
import pydotplus
from IPython.display import Image


## Loading and Preprocessing Loans Encoded Data

Load the `sba_loans_encoded.csv` in a pandas DataFrame called `df_loans`.

In [26]:
# Loading data

df_loans = pd.read_csv('../Resources/sba_loans_encoded.csv')
df_loans

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,...,City_WILLITS,City_WILMINGTON,City_WINDSOR,City_WINNETKA,City_WOODLAND,City_WOODLAND HILLS,City_WRIGHTWOOD,City_Watsonville,City_YORBA LINDA,City_YUBA CITY
0,2001,11,32812,36,92801,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2092,2006,6,150000,60,92346,0,5,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2093,1997,4,99000,300,92021,0,4,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2094,1997,2,50000,84,93012,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2095,1997,1,251150,120,91352,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Define the features set, by copying the `df_loans` DataFrame and dropping the `Default` column.

In [25]:
# Define features set
df_loans = df_loans.drop('Default',1)
df_loans

  


Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,...,City_WILLITS,City_WILMINGTON,City_WINDSOR,City_WINNETKA,City_WOODLAND,City_WOODLAND HILLS,City_WRIGHTWOOD,City_Watsonville,City_YORBA LINDA,City_YUBA CITY
0,2001,11,32812,36,92801,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2092,2006,6,150000,60,92346,0,5,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2093,1997,4,99000,300,92021,0,4,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2094,1997,2,50000,84,93012,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2095,1997,1,251150,120,91352,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Create the target vector by assigning the values of the `Default` column from the `df_loans` DataFrame.

In [27]:
# Define target vector
y = df_loans["Default"].values.reshape(-1,1),


Split the data into training and testing sets.

In [5]:
# Splitting into Train and Test sets



Use the `StandardScaler` to scale the features data, remember that only `X_train` and `X_testing` DataFrames should be scaled.

In [6]:
# Create the StandardScaler instance



In [7]:
# Fit the Standard Scaler with the training data



In [8]:
# Scale the training data



## Fitting the Decision Tree Model

Once data is scaled, create a decision tree instance and train it with the training data (`X_train_scaled` and `y_train`).

In [9]:
# Create the decision tree classifier instance



In [10]:
# Fit the model



## Making Predictions Using the Tree Model

Validate the trained model, by predicting fraudulent loan applications using the testing data (`X_test_scaled`).

In [11]:
# Making predictions using the testing data



## Model Evaluation

Evaluate model's results, by using `sklearn` to calculate the confusion matrix, the accuracy score and to generate the classification report.

In [12]:
# Calculating the confusion matrix

# Calculating the accuracy score



In [13]:
# Displaying results



## Visualizing the Decision Tree

In this section, you should create a visual representation of the decision tree using `pydotplus`. Show the graph on the notebook, and also save it in `PDF` and `PNG` formats.

In [14]:
# Create DOT data

# Draw graph

# Show graph



In [15]:
# Saving the tree as PDF


# Saving the tree as PNG



## Analysis Question

Finally, analyze the model's evaluation results and answer the following question.

* Would you trust in this model to deploy a loans application approval solution in a bank?

 * **Your answer here**