# Data Science in a Day

## Problem statement
Given data on loans that we've given in the past, we want to be able to predict whether a new customer will be given a loan or not. This is to aid decision making when it comes to offering loans at our bank.

## Loading Libraries

In [None]:
# Pandas - for data reading, manipulating, writing, analysis, and some plotting!
import pandas as pd 

# Numpy - for mathematical and matrix operations
import numpy as np

# Seaborn - for data visualisation
import seaborn as sns

# For visualising decision tree
import pydotplus
import graphviz 

# Import Decision tree from sklearn 
from sklearn.tree import DecisionTreeClassifier, export_graphviz

# For splitting our data into training and testing datasets
from sklearn.model_selection import train_test_split

# For evaluating our models
from sklearn.metrics import classification_report, confusion_matrix  

# To produe graphs here in our notebook
%matplotlib inline

## Data Reading and Preparation

In [None]:
# Let's start by loading our .csv file into our Jupyter notebook
loan = pd.read_csv('../loan_data.csv')

Let's explore the dimensions of the data a little bit more

In [None]:
# Print the first 5 rows of the dataframe
loan.head()

# Anyone guess what the command for getting the last 5 rows is?
# loan_train.tail()

In [None]:
# Get dimensions of training dataframe
loan.shape

In [None]:
# Get high-level information on the columns
loan.info() 

# Explain all this

In [None]:
# Maybe we want to get some descriptive statistics of the numerical features? Try doing this on your own (Google)
loan.describe()

### Data Cleaning
1) Are there any missing/NA values?

--> if so, how do we deal with them?

In [None]:
# Use the .isnull() function to return values in the dataframe which are Null/NA. Get first 5 rows of dataframe 
# only
loan.isnull().head()

In [None]:
# Let's get the the sum of NA values in each of the columns/features
loan.isnull().sum()

Now that we have an idea of where our NA values our located, how do we deal with them?


**Note** We engage the learners before each of the following blocks to get them thinking of best approaches

**Note2** These are our assumptions. In data cleaning you make assumptions based on common sense, and domain knowledge

In [None]:
loan.head()

In [None]:
# 1) Dealing with Gender, Married, Loan_Amount_Term ==> We remove those rows because they don't make sense!
loan.dropna(subset= ['Gender', 'Married', 'Loan_Amount_Term'], how = 'any', inplace= True)

In [None]:
# FOR DEPENDENTS -- it makes sense to fill NA dependents with 0
loan['Dependents'] = loan['Dependents'].fillna('0')

In [None]:
# FOR Self-Employed -- it makes sense to fill NA Self-employed values with 0
loan['Self_Employed'] = loan['Self_Employed'].fillna('No')

In [None]:
# FOR LOAN_AMount -- it makes sense to fill NA loans with 0
loan['LoanAmount'] = loan['LoanAmount'].fillna(float(0))   # or 0.0

In [None]:
# For Credit History - same
loan['Credit_History'] = loan['Credit_History'].fillna(0.0)

In [None]:
# Find sum of NA values in each column/feature
loan.isnull().sum()

In [None]:
# Let's check the dimensions now that we've done a bit of cleaning
loan.shape

## Data Exploration

In [None]:
# Using pandas plot function 
loan.Loan_Status.value_counts().plot(kind = 'bar')

In [None]:
# Using seaborn instead - a library built for data visualisation
l_status = sns.countplot(x = 'Loan_Status', data = loan)

l_status.set_title('Distribution of Loan status in our data')  # First, type variable name, and press tab --> 
                                                                #    Jupyter notebook is so useful!
l_status.set_ylabel('Frequency', fontsize = 18)
l_status.set_xlabel('Loan Status', fontsize = 18)
l_status.tick_params(labelsize = 12)

In [None]:
# Example 2: Countplot of Loan status while accounting for Gender
ls_gender = sns.countplot(x = 'Loan_Status', hue = 'Gender', data = loan)

ls_gender.set_title('Loan Status and Gender') 

In [None]:
# Example 2.2 - maybe we want to graph horizontally instead?
 
sns.set_style('darkgrid') # or try 'whitegrid' (Basic point is seaborn is flexible style-wise)

ls_gender_horizontal = sns.countplot(y = 'Loan_Status', hue = 'Gender', data = loan, palette= None)

ls_gender_horizontal.set_title('Loan Status and Gender')

In [None]:
# Explore Property area and No. of dependents for both class values (i.e. Loan status = Yes, and Loan_Status = No)

area_gender = sns.catplot(x = 'Dependents', hue = 'Property_Area', col = 'Loan_Status', data = loan, kind = 'count', 
            palette= 'rainbow')

area_gender
# See if you can add a title, and change the font sizes of the X and Y labels

#### Explore other relationships using same syntax. 

With every graph, try adding a title, and changing the X and Y labels appropriately. Feel free to play around with seaborn palettes and styles!

In [None]:
# Comment 
# Let them tell you what to type.

## Building our model



We don't want to include the thing we want to predict as the input data, so lets drop it. Also let's put the classes into their own variable for convenience

In [None]:
# Python has many data types. Let's explore the distribution of different data types across our features.
loan.dtypes.value_counts()

In [None]:
# Let's drop the ID column as it adds no useful information 
loan = loan.drop(['Loan_ID'], axis =1)

We now to split our data in several ways: 

1) We need to split by **features** (all columns but *Loan Status*) and our **target variable**, *Loan Status*


2) We need to split by rows into one dataset that we will train our model on -- **the training set** -- and one which our model will not see and on which we test performance -- **the testing set**.


In [None]:
# First we split by features and target variable

loan_feats = loan.drop(['Loan_Status'], axis= 1)

loan_class = loan['Loan_Status']

In [None]:
# Now we split into training and testing sets!
train_feats, test_feats, train_class, test_class = train_test_split(loan_feats, loan_class, test_size = 0.3, random_state = 123)

# EXPLAIN parameters

**Note** 
Mention that we'll be using decision trees. The model we'll be using can't handle object (or non-numeric) data types!

==> Therefore, we'll need to convert all to numeric.

To transform the data into a form our Decision tree can use, we need to use **one hot encoding**. This basically means each value a feature can take, now becomes a column in its own right. 

**Note** Important to show shape before and after so it can sink in!

In [None]:
# One-hot encoding all features in training set
train_feats = pd.get_dummies(train_feats)

# One-hot encoding all features in testing set
test_feats = pd.get_dummies(test_feats)

In [None]:
# Let's see how the data looks after the transformation
train_feats.shape

In [None]:
# And the same with the test set. Note that the same features were used, and hence the same output is expected.
test_feats.shape

In [None]:
# See how columns look in training set
train_feats.columns

In [None]:
# See how columns look in testing set
test_feats.columns

Let's check if they're all numeric now

In [None]:
# Gloss over these different numeric data types
train_feats.dtypes

In [None]:
# Convert target attribute values from Y/N to 1/0 -- training set
train_class = np.where(train_class == 'Y', 1,0)
train_class.dtype

In [None]:
# Convert target attribute values from Y/N to 1/0 -- testing set
test_class = np.where(test_class == 'Y', 1,0)
test_class.dtype

In [None]:
# Create the model! (Note use of sklearn)
tree_model = DecisionTreeClassifier()

# Now we fit/train it on our data. Mention X = train_feats, Y = train_class. 
tree_model.fit(train_feats, train_class)

In [None]:
# Let's define a function for plotting our decision tree!
def plotTree():
    dot_data = export_graphviz(tree_model, 
                                    out_file=None, 
                                    feature_names=train_feats.columns,
                                    filled=True, 
                                    rounded=True)
    graph = graphviz.Source(dot_data)
    return graph 

# NOTE This is quite hard to follow. So mention that this will be explored in more detail in modules to come 
# (Classification specifically), but this session is meant illustrate ART OF THE POSSIBLE.

#### NOTE ON PREPARATION 

**Need to do this prior to session and then remove this block of code from notebook**

Do the following: 

1) You need to download the .zip graphviz file (again) from https://graphviz.gitlab.io/_pages/Download/Download_windows.html


2) In the notebook, just before you run the _plotTree()_ function, you do:
        - import os
        - os.environ["PATH"] += os.pathstep + r"C:\Users\Client\Downloads\graphviz-2.38\release\bin"
        - plotTree()

**Note** the path in the second part is just where you extract the zip file. On the Windows laptop I couldn't extract it to the working directory of the notebook because of admin restrictions, but I could extract it in the Downloads folder.

In [None]:
plotTree()

Huge tree! Very complex. Not very useful since it's not very understandable (defeating the original motive for
using a decision tree in the first place!). 

Let's try another tree. This time, let's set a limit to the **max_depth** parameter, to avoid creating a very complex tree.

In [None]:
# Creating model 2. Setting max_depth at 3 (arbitrary choice)
tree_model2 = DecisionTreeClassifier(max_depth= 3)

# Fitting the model to the same data as before
tree_model2.fit(train_feats, train_class)

In [None]:
def plotTree2():
    dot_data = export_graphviz(tree_model2, 
                                    out_file=None, 
                                    feature_names=train_feats.columns,
                                    filled=True, 
                                    rounded=True)
    graph = graphviz.Source(dot_data)
    return graph 

plotTree2()

**Much better!**

### Making Predictions

Now that we've trained our models, it's time to put them to the test. We'll do this by predicting test set values and comparing those predictions to the values we already know are the ground truth. 

In [None]:
# Predictions made using model 1 (complex one)
predictions1 = tree_model.predict(test_feats)
predictions1

We need to compare those values with the test set!

In [None]:
# Let's evaluate how this model did
# Note - refer back to library call

print(confusion_matrix(test_class, predictions1))  
print(classification_report(test_class, predictions1))

In [None]:
# Making prediction using model 2 (simpler)
predictions2 = tree_model2.predict(test_feats)
predictions2

In [None]:
# Let's evaluate this model
print(confusion_matrix(test_class, predictions2))  
print(classification_report(test_class, predictions2))

### Model Evaluation

Q: Which model is better? 
> A: Model 2 (less complex)


Q: Why though? Isn't the more complex model supposed to be better?
> A: Nope. Explain concept of variance vs. bias. Mention that decision trees are actually high variance models that overfit to our training data, and don't perform well on data they haven't seen (like the test set in our case)

Link to *Stupid Data Miner tricks paper* (Overfitting the S&P 500)
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500