# Data Science in a Day

## Problem statement
Given data on loans that we've given in the past, we want to be able to predict whether a new customer SHOULD be given a loan or not. This is to aid decision making when it comes to offering loans at our bank.

## Loading Libraries

In [517]:
# Pandas - for data reading, manipulating, writing, analysis, and some plotting!

# Numpy - for mathematical and matrix operations

# Seaborn - for data visualisation

# For visualising decision tree

# Import Decision tree from sklearn 

# For splitting our data into training and testing datasets

# For evaluating our models

# To produe graphs here in our notebook


## Data Reading and Preparation

In [538]:
# Let's start by loading our .csv file into our Jupyter notebook


Let's explore the dimensions of the data a little bit more

In [539]:
# Print the first 5 rows of the dataframe


In [540]:
# Get dimensions of training dataframe


In [541]:
# Get high-level information on the columns


In [542]:
# Maybe we want to get some descriptive statistics of the numerical features? Try doing this on your own (Google)


### Data Cleaning
1) Are there any missing/NA values?

--> if so, how do we deal with them?

In [543]:
# Use the .isnull() function to return values in the dataframe which are Null/NA. Get first 5 rows of dataframe 
# only


In [544]:
# Let's get the the sum of NA values in each of the columns/features


Now that we have an idea of where our NA values our located, how do we deal with them?

**Note** These are our assumptions. In data cleaning you make assumptions based on common sense, and domain knowledge

In [545]:
# 1) Dealing with Gender, Married, Loan_Amount_Term

In [547]:
# For Dependents

In [548]:
# FOR Self-Employed 

In [549]:
# FOR LOAN_Amount 

In [550]:
# For Credit History 

In [551]:
# Find sum of NA values in each column/feature after the above changes have been made


In [552]:
# Let's check the dimensions now that we've done a bit of cleaning


## Data Exploration

In [553]:
# Using pandas plot function 


In [554]:
# Using seaborn instead - a library built for data visualisation


In [555]:
# Example 2: Countplot of Loan status while accounting for Gender
 

In [556]:
# Example 2.2 - maybe we want to graph horizontally instead?


In [557]:
# Explore Property area and No. of dependents for both class values (i.e. Loan status = Yes, and Loan_Status = No)


#### Explore other relationships using same syntax. 

With every graph, try adding a title, and changing the X and Y labels appropriately. Feel free to play around with seaborn palettes and styles!

In [558]:
# Comment 


## Building our model



We don't want to include the thing we want to predict as the input data, so lets drop it. Also let's put the classes into their own variable for convenience

In [559]:
# Python has many data types. Let's explore the distribution of different data types across our features.


In [560]:
# Let's drop the ID column as it adds no useful information 


We now to split our data in several ways: 

1) We need to split by **features** (all columns but *Loan Status*) and our **target variable**, *Loan Status*


2) We need to split by rows into one dataset that we will train our model on -- **the training set** -- and one which our model will not see and on which we test performance -- **the testing set**.


In [561]:
# First we split by features and target variable


In [562]:
# Now we split into training and testing sets!


**Note** 
Mention that we'll be using decision trees. The model we'll be using can't handle object (or non-numeric) data types!

==> **Therefore, we'll need to convert all to numeric.**

To transform the data into a form our Decision tree can use, we need to use **one hot encoding**. This basically means each value a feature can take, now becomes a column in its own right. 

**Note** Important to show shape before and after so it can sink in!

In [563]:
# One-hot encoding all features in training set

# One-hot encoding all features in testing set


In [564]:
# Let's see how the data looks after the transformation


In [565]:
# And the same with the test set. Note that the same features were used, and hence the same output is expected.


In [566]:
# See how columns look in training set


In [567]:
# See how columns look in testing set


Let's check if they're all numeric now

In [568]:
# Gloss over these different numeric data types


In [569]:
# Convert target attribute values from Y/N to 1/0 -- training set


In [570]:
# Convert target attribute values from Y/N to 1/0 -- testing set


In [572]:
# Create the model! (Note use of sklearn)

# Now we fit/train it on our data. Mention X = train_feats, Y = train_class. 


In [573]:
# Let's define a function for plotting our decision tree!


In [574]:
# Run the function!

Let's try another tree. This time, let's set a limit to the **max_depth** parameter, to avoid creating a very complex tree.

In [575]:
# Creating model 2. Setting max_depth at 3 (arbitrary choice)

# Fitting the model to the same data as before


**Much better!**

### Making Predictions

Now that we've trained our models, it's time to put them to the test. We'll do this by predicting test set values and comparing those predictions to the values we already know are the ground truth. 

In [576]:
# Predictions made using model 1 (complex one)


We need to compare those values with the test set!

In [577]:
# Making prediction using model 2 (simpler)


In [578]:
# Let's evaluate this model


### Model Evaluation

Q: Which model is better? 
> A: Model 2 (less complex)


Q: Why though? Isn't the more complex model supposed to be better?
> A: Nope. Explain concept of variance vs. bias. Mention that decision trees are actually high variance models that overfit to our training data, and don't perform well on data they haven't seen (like the test set in our case)

Link to *Stupid Data Miner tricks paper* (Overfitting the S&P 500)
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500