# Machine Learning Workshop - Classification

Betabanenmarkt @ Leiden, Machine Learning Workshop

Koen Oussoren

In this workshop we'll get familiar with some popular tool used by data scientists and data engineers. From the Lending Club I've obtained a free online data set which contains a variety of information about loans granted to people (https://www.lendingclub.com/info/download-data.action). The goal of this workshop is that you'll be able to predict if a loan is safe or risky by training a particular machine learning model, a decision tree. You can go step by step through each cell of this jupyter-notebook and along the way you'll learn how to preprocess the data with pandas and make it ready for use for the machine learning models of scikit-learn. At the end, you can take a look at how your model performed and try to play with some hyperparameters in order to make your model even better at predicting.

You'll need to have the pandas and scikit-learn python libraries installed on your laptop in order to run this notebook. Otherwise, team up with someone who has it already installed.

## 1. Reading in data with Pandas

Read in the data with pandas, a popular tool to represent data in tabular format and do data munching. You can use some basic operations like slicing, groupby/aggregate and create simple graphs via the matplotlib library. 

In [None]:
# Importing the pandas library
import pandas as pd

In [None]:
# Reading in the csv file into a pandas DataFrame object (Make sure you put the data in the same directory as 
# this notebook)
loans_df = pd.read_csv('lending-club-data.csv')

In [None]:
# How many entries (rows) do we have in our data?
len(loans_df)

That should be decent enough to train a simple model

In [None]:
# Let's take a look at the first 5 entries of the loans dataframe
loans_df.head(5)

As you can see, there are a variety of variables associated with each loan. There's an id number of the loan, the id of the member of the lending club, the total loan amount, interest rate, etc. You can also see that the columns can contain different data types: integers, floats, timestamps, letters. Not all the columns are displayed by the way (see the '...' in the middle?), so let's print out all the column names

In [None]:
loans_df.columns

That's quite a list of columns, and pandas doesn't show them all together by default in order not to overwhelm the user who wants to take a quick peek at the data. We can also just select the one or a couple of columns if we just want to focus on a subset of the variables

In [None]:
# Show first 5 entries of the total loan amount (Note the set of two square brackets I used here. One set of 
# of square brackets will also work, but when you want to extract mulitple columns you nee to use the set of two)
loans_df[['loan_amnt']].head(5)

In [None]:
# Show first 5 entries of 4 columns (Basically, I'm passing a list of values to the dataframe in order to select
# the columns I want to see)
loans_df[['id','member_id','loan_amnt','safe_loans']].head(5)

The loan with id 1077430 is classified as a risky loan, because it's safe_loans value is -1. The other 4 loans are classified as safe

## Exercise

**Try out printing out some other columns in the cells below**

## 2. Exploring the data

Before we start to build our machine learning model, let's first take a closer look at some variables. In the end we want to train our model on some variables that have predictive power. For example, if a certain variable has roughly to same values for both safe and risky loans, it's not hard to imagine that our model will do something close to a random guess if you just give it that variable as input. Kinda like trying to guess what car I'm thinking of right now and I only give you the color of the car as a clue.

### Describe

In [None]:
# Let's take a look at some basic statistics of our data
loans_df.describe()

The describe method shows the mean, standard deviation, min/max and various percentiles of columns that contain integers and floats as types. These numbers help you to get a grasp of some parts of the data. For example, if you want to get an idea if there are outliers (values with very large/tiny values compared to bulk of the values for one a particular variable) in your data. Our model later on could be fooled by the large value of the outlier and perform badly if we don't correct for it.

### Categorical features

You can see that the describe method doesn't print out statistics for columns containing text or letters, such as 'grade', 'sub_grade'. Let's take a look at these individually

In [None]:
# Get all the unique values for grade
loans_df['grade'].unique()

NOTE: grade here is like credit history, which reflects if a borrower is reliable or not based on previous loans. Grade A here doesn't mean a loan is automatically safe and grade G per definition risky. 

In [None]:
# and sub_grade
loans_df['sub_grade'].unique()

As the title of this section suggests, the 'grade' and 'sub_grade' contain values for a certain category. A loan's grade can belong to the categories ranging from 'A' to 'G', and a similar story of a loan's 'sub_grade'. Categorical features (I'll explain a bit later what I mean exactly with feature) are not continuous: the loan is either this grade or one of the others. You can't have a loan that's 0.45 \* A and 0.55 \* F. (In the clustering domain of machine learning there are algorithms that can do exactly that, for example in recommending news paper articles for online readers, but that goes beyond the scope of this workshop)

### Other types of data

In [None]:
# Let's take a look at the 'desc' column
loans_df['desc'].head(6)

Interesting, let's pick out one example and print out the full text

In [None]:
# Extract the first loan's 'desc' column and print out it's value
text = loans_df['desc'].iloc[0:1].values
print(text)

So basically the borrower gives a small explanation of why he needs the loan in the 'desc' column. This is neither a numerical nor a categorical value. It's just a piece of text and at the moment not very usefull to put into our model for training. There are ways, of course, to cast this piece of text into an object that can be used as a valid input for training to our model. In the field of Natural Language Processing (NLP) we have techniques that can convert words to vectors in a higherdimensional word space, and because vectors are just an array of numbers, we can feed them into models like Neural Networks for example. Very cool, but also an advanced topic which I'll not cover today.

## Exercise (Optional)

**Try out some other loan's descriptions in the cells below**

use iloc[n:n+1] if you want to extract the n-th row

### Plotting

A picture says more than a thousand words. Cliche, I know, but nothing beats a good graph in order to get a grasp
of your data. How does the distribution look like? Are there any outliers? Do I understand the overall shape of the distribution?

In [None]:
# Pandas has a built-in plotting functionality that makes call to matplotlib in order to draw plots

# The line below is tells matplotlib that it should draw plots within this notebook
%matplotlib inline

# let's take a look at the all the loans by plotting its distribution as a histogram
loans_df['loan_amnt'].plot.hist(bins=40)

First thing to note is that the distrbution is not very smooth, because there are many upward spikes. Probably borrowers are inclined to ask for the same amounts. Around 20.000 US dollars is more popular to ask than let's say 19.000 or 21.000. Also note the large spike all the way at 35.000 dollars and is relatively large compared to the spikes below. 35.000 is the maximum loan amount in the data set and it's not unlikely that the highest borrowers would like to borrow more, but had to go for the maximum as only option left. 

In [None]:
# Plot a categorical value (First we need to convert them to numerical values, before plotting)
loans_df['grade'].value_counts().plot(kind='bar')

Most loans are classified with grade B and the least with grade G

## Exercise

**Plot another distributions**

NOTE: before you plot the values of a column, try to figure out first if you're dealing with numerical, categorical
or any other type of data and based on that make a decission if it make sense to plot it or not. Use the 
value_counts() method if you're dealing with categorical features, like in the cell above with 'grade'

### Scatter plots and using filters

There's also the option to plot the values of one column versus the value of another column, a.k.a. scatter plot

In [None]:
loans_df.plot.scatter(x='annual_inc', y='loan_amnt')

There are a couple of outliers in the annual_income which make it hard to look at the scatter plot, so let's introduce a filter in order to remove these

In [None]:
# Creating a new dataframe, which is a subset of the original loans_df dataframe but with outlier in  the
# annual income removed

loans_lower_annual_inc_df = loans_df[loans_df['annual_inc'] < 500000]

# You might be a bit confused by this notation at first, but basically you're telling pandas go through every
# entry and check if the 'annual_inc' value for that entry is lower than 500.000. If yes than keep this row
# and put it into our new dataframe, if not, than do not include it into our new dataframe.

In [None]:
loans_lower_annual_inc_df.plot.scatter(x='annual_inc', y='loan_amnt')

Interesting, but what to conclude from this? I would say that the lowest incomes (< 30.000) are not allowed to borrow more than their annual income, which makes sort of sense if you think about it. On the other hand you see that the higher incomes (> 300.000) tend to go often for the max of 35.000 (I wonder why they need to borrow that amount if their income is already an order of magnitude higher)

## Exercise

**Try to create a scatter plot, where you plot a numerical value versus a categorical value, e.g. annual_inc versus
safe_loans. Can you explain what you see?**

In [None]:
# loans_lower_annual_inc_df.plot.scatter(x='annual_inc', y='safe_loans')

## Exercise (Optional)

**Try to create a new dataframe where you only select the risky loans and plot the annual income as a histogram**

In order to slice out the risky loans use the following filter/condition: loans_df['safe_loans'] == -1

# 3. Selecting Features

What is a feature? A feature is any variable that describes a piece of data. All the columns we saw from the
loans dataframe are features. Now for our model there is one important feature: 'safe_loans'. Let's call it our target feature, because we want to our model to learn what the goal or target of the training is. We're going to
train our model using a subset of the remaining features, but which ones to select?

## Question

**Which features in our data set could be usefull to predict the value of safe_loans?**

**Why is it a bad idea to take, for example, 'member_id' as input feature for our model?**

HINT: If there is one borrower that made loans that have always been classified as risky, what will our trained
model predict if it encounters a different member_id during testing?

In [1]:
# List of input features
input_features = [
    'loan_amnt',
    'annual_inc'
]

target_feature = 'safe_loans'

In [None]:
# Slicing out the input features and the target feature from our loans dataframe
loans_sel_features_df = loans_df[input_features+[target_feature]]

### Data processing

Let's put our loans data first into a shape, which will later on helps us creating training and test samples. First
we need to know how many loans belong to which category of our target feature: safe or risky?

In [None]:
# Checking for class imbalance
safe_loans_df = loans_sel_features_df[loans_sel_features_df['safe_loans'] == 1]
risky_loans_df = loans_sel_features_df[loans_sel_features_df['safe_loans'] == -1]

print(len(safe_loans_df))
print(len(risky_loans_df))

As you can see we have about a factor 4 more safe loans than risky loans. If we would continue with this data set and train our model, what problems might we run into? Answer: our model can become biased towards the majority class, meaning it is more likely to classify a loan as safe, because during the training it mostly saw safe loans. This problem is known as class imbalance and there are various ways to over come it. We'll downsize the safe loans so it will be the same size as our risky loans. Of course we'll be throwing away data this way, but we end up with a total of about 46.000 loans, whish should be enough

In [None]:
# Downsizing the safe_loans_df by making a subset with random safe loans
safe_loans_df = safe_loans_df.sample(len(risky_loans_df))
print(len(safe_loans_df))

In [None]:
# Putting both loans back together and reshuffle
loans_sel_df = pd.concat([safe_loans_df,risky_loans_df])
loans_sel_df = loans_sel_df.sample(frac=1)
print('Total loans {}'.format(len(loans_sel_df)))

Now we have a properly balanced data set, let's split up our data into training and test data and also let's split
off the target feature from the rest of the input features. Normally, about 80% of the complete data set is used for training, while the remaining 20% is used for testing.

## Question

**Why do you need to split up your data into a training and a test set? Why can't you just use the complete data set for training?**

In [None]:
# Creating the training and test data sets. 
total_loans = len(loans_sel_df)
total_loans_80 = int(total_loans*0.8)

train_df = loans_sel_df.iloc[:total_loans_80]
test_df = loans_sel_df.iloc[total_loans_80:]

# Just checking the lengths
print(len(train_df))
print(len(test_df))

In [None]:
# Separating the target feature from the input features and call them Y and X respectively
X_train = train_df[input_features].values
Y_train = train_df[target_feature].values
X_test = test_df[input_features].values
Y_test = test_df[target_feature].values

Now, we're ready to train our model

# 4. Training and Evaluating the model

For this workshop we'll be using a decision tree (https://en.wikipedia.org/wiki/Decision_tree). A decision tree is 
a set of hierarchily structured questions that can be either answered by yes or no. You start at root or top node with the first question (e.g. is the annual income higher or lower than 500.000?) and based on the answer you either follow the 'yes' path or the 'no' path. Each path leads you to a new node with another yes/no question (e.g.
was the loan amount higher or lower than 15.000?). After this you again have to follow either the 'yes' or 'no' path to the next node. This results in several layers of nodes, with more nodes in the next layer compared to the previous. At some point you'll reach a final node or leave, which doesn't contain a question but will just state if the loan is risky or safe.
  

In [None]:
# Importing a decision tree
from sklearn import tree

In [None]:
# Getting the decision tree classifer and start the training

# Setting the maximum depth the tree is allowed to go
max_layers = 2

clf = tree.DecisionTreeClassifier(max_depth=max_layers)
clf = clf.fit(X_train, Y_train)

If you want to get a visual output of what our trained decision tree looks like, run the cell below. Make sure you have the graphviz library installed.

**NOTE: Keep max_layers lower than 4 when running the cell below. It's not hard to imagine that you'll get a very large picture if max_layers is set to 10 for example. Also the cell takes a long time to run.**

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=input_features,  
                         class_names=target_feature,  
                         filled=True, rounded=True,  
                         special_characters=True)
graph = graphviz.Source(dot_data)  
graph 

At the top of each node you see the condition that will be applied (the yes/no question). The samples indicate
how many loans ended up in this node. As you can see, the root node contains all the training data. The value 
indicates the amount of safe and risky loans that ended up in this node and the class is determined by which type of laon is in the majority. You can ignore the gini parameter.

## Question

**The leave nodes at the bottom either indicate if a loan is best classified as 'a' or 's'. Can you figure out which one of these two is the risky loans?**

Training a decesion tree involves the creation of nodes (also known as branches) and how many layer of nodes (branches) you want. For the creation of a node the algorithm has to decide which variable to use to base the yes/no question on. For the root node the algorithm tries to find the variable that is the best at separating the safe and risky loans and will try to find a value for this variable that results in a maximum separation. For the next two nodes the algorithm ends up with two different data sets and tries to apply the same strategy. It keeps on looking for a variable that can best separate the risky from the safe loans. The algorithm is terminated when it reaches the maximum number of allowed node layers, otherwise it could go on for ever untill every loan will end up in it's own leave. This last scenario will result in pefect training but will perform very bad on the testing data. In general, a tree with a lot of branches will tend to have a bad performance on our test data. This phenomenon is known as overtraining.

## Question

**Can you explain why a model is performing badly on a test data set when it's being overtrained during training?**

Let's retrain our a model and allow it to have 25 branch layers (depth). In a later section we'll address
the issue of overtraining, but for now just assume that 25 is the best number for now

In [None]:
# Getting the decision tree classifer and start the training

# Setting the maximum depth the tree is allowed to go
max_layers = 25

clf = tree.DecisionTreeClassifier(max_depth=max_layers)
clf = clf.fit(X_train, Y_train)

## Prediction

Now let's see how well our trained tree can generalize by letting it predict the loans in the test data set

In [None]:
# Get the input features from the test set and predict for each entry if the loan is risky or not. Let's call
# it Y_predict
Y_predict = clf.predict(X_test)
print(Y_predict)

In [None]:
# Getting the accuracy score on both train and testing data
from sklearn.metrics import accuracy_score

# Just gonna pretend that the training data set is test data
Y_predict_train = clf.predict(X_train)

train_acc = accuracy_score(Y_train, Y_predict_train)
test_acc = accuracy_score(Y_test, Y_predict)

print('Train accuracy: {}'.format(train_acc))
print('Test accuracy:  {}'.format(test_acc))

Our training accuracy is doing ok, but when checking the performance on the test data set it seems that we're just slightly better than random guessing. Let's try to make our model better!

## Exercise

**Try adding extra features to the input features and see if it improves the decision tree model's accuracy on the test data.**

NOTE: First check what type your feature is. If it's a numerical/continuous feature, it shouldn't be a problem to add it directly to the list of input features. If the values of your features are or contain strings you have to find a way how to convert these into numbers. For categorical features, you can try out a technique called one-hot-encoding. I'll give an example in the cells below for the 'grade' feature

In [None]:
loans_df['grade'].unique()

We have 7 different categories of grade and pandas has a nice function to one-hot-encode these for us. Just run the cell below and check the output. What has changed?

In [None]:
loans_df = pd.get_dummies(loans_df,columns=['grade'])
print(loans_df.head(4))

Notice the 7 extra columns that have been added to the dataframe? Each grade category turned into a feature which 
is either 1 or 0 for each row, hence the name one-hot-encoding. You can now add all these new features to the list of input features if you like. 

Now it's your turn!

In [None]:
# Add your additional features to the input_features list (Don't forget to put in extra commas)
input_features = [
    'loan_amnt',
    'annual_inc'
]

target_feature = 'safe_loans'

**I've added all you need in the cells below, so you don't have to go back in the notebook to rerun cells**

In [None]:
# Select only our features, balance out the classes and create the training and test data sets
loans_sel_features_df = loans_df[input_features+[target_feature]]
safe_loans_df = loans_sel_features_df[loans_sel_features_df['safe_loans'] == 1]
risky_loans_df = loans_sel_features_df[loans_sel_features_df['safe_loans'] == -1]
safe_loans_df = safe_loans_df.sample(len(risky_loans_df), random_state=1)
loans_sel_df = pd.concat([safe_loans_df,risky_loans_df])
loans_sel_df = loans_sel_df.sample(frac=1, random_state=1)
total_loans = len(loans_sel_df)
total_loans_80 = int(total_loans*0.8)
train_df = loans_sel_df.iloc[:total_loans_80]
test_df = loans_sel_df.iloc[total_loans_80:]
X_train = train_df[input_features].values
Y_train = train_df[target_feature].values
X_test = test_df[input_features].values
Y_test = test_df[target_feature].values

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=30)
clf = clf.fit(X_train, Y_train)

In [None]:
Y_predict = clf.predict(X_test)
Y_predict_train = clf.predict(X_train)

train_acc = accuracy_score(Y_train, Y_predict_train)
test_acc = accuracy_score(Y_test, Y_predict)

print('Train accuracy: {}'.format(train_acc))
print('Test accuracy:  {}'.format(test_acc))

NOTE: You might notice that if you keep everything the same and rerun the 4 cells over and over again, the train and test accuracy will each time be slightly different from the previous runs. This has to do with the fact that there is some random initialization going on under the hood when training the decision tree, so a slightly different outcome is completly normal. In general it's good practice to keep an eye on these things, because too large changes could indicate that you need more training or test data.

## Confusion Matrix (Optional)

Besides the overal accuracy of our classifier (model) and can also be very insightful to take a look at how many
loans our decision tree classified correctly, the so called true positives and true negatives, but more importantly
where did it make the wrong predicitions, referred to as the false positives and false negatives. A false positive
is a risky loan that is classified as safe (type-1 error) and a false negative is a safe loan classified as risky (type-2 error). These 4 quantities can be nicely put together in a Confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

conf_matr = confusion_matrix(Y_test, Y_predict)
print(conf_matr)

It might be a bit confusing to read the confusion matrix in this form. Therefore, I've added the function below in order to create an output, which is a bit more user-friendly
You can find the original code here: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="red" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
plot_confusion_matrix(conf_matr, classes=['risky','safe'])

## Precision and Recall (Optional)

In [None]:
from sklearn.metrics import classification_report
target_names = ['safe', 'risky']
print(classification_report(Y_test, Y_predict, target_names=target_names))

If you look at these numbers, what do they mean? I'll explain the most important ones. Precision for the safe category means of all the loans that our model classified as safe, what is the percentage of correctly classified loans. Recall for safe means of all truely safe loans in the whole test data, what is the percentage that our model classified as safe.

Depending on what you want to achieve you can either go for optimizing precision or recall. In general, you want to  optimize both as much as possible, but at some point you've to sacrifice precision in order for better recall or the other way around. For example, if I really care about trying to find all the risky loans in the data set (high recall) I would also have to allow for letting more safe loans being classified as risky (lower precision)

## Question (Optional)

**Can you think of examples where you would rather have a high recall than a high precision? How about the opposite scenario?**

# 5. Tuning hyper parameters

The most interesting part of a job as a data scientist is trying to make your model better and better. That can be either by using more input features or creating new features from existing features, also known as feature engineering.

But apart from introducing new features, we can also turn the knobs and dials of the model we've been using. In the case of our decision tree we can decide to eiter increase or decrease the depth of our tree (layers of branches) and can decide what the minimum amount of data in a node should in order to decide it can be a normal or a leaf node. This process of tuning the parameter of a model is known has hyper parameter tuning, because we don't tweak any parameters related to the data. When doing this hyperparameter tuning, another data set is required known has the validation data set. So we have to split up our data in 3 different sets: a training, validation and a test set.

## Question

**Why do we need to introduce a validation data set when we start to do hyperparameter tuning?**

HINT: Remember the question about overtraining. Isn't tuning hyperparameters kinda similar to just training the model?

Let's continue with the loans_sel_df we created in section 4, but now split it up into a training, validation and test set. Will use 60:20:20 as ratio to divide the complete data set.

In [None]:
# Creating training, validation and test sets
total_loans = len(loans_sel_df)
total_loans_60 = int(total_loans*0.60)
total_loans_20 = int(total_loans*0.20)
train_df = loans_sel_df.iloc[:total_loans_60]
vali_df = loans_df.iloc[total_loans_60:(total_loans_60+total_loans_20)]
test_df = loans_sel_df.iloc[(total_loans_60+total_loans_20):]
X_train = train_df[input_features].values
Y_train = train_df[target_feature].values
X_vali = vali_df[input_features].values
Y_vali = vali_df[target_feature].values
X_test = test_df[input_features].values
Y_test = test_df[target_feature].values

We'll start trying to find the best tree depth

In [None]:
accuracies = []
tree_depth = []

for depth in range(1,25):

    tree_depth.append(depth)
    clf = tree.DecisionTreeClassifier(max_depth=depth)
    clf = clf.fit(X_train, Y_train)
    Y_vali_pred = clf.predict(X_vali)
    vali_acc = accuracy_score(Y_vali, Y_vali_pred)
    accuracies.append(vali_acc)
    
print(accuracies)


In [None]:
# Let's plot the validation accuracies
import matplotlib.pyplot as plt
plt.plot(tree_depth, accuracies, 'ro')
plt.ylabel('Accuracy')
plt.xlabel('Tree depth')
plt.show()

Look up in the graph which tree depth gives you the highest accuracy on the validation set and use in the cell below.

In [None]:
best_depth = <'your input'>

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=best_depth)
clf = clf.fit(X_train, Y_train)
Y_test_pred = clf.predict(X_test)
test_acc = accuracy_score(Y_test, Y_test_pred)
print(test_acc)

Is this better or worse than you had at the end of section 4? Is your model better than random guessing?

# 6. Other ML algorithms (Optional)

If you have time left over and you want to play around with some different model, you can check out http://scikit-learn.org/stable/supervised_learning.html#supervised-learning for some inspiration. See if you can train Support Vector Machine model for example and see if you can beat your accuracy you got with the decision tree model