[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/7_tree_learning_tasks.ipynb) 

# Tutorial 7 - Classification using Logistic Regression and Decision Trees

In this demo notebook, we will recap the logistic regression model and revisit our lecture on decision trees.  

## Preliminaries

### Standard imports

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### The HMEQ data set
We continue using the "Home Equity" data set (HMEQ). To streamline the notebook, and following notebooks using the same data set, we implemented a helper function `get_HMEQ_credit_data` that  loads and preprocesses the data. To use this helper function, you need to import it just as other libraries/modules. More specifically, you need to:
- Ensure the file `bads_helper_functions.py`, which is available on our [GitHub](https://github.com/Humboldt-WI/bads), is stored in the same directory from which you run this notebook. 
- Import the module using `import bads_helper_functions as bads`.

Afterwards, you can call the function `get_HMEQ_credit_data` to load a ready-to-use version of the data set. 

In [None]:
import bads_helper_functions as bads
X, y = bads.get_HMEQ_credit_data()
X

# Logistic regression



In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=888)  # the random_state ensures that we get the same results when re-running this cell multiple times
model.fit(X, y)  # Train the model
print(model)

Note that the `sklearn` implementation does not provide an informative summary, as did the library `statsmodels`, which we used to [illustrate regression analysis](https://github.com/Humboldt-WI/bads/blob/master/tutorial_notebooks/4_predictive_analytics_tasks.ipynb). You can still access the estimated coefficients and the intercept using the attributes `coef_` and `intercept_` of the fitted model. However, $R^2$, p-values or confidence intervals are not available. In brief, this is because `sklearn` is designed to support prediction. Let's demonstrate how to do this, that is compute predictions using the trained model. For simplicity, we compute prediction for the training data. You already learnt that this is inappropriate and that we should use the *holdout* method instead. We will talk about model evaluation in a future notebook. Here, we keep things simple and compute predictions for the training data. 

In [None]:
print("Estimated coefficients:\n", model.coef_)  # The coefficients of the model 
print("\nIntercept coefficients:\n", model.intercept_)  # The intercept of the model   
yhat = model.predict(X)  # simple way to compute predictions using logistic regression and any other machine learning model in sklearn 
print("\nPredictions:\n", yhat)  # The predictions of the model   

### Diagnosing predictions
The above output hints at an issue with our predictions. We discuss this part in the tutorial and *debug* the predictions to fully understand what is going on when we call the function `predict()` and when this function is useful. 

In [None]:
# To be completed in class...




### Visualizing the logistic regression
We complete our examination of the logistic regression model with a visualization of its behavior. Given that plotting is resticted to low dimensional data, we consider only two features of our data set in this demo.

In [None]:
# Estimate low dimensional logit model for visualization
x1 = "YOJ"
x2 = "DEBTINC"
model2 = LogisticRegression(random_state=888).fit(X[[x1, x2]], y)

For visualization, we use another helper function `plot_logit_decision_surface()`. Let's first inspect its interface and understand how we can use it.

In [None]:
help(bads.plot_logit_decision_surface)

In [None]:
# Code to call the function plot_logit_decision_surface()
bads.plot_logit_decision_surface(model2, X, x2, x1, y)

#### Illustration using synthetic data
Given our real-world lending data is challenging to work with, we also demonstrate the use of our visualization function using well-behaved synthetic data. To that end, we use the function `make_blobs()`, which we learnt about several weeks ago in our [clustering tutorial](https://github.com/Humboldt-WI/bads/blob/master/tutorial_notebooks/3_descriptive_analytics_solutions.ipynb). This function allows us to generate a well-behaved two-dimensional data set with two classes that logistic regression can distinguish easily. The resulting plot will help us understand the decision boundary of the logistic regression model.

In [None]:
from sklearn.datasets import make_blobs

Xsyn, ysyn = make_blobs(n_samples=250, centers=2, cluster_std=1, n_features=2, random_state=88)
Xsyn = pd.DataFrame(Xsyn, columns=["x1", "x2"])
model_syn = LogisticRegression(random_state=88).fit(Xsyn, ysyn)
bads.plot_logit_decision_surface(model_syn, Xsyn, "x2", "x1", ysyn)

# Decision trees
Once you are familiar with `sklearn`, switching to another model is straightforward. We will now introduce decision trees, including how to train them,  how to visualize grown trees, and how to compute predictions. For training and prediction, you already know all you need. The functions `fit()` and `predict()` are the same as for logistic regression. Likewise, you can obtain probabilistic predictions using the function `predict_proba()`. For visualization, you can use the function `plot_tree`. All these functions are part of the module `sklearn.tree`.

These information should be sufficient to get started. Here is your task: 

## Exercise tree learning
1. Train a decision tree on the HMEQ data set. Set the maximum depth of the tree to 3.
2. Visualize the tree using the function `plot_tree()`.
3. Compute probabilistic predictions for the training data.

In [None]:
# Your solution: