[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/IPML/blob/master/tutorial_notebooks/9_classification_tasks.ipynb) 

# Classification 
In this demo notebook, we will revisit our lecture on classification models. To that end, we consider the logistic regression model and study how it allows us to approach a probability-to-default prediction task. As usual, we provide ready-to-use demo codes and small programming tasks. 

## Preliminaries
We continue using the HMEQ classification data sets. Beyond loading standard libraries, the following code block reads the data from our [GitHub repository](https://github.com/Humboldt-WI/IPML/tree/main) and performs some preprocessing operations, which we introduced in earlier tutorials. Since future tutorials will need the same functionality, we encapsulate the code in a function called `get_credit_risk_data()`.


In [None]:
# Load standard libraries
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# We put all codes to retrieve and prepare our credit risk data into a custom function
def get_credit_risk_data(outlier_factor=2):
    # Load credit risk data directly from GitHub
    data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq.csv'
    hmeq = pd.read_csv(data_url)

    # Code categories properly 
    hmeq['REASON'] = hmeq['REASON'].astype('category')
    hmeq['JOB'] = hmeq['JOB'].astype('category')
    hmeq = pd.get_dummies(hmeq, drop_first=True)  
    
    # Code the target variable properly
    hmeq['BAD'] = hmeq['BAD'].astype('bool')
    
    # Downcast numerical columns to save memory
    ix_numeric_columns = hmeq.select_dtypes(include=np.number).columns
    hmeq[ix_numeric_columns] = hmeq[ix_numeric_columns].astype('float32')

    # Handle missing values:
    # 1. The feature DEBTINC is important but suffers many missing values. Blindly replacing these missing values
    #    would introduce bias and harm any model trained on the data. To avoid this, we add a dummy variable
    #    to indicate whether the feature value was missing or not.
    hmeq['D2I_miss'] = hmeq['DEBTINC'].isna().astype('category')
    # 2. For the other numerical features, we use the median to impute missing values. For the categorical features
    imputer = SimpleImputer(strategy='median')  # Create an imputer object with the strategy 'median'
    hmeq[ix_numeric_columns] = imputer.fit_transform(hmeq[ix_numeric_columns])  
    # 3. For the categorical features, we use the mode to impute missing values
    ix_cat = hmeq.select_dtypes(include=['category']).columns  # Get an index of the categorical columns
    for c in ix_cat:  # Process each category
        hmeq.loc[hmeq[c].isna(), c ] = hmeq[c].mode()[0]  # the index [0] is necessary as the result of calling mode() is a Pandas Series
    
    # Truncate outliers among numerical features
    if outlier_factor > 0:
        for col in ix_numeric_columns:
            if col not in ['DELINQ', 'DEROG']:  # We do not truncate these features as their distribution if strongly skewed such outlier trunction would leave us with a constant feature
                q1 = hmeq[col].quantile(0.25)
                q3 = hmeq[col].quantile(0.75)
                iqr = q3 - q1
                lower_bound = q1 - outlier_factor * iqr
                upper_bound = q3 + outlier_factor * iqr
                hmeq[col] = hmeq[col].clip(lower=lower_bound, upper=upper_bound)

    # Scale numerical features
    scaler = StandardScaler()
    hmeq[ix_numeric_columns] = scaler.fit_transform(hmeq[ix_numeric_columns])

    # Separate the target variable and the feature matrix
    y = hmeq.pop('BAD')
    X = hmeq

    return X, y

# Call the function to retrieve the data
X, y = get_credit_risk_data()   

# Preview the data
X



# Binary classification for PD modeling
Having prepared our data, we can proceed with predictive modeling. The lecture introduced the general classification setup and the logistic regression model. Let's revisit these elements in detail. 

## Excercise 1: Plotting data for classification
You will remember the many plots we came across when discussing regression. We also saw some analog plots for classification problems in the lecture. One of them was a 2d scatter plot displaying the bi-variate relationship between selected features and the binary target variable. 

![Classification problem in 2D](https://raw.githubusercontent.com/stefanlessmann/ESMT_IML/main/resources/2d_classification_problem.png)

Your first task is to create a similar plot for the credit data. In principle, you can select any combination of features that you like.  

In [None]:
# Exercise 1


# Logistic regression
Time to estimate our first classification model. We will use logistic regression. Think of it as an extension of linear regression for cases in which we work with a binary target variable. Just as in linear regression, logistic regression involves model training on labelled data. The below code uses the `sklearn` library to train a logistic regression-based classification model. In case you receive a warning message when running the code (i.e., *Convergence warning*), please ignore this message for now. 

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state=888)  # we define a random_state to ensure that we get the same results when re-running this cell multiple times
logreg.fit(X, y)  # Initiate training by calling the fit method


Likely, you are also interested to assess the model. There is an easy way to do using the method `score()` of the trained model. 

In [None]:
perf = logreg.score(X, y)  # Call a general purpose evaluation function and obtain a (quality ) score of the model
print('Logit model achieves a score of {:.3f} %'.format(perf*100))

## Exercise 2: Diagnosing predictions
A score of above 87 percent sounds good. Actually, it is not, and your task is to find out why. Let's break it down into pieces.

### A) What score?
Find out what metric the function `score()` has provided. What specifically is this number of about 87 percent? You can check out the [sklearn documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) or any other source. 

**Your answer:** ...

### B) Is it good or is it bad?
Interpreting our score will be easier if we compare it to a baseline. But what baseline? We face a classification problem. There are two classes, good payers and bad payers, and we aim to tell these apart. Come up with a very basic strategy to solve the classification problem without using any model. Write a piece of code to calculate the performance of your super-basic strategy. 

<details> <summary>Hint: </summary> If you feel a bit lost, consider web searching for <i>dummy classifier</i> </details>  

In [None]:
# Solution to 2b


If you succeeded with the previous task, you will have found that a super-basic - stupid - classifier achieves an accuracy of 80 percent. This is not as high as what we saw from logit but it puts the first impression of logistic regression performing very well into perspective. 

Note that our approach to compute the score of the naive classifier assumes that the positive class with $Y=1$ is the minority class. While this is typically the case, we should acknowledge that our approach is simplistic. It would be better to first establish which of the two classes is the majority class and to then use the fraction of that class as the accuracy score of a dummy classifier. While not too difficult, we leave this extension for the interested to perform and move on with probabilities. 

### C) What about probabilities?
Exactly, what about probabilities? The lecture introduced classification as a machine learning setup aimed at predicting class membership probabilities. So logistic regression should answer questions such as "what is the estimated probability of the first credit applicant in our data set to repay?" Guess what is your next task? Write code that provides you with the probabilistic predictions of the logistic regression classifier. More formally, your code should give you estimates of class membership probabilities $\hat{p}(BAD==1|X_1)$ for all observations in the data set. 

To that end, begin experimenting with the method `predict()` that every learning algorithm in `sklearn` provides. Understand what it does and how it differs from what you need. Afterwards, search for the right method to obtain class membership probabilities.



In [None]:
# Solution to 2C
