# Case 2 -- Machine Learning for Finance 2025

<span style='color:crimson; font-weight: bold'>Submission deadline: Friday, 5 December 2025, 22:00 pm AMS. </span>

## Instructions
* This case covers the material discussed in wk2-wk5.
* Do not forget to create a group again, go to Canvas -> people -> groups -> Case2
* Each group submits _only one_ notebook via canvas on the assignment page. Only Jupyter notebooks are accepted via a group. 
* The notebook should be named `case2_groupXX.ipynb` where `XX` is your group number,  
e.g. for group 3 this will be `case2_group03.ipynb`.
* Make sure you download the correct dataset. Loading the wrong set will lead to deduction of points. 
* The notebook should run without raising any errors.
* Deadline: **Friday 5 December 22:00 (AMS)**. Not meeting the deadline gives a discount of 10 pt per hour
* Standard plagiarism and AI checks are in place
* As a standard anti-fraud measure, I can at random select a number of you to explain your code 
and answers. Any one of you must be able to explain any part of the code. Failure to explain 
your answers will result in a deduction of credits for this case for the whole group. Each group is responsible for all group members being able to explain any part the code
* If you need to make a Table or Figure, do this in JF-style. (hence provide a sufficient caption explaining (NOT interpreting) what is in the Figure/Table)
* If you test something, provide H0/HA, the test statistic (formula and number) and your conclusion.
* Do not spend time on optimizing the speed of your code. However, if it runs for more than 5 minutes, we will terminate 
it.

----

<div style="font-size:24px; text-align:center; font-weight:bold">Good luck!</div>

----


# Case 2 - Modeling defaults

In this case you will model the defaults of U.K. companies by using several classification models. 
The goal is to see whether sophisticated machine learning models such as decision trees, neural networks and ensemble learning methods can beat a simple logit model. 

Download your own dataset from Canvas. You have the following variables at your disposal:
* **org_id**: organisation ID
* **sic2**: 2-Digit SIC (Standard Industrial Classification) Codes
* **year**: time in year
* **def**: binary indicator: 1 if the company defaulted, 0 else
* **wkta**: working capital over total assets
* **reta**: retained earnings over total assets
* **ebitta**: earnings before interest and taxes over total assets 
* **mv**: the market to book value
  
First take a look at the data, then test different algorithms, make predictions and provide an answer to the main questios of the case

State your imports below.

In [3]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Scikit-learn libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, recall_score, precision_score, 
                             f1_score, fbeta_score, confusion_matrix, 
                             classification_report)
from sklearn.preprocessing import StandardScaler

# Neural Network libraries
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

# Set random seed for reproducibility
np.random.seed(42)

# Load the dataset from Excel
df = pd.read_excel('data_MLF_case2_group_26.xlsx', sheet_name='Data')


# Part I: Preprocessing (15 points)

Import your data from excel. Then perform the following tasks:
1. Check for missing values and delete these.
2. Do the features contain any outliers? If so, treat them carefully with an explanation.
3. Show the correlations between the (labels and the) features. Comment on the sign of 2 randomly selected features
4. Show summary statistics in **Table 1**

In [11]:
### 1

print(df.head())
print(df.shape)

print("")
print("1. Dealing with missing values")
print("")
# Check for missing values
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

# Delete rows with missing values
df_clean = df.dropna()

print(f"\nOriginal dataset size: {len(df)}")
print(f"Cleaned dataset size: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

   org_id  sic2  year  def      wkta      reta    ebitta        mv
0    3671    33  2001    0  0.153244  0.113063  0.070824  1.806667
1    3174    59  2001    0  0.380485  0.203497  0.126315  1.734991
2    3289    78  1988    0  0.039873  0.000000  0.031834  0.196416
3    7872    48  2001    0  0.202144 -0.160948  0.062240 -0.238152
4    1275    49  1991    0 -0.510000 -0.260833  0.025862 -0.130025
(1025, 8)

1. Dealing with missing values

org_id     0
sic2       0
year       0
def        0
wkta      42
reta      42
ebitta    42
mv        42
dtype: int64

Total missing values: 168

Original dataset size: 1025
Cleaned dataset size: 983
Rows removed: 42


1. 42 rows were found with missing values and subsequently deleted.

In [None]:
### 2

print("")
print("2. Looking at outliers")
print("")
print(df_clean.describe())




SyntaxError: invalid syntax (182529269.py, line 7)

# Part II: Training (50 points)

We've familiarized ourselves with the data, so now we're going to train some models to model the probability of default. 
Use *wkta, reta, ebitta* and *mv* as features in the following models:
- Model 1: The logistic classifier 
- Model 2: The decision tree
- Model 3: A neural network
- Model 4: Gradient Boosting for Classicifation

Split the data into a random training, validation and test using using the 60/20/20 rule. Pin down your random sets by providing the seed.
If there are hyperparameters, tune these in the correct way and show plots - using an appropriate measure - to explain your final hyperparameter(s). Explain why you have used this measure!

Create **Table 2** by showing the accuracy, recall, precision, F1 score, and the $F_\beta$ using the test set for each model.
Choose your own $\beta$ with explanation. Interpret the outcomes. Did you expect these results? Why/Why NOT?

### More information about Model 2
Estimate a dicision tree using the default hyperparameteres of Python. Just tune the *maximum depth* parameter. 

### More information about Model 3
Set the following conditions fixed: 
- **Activation function**: *Relu* for hidden layers
- **dropout ratio**: put this after each hidden layer and set it equal to 0.2
- **epochs**: 50
- **batch size**: 10

When compiling the model, set the following conditions fixed:
- Use loss = 'categorical_crossentropy'
- optimizer='adam'
- metrics=accuracy

Now tune the following hyperparameters: 
- the number of **hidden** layers: 1 or 2.
- Number of nodes per hidden layer: 16 or 8. 

### More information about Model 4
Set the following hyperparameters fixed:
- Learning rate: 0.8
- random state: 0
- max depth: 2

Tune the the parameter *n_estimators*.



# Part III: Estimating a tree using only two features (30 points)

Select only **wkta** and **reta** as features. Then estimate again a decision tree, again tuning the *maximum depth* hyperparameter.
Lets call this model **Decision Tree Small**

Answer the following questions:
- Is the hyperparameter of your decision tree of Part II robust?
- Does Decision Tree Small outperform the best model of Part II?
- Does Decision Tree Small beat the logit *with the same two features*?

Also make a plot of your final decision tree with the two features as done in the tutorial/lecture. Interpret this figure. 

Put your test results in **Table 3**. Interpret your results and relate these to main question of the case.


# Discussion (5 points)

Provide a short discussion about the way we tune the hyperparameters. Pay attention to the charactaristics of the data at hand while answering this question.