# Loss Given Default Analysis [TPS August]
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@constantinevdokimov?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Konstantin Evdokimov</a>
        on 
        <a href='https://unsplash.com/s/photos/loan?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by author unless specified otherwise.
    </strong>
</figcaption>

# 1. Problem definition

In this month's TPS competition, we are tasked to predict the amount of money a bank or a financial institution might lose if a loan goes into default.

Before we start the EDA, let's make sure we are all on the same page on some of the key terms of the problem definition:
1. What is loan default?
   - Default is a failure to repay a debt/loan on time. It can occur when a borrower fails to make timely payments on loans such as mortgage, bank loans, car leases, etc.
2. What is a loss given default (LGD)?
   - LGD is the amount of money a bank or financial institution might lose if a loan goes into default. Calculating and predicting LGD can be complex and involve many factors. 

As you will see in just a bit, the dataset for the competition has over 100 features and the target `loss` is (I think) LGD. For more information on these terms, check out [this](https://www.kaggle.com/c/tabular-playground-series-aug-2021/discussion/256337) discussion thread.

The metric used in this competition is Root Mean Squared Error, a regression metric:
![](images/metric.png)

# 2. Setup

In [4]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from matplotlib import rcParams

# Global plot configs
rcParams["figure.dpi"] = 200
rcParams["axes.spines.top"] = False
rcParams["axes.spines.right"] = False

# Pandas global settings
pd.set_option("display.max_columns", None)
pd.set_option("precision", 4)

# Import data
train_df = pd.read_csv("data/train.csv", index_col="id")
test_df = pd.read_csv("data/test.csv", index_col="id")
sub = pd.read_csv("data/sample_submission.csv")

## Overview of the datasets