# Pre-Examination #2 - Give Me Some Credit

## Dataset Description:
### Dataset Kaggle Link:
[Kaggle Give Me Some Credit](https://www.kaggle.com/competitions/GiveMeSomeCredit/overview)

### Features:
| Feature Name                        | Description                                                                                                                                              | Type       |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| `SeriousDlqin2yrs`                     | Person experienced 90 days past due delinquency or worse                                                                                               | Y/N        |
| `RevolvingUtilizationOfUnsecuredLines` | Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits | percentage |
| `age`                                  | Age of borrower in years                                                                                                                                 | integer    |
| `NumberOfTime30-59DaysPastDueNotWorse` | Number of times borrower has been 30-59 days past due but no worse in the last 2 years.                                                                  | integer    |
| `DebtRatio`                            | Monthly debt payments, alimony,living costs divided by monthy gross income                                                                               | percentage |
| `MonthlyIncome`                        | Monthly income                                                                                                                                           | real       |
| `NumberOfOpenCreditLinesAndLoans`      | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)                                                     | integer    |
| `NumberOfTimes90DaysLate`              | Number of times borrower has been 90 days or more past due.                                                                                              | integer    |
| `NumberRealEstateLoansOrLines`         | Number of mortgage and real estate loans including home equity lines of credit                                                                           | integer    |
| `NumberOfTime60-89DaysPastDueNotWorse` | Number of times borrower has been 60-89 days past due but no worse in the last 2 years.                                                                  | integer    |
| `NumberOfDependents`                   | Number of dependents in family excluding themselves (spouse, children etc.)                                                                              | integer    |

### Target:
There is a Target Column in the dataset - `SeriousDlqin2yrs`, of datatype `boolean`, with 2 possible values - `Y/N` or, respectivelly, `1/0`. This column is showing if a person experienced 90 days past due delinquency or worse. Therefore, this problem is a Supervised Classification Machine Learning Problem.

### Problem Description:
Banks, in order to determine whether or not a loan should be granted to borrowers, require prior knowledge about borrower's capability to return the money they borrowed. For this, they use a system based on credibility that will offers a credit/reputation to borrowers. This credit is based on different criteria, such as: previous loans, overdebt, concurrent loans, and so on. This dataset is built upon the idea of prediction of probability that potential borrowers will experience financial distress in the next two years, enabling banks to decide better to grant a loan to that specific person or not. The task is to build a model that will predict this based on several features, like **Number of Days Overdue**, **Monthly Income** and others.

## Importing Prerequisites

In [1]:
# Import Data Structures
import pandas as pd
import dask.dataframe as dd

# Import Data Manipulation Libraries
import numpy as np
import math

# Import Base Classes for Type Annotation
from sklearn.base import BaseEstimator

# Import Structure Manipulation Methods
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Import Visualization Libs
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
sns.set_style(style="whitegrid")
sns.set_palette('bright')

# Import Outlier Detection
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest

# Import Feature Selection Methods
from kydavra import PValueSelector
from sklearn.feature_selection import RFECV

# Import Hyperparameter Tuning
import optuna

# Import ML Models
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Import Interpretation Metrics
from sklearn.metrics import classification_report
# from lime.lime_tabular import LimeTabularExplainer
# import shap

# Import Custom Utils
# from utils import get_percentage_cat_col, get_distance_osmnx, get_distance_api, get_distance_haversine, measure_time_function, get_outliers_by_boxplot
import swifter

## Dataset Loading
Since both training and test datasets are not very large, basic `pandas.DataFrame` will be sufficient.

In [2]:
credit_train_zip: pd.DataFrame = pd.read_csv(filepath_or_buffer='dataset/cs-training.csv', sep=',')

In [3]:
credit_test_zip: pd.DataFrame = pd.read_csv(filepath_or_buffer='dataset/cs-test.csv', sep=',')

### Basic Dataset Analysis

In [9]:
credit_train_zip.head(n=10)

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0
5,6,0,0.213179,74,0,0.375607,3500.0,3,0,1,0,1.0
6,7,0,0.305682,57,0,5710.0,,8,0,3,0,0.0
7,8,0,0.754464,39,0,0.20994,3500.0,8,0,0,0,0.0
8,9,0,0.116951,27,0,46.0,,2,0,0,0,
9,10,0,0.189169,57,0,0.606291,23684.0,9,0,4,0,2.0


In [10]:
credit_train_zip.tail(n=10)

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
149990,149991,0,0.055518,46,0,0.609779,4335.0,7,0,1,0,2.0
149991,149992,0,0.104112,59,0,0.477658,10316.0,10,0,2,0,0.0
149992,149993,0,0.871976,50,0,4132.0,,11,0,1,0,3.0
149993,149994,0,1.0,22,0,0.0,820.0,1,0,0,0,0.0
149994,149995,0,0.385742,50,0,0.404293,3400.0,7,0,0,0,0.0
149995,149996,0,0.040674,74,0,0.225131,2100.0,4,0,1,0,0.0
149996,149997,0,0.299745,44,0,0.716562,5584.0,4,0,1,0,2.0
149997,149998,0,0.246044,58,0,3870.0,,18,0,1,0,0.0
149998,149999,0,0.0,30,0,0.0,5716.0,4,0,0,0,0.0
149999,150000,0,0.850283,64,0,0.249908,8158.0,8,0,2,0,0.0


As it may be seen, several features are present in the dataset. At the same time, several `NaN` values have been noticed in `MonthlyIncome` and `NumberOfDependents` columns. Besides that, no feature scaling was perfomed on this dataset. Also, first column of the dataset - `Unnamed: 0`, that is a result of the `pd.readcsv()` function, is a replacement for the first missing name in the original `.csv` file, that, most probably, is just the ID column for the samples in the dataset.

In [11]:
credit_train_zip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

There are no columns of `object` data type, which is often used for columns of `String` data type. Besides that there are 8 columns of `int64` data type and 4 columns of `float64` data type. Also, there are 2 columns with missing values - `MonthlyIncome` and `NumberOfDependents`, as it was mentioned in the previous paragraph. In total there are 11 Features and 1 Target Variable - `SeriousDlqin2yrs`. Again, the memory usage of this specific dataset is $\approx$ 13.7 MB.

In [12]:
print(f"Train Dataset Shape: {credit_train_zip.shape[0]} samples, {credit_train_zip.shape[1]} columns")

Train Dataset Shape: 150000 samples, 12 columns


As it may be seen, train dataset contains exactly 150000 training samples and 12 columns, as it was mentioned previously, 11 features and 1 target variable column.

In [13]:
credit_train_zip.describe()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
count,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,120269.0,150000.0,150000.0,150000.0,150000.0,146076.0
mean,75000.5,0.06684,6.048438,52.295207,0.421033,353.005076,6670.221,8.45276,0.265973,1.01824,0.240387,0.757222
std,43301.414527,0.249746,249.755371,14.771866,4.192781,2037.818523,14384.67,5.145951,4.169304,1.129771,4.155179,1.115086
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,37500.75,0.0,0.029867,41.0,0.0,0.175074,3400.0,5.0,0.0,0.0,0.0,0.0
50%,75000.5,0.0,0.154181,52.0,0.0,0.366508,5400.0,8.0,0.0,1.0,0.0,0.0
75%,112500.25,0.0,0.559046,63.0,0.0,0.868254,8249.0,11.0,0.0,2.0,0.0,1.0
max,150000.0,1.0,50708.0,109.0,98.0,329664.0,3008750.0,58.0,98.0,54.0,98.0,20.0


After a brief analysis of the described dataset, were derived several conclusions:
1. Dataset is highly imbalanced, since target variable `SeriousDlqin2yrs` contains at least 75% of negative samples;
2. Features have different ranges of values, from very small range of continuous values - `age`, to very big ranges - `DebtRation` etc., which can impact gradient-based or distance-based Machine Learning Models, such as Logistic Regression or K-Nearest Neighbors;
3. There might be present a considerable amount of outliers, judging by the percentile values for several columns, such as: `ResolvingUtilizationOfUnsecuredLines` or `DebtRation`.