# Dataset Description
This data corresponds to a set of financial transactions associated with individuals. The data has been standardized, de-trended, and anonymized. You are provided with over two hundred thousand observations and nearly 800 features.  Each observation is independent from the previous. 

For each observation, it was recorded whether a default was triggered. In case of a default, the loss was measured. This quantity lies between 0 and 100. It has been normalised, considering that the notional of each transaction at inception is 100. For example, a loss of 60 means that only 40 is reimbursed. If the loan did not default, the loss was 0. You are asked to predict the losses for each observation in the test set.

Missing feature values have been kept as is, so that the competing teams can really use the maximum data available, implementing a strategy to fill the gaps if desired. Note that some variables may be categorical (e.g. f776 and f777).

The competition sponsor has worked to remove time-dimensionality from the data. However, the observations are still listed in order from old to new in the training set. In the test set they are in random order.

More info: https://www.kaggle.com/competitions/loan-default-prediction/overview

# 0 Import packages

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

In [None]:
print(f"pandas: {pd.__version__}, numpy:{np.__version__}, sklearn:{sklearn.__version__}")

# 1. Data cleaning
and missing values replacement

## read data

In [None]:
data_path = 'data/'

train_df = pd.read_csv(data_path+"train_v2.csv.zip", compression="zip")
delay_df = pd.read_csv(data_path+"test_v2.csv.zip", compression="zip")

In [None]:
train_df.head()

In [None]:
# looking at data type, we can observe that there are both numeric & categorical data
train_df.shape, train_df.dtypes.value_counts()

In [None]:
# we have 770 features and target 'loss' at train_df, there is no 'loss' at delay_df
delay_df.shape, delay_df.dtypes.value_counts()

In [None]:
delay_df.head()

In [None]:
train_df['loss'].describe()

In [None]:
train_loss_stat = (train_df['loss'].value_counts(dropna=False)*100/len(train_df)).sort_index()
print(train_loss_stat.head().to_string())
train_loss_stat.loc[1:].plot(kind='bar', figsize=(12, 6))

There is ~91% of zeros at loss, so lets try to build solution in 2 steps:
- binary classification zero or not
- regression task for those who not zero

## split data

It is better to split data at the very begging in order to evaluate than preprocessing & algorithm at out-of-the-sample data

In [None]:
id_col = train_df["id"]
y = (train_df["loss"]>0).astype(int)
X = train_df.drop(["id", "loss"], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
y_train.mean(), y_test.mean()

In [None]:
# TO:DO - u need to make an analysis of features, and add prefixes 'num_' (bool, int, float) and 'cat_' (string, object)
# then filter cols at lists
num_cols = [j for j in X_train.columns if 'num_' in j]
cat_cols = [j for j in X_train.columns if 'cat_' in j]

In [None]:
# unit test - there is no other cols
set(X_train.columns)-set(num_cols)-set(cat_cols)

## handle Nan

In [None]:
def get_na_list(df: pd.DataFrame, na_threshold: float = 0.75) -> list:
    """Count na share per col, check if share is more thanna_threshold,
    then create list of columns which we shoud drop"""

    cols_na_stat = {}
    df_cols = df.columns
    # TO:DO
    
    # then print results
    print(
        f"% of cols with na rate > {na_threshold}: {sum_na_cols} obs or {proc_na_cols}%"
    )

    return list(cols_na_stat[cols_na_stat>na_threshold].index)

In [None]:
drop_na_list = get_na_list(X_train)

In [None]:
X_train.drop(drop_na_list, axis=1, inplace=True)

In [None]:
num_cols = [j for j in X_train.columns if 'num_' in j]
cat_cols = [j for j in X_train.columns if 'cat_' in j]

## fill na

In [None]:
mean_dict = {}
for col in num_cols:
    # TO:DO - fill dict mean_dict with mean values, then replace Nan with them 

In [None]:
# TO:DO - check cols, in order to see whether or not all cat cols are already encoded
# if not - apply LabelEncoder from sklearn.preprocessing
(X_train[cat_cols].dtypes=='int64').astype(int).sum() == len(cat_cols)

In [None]:
most_popular_value_dict = {}
for col in cat_cols:
    # TO:DO - fill dict most_popular_value_dict with most popular values, then replace Nan with them

In [None]:
# unit test - no nan
X_train.isna().astype(int).sum()[X_train.isna().astype(int).sum()>0]

## handle variety

In [None]:
def get_val_share_list(df: pd.DataFrame, val_share_threshold: float = 0.75) -> list:

    cols_val_share_stat = {}
    df_cols = X_train.columns
    # TO:DO - code calculation of biggest value share per columns, then filter that if it is > than val_share_threshold
    print(
        f"% of cols with biggest value share > {val_share_threshold}: {sum_val_share_cols} obs or {proc_val_share_cols}%"
    )
    
    return list(cols_val_share_stat[cols_val_share_stat>val_share_threshold].index)

In [None]:
drop_val_share_list = get_val_share_list(X_train)

In [None]:
X_train.drop(drop_val_share_list, axis=1, inplace=True)

In [None]:
X_train.shape[1]

In [None]:
num_cols = [j for j in X_train.columns if 'num_' in j]
cat_cols = [j for j in X_train.columns if 'cat_' in j]

# 2. Handling outliers

In [None]:
# Define a function to handle outliers using the IQR method
def handle_outliers(col_array: np.array, perc_lower=10, perc_upper=90):
    
    q_lower = np.percentile(col_array, perc_lower)
    q_upper = np.percentile(col_array, perc_upper)
    iqr = q_upper - q_lower
    lower_bound = q_lower - 1.5 * iqr
    upper_bound = q_upper + 1.5 * iqr
    col_array[col_array < lower_bound] = lower_bound
    col_array[col_array > upper_bound] = upper_bound
    
    return col_array, [lower_bound, upper_bound]

In [None]:
outliers_dict = {j:[] for j in num_cols}
for col in num_cols:
    X_train[col], outliers_dict[col] = handle_outliers(X_train[col].values)

# 3. Descriptive statistics

In [None]:
# TO:DO - anything that help you to get depper understanding of data and find some patterns

# 4. Encoding categorical variables 
using onehot and target encodings

In [None]:
cat_cols = [j for j in X_train.columns if 'cat_' in j]

In [None]:
cat_cols[0]

In [None]:
X_train[cat_cols[0]].value_counts()

In [None]:
mean_enc_dict = {}
for col in cat_cols:
    # TO:DO - refactor that code, in order to make it work in your case
    mean_enc_dict[col] = X_train[[col]].join(y_train).groupby(col)['loss'].mean().to_dict()
    # hint: 999999999999 is value for other categorical values in test data, which haven`t seen before
    mean_enc_dict[col][999999999999] = np.array(list(mean_enc_dict[col].values())).mean()
    X_train[col] = X_train[col].map(mean_enc_dict[col])

# 5. Feature selection
using correlation 

In [None]:
corr_dict = {}
for col in num_cols+cat_cols:
    corr_dict[col] = X_train[[col]].join(y_train).corr(method='spearman').iloc[0,1]

In [None]:
pd.Series(corr_dict).abs().describe()

In [None]:
corr_threshold = 0.8
corr_df = X_train.corr(method='spearman').abs()
corr_stat_len = 1
    
while corr_stat_len>0:
    corr_stat = (corr_df>corr_threshold).sum().sort_values()-1
    corr_stat = corr_stat[corr_stat>0]
    try:
        col_to_drop = corr_stat.index[0]
        corr_df.drop(col_to_drop, axis=0, inplace=True)
        corr_df.drop(col_to_drop, axis=1, inplace=True)
        corr_stat_len = len(corr_stat)
    except IndexError:
        break

In [None]:
X_train = X_train[list(corr_df.columns)]

In [None]:
X_train.shape[1]

In [None]:
num_cols = [j for j in X_train.columns if 'num_' in j]
cat_cols = [j for j in X_train.columns if 'cat_' in j]

# 6. Normalisation
Numerical columns  using min-max scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_dict = {}
for col in num_cols:
    scaler = MinMaxScaler()
    # TO:DO - code data scaling using scaler and save that scaler for each feature

# Preproc test data

In [None]:
X_test = X_test[num_cols+cat_cols]

In [None]:
for col in num_cols:
    X_test[col] = X_test[col].fillna(mean_dict[col])

In [None]:
for col in cat_cols:
    X_test[col] = X_test[col].fillna(most_popular_value_dict[col])

In [None]:
# Define a function to handle outliers using the IQR method
def handle_outliers_test(col_array: np.array, bound_list):
    col_array = col_array.copy()
    lower_bound, upper_bound = bound_list
    col_array[col_array < lower_bound] = lower_bound
    col_array[col_array > upper_bound] = upper_bound
    
    return col_array

In [None]:
for col in num_cols:
    X_test[col] = handle_outliers_test(X_test[col], outliers_dict[col])

In [None]:
for col in cat_cols:
    new_cat_val_filter = X_test[col].isin(list(mean_enc_dict[col].keys()))
    X_test.loc[new_cat_val_filter==False, col] = 999999999999
    X_test[col] = X_test[col].map(mean_enc_dict[col])

In [None]:
for col in num_cols:
    X_test[col] = scaler_dict[col].transform(X_test[[col]])

In [None]:
X_train.shape[1], X_test.shape[1]

# 7. Modeling
using logistic regression, decision tree, random forest and LGBM

In [None]:
# pip install imbalanced-learn

In [None]:
# from imblearn.over_sampling import SMOTEN
# sm = SMOTEN(sampling_strategy=0.2, random_state=42)
# X_res, y_res = sm.fit_resample(X_train, y_train)

In [None]:
# pd.Series(y_res).describe().apply(lambda x: '%.4f' % x)

In [None]:
import seaborn as sns
from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, confusion_matrix

## LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
# all parameters not specified are set to their defaults
logistic_model = LogisticRegression(
    solver='liblinear',
    penalty='l2'
)
logistic_model.fit(X_train, y_train)

In [None]:
y_proba_train = logistic_model.predict_proba(X_train)[:,1]
pd.Series(y_proba_train).describe().apply(lambda x: '%.4f' % x)

In [None]:
score = roc_auc_score(y_train, y_proba_train)
print(f"ROC AUC: {score:.4f}")

In [None]:
cm = confusion_matrix(y_train, np.where(y_proba_train>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
def plot_sklearn_roc_curve(y_real, y_pred):
    fpr, tpr, _ = roc_curve(y_real, y_pred)
    roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
    roc_display.figure_.set_size_inches(5,5)
    plt.plot([0, 1], [0, 1], color = 'g')
# Plots the ROC curve using the sklearn methods - Good plot
plot_sklearn_roc_curve(y_train, y_proba_train)

In [None]:
y_proba_test = logistic_model.predict_proba(X_test)[:,1]
pd.Series(y_proba_test).describe().apply(lambda x: '%.4f' % x)

In [None]:
cm = confusion_matrix(y_test, np.where(y_proba_test>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
score = roc_auc_score(y_test, y_proba_test)
print(f"ROC AUC: {score:.4f}")
plot_sklearn_roc_curve(y_test, y_proba_test)

## DecisionTree

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
decision_tree_model = DecisionTreeClassifier(
    max_depth = 3,
    min_samples_leaf = 100,
    random_state = 13
)
decision_tree_model.fit(X_train, y_train)

In [None]:
plot_tree(decision_tree_model)

In [None]:
# pip install dtreeviz

In [None]:
import dtreeviz

In [None]:
viz_model = dtreeviz.model(decision_tree_model,
                           X_train=X_train, y_train=y_train,
                           feature_names=X_train.columns,
                           target_name='gb',
                           class_names=list(y_train.unique()))

# v = viz_model.view()     # render as SVG into internal object 
# v.save("/tmp/iris.svg")  # optionally save as svg
viz_model.view()       # in notebook, displays inline

In [None]:
# pip install pydotplus

In [None]:
# pip install graphviz

In [None]:
# Create DOT data
from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data
from IPython.display import Image

dot_data = export_graphviz(decision_tree_model, out_file=None, 
                           feature_names=X_train.columns,  
                           class_names=np.unique(y_train).astype('str'), 
                           filled=True, rounded=True, special_characters=True)
# Draw graph
graph = graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())

In [None]:
y_proba_train = decision_tree_model.predict_proba(X_train)[:,1]
pd.Series(y_proba_train).describe().apply(lambda x: '%.4f' % x)

In [None]:
score = roc_auc_score(y_train, y_proba_train)
print(f"ROC AUC: {score:.4f}")

In [None]:
cm = confusion_matrix(y_train, np.where(y_proba_train>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
plot_sklearn_roc_curve(y_train, y_proba_train)

In [None]:
y_proba_test = decision_tree_model.predict_proba(X_test)[:,1]
pd.Series(y_proba_test).describe().apply(lambda x: '%.4f' % x)

In [None]:
cm = confusion_matrix(y_test, np.where(y_proba_test>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
score = roc_auc_score(y_test, y_proba_test)
print(f"ROC AUC: {score:.4f}")
plot_sklearn_roc_curve(y_test, y_proba_test)

## RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_forest_model = RandomForestClassifier(
    max_depth = 3,
    min_samples_leaf = 100,
    random_state = 13
)
random_forest_model.fit(X_train, y_train)

In [None]:
y_proba_train = random_forest_model.predict_proba(X_train)[:,1]
pd.Series(y_proba_train).describe().apply(lambda x: '%.4f' % x)

In [None]:
score = roc_auc_score(y_train, y_proba_train)
print(f"ROC AUC: {score:.4f}")

In [None]:
cm = confusion_matrix(y_train, np.where(y_proba_train>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
plot_sklearn_roc_curve(y_train, y_proba_train)

In [None]:
y_proba_test = random_forest_model.predict_proba(X_test)[:,1]
pd.Series(y_proba_test).describe().apply(lambda x: '%.4f' % x)

In [None]:
cm = confusion_matrix(y_test, np.where(y_proba_test>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
score = roc_auc_score(y_test, y_proba_test)
print(f"ROC AUC: {score:.4f}")
plot_sklearn_roc_curve(y_test, y_proba_test)

## LGBMClassifier

In [None]:
# pip install lightgbm

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lightgbm_model = LGBMClassifier(
    boosting_type = 'gbdt',
    n_estimators = 100,
    max_depth = 3,
    learning_rate = 0.02,
    colsample_bytree = 0.3,
    min_child_samples = 20,
    reg_alpha = 2,
    objective = 'binary',
    is_unbalance = False,
    random_state = 21
)

lightgbm_model.fit(X_train, y_train, eval_metric=['auc'])

In [None]:
y_proba_train = lightgbm_model.predict_proba(X_train)[:,1]
pd.Series(y_proba_train).describe().apply(lambda x: '%.4f' % x)

In [None]:
score = roc_auc_score(y_train, y_proba_train)
print(f"ROC AUC: {score:.4f}")

In [None]:
cm = confusion_matrix(y_train, np.where(y_proba_train>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
plot_sklearn_roc_curve(y_train, y_proba_train)

In [None]:
y_proba_test = lightgbm_model.predict_proba(X_test)[:,1]
pd.Series(y_proba_test).describe().apply(lambda x: '%.4f' % x)

In [None]:
cm = confusion_matrix(y_test, np.where(y_proba_test>0.5, 1, 0))

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(round(score,4))
plt.title(all_sample_title, size = 15);

In [None]:
score = roc_auc_score(y_test, y_proba_test)
print(f"ROC AUC: {score:.4f}")
plot_sklearn_roc_curve(y_test, y_proba_test)

# 8. Interpretation of results
using f1, precision, recal, ROC-AUC and confusion matrix 

# 9. Delay prediction

In [None]:
X_delay = delay_df[num_cols+cat_cols]

for col in num_cols:
    X_delay[col] = X_delay[col].fillna(mean_dict[col])

for col in cat_cols:
    X_delay[col] = X_delay[col].fillna(most_popular_value_dict[col])

# Define a function to handle outliers using the IQR method
def handle_outliers_test(col_array: np.array, bound_list):
    col_array = col_array.copy()
    lower_bound, upper_bound = bound_list
    col_array[col_array < lower_bound] = lower_bound
    col_array[col_array > upper_bound] = upper_bound
    
    return col_array

for col in num_cols:
    X_delay[col] = handle_outliers_test(X_delay[col], outliers_dict[col])

for col in cat_cols:
    new_cat_val_filter = X_delay[col].isin(list(mean_enc_dict[col].keys()))
    X_delay.loc[new_cat_val_filter==False, col] = 999999999999
    X_delay[col] = X_delay[col].map(mean_enc_dict[col])

for col in num_cols:
    X_delay[col] = scaler_dict[col].transform(X_delay[[col]])

X_train.shape[1], X_delay.shape[1]

In [None]:
lightgbm_model.fit(pd.concat([X_train, X_test]), pd.concat([y_train, y_test]), eval_metric=['auc'])
y_proba_delay = lightgbm_model.predict_proba(X_delay)[:,1]

In [None]:
pd.Series(y_proba_delay).describe().apply(lambda x: '%.4f' % x)

In [None]:
pd.DataFrame({'proba': y_proba_delay}).to_csv('proba_delay.csv')