# Data preprocessing and classification

The purpose of the work is to pre-process data to train a classification model for clients with debts of more than 90 days, taking into account the class imbalance. Work with missing values, duplicates, outliers, as well as feature encoding, normalization, and standardization. For example and comparison, several different classification models will be trained with a preliminary selection of hyperparameters.

## Initial data

The dataset contains credit information about Bank N's customers. All available information about the data is provided below.
* **client_id** - positive int
* **Age** - client's age (float)
* **Income** - monthly income (float)
* **BalanceToCreditLimit** - the ratio of the credit card balance to the loan limit (float)
* **DIR** - Debt-to-income Ratio (float)
* **NumLoans** - number of loans and credit lines (int)
* **NumRealEstateLoans** - the number of mortgages and loans related to real estate (int)
* **NumDependents** - the number of family members supported by the client, excluding the client himself (int)
* **Num30-59Delinquencies** - the number of overdue loan payments is from 30 to 59 days (int)
* **Num60-89Delinquencies** - the number of overdue loan payments is from 60 to 89 days (int)
* **Delinquent90** - Target, were there any loan payments overdue for more than 90 days (binary)

In [378]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from scipy.stats import f_oneway

## EDA

In [5]:
df_raw=pd.read_csv(path)
df_raw.describe()

Unnamed: 0,client_id,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit,Delinquent90
count,75000.0,75000.0,75000.0,75000.0,75000.0,73084.0,75000.0,75000.0,60153.0,75000.0,75000.0
mean,37499.5,353.260293,52.595605,8.44976,1.016693,0.755966,0.42832,0.248,6740.059,6.276196,0.06684
std,21650.779432,2117.237432,14.869729,5.15644,1.124019,1.108119,4.276439,4.239486,14228.75,267.743321,0.249746
min,0.0,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18749.75,0.176022,41.3,5.0,0.0,0.0,0.0,0.0,3421.354,0.029703,0.0
50%,37499.5,0.366848,52.2,8.0,1.0,0.0,0.0,0.0,5424.552,0.15372,0.0
75%,56249.25,0.86265,63.1,11.0,2.0,1.0,0.0,0.0,8291.518,0.560638,0.0
max,74999.0,332600.27282,109.8,56.0,32.0,20.0,98.0,98.0,1805573.0,50873.874533,1.0


In [6]:
df_raw[df_raw.Delinquent90==1].describe()

Unnamed: 0,client_id,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit,Delinquent90
count,5013.0,5013.0,5013.0,5013.0,5013.0,4928.0,5013.0,5013.0,4186.0,5013.0,5013.0
mean,38236.276681,305.013555,46.251167,7.894075,0.975663,0.942167,2.36166,1.79992,5616.791347,7.223579,1.0
std,21827.36365,1430.838514,12.97314,5.753218,1.407215,1.207893,11.625605,11.641649,6070.672678,182.560328,0.0
min,34.0,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,19495.0,0.189518,36.2,4.0,0.0,0.0,0.0,0.0,2996.352503,0.388838,1.0
50%,39068.0,0.426576,46.0,7.0,1.0,0.0,0.0,0.0,4525.108818,0.835251,1.0
75%,57356.0,0.873894,54.5,11.0,2.0,2.0,2.0,1.0,6813.442413,1.003804,1.0
max,74960.0,38943.239976,101.8,52.0,25.0,8.0,98.0,98.0,236318.53804,8381.813981,1.0


In [7]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   client_id              75000 non-null  int64  
 1   DIR                    75000 non-null  float64
 2   Age                    75000 non-null  float64
 3   NumLoans               75000 non-null  int64  
 4   NumRealEstateLoans     75000 non-null  int64  
 5   NumDependents          73084 non-null  float64
 6   Num30-59Delinquencies  75000 non-null  int64  
 7   Num60-89Delinquencies  75000 non-null  int64  
 8   Income                 60153 non-null  float64
 9   BalanceToCreditLimit   75000 non-null  float64
 10  Delinquent90           75000 non-null  int64  
dtypes: float64(5), int64(6)
memory usage: 6.3 MB


In [364]:
# Distribution of features with serious outliers

features=['Income','Num60-89Delinquencies','Num30-59Delinquencies','NumRealEstateLoans','NumLoans','Age','DIR','BalanceToCreditLimit']
print('The entire dataset')
display(df_raw[features].quantile([.25,.5,.75,.9,.95,.99]))

print('\nData on the minor class')
display(df_raw.loc[df_raw.Delinquent90==1,features].quantile([.25,.5,.75,.9,.95,.99]))

The entire dataset


Unnamed: 0,Income,Num60-89Delinquencies,Num30-59Delinquencies,NumRealEstateLoans,NumLoans,Age,DIR,BalanceToCreditLimit
0.25,3421.353782,0.0,0.0,0.0,5.0,41.3,0.176022,0.029703
0.5,5424.552473,0.0,0.0,1.0,8.0,52.2,0.366848,0.15372
0.75,8291.517816,0.0,0.0,2.0,11.0,63.1,0.86265,0.560638
0.9,11720.520611,0.0,1.0,2.0,15.0,72.4,1255.713033,0.987508
0.95,14770.348646,1.0,2.0,3.0,18.0,78.3,2463.816623,1.006007
0.99,25166.942526,2.0,4.0,4.0,24.0,87.4,4954.675484,1.101759



Data on the minor class


Unnamed: 0,Income,Num60-89Delinquencies,Num30-59Delinquencies,NumRealEstateLoans,NumLoans,Age,DIR,BalanceToCreditLimit
0.25,2996.352503,0.0,0.0,0.0,4.0,36.2,0.189518,0.388838
0.5,4525.108818,0.0,0.0,1.0,7.0,46.0,0.426576,0.835251
0.75,6813.442413,1.0,2.0,2.0,11.0,54.5,0.873894,1.003804
0.9,9976.991885,1.0,3.0,2.0,15.8,62.6,572.949771,1.024379
0.95,12592.471101,2.0,4.0,3.0,18.0,68.1,2000.14665,1.13025
0.99,21530.512047,98.0,98.0,6.0,25.0,82.088,5443.217135,1.763305


In [365]:
print('The number of representatives of the target attribute classes')
print(df_raw.Delinquent90.value_counts())

min_isna=df_raw[df_raw['Delinquent90']==1].isna().sum()
minority=df_raw[df_raw['Delinquent90']==1].shape[0]
print(f'\n\nPercentage of missing values in the minority class')
print(round(min_isna/minority*100,2))

The number of representatives of the target attribute classes
0    69987
1     5013
Name: Delinquent90, dtype: int64


Percentage of missing values in the minority class
client_id                 0.0
DIR                       0.0
Age                       0.0
NumLoans                  0.0
NumRealEstateLoans        0.0
NumDependents             1.7
Num30-59Delinquencies     0.0
Num60-89Delinquencies     0.0
Income                   16.5
BalanceToCreditLimit      0.0
Delinquent90              0.0
dtype: float64


### EDA conclusion

* The major class is many times higher than the minor class - 93.8% versus 6.7%
* There are two attributes in the data with missing values, one of which is less than 3%, which can be considered an insignificant part of the data and deleted. But if you take into account the proportions of classes and the number of skips in a minority class, then this can be a significant data loss. In my opinion, with such an imbalance, you can look first at the gaps in the minority class, and then at the overall statistics.
* It can be noted that each feature contains values exceeding 1.5IQR. Given that this is real data that may represent rare phenomena or simply be an error, it is worth limiting the features with the strongest outliers to the 99th quantile, which are not planned to be categorized.
* **DIR** is more difficult to estimate due to the lack of data on all real incomes and funds of clients in the table, which leads to astronomical values. Let's assume that in the absence of income information, the feature indicates the repayment of the loan, and not its ratio to income. We can also assume that DIR<=1 is the ratio of payments to income, and 1<DIR<=100 is the percentage ratio. Accordingly, it is necessary to switch from the ratio to the loan repayment values, and then check that the payments do not exceed income. The bank cannot issue a loan in the absence of income information or lack of income, so let's assume that the system could have failed or the latest up-to-date income information is indicated there, and not at the time of issuing the loan/loan. Therefore, loan repayments may exceed the stated income.
* **BalanceToCreditLimit** - entries with a value >1 can be interpreted in the same way as in the case of DIR - the remainder is indicated, not the ratio. But due to the lack of additional information, it is not possible to bring everything into a single view. We can assume that 1<x<100 is the percentage and divide it by 100, and delete everything that is greater than 100.
* **Num60-89Delinquencies**, **Num30-59Delinquencies** and **Delinquent90** are disjoint sets. The number of delays from 60 to 90 may be more than from 30 to 60, and so on. Also, the first two attributes have no more than 1% of the data, which is many times higher than 0. However, such outliers may indicate a high probability that the client belongs to a minority class. You can categorize these signs and get rid of outliers (such a number of debts can be quite real, and with categorization we will get rid of the excess)
* **NumRealEstateLoans** and **NumLoans** are overlapping sets. The number of real estate loans cannot be more than the number of loans - you need to check.
* **Income** is a monthly salary, and given its distribution, it can be said that it is presented in dollars or euros. Let's assume that the income may be lower than the minimum wage in the country of the bank's clients (It is missing altogether or some payments are indicated as income). In the descriptive table, you can see a 7-digit salary, but when you look at the salary distribution in detail, you can also see that even 6-digit salaries are not included in the 1%. I believe that salary data can also be categorized to preserve more information about the data. Nan values can be replaced with 0, as there is no income information. Considering that Nan entries in Income account for 16.5% of customers with a delay of more than 90 days, this information may be significant.

## Preprocessing

In [366]:
# We check the duplicates by ID, because it doesn't make sense for the rest of the lines
print('Duplicate count of id:',df_raw.client_id.duplicated().sum())

Duplicate count of id: 0


In [369]:
# BalanceToCreditLimit
df_stage1=df_raw.drop(columns=['client_id'])
mask=(df_stage1.BalanceToCreditLimit<=100.0001)&(df_stage1.BalanceToCreditLimit>=1.0001)
df_stage1.loc[mask,'BalanceToCreditLimit']=df_stage1[mask].BalanceToCreditLimit/100
mask=df_stage1.BalanceToCreditLimit<=100.0001
df_stage1=df_stage1[mask]

# NumLoans
mask=df_stage1.NumRealEstateLoans<=df_stage1.NumLoans
df_stage1=df_stage1[mask]

Since some signs with serious outliers may contain real information and belong to a minority class, they should be preserved and categorized in order to avoid problems.

In [370]:
# Num60-89Delinquencies 
nd_bins=[-1,0,2,5,9,100]
nd_labels=[0,1,2,3,4]
df_stage1['Num60-89Delinquencies']=pd.cut(df_stage1['Num60-89Delinquencies'],bins=nd_bins,labels=nd_labels)
df_stage1['Num60-89Delinquencies']=df_stage1['Num60-89Delinquencies'].astype(int)

# Num30-59Delinquencies
nd_bins=[-1,0,2,5,9,100]
nd_labels=[0,1,2,3,4]
df_stage1['Num30-59Delinquencies']=pd.cut(df_stage1['Num30-59Delinquencies'],bins=nd_bins,labels=nd_labels)
df_stage1['Num30-59Delinquencies']=df_stage1['Num30-59Delinquencies'].astype(int)

# NumLoans
bins=[-1,0,5,8,11,20,60]
labels=[0,1,2,3,4,5]
df_stage1['NumLoans']=pd.cut(df_stage1['NumLoans'],bins=bins,labels=labels)
df_stage1['NumLoans']=df_stage1['NumLoans'].astype(int)

# NumRealEstateLoans
bins=[-1,0,2,8,15,35]
labels=[0,1,2,3,4]
df_stage1['NumRealEstateLoans']=pd.cut(df_stage1['NumRealEstateLoans'],bins=bins,labels=labels)
df_stage1['NumRealEstateLoans']=df_stage1['NumRealEstateLoans'].astype(int)

# Age
bins=[0,25,35,45,60,80,110]
labels=[1,2,3,4,5,6]
df_stage1['Age']=pd.cut(df_stage1['Age'],bins=bins,labels=labels)
df_stage1['Age']=df_stage1['Age'].astype(int)

In [371]:
# DIR
# Transform and limit from above
mask=df_stage1.Income.notna() & (df_stage1.DIR<1.0001)&(df_stage1.Income>100)
df_stage1.loc[mask,'DIR']=df_stage1.loc[mask,'DIR'].values*df_stage1.loc[mask,'Income'].values
mask=df_stage1.Income.notna() & (df_stage1.DIR<100.0001)&(df_stage1.Income>100)
df_stage1.loc[mask,'DIR']=df_stage1.loc[mask,'DIR'].values/100*df_stage1.loc[mask,'Income'].values

quantile=df_stage1['DIR'].quantile(.99)
df_stage1.loc[df_stage1.DIR>quantile,feature]=quantile

df_stage1.describe()

Unnamed: 0,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit,Delinquent90
count,74887.0,74887.0,74887.0,74887.0,72977.0,74887.0,74887.0,60076.0,74887.0,74887.0
mean,1968.00282,3.894214,2.361518,0.692884,0.755909,0.187816,0.060237,6737.691,85.7524,0.066821
std,2647.528529,1.100181,1.21417,0.591761,1.108155,0.472395,0.293601,14235.7,850.900282,0.249713
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,569.813803,3.0,1.0,0.0,0.0,0.0,0.0,3420.354,0.012609,0.0
50%,1538.381238,4.0,2.0,1.0,0.0,0.0,0.0,5423.588,0.089018,0.0
75%,2758.599134,5.0,3.0,1.0,1.0,0.0,0.0,8286.185,0.374017,0.0
max,332600.27282,6.0,5.0,4.0,20.0,4.0,4.0,1805573.0,8551.304143,1.0


In [372]:
# Income
income_bins=[-1,0.00001,1_000,3_000,8_000,15_000,25_000,100_000,float('inf')]
income_labels=[0,1,2,3,4,5,6,7]
df_stage1['Income']=pd.cut(df_stage1['Income'],bins=income_bins,labels=income_labels)
df_stage1['Income']=df_stage1['Income'].astype(float)
df_stage1['Income'].fillna(0,inplace=True)

In [373]:
# NumDependents
imputer=KNNImputer()
imputation_data=imputer.fit_transform(df_stage1)
df_stage2=pd.DataFrame(data=imputation_data,columns=df_stage1.columns)

# NumDependents
bins=[-1,0,2,5,20]
labels=[0,1,2,3]
df_stage2['NumDependents']=pd.cut(df_stage2['NumDependents'],bins=bins,labels=labels)
df_stage2['NumDependents']=df_stage2['NumDependents'].astype(int)

df_stage2.describe()

Unnamed: 0,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit,Delinquent90
count,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0
mean,1968.00282,3.894214,2.361518,0.692884,0.497576,0.187816,0.060237,2.478721,85.7524,0.066821
std,2647.528529,1.100181,1.21417,0.591761,0.655706,0.472395,0.293601,1.478237,850.900282,0.249713
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,569.813803,3.0,1.0,0.0,0.0,0.0,0.0,2.0,0.012609,0.0
50%,1538.381238,4.0,2.0,1.0,0.0,0.0,0.0,3.0,0.089018,0.0
75%,2758.599134,5.0,3.0,1.0,1.0,0.0,0.0,3.0,0.374017,0.0
max,332600.27282,6.0,5.0,4.0,3.0,4.0,4.0,7.0,8551.304143,1.0


In [374]:
feature=['DIR']
rubust_scaler=RobustScaler()
df_stage2[feature]=rubust_scaler.fit_transform(df_stage2[feature])
df_stage2.describe()

Unnamed: 0,DIR,Age,NumLoans,NumRealEstateLoans,NumDependents,Num30-59Delinquencies,Num60-89Delinquencies,Income,BalanceToCreditLimit,Delinquent90
count,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0,74887.0
mean,0.196283,3.894214,2.361518,0.692884,0.497576,0.187816,0.060237,2.478721,85.7524,0.066821
std,1.209588,1.100181,1.21417,0.591761,0.655706,0.472395,0.293601,1.478237,850.900282,0.249713
min,-0.702847,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.442514,3.0,1.0,0.0,0.0,0.0,0.0,2.0,0.012609,0.0
50%,0.0,4.0,2.0,1.0,0.0,0.0,0.0,3.0,0.089018,0.0
75%,0.557486,5.0,3.0,1.0,1.0,0.0,0.0,3.0,0.374017,0.0
max,151.253705,6.0,5.0,4.0,3.0,4.0,4.0,7.0,8551.304143,1.0


In [199]:
df_stage2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74887 entries, 0 to 74886
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   DIR                    74887 non-null  float64
 1   Age                    74887 non-null  float64
 2   NumLoans               74887 non-null  float64
 3   NumRealEstateLoans     74887 non-null  float64
 4   NumDependents          74887 non-null  int32  
 5   Num30-59Delinquencies  74887 non-null  float64
 6   Num60-89Delinquencies  74887 non-null  float64
 7   Income                 74887 non-null  float64
 8   BalanceToCreditLimit   74887 non-null  float64
 9   Delinquent90           74887 non-null  float64
dtypes: float64(9), int32(1)
memory usage: 5.4 MB


In [367]:
print('The number of representatives of the target attribute classes')
print(df_stage2.Delinquent90.value_counts())

The number of representatives of the target attribute classes
0.0    69883
1.0     5004
Name: Delinquent90, dtype: int64


It was possible to save more than 99% of the minority class records and convert the data.  
Based on the results of the F-test, it can be concluded that the DIR attribute is weakly related to the target variable and can be removed from the dataset.

In [375]:
results = []

for feature in df_stage2.columns.drop('Delinquent90'):
    try:
        groups = [
            df_stage2.loc[df_stage2[feature] == val, 'Delinquent90'] 
            for val in df_stage2[feature].unique()
            if len(df_stage2.loc[df_stage2[feature] == val, 'Delinquent90']) > 1
        ] 
        if len(groups) >= 2:
            f_stat, p_value = f_oneway(*groups)
            results.append({'Feature': feature, 'F_statistic': f_stat, 'p_value': p_value})
        else:
            results.append({'Feature': feature, 'F_statistic': np.nan, 'p_value': np.nan})
            
    except Exception as e:
        results.append({'Feature': feature, 'F_statistic': np.nan, 'p_value': np.nan})
        print(f"Error for {feature}: {e}")
        
results_df = pd.DataFrame(results)
results_df.sort_values('p_value')

Unnamed: 0,Feature,F_statistic,p_value
5,Num30-59Delinquencies,1580.729353,0.0
6,Num60-89Delinquencies,1595.913993,0.0
1,Age,193.521557,1.345748e-205
2,NumLoans,133.496367,2.280422e-141
7,Income,48.675766,1.7280340000000003e-69
8,BalanceToCreditLimit,1.465991,1.7497299999999999e-63
3,NumRealEstateLoans,72.328262,2.8786380000000002e-61
4,NumDependents,56.091661,3.29285e-36
0,DIR,0.183645,0.832245


In [376]:
data=df_stage2.drop(columns=['DIR'])

In [377]:
data.to_csv('processed.csv',index=False)

## Model training

In [303]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, precision_score, recall_score, average_precision_score
from imblearn.under_sampling import RandomUnderSampler

In [304]:
seed=0
np.random.seed(seed)
random_state=seed

In [305]:
X=data.drop(columns=['Delinquent90'])
y=data['Delinquent90']
random_state=0
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=.1,shuffle=True,random_state=random_state,stratify=y)
df_metrics=pd.DataFrame(columns=['F1','Precision','Recall','PR_AUC','ROC_AUC'],data=[])

Since there is a significant class imbalance in the data, and the data itself represents a fairly large sample, in my opinion, it is reasonable to apply the under sampling method.

In [306]:
undersamp=RandomUnderSampler(random_state=random_state)
X_train,y_train=undersamp.fit_resample(X_train,y_train)

### Decision tree

The work plans to use various methods with a large number of hyperparameters, and I decided that it would be wise to speed up the selection process a little. To do this, I decided to first run the model through RandomizedSearchCV from all the parameters, and then select up to 2 values with the best learning outcomes for each hyperparameter and find the best combination using GridSearchCV with this sample of hyperparameters.

In [368]:
def CheckScore(trained_model, data, true):
    predict=trained_model.predict(data)
    proba=trained_model.predict_proba(data)[:,-1]
    metrics={}
    metrics['F1']=f1_score(true,predict)
    metrics['Precision'] = precision_score(true, predict, zero_division=0)
    metrics['Recall'] = recall_score(true, predict)
    metrics['PR_AUC'] = average_precision_score(true, proba)
    metrics['ROC_AUC'] = roc_auc_score(true, proba)
    return metrics
    
def SelectHyperparameters(model, params, X_train, y_train):
    rs=RandomizedSearchCV(
        model,
        params,
        n_iter=100,
        cv=3,
        verbose=1,
        n_jobs=-1,
        random_state=random_state,
        scoring='recall',
        error_score=np.nan
    )
    rs.fit(X_train,y_train)
    rs_df=pd.DataFrame(rs.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
    drop_cols=['mean_fit_time',
                'std_fit_time',
                'mean_score_time',
                'std_score_time',
                'params',
                'split0_test_score',
                'split1_test_score',
                'split2_test_score',
                'std_test_score',
                'rank_test_score']
    rs_df = rs_df.drop(columns=drop_cols)
    target=rs_df.columns[-1]
    best_selection={}
    for param in rs_df.columns[:-1]:
        top=rs_df.sort_values(by=target,ascending=0)[param].unique()
        if len(top)>1:
            top=top[:2]
        best_selection[param.replace('param_','')]=top
    gs=GridSearchCV(
        model,
        best_selection,
        scoring='recall',
        n_jobs=-1,
        cv=3
    )
    gs.fit(X_train,y_train)
    print(gs.best_score_)
    return gs.best_estimator_

In [307]:
model=DecisionTreeClassifier(random_state=random_state)
params = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'splitter': ['best', 'random'],
    'max_depth': [int(x) for x in np.linspace(2,10,5)],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2', 0.5, 0.8],
    'class_weight': [None,'balanced']
}
dtree_model=SelectHyperparameters(model,params,X_train,y_train)
df_metrics.loc['DecisionTree']=CheckScore(dtree_model,X_test,y_test)
df_metrics

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.7932918223181883


Unnamed: 0,F1,Precision,Recall,PR_AUC,ROC_AUC
DecisionTree,0.266523,0.162434,0.742,0.265111,0.799764


### Ensemble methods

In [308]:
from sklearn.ensemble import BaggingClassifier, StackingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

#### Bagging

In [309]:
knn_model=KNeighborsClassifier()
knn_params = {
    'n_neighbors': [int(x) for x in np.linspace(start=3,stop=15,num=10)],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski','chebyshev'],
    'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']
}
knn_model=SelectHyperparameters(knn_model,knn_params,X_train,y_train)

bagging=BaggingClassifier(knn_model,n_jobs=-1,random_state=random_state)
bagging_params = {
    'n_estimators': [int(x) for x in np.linspace(10,200,10)],
    'max_samples': [float(x) for x in np.linspace(.5,1.0,6)],
    'max_features': [float(x) for x in np.linspace(.5,1.0,6)],
    'bootstrap': [True, False]
}
bagging_KNN=SelectHyperparameters(bagging,bagging_params,X_train,y_train)
df_metrics.loc['Bagging_KNN']=CheckScore(bagging_KNN,X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.6956034340769418
Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.7557710453720304


In [310]:
bagging=BaggingClassifier(dtree_model,n_jobs=-1,random_state=random_state)
bagging_dtree=SelectHyperparameters(bagging,bagging_params,X_train,y_train)
df_metrics.loc['Bagging_DTree']=CheckScore(bagging_dtree,X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.7204748543137244


#### Stacking

In [311]:
lr_model=LogisticRegression(penalty='l2',random_state=random_state)
stacking_params={
    'final_estimator__C': [.0001,.001,.01,.1,1,10,100],
    'final_estimator__solver': ['saga','sag','lbfgs','newton-cg','newton-cholesky'],
    'final_estimator__max_iter':[int(x) for x in np.linspace(start=100, stop=1000,num=10)]
}
estimators=[('knn',knn_model),
           ('dtree',dtree_model)]
stacking=StackingClassifier(
    estimators=estimators,
    final_estimator=lr_model,
    cv=5
)
stacking_knn_dtree=SelectHyperparameters(stacking,stacking_params,X_train,y_train)
df_metrics.loc['Stacking_KNN_DTree']=CheckScore(stacking_knn_dtree,X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.7280199056524825


#### RandomForest

In [312]:
rfc=RandomForestClassifier(bootstrap=True,random_state=random_state)
rfc_params={
    'n_estimators': [int(x) for x in np.linspace(50, 500, 20)],
    'max_depth': [int(x) for x in np.linspace(4,8,5)],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2', 0.5, 0.7, 0.9],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_samples': [0.5, 0.7, 0.9, 1.0],
    'class_weight': ['balanced', 'balanced_subsample']
}
rfc_model=SelectHyperparameters(rfc_model,rfc_params,X_train,y_train)
df_metrics.loc['RandomForest']=CheckScore(rfc_model,X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.7775291394729301


#### XGBoost

In [314]:
xgb=XGBClassifier(random_state=random_state)
xgb_params={
    'learning_rate':[float(x) for x in np.linspace(.01,.6,10)],
    'max_depth':[4, 5, 6, 7, 8],
    'n_estimators':[int(x) for x in np.linspace(100,1000,10)],
    'subsample':[float(x) for x in np.linspace(.3,.9,5)],
    'colsample_bytree':[float(x) for x in np.linspace(.5,.9,5)],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3],
    'min_child_weight':[1, 2, 3, 4],
    'gamma': [0, 0.1, 0.2, 0.3, 0.5, 1, 2],
    'reg_alpha': [0, 0.001, 0.01, 0.1, 1, 10],
    'reg_lambda': [0.1, 0.5, 1, 2, 5, 10],
    'scale_pos_weight': [int(x) for x in np.linspace(1,10,10)]
}
xgb_model=SelectHyperparameters(xgb,xgb_params,X_train,y_train)
df_metrics.loc['XGBoost']=CheckScore(xgb_model,X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
1.0


In [349]:
def FeatureImpotance(model,X_test,y_test):
    y_pred_proba=model.predict_proba(X_test)[:, 1]
    f_stats=[]
    p_values=[]
    feature_names=X_train.columns
    for i, feature_name in enumerate(feature_names):
        feature_values=X_test[feature_name].values
        n_groups=5
        quantiles=np.quantile(feature_values, np.linspace(0, 1, n_groups + 1))
        groups=np.digitize(feature_values, quantiles[1:-1])

        group_predictions=[]
        for group in range(n_groups):
            mask=groups==group
            if np.sum(mask) > 0:
                group_predictions.append(y_pred_proba[mask])
        if len(group_predictions)>=2:
            f_stat, p_value=f_oneway(*group_predictions)
        else:
            f_stat, p_value=0.0, 1.0
        f_stats.append(f_stat)
        p_values.append(p_value)

    importance_df = pd.DataFrame({
        'feature': feature_names,
        'f_statistic': f_stats,
        'p_value': p_values,
        'significant': [p < 0.05 for p in p_values]
    })

    return importance_df.sort_values('f_statistic', ascending=False)

## Conclusion

In [316]:
df_metrics

Unnamed: 0,F1,Precision,Recall,PR_AUC,ROC_AUC
DecisionTree,0.266523,0.162434,0.742,0.265111,0.799764
Bagging_KNN,0.285938,0.177789,0.73,0.308744,0.825981
Bagging_DTree,0.290955,0.183781,0.698,0.309147,0.821679
Stacking_KNN_DTree,0.272761,0.167503,0.734,0.29251,0.808875
RandomForest,0.275596,0.167092,0.786,0.324161,0.831087
XGBoost,0.125172,0.066765,1.0,0.321138,0.833246


In [351]:
FeatureImpotance(rfc_model,X_test,y_test)

Unnamed: 0,feature,f_statistic,p_value,significant
7,BalanceToCreditLimit,830.654311,0.0,True
0,Age,526.393265,2.093054e-310,True
3,NumDependents,166.786543,9.40451e-38,True
1,NumLoans,60.389347,2.799582e-50,True
2,NumRealEstateLoans,31.8978,1.684539e-08,True
6,Income,18.583891,8.893824e-09,True
4,Num30-59Delinquencies,0.0,1.0,False
5,Num60-89Delinquencies,0.0,1.0,False


In [352]:
FeatureImpotance(xgb_model,X_test,y_test)

Unnamed: 0,feature,f_statistic,p_value,significant
0,Age,1128.481228,0.0,True
7,BalanceToCreditLimit,919.168597,0.0,True
3,NumDependents,314.006898,7.351498000000001e-69,True
1,NumLoans,47.982813,6.665736e-40,True
6,Income,44.767064,4.712571e-20,True
2,NumRealEstateLoans,0.001676,0.9673436,False
4,Num30-59Delinquencies,0.0,1.0,False
5,Num60-89Delinquencies,0.0,1.0,False


In [363]:
threshold=pd.DataFrame(columns=['F1','Precision','Recall'],data=[])
for k in np.linspace(.3,.5,10):
    row=[]
    proba=rfc_model.predict_proba(X_test)
    predict=np.argmax((proba>k).astype(int),1)
    row.append(f1_score(y_test,predict))
    row.append(precision_score(y_test,predict))
    row.append(recall_score(y_test,predict))
    threshold.loc[k]=row
threshold    

Unnamed: 0,F1,Precision,Recall
0.3,0.369841,0.306579,0.466
0.322222,0.368152,0.289988,0.504
0.344444,0.363511,0.277719,0.526
0.366667,0.357284,0.26647,0.542
0.388889,0.352047,0.24876,0.602
0.411111,0.331217,0.22518,0.626
0.433333,0.318288,0.207398,0.684
0.455556,0.298927,0.188345,0.724
0.477778,0.283179,0.173696,0.766
0.5,0.275596,0.167092,0.786


A large number of anomalies and outliers were found in the source data from the credit information of Bank N customers. The data, in general, can be called bad, because 2 out of 10 signs (DIR, BalanceToCreditLimit) contained incorrect information. It is also worth noting a significant imbalance in the classes of the target variable, which could not but affect the effectiveness of training. During preprocessing, it was decided to convert most of the data to categorical due to anomalies and their belonging to a minority class. This helped preserve data that could be interpreted as outliers and anomalies, but they could also be quite real and accounted for 20% of the minority class. After preprocessing, more than 99% of the records for the minority class were saved, and an f-test revealed a feature (DIR) that did not explain the variance of the target variable, as a result of which this feature was deleted.  
Since the minority class makes up 7% of the total sample and these are 5,000 values, it was decided to apply the method of sample alignment towards a smaller class. This solution also made it possible to speed up the process of selecting hyperparameters and training models. 

The models chosen were DecisionTree, KNN, randomForest, XGBoost, and bagging and stacking methods with LogisticRegression as a meta-model.  
The following measures have been taken to speed up the process of parameter selection and training of ensemble methods: First, the selected values of the moedil hyperparameters were passed through RandomSearchCV, then up to 2 values with the highest average value of the selected metric (recall) were selected for each parameter, and the already reduced sample was passed through GridSearchCV; Model parameters for ensemble methods were selected separately. Of course, such decisions could have an impact on the classification ability of trained models. The priority was to test different models and methods.

After working with models and selecting hyperparameters, the results were quite ambiguous. Since a 90-day delay in debt repayment is significant, it would be important for the bank to identify such debtors. It is better not to make a profit than to lose. This means that when training models, it was logical to focus on the completeness of the detection of the debtor class. The best debtor detection models are randomForest with 78.6% and XGBoost with 100%. However, all 5 models showed a high probability of a false positive, and XGBoost, one might say, generally classifies everyone as debtors. Judging by PR_AUC<.5 All models have extremely low classification ability. The best combination of recall and precision (with priority in recall) can be called randomForest - high confidence in classification, high completeness of detection, and in other metrics it does not lag behind competitors. Another conclusion based on the results is the poor quality of the source data and, possibly, the poor quality of preprocessing, which led to equally poor results for most metrics for all models.

It is also worth noting that when examining the effect of signs on prognosis in the two models selected by the best, Num30-59Delinquencies and Num60-89Delinquencies showed no effect. This means that out of 10 features in the original dataset, 6 remain significant for randomForest and 5 for XGBoost. Also, a classification threshold was selected for the randomForest model to improve the quality of the combination of completeness and accuracy. A good value can be called .41 - Recall is reduced by 16%, and Precision is increased by 6%.