# **Implementation of a scoring model**
## **Notebook 4/6 - Modeling: Selection of features**

This notebook is organized as follows:

**0. Set up**
- 0.1 Loading libraries and useful functions
- 0.2 Loading and description of the dataset
- 0.3 Removal of irrelevant features
- 0.4 Data separation
    
**1. Feature selection**
- 1.1 Baseline: no selection
- 1.2 Removal of collinear features
- 1.3 Deletion of features with more than 75% missing values
- 1.4 Deletion of features having zero importance for the model
- 1.5 Deletion of features having an importance less than 95% for the model
- 1.6 Performance comparison

**2. Conclusion**

**3. Data export**

___
### 0. SETUP

In this first step, the working framework is put in place, that is to say:
- The necessary Python libraries and packages are loaded
- Useful functions are defined
- The dataset is loaded
___

___
#### 0.1 LOADING LIBRARIES AND USEFUL FUNCTIONS

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import math
import pandas as pd
import numpy as np
import re
from unidecode import unidecode
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import time
import random
import gc

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
import lightgbm as lgb
from joblib import load, dump

In [4]:
from sys import path
path.append("./Resources/functions")

import helper_functions as hf
import graphical_functions as gf

___
#### 0.2 LOADING DATA

In [5]:
data = pd.read_csv("./Resources/datasets/assembled/full_training_data.csv").rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

___
#### 0.3 REMOVAL OF IRRELEVANT FEATURES

We will remove the column indicating the customer ID.

In addition, to avoid perpetuating bias, we will remove the column relating to the gender of clients.

In [6]:
cols_to_drop = ["SK_ID_CURR", "CODE_GENDER_M", "CODE_GENDER_F", "CODE_GENDER_XNA"]
data_model = data.drop(columns=cols_to_drop)

___
#### 0.4 DATA SEPARATION

The dataset will be separated into training data and test data.

The exploratory analysis having revealed a significant imbalance of classes in TARGET, we must ensure to maintain these proportions in our new games.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(data_model.drop(columns=["TARGET"]), 
                                                    data_model["TARGET"], 
                                                    train_size=0.8, random_state=42, 
                                                    stratify=data_model["TARGET"])

___
### 1. SELECTION OF FEATURES

In [8]:
custom_score = make_scorer(hf.bank_score)

In [9]:
results = pd.DataFrame(columns=["Features", "Custom Score"])

___
#### 1.1 BASELINE: NO SELECTION

In [10]:
baseline_model = lgb.LGBMClassifier(n_estimators=10000, 
                                    objective = 'binary', 
                                    class_weight = 'balanced', 
                                    learning_rate = 0.05, 
                                    reg_alpha = 0.1, 
                                    reg_lambda = 0.1, 
                                    subsample = 0.8, 
                                    n_jobs = -1, 
                                    random_state = 42)

In [None]:
baseline_model.fit(X_train, y_train, eval_metric=custom_score)

In [None]:
y_pred = baseline_model.predict(X_test)

In [None]:
results.loc[len(results)] = ["All features", 
                             round(hf.bank_score(y_test, y_pred), 3)]

___
#### 1.2 REMOVAL OF COLINEAR FEATURES

For each pair of features that are more than 90% collinear (Spearman coefficient), one of the 2 features is deleted.

In [None]:
# Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = X_train.corr("spearman").abs()
corr_matrix.head()

In [None]:
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()

In [None]:
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

print('There are %d columns to remove.' % (len(to_drop)))

In [None]:
X_train_nc = X_train.drop(columns = to_drop)
X_test_nc = X_test.drop(columns = to_drop)

print('Training shape: ', X_train_nc.shape)
print('Testing shape: ', X_test_nc.shape)

In [None]:
nc_model = lgb.LGBMClassifier(n_estimators=10000, 
                              objective = 'binary', 
                              class_weight = 'balanced', 
                              learning_rate = 0.05, 
                              reg_alpha = 0.1, 
                              reg_lambda = 0.1, 
                              subsample = 0.8, 
                              n_jobs = -1, 
                              random_state = 42)

In [None]:
nc_model.fit(X_train_nc, 
          y_train, 
          eval_metric = custom_score)

# Make predictions
y_pred = nc_model.predict(X_test_nc)

row = ["Without collinear", 
       hf.bank_score(y_test, y_pred)]

results.loc[len(results)] = row

___
#### 1.3 DELETION OF FEATURES WITH MORE THAN 75% MISSING VALUES

In [None]:
THRESHOLD = 0.75

In [None]:
# Train missing values (in percent)
X_train_collinear_missing = (X_train_nc.isnull().sum() / len(X_train_nc)).sort_values(ascending = False)

In [None]:
X_train_collinear_missing = X_train_collinear_missing.index[X_train_collinear_missing > THRESHOLD]

In [None]:
X_train_without_collinear_missing = X_train_nc.drop(columns= X_train_collinear_missing)
X_test_without_collinear_missing = X_test_nc.drop(columns= X_train_collinear_missing)

In [None]:
ncm_model = lgb.LGBMClassifier(n_estimators=10000, 
                              objective = 'binary', 
                              class_weight = 'balanced', 
                              learning_rate = 0.05, 
                              reg_alpha = 0.1, 
                              reg_lambda = 0.1, 
                              subsample = 0.8, 
                              n_jobs = -1, 
                              random_state = 42)

In [None]:
ncm_model.fit(X_train_without_collinear_missing, 
          y_train, 
          eval_metric = custom_score)

# Make predictions
y_pred = ncm_model.predict(X_test_without_collinear_missing)

row = ["Without collinear and missing", 
       hf.bank_score(y_test, y_pred)]

results.loc[len(results)] = row

___
#### 1.4 DELETION OF FEATURES OF ZERO IMPORTANCE FOR THE MODEL

In [None]:
train = X_train_without_collinear_missing
test = X_test_without_collinear_missing

In [None]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(train.shape[1])

In [None]:
ncmnof_model = lgb.LGBMClassifier(n_estimators=10000, 
                              objective = 'binary', 
                              class_weight = 'balanced', 
                              learning_rate = 0.05, 
                              reg_alpha = 0.1, 
                              reg_lambda = 0.1, 
                              subsample = 0.8, 
                              n_jobs = -1, 
                              random_state = 42)

In [None]:
# Fit the model twice to avoid overfitting
for i in range(2):
    
    # Train using early stopping
    ncmnof_model.fit(train, 
              y_train, 
              eval_metric = custom_score)
    
    # Record the feature importances
    feature_importances += ncmnof_model.feature_importances_

In [None]:
# Average feature importances
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(train.columns), 
                                    'importance': feature_importances}).sort_values('importance', ascending = False)

feature_importances.head()

In [None]:
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
feature_importances.tail(len(zero_features))

In [None]:
norm_feature_importances = gf.plot_feature_importances2(feature_importances)

In [None]:
train = train.drop(columns = zero_features)
test = test.drop(columns = zero_features)

print('Training shape: ', train.shape)
print('Testing shape: ', test.shape)

In [None]:
second_round_zero_features, feature_importances = hf.identify_zero_importance_features(train, y_train)

In [None]:
norm_feature_importances = gf.plot_feature_importances2(feature_importances)

In [None]:
train = train.drop(columns=second_round_zero_features)
test = test.drop(columns=second_round_zero_features)

In [None]:
print('Training shape: ', train.shape)
print('Testing shape: ', test.shape)

In [None]:
third_round_zero_features, feature_importances = hf.identify_zero_importance_features(train, y_train)

In [None]:
norm_feature_importances = gf.plot_feature_importances2(feature_importances)

In [None]:
ncmnof_model.fit(train, 
          y_train, 
          eval_metric = custom_score)

# Make predictions
y_pred = ncmnof_model.predict(test)

row = ["Without all + 0 importance features", 
       hf.bank_score(y_test, y_pred)]

results.loc[len(results)] = row

___
#### 1.5 DELETION OF FEATURES HAVING AN IMPORTANCE LESS THAN 95% FOR THE MODEL

In [None]:
# Threshold for cumulative importance
THRESHOLD = 0.95

# Extract the features to keep
features_to_keep = list(norm_feature_importances[norm_feature_importances['cumulative_importance'] < THRESHOLD]['feature'])

# Create new datasets with smaller features
train_small = train[features_to_keep]
test_small = test[features_to_keep]

In [None]:
ncmnofif_model = lgb.LGBMClassifier(n_estimators=10000, 
                              objective = 'binary', 
                              class_weight = 'balanced', 
                              learning_rate = 0.05, 
                              reg_alpha = 0.1, 
                              reg_lambda = 0.1, 
                              subsample = 0.8, 
                              n_jobs = -1, 
                              random_state = 42)

In [None]:
ncmnofif_model.fit(train_small, 
          y_train, 
          eval_metric = custom_score)

In [None]:
# Make predictions
y_pred = ncmnofif_model.predict(test_small)

row = ["Without 0 importance features and small", 
       hf.bank_score(y_test, y_pred)]

results.loc[len(results)] = row

___
#### 1.6 PERFORMANCE COMPARISON

In [None]:
results

The best score is obtained by removing:
- collinear features
- features with more than 75% missing values
- features having 0 importance for the model

In [None]:
best_model = ncmnof_model

___
### 2.CONCLUSION

___

At this stage, our model has the following characteristics: 
- algorithm: Light Gradient Boosting Machine
- rebalancing strategy: class_weight = 'balanced'
- features: selected in this notebook

Our model obtains a score of 0.109, or 10.9% better than the baseline consisting of systematically predicting that the customer will repay their credit.

In order to finalize our model and improve its performance, the last step consists of optimizing the hyperparameters.

___
### 3. DATA EXPORT
___

In [None]:
# Saving the best model and the dataset to save time
dump(best_model, "lgbm_best_features_model.joblib")
train.to_csv("./Resources/datasets/assembled/train.csv")
y_train.to_csv("./Resources/datasets/assembled/y_train.csv")
test.to_csv("./Resources/datasets/assembled/test.csv") 
y_test.to_csv("./Resources/datasets/assembled/y_test.csv")

# Saving columns for new data pipeline
dump(train.columns, "model_features.joblib"
