

# <font size="+3"><span style='color:#2994ff'> **P7 - Implémentez un modèle de scoring** </span></font>


<a id='LOADING_LIBRARIES'></a>

---

---

<font size="+1"> **LOADING THE LIBRARIES** </font>

---

In [6]:
# Import required librairies
import sys
import os
import pandas as pd
import numpy as np
import datetime
import pickle

import json
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns
import shap
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler


<a id='notebook_settings'></a>


<br>


---
---

<font size="+1"> **NOTEBOOK SETTINGS** </font>

---


In [7]:
#################################
#    -- NOTEBOOK SETTINGS --    #
#################################

%matplotlib inline
sns.set_theme(palette="Set1")

# Random state
seed = 84

# Define training set size
TRAIN_SIZE = 0.8


<a id='USED_FUNCTIONS'></a>


<br>


---
---

<font size="+1"> **FUNCTIONS USED IN THIS NOTEBOOK** </font>

---



## <font color = '#0085dd'>**Table of content**</font>


[Librairies loading](#LOADING_LIBRARIES)<br>

[Functions used in this notebook](#USED_FUNCTIONS)<br>

---

[**Datasets**](#datasets)
 * [Description](#datasets_description)
   * [orders_dataset](#orders_dataset)
   * [customers_dataset](#customers_dataset)
   * [order_items](#order_items)
   * [products_dataset](#products_dataset)
   * [product_category_name_translation](#product_category_name_translation)   
   * [sellers_dataset](#sellers_dataset)
   * [order_payments](#order_payments)
   * [order_reviews](#order_reviews)
   * [geolocation_dataset](#geolocation_dataset)<br> 
<br>
 * [Columns preparation](#columns_preparation)
   * [Zip codes centroids calculation](#centroid_zipCode)
   * [Product category check and update](#cat_products)   
   
   
[**Datasets joining**](#datasets_joining)
 * [Data description](#data_description)
 * [Evaluation of missing values](#missing_values)  
 * [Column filling analysis](#column_fill)   
 * [Row filling analysis](#row_filling)  
 * [Features pre-selection](#features_preSelection)

[**Features engineering**](#features_engineering)
 * [Customers spatial distribution](#customers_spatial_distribution)
 * [RFM features](#RFM_features)  
 * [Products](#products)   
 * [Orders](#Orders)  
 * [Dates](#dates)
 * [Joining customers information](#join_customers_datasets)
 * [Features analysis](#features_analysis)  

[**Dataset for segmentation**](#segmentation_dataset)
<br>

---


<a id='introduciton'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Introduction** </span></span>


This notebook deals with the preparation of the various fields and graphs for API development and display in the dashboard.

In particular:
- **subsample**: recovery of the fields to be used by the API and displayed in the dashboard in an understandable form (e.g. age in years rather than number of days, gender in Female/Male rather than 0/1).
- **interpretability/transparency** :
    - recovery of the 10 most important variables to create graphs for each of these variables by comparing the results for defaulters/non-defaulters and the overall average for all customers to determine the position of the customer to whom the loan will or will not be granted.
    - recovery of the 10 closest neighbours on these 10 variables to determine which of these 10 neighbours have defaulted or not.
    - situate the patient among the defaulters or non-defaulters on the variables age, sex, socio-professional category, income, amount of credit, duration of credit, etc.

<a id='upload_files'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Upload files** </span></span>


<a id='info_api'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Information for API** </span></span>


**Create a list of selected features for the API**

In [21]:
# -----------------------------
# Data used for modeling
# -----------------------------

# Open final train_dataset
path_train_data = \
    'preprocessing/train_data_fs_t25_combi_ML.pkl'

with open(path_train_data, 'rb') as f:
    train_data_modeling = pickle.load(f)


# Open final test_dataset
path_test_data = \
    'preprocessing/test_data_fs_t25_combi_ML.pkl'
os.makedirs(os.path.dirname(path_test_data), exist_ok=True)

In [24]:
selected_feautes = train_data_modeling.columns.tolist()
selected_feautes

['AMT_CREDIT',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'DAYS_LAST_PHONE_CHANGE',
 'CREDIT_INCOME_RATIO',
 'CREDIT_ANNUITY_RATIO',
 'CREDIT_GOODS_RATIO',
 'INCOME_EXT_RATIO',
 'CAR_EMPLOYED_RATIO',
 'EXT_SOURCE_MEAN',
 'OBS_30_CREDIT_RATIO',
 'ENQ_CREDIT_RATIO',
 'REGIONS_RATING_INCOME_MUL_0',
 'DAYS_CREDIT_MEAN_OVERALL',
 'AMT_CREDIT_SUM_MEAN_OVERALL',
 'CURRENT_DEBT_TO_CREDIT_RATIO_MEAN_OVERALL',
 'DAYS_CREDIT_MEAN_CREDITACTIVE_ACTIVE',
 'CURRENT_DEBT_TO_CREDIT_RATIO_MEAN_CREDITACTIVE_ACTIVE',
 'CURRENT_CREDIT_DEBT_DIFF_MEAN_CREDITACTIVE_ACTIVE',
 'HOUR_APPR_PROCESS_START_MEAN_LAST_5',
 'DAYS_DECISION_MEAN_LAST_5',
 'SELLERPLACE_AREA_MEAN_LAST_5',
 'INTEREST_SHARE_MEAN_LAST_5',
 'INTEREST_SHARE_MEAN_FIRST_2',
 'DAYS_PAYMENT_DIFF_MEAN_MEAN',
 'DAYS_PAYMENT_DIFF_MIN_MEAN',
 'DAYS_PAYMENT_DIFF_MAX_MEAN',
 'TARGET',
 'SK_ID_CURR']

<a id='subsample_api_dashboard'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Create a data sample for the API and dashboard** </span></span>


In [10]:
# -------------------------------------------
# Load de datasets after feature engineering
# -------------------------------------------

# Train data
path_train_dataset_after_feat_eng =  '/home/raquelsp/Documents/Openclassrooms/P7_implementez_modele_scoring/P7_travail/P7_scoring_credit/preprocessing/final_train_data.pkl'

with open(path_train_dataset_after_feat_eng, 'rb') as f:
    final_train_data = pickle.load(f)

**Add columns useful for clients description and general comprenhension**

In [12]:
# -----------------------------
# Upload: application_train
# -----------------------------
application_train = pd.read_csv('/home/raquelsp/Documents/Openclassrooms/P7_implementez_modele_scoring/P7_travail/p7_source/application_train.csv',
                                low_memory=False,
                                encoding='utf-8')

In [13]:
# General client information
client_info_columns = ['SK_ID_CURR',
                       'DAYS_BIRTH', 'CODE_GENDER',
                       'NAME_FAMILY_STATUS',
                        'CNT_CHILDREN', 'NAME_EDUCATION_TYPE',
                        'NAME_INCOME_TYPE', 'DAYS_EMPLOYED',
                        'AMT_INCOME_TOTAL', 
                        'NAME_CONTRACT_TYPE',
                        'AMT_GOODS_PRICE', 
                        'NAME_HOUSING_TYPE', ]

In [14]:
general_info = application_train[client_info_columns]

In [15]:
# Change age features to years (instead of days)
# Transform DAYS_BIRTH to years
general_info['AGE'] = np.trunc(np.abs(general_info['DAYS_BIRTH'] / 365)).astype('int8')
# Transform DAYS_EMPLOYED to years
general_info['YEARS_WORKING'] = np.trunc(np.abs(general_info['DAYS_EMPLOYED'] / 365)).astype('int8')
# Transform gender : 0 = Féminin et 1 = Masculin
general_info['GENDER'] = ['Woman' if row == 0 else 'Man'
                        for row in general_info['CODE_GENDER']]

general_info = general_info.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED',
                                          'CODE_GENDER'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [16]:
# Ajout des varaibles manquantes au dataframe du dashboard
all_data = train_data.merge(general_info, on='SK_ID_CURR',
                                      how='left')
all_data.head(3)

Unnamed: 0,AMT_CREDIT,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_LAST_PHONE_CHANGE,...,CNT_CHILDREN,NAME_EDUCATION_TYPE,NAME_INCOME_TYPE,AMT_INCOME_TOTAL,NAME_CONTRACT_TYPE,AMT_GOODS_PRICE,NAME_HOUSING_TYPE,AGE,YEARS_WORKING,GENDER
0,-0.478095,-0.149689,-1.506769,0.755834,0.379837,0.579154,-3.021547,-1.317986,-2.153361,-0.206991,...,0,Secondary / secondary special,Working,202500.0,Cash loans,351000.0,House / apartment,25,1,Man
1,1.72545,-1.252595,0.167322,0.497899,1.078697,1.790855,-1.3841,0.563563,0.111516,0.163108,...,0,Higher education,State servant,270000.0,Cash loans,1129500.0,House / apartment,45,3,Man
2,-1.152888,-0.783388,0.690067,0.9487,0.206115,0.306869,0.011671,0.218208,1.223695,0.178831,...,0,Secondary / secondary special,Working,67500.0,Revolving loans,135000.0,House / apartment,52,0,Man


In [17]:
# Create a data sample for the dashboard
data_0 = all_data[all_data["TARGET"] == 0].sample(100, random_state=seed)
data_1 = all_data[all_data["TARGET"] == 1].sample(100, random_state=seed)
data_dashboard = pd.concat([data_0, data_1]).reset_index(drop=True)

In [18]:
data_dashboard.shape

(200, 46)

**Export data to csv**

In [19]:
# Export this data sample in a csv file
pd.DataFrame(data_dashboard).to_csv("data_dashboard.csv", index=False)