

# <font size="+3"><span style='color:#2994ff'> **P7 - Implémentez un modèle de scoring** </span></font>


<a id='LOADING_LIBRARIES'></a>

---

---

<font size="+1"> **LOADING THE LIBRARIES** </font>

---

In [12]:
# File system management
import sys
import pandas as pd
import numpy as np
import os
import pickle

# Data drift evidently
import evidently
from evidently import ColumnMapping

from evidently.report import Report
from evidently.metrics.base_metric import generate_column_metrics
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, DataQualityPreset, RegressionPreset
from evidently.metrics import *

from evidently.test_suite import TestSuite
from evidently.tests.base_test import generate_column_tests
from evidently.test_preset import DataStabilityTestPreset, NoTargetPerformanceTestPreset, RegressionTestPreset
from evidently.tests import *


# Personnal packages
import tools_dataframe
import tools_preprocessing
import tools_feat_engineering
import tools_modeling


# Warnings
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')



In [13]:
# Versions
print('Version of used libraries :')

print('Python    : ' + sys.version)
print('NumPy     : ' + np.version.full_version)
print('Pandas    : ' + pd.__version__)
print('Evidently : ' + evidently.__version__)

Version of used libraries :
Python    : 3.9.13 (main, Aug 25 2022, 23:26:10) 
[GCC 11.2.0]
NumPy     : 1.24.3
Pandas    : 1.5.3
Evidently : 0.3.3



<a id='notebook_settings'></a>


<br>


---
---

<font size="+1"> **NOTEBOOK SETTINGS** </font>

---


In [14]:
#################################
#    -- NOTEBOOK SETTINGS --    #
#################################

%matplotlib inline

# Random state
seed = 84

# Define training set size
TRAIN_SIZE = 0.8



<a id='USED_FUNCTIONS'></a>


<br>


---
---

<font size="+1"> **FUNCTIONS USED IN THIS NOTEBOOK** </font>

---




<font size="+3"><span style='color:#2994ff'> **P7 - Implémentez un modèle de scoring** </span></font>



## <font color = '#0085dd'>**Table of content**</font>


[Librairies loading](#LOADING_LIBRARIES)<br>

[Functions used in this notebook](#USED_FUNCTIONS)<br>

---

[**Datasets**](#datasets)
 * [Description](#datasets_description)
   * [orders_dataset](#orders_dataset)
   * [customers_dataset](#customers_dataset)
   * [order_items](#order_items)
   * [products_dataset](#products_dataset)
   * [product_category_name_translation](#product_category_name_translation)
   * [sellers_dataset](#sellers_dataset)
   * [order_payments](#order_payments)
   * [order_reviews](#order_reviews)
   * [geolocation_dataset](#geolocation_dataset)<br>
<br>
 * [Columns preparation](#columns_preparation)
   * [Zip codes centroids calculation](#centroid_zipCode)
   * [Product category check and update](#cat_products)


[**Datasets joining**](#datasets_joining)
 * [Data description](#data_description)
 * [Evaluation of missing values](#missing_values)
 * [Column filling analysis](#column_fill)
 * [Row filling analysis](#row_filling)
 * [Features pre-selection](#features_preSelection)

[**Features engineering**](#features_engineering)
 * [Customers spatial distribution](#customers_spatial_distribution)
 * [RFM features](#RFM_features)
 * [Products](#products)
 * [Orders](#Orders)
 * [Dates](#dates)
 * [Joining customers information](#join_customers_datasets)
 * [Features analysis](#features_analysis)

[**Dataset for segmentation**](#segmentation_dataset)
<br>

---


<a id='datasets_loading'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Loading datasets** </span></span>


In [15]:
# Define the folder containing the files with the project data
P7_scoring_credit = "/home/raquelsp/Documents/Openclassrooms/P7_implementez_modele_scoring/P7_travail/"

os.chdir(P7_scoring_credit)

In [16]:
# --------------------------------
# Files after feature engineering
# --------------------------------
# Open final train_dataset : these will be our reference data
path_train_data = \
    'P7_scoring_credit/preprocessing/train_data_fs_t25_combi_ML.pkl'
with open(path_train_data, 'rb') as f:
    train_data_fe = pickle.load(f)

# Open final test_dataset : these will be our current data
path_test_data = \
    'P7_scoring_credit/preprocessing/test_data_fs_t25_combi_ML.pkl'
with open(path_test_data, 'rb') as f:
    test_data_fe = pickle.load(f)

<a id='data_preparation'></a>

## <span style='background:#0085dd'><span style='color:white'>Data preparation</span></span>

**Data after feature engineering**

In [17]:
# --------------------
# Column description
# --------------------
info_train_data_fe = tools_dataframe.complet_description(train_data_fe)
info_train_data_fe.sample(5)

Unnamed: 0,Variable,Type,null,Duplicated,Filling percentage,count,mean,std,min,25%,50%,75%,max
34,SK_ID_CURR,int32,0,0,100.0,307511.0,278180.518577,102790.175348,100002.0,189145.5,278202.0,367142.5,456255.0
11,INTEREST_SHARE_MEAN_FIRST_2,float32,0,301202,100.0,307511.0,-0.0,0.999843,-4.486805,-0.258287,0.06835,0.440544,5.295004
18,EXT_SOURCE_MEAN,float32,0,305131,100.0,307511.0,0.0,0.999934,-4.511773,-0.648111,0.092637,0.706916,3.164031
10,INTEREST_SHARE_MEAN_LAST_5,float32,0,293183,100.0,307511.0,-0.0,0.999778,-3.572714,-0.270322,0.085896,0.466777,5.512062
23,DAYS_ID_PUBLISH,float32,0,301343,100.0,307511.0,0.0,0.999971,-2.784328,-0.86442,-0.172114,0.844151,1.983641


In [18]:
# Identify empty columns
to_remove = info_train_data_fe.loc[info_train_data_fe['Filling percentage']<1]
cols_to_remove = to_remove['Variable'].tolist()
print(f'There are {len(cols_to_remove)} empty columns')

There are 0 empty columns


In [19]:
# Remove empty columns
train_data_fe = train_data_fe[train_data_fe\
                    .columns[~train_data_fe.columns.isin(cols_to_remove)]]
test_data_fe = test_data_fe[test_data_fe\
                    .columns[~test_data_fe.columns.isin(cols_to_remove)]]

In [20]:
reference_fe = train_data_fe.drop(columns=['TARGET', 'SK_ID_CURR'])
print('Reference' + str(reference_fe.shape))
reference_fe_10000 = reference_fe.sample(n=10000, replace=False)

current_fe = test_data_fe.drop(columns=['SK_ID_CURR'])
print('Current' + str(current_fe.shape))
current_fe_10000 = current_fe.sample(n=10000, replace=False)

Reference(307511, 33)
Current(48744, 33)


<a id='data_drift_analysis'></a>

## <span style='background:#0085dd'><span style='color:white'>Datadrift analysis</span></span>

In [21]:
# Data after feature engineering
feng_report = Report(metrics=[DataDriftPreset(), ])

feng_report.run(reference_data=reference_fe_10000,
                current_data=current_fe_10000)

feng_report.save_html('feng_report10000.html')