

# <font size="+3"><span style='color:#2994ff'> **P7 - Implémentez un modèle de scoring** </span></font>


<a id='LOADING_LIBRARIES'></a>

---

---

<font size="+1"> **LOADING THE LIBRARIES** </font>

---

In [1]:
#!pip install tensorflow

In [2]:
# File system management
import sys
import pandas as pd
import numpy as np
import os
import pickle

# Data drift evidently
import evidently
from evidently import ColumnMapping

from evidently.report import Report
from evidently.metrics.base_metric import generate_column_metrics
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, DataQualityPreset, RegressionPreset
from evidently.metrics import *

from evidently.test_suite import TestSuite
from evidently.tests.base_test import generate_column_tests
from evidently.test_preset import DataStabilityTestPreset, NoTargetPerformanceTestPreset, RegressionTestPreset
from evidently.tests import *


# Personnal packages
import tools_dataframe
import tools_preprocessing
import tools_feat_engineering
import tools_modeling


# Warnings
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')



  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()
  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
  def _reverse_window(order, start, length):
  def _reverse_window_score_gain(masks, order, start, length):
  def _mask_delta_score(m1, m2):
  def identity(x):
  def _identity_inverse(x):
  def logit(x):
  def _logit_inverse(x):
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
  def _delta_masking(masks, x, curr_delta_inds, varyi

In [3]:
# Versions
print('Version of used libraries :')

print('Python    : ' + sys.version)
print('NumPy     : ' + np.version.full_version)
print('Pandas    : ' + pd.__version__)
print('Evidently : ' + evidently.__version__)

Version of used libraries :
Python    : 3.9.13 (main, Aug 25 2022, 23:26:10) 
[GCC 11.2.0]
NumPy     : 1.24.3
Pandas    : 1.5.3
Evidently : 0.3.3



<a id='notebook_settings'></a>


<br>


---
---

<font size="+1"> **NOTEBOOK SETTINGS** </font>

---


In [4]:
#################################
#    -- NOTEBOOK SETTINGS --    #
#################################

%matplotlib inline

# Random state
seed = 84

# Define training set size
TRAIN_SIZE = 0.8



<a id='USED_FUNCTIONS'></a>


<br>


---
---

<font size="+1"> **FUNCTIONS USED IN THIS NOTEBOOK** </font>

---




<font size="+3"><span style='color:#2994ff'> **P7 - Implémentez un modèle de scoring** </span></font>



## <font color = '#0085dd'>**Table of content**</font>


[Librairies loading](#LOADING_LIBRARIES)<br>

[Functions used in this notebook](#USED_FUNCTIONS)<br>

---

[**Datasets**](#datasets)
 * [Description](#datasets_description)
   * [orders_dataset](#orders_dataset)
   * [customers_dataset](#customers_dataset)
   * [order_items](#order_items)
   * [products_dataset](#products_dataset)
   * [product_category_name_translation](#product_category_name_translation)
   * [sellers_dataset](#sellers_dataset)
   * [order_payments](#order_payments)
   * [order_reviews](#order_reviews)
   * [geolocation_dataset](#geolocation_dataset)<br>
<br>
 * [Columns preparation](#columns_preparation)
   * [Zip codes centroids calculation](#centroid_zipCode)
   * [Product category check and update](#cat_products)


[**Datasets joining**](#datasets_joining)
 * [Data description](#data_description)
 * [Evaluation of missing values](#missing_values)
 * [Column filling analysis](#column_fill)
 * [Row filling analysis](#row_filling)
 * [Features pre-selection](#features_preSelection)

[**Features engineering**](#features_engineering)
 * [Customers spatial distribution](#customers_spatial_distribution)
 * [RFM features](#RFM_features)
 * [Products](#products)
 * [Orders](#Orders)
 * [Dates](#dates)
 * [Joining customers information](#join_customers_datasets)
 * [Features analysis](#features_analysis)

[**Dataset for segmentation**](#segmentation_dataset)
<br>

---


<a id='datasets_loading'></a>

---
---

# <span style='background:#2994ff'><span style='color:white'>**Loading datasets** </span></span>


In [5]:
# Define the folder containing the files with the project data
P7_scoring_credit = "/home/raquelsp/Documents/Openclassrooms/P7_implementez_modele_scoring/P7_travail/P7_scoring_credit/"

os.chdir(P7_scoring_credit)

In [6]:
# -----------------------------
# Files loading:
# -----------------------------

# Open final train_dataset : these will be our reference data
path_train_data = \
    'preprocessing/final_train_data.pkl'

with open(path_train_data, 'rb') as f:
    train_data = pickle.load(f)


# Open final test_dataset : these will be our current data
path_test_data = \
    'preprocessing/final_test_data.pkl'
with open(path_test_data, 'rb') as f:
    test_data = pickle.load(f)

<a id='data_preparation'></a>

## <span style='background:#0085dd'><span style='color:white'>Data preparation</span></span>

In [7]:
# --------------------
# Column description
# --------------------
info_train_data = tools_dataframe.complet_description(train_data)
info_train_data.sample(5)

Unnamed: 0,Variable,Type,null,Duplicated,Filling percentage,count,mean,std,min,25%,50%,75%,max
369,EXT_SOURCE_3_MIN_AGG_CODE_GENDER_NAME_FAMILY_S...,float16,0,307492,100.0,307511.0,0.0,0.007011,0.000527,0.000527,0.000527,0.000527,0.67041
94,NAME_CONTRACT_TYPE_MEAN_LAST_5,float16,0,307486,100.0,307511.0,,0.0,1.0,1.0,1.599609,2.0,3.666016
204,CONTRACT_Signed,float16,0,307257,100.0,307511.0,0.0,0.017105,0.0,0.0,0.0,0.0,1.0
240,DAYS_PAYMENT_RATIO_MIN_MIN,float16,0,304673,100.0,307511.0,,0.0,0.008163,0.697754,0.877441,0.952637,1.166992
339,APARTMENTS_SUM_AVG_MEAN_AGG_CODE_GENDER_NAME_F...,float16,0,307455,100.0,307511.0,,0.0,2.0,2.640625,2.658203,2.681641,3.884766


In [8]:
# Identify empty columns
to_remove = info_train_data.loc[info_train_data['Filling percentage']<1]
cols_to_remove = to_remove['Variable'].tolist()
print(f'There are {len(cols_to_remove)} empty columns')

There are 23 empty columns


In [9]:
# Remove empty columns
train_data = \
    train_data[train_data.columns[~train_data.columns.isin(cols_to_remove)]]
test_data = \
    test_data[test_data.columns[~test_data.columns.isin(cols_to_remove)]]

In [10]:
reference = train_data.drop(columns=['TARGET', 'SK_ID_CURR'])
print('Reference' + str(reference.shape))
reference5000 = reference.sample(n=5000, replace=False)

current = test_data.drop(columns=['SK_ID_CURR'])
print('Current' + str(current.shape))
current5000 = current.sample(n=5000, replace=False)

Reference(307511, 578)
Current(48744, 578)


In [None]:
report = Report(metrics=[DataDriftPreset(), ])

report.run(reference_data=reference5000, current_data=current5000)

In [None]:
#report.as_dict()

In [None]:
report.save_html('report.html')