# Liens utiles

- [Dépôt Git](https://github.com/JeanRosselVallee/project_8.git)
- [Google Cloud VM](https://console.cloud.google.com/compute/instances/observability?project=ocr-p8-dashboard&tab=instances)
- [Application Web](http://localhost:8501/)

# Initialisation

In [2]:
dir_in     = './data/in/'
dir_out    = './data/out/'
model_path = './data/model/'

In [3]:
 pip install --quiet plotly mlflow xgboost

Note: you may need to restart the kernel to use updated packages.


In [4]:
! cat ./data/model/requirements.txt

mlflow==2.14.1
cloudpickle==3.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.2
psutil==5.9.0
pyyaml==6.0.1
scikit-learn==1.5.0
scipy==1.13.1
xgboost==2.1.0

In [5]:
 pip install --quiet -r ./data/model/requirements.txt

[33m  DEPRECATION: psutil is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [7]:
import pandas as pd
import numpy as np
import json
import mlflow
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

# Jeu de données

Les prédictions se feront sur le jeu de test parce que le modèle a été entraîné sur celui d'entraînement

In [8]:
! for file_i in ./data/in/* ; do wc -l $file_i ; done

33121 ./data/in/X_TN.csv
2359 ./data/in/X_TP.csv
48679 ./data/in/X_test_2.csv
8 ./data/in/config.json
48679 ./data/in/data.csv
1 ./data/in/li_features.txt
0 ./data/in/model_optimal_simplified.json
48679 ./data/in/y_pred_4.csv
48679 ./data/in/y_test_2.csv


### Chargement

In [9]:
def load_data(file):
    df_contents = pd.read_csv(file)   .rename(columns={'Unnamed: 0': 'request_id'}) \
                                      .set_index('request_id')
    return df_contents

#### Attributs

In [10]:
path_X = dir_in + 'X_test_2.csv'
df_X = load_data(path_X)
df_X.shape

(48678, 125)

In [11]:
with open(dir_in + 'li_features.txt') as f :
    str_li_features = f.read()
li_features = eval(str_li_features)
li_features

['CODE_GENDER_M',
 'EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'NAME_EDUCATION_TYPE_Secondary_or_secondary_special',
 'NAME_EDUCATION_TYPE_Higher_education',
 'NAME_CONTRACT_TYPE_Cash_loans',
 'NAME_INCOME_TYPE_Working']

In [12]:
df_X = df_X[li_features]
display(df_X.head(1))
df_X.shape

Unnamed: 0_level_0,CODE_GENDER_M,EXT_SOURCE_3,EXT_SOURCE_2,NAME_EDUCATION_TYPE_Secondary_or_secondary_special,NAME_EDUCATION_TYPE_Higher_education,NAME_CONTRACT_TYPE_Cash_loans,NAME_INCOME_TYPE_Working
request_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
155094,0,0.770087,0.607697,0,1,1,0


(48678, 7)

#### Cible

In [13]:
path_y = dir_in + 'y_test_2.csv'
df_y = load_data(path_y)
df_y.shape
display(df_y.head(1))

Unnamed: 0_level_0,TARGET
request_id,Unnamed: 1_level_1
155094,0


### Jointure d'attributs et cible

In [14]:
df_data = df_X.join(df_y)
#df_data.columns = li_variables_simplified

In [15]:
display(df_data.head(1))
df_data.shape

Unnamed: 0_level_0,CODE_GENDER_M,EXT_SOURCE_3,EXT_SOURCE_2,NAME_EDUCATION_TYPE_Secondary_or_secondary_special,NAME_EDUCATION_TYPE_Higher_education,NAME_CONTRACT_TYPE_Cash_loans,NAME_INCOME_TYPE_Working,TARGET
request_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
155094,0,0.770087,0.607697,0,1,1,0,0


(48678, 8)

### Sauvegarde

In [16]:
df_data.to_csv(dir_in + 'data.csv')

In [17]:
ls $dir_in/data.csv

./data/in//data.csv


In [18]:
! head -n 3 $dir_in/data.csv

request_id,CODE_GENDER_M,EXT_SOURCE_3,EXT_SOURCE_2,NAME_EDUCATION_TYPE_Secondary_or_secondary_special,NAME_EDUCATION_TYPE_Higher_education,NAME_CONTRACT_TYPE_Cash_loans,NAME_INCOME_TYPE_Working,TARGET
155094,0,0.7700870700124128,0.6076973714617412,0,1,1,0,0
74108,0,0.4258928980051529,0.7318427244611323,1,0,1,1,0


# Score normalisé

L'affichage sur la jauge a besoin d'un score normalisé par rapport au seuil entre les 2 classes:
|score original|score normalisé|
|--|--|
|0|0|
|seuil S|0.5|
|1|1|

On obtient ces résultats en élévant le score à la puissance P

$$ T ^ P = 0.50 \quad \quad \quad \text{where T : threshold} $$
$$ \Rightarrow \quad P = \frac{\log_{2}(0.50)}{\log_{2}(threshold)} = -\frac{1}{\log_{2}(threshold)} $$


In [19]:
def get_normalizing_power(threshold) :
	normalizing_power = - 1 / np.log2(threshold)
	return normalizing_power

# Modèle adapté

In [20]:
from xgboost import XGBClassifier

## Classe du modèle

Le score renvoyé est la probabilité qu'un emprunteur ne rembourse pas (classe "1")

In [21]:
class XGB_prob(XGBClassifier):
    threshold = 0 
    def __init__(self, threshold=0.5, **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
    def fit(self, df_X_train, df_y_train, **kwargs) : 
        super().fit(df_X_train, df_y_train, **kwargs)
    def predict(self, df_X_subset, bool_save_events=True):      
        np_y_pred_proba = super().predict_proba(df_X_subset)[:, 1]  # class 1 probas  
        np_y_normalized = np.power(np_y_pred_proba, get_normalizing_power(self.threshold))
        return np_y_normalized

## Récupération

On importe le modèle de prédiction le plus récent: celui deployé en pré-Prod car celui en Prod est absent du dépôt Git

In [22]:
path_config = '../config.json' 

In [23]:
model_prob = XGB_prob()
model_prob.load_model(dir_in + 'model_optimal_simplified.json')
model_prob

### Génération de signature

In [24]:
from mlflow.models.signature import infer_signature

In [25]:
model_signature = infer_signature(df_X.head(1), df_y.head(1))



#### Attributs

In [26]:
list(model_signature.inputs)[:10]

['CODE_GENDER_M': long (required),
 'EXT_SOURCE_3': double (required),
 'EXT_SOURCE_2': double (required),
 'NAME_EDUCATION_TYPE_Secondary_or_secondary_special': long (required),
 'NAME_EDUCATION_TYPE_Higher_education': long (required),
 'NAME_CONTRACT_TYPE_Cash_loans': long (required),
 'NAME_INCOME_TYPE_Working': long (required)]

#### Cible

In [27]:
list(model_signature.outputs)

['TARGET': long (required)]

### Mise à jour du seuil binaire

In [28]:
dict_params_old = model_prob.get_params()
dict_params_old['threshold']

0.5

In [29]:
with open(dir_in + 'config.json', 'r') as json_file: dict_to_config = json.load(json_file)
best_threshold = float(dict_to_config['best_threshold'])

In [30]:
setattr(model_prob, 'threshold', best_threshold)

In [31]:
dict_params_new = model_prob.get_params()
dict_params_new['threshold']

0.09

## Déploiement

In [32]:
from mlflow import sklearn as skl

In [33]:
! rm -rf $model_path ; mkdir -p "$model_path"
%time skl.save_model(model_prob, model_path, signature=model_signature)

CPU times: user 1.5 s, sys: 230 ms, total: 1.73 s
Wall time: 6.91 s


#### Fichiers générés

In [34]:
!find "$model_path"

./data/model/
./data/model/python_env.yaml
./data/model/model.pkl
./data/model/conda.yaml
./data/model/MLmodel
./data/model/requirements.txt


### Prédictions

In [35]:
np_y_pred_proba = model_prob.predict(df_X)

In [36]:
np_y_pred_proba

array([0.32588145, 0.4276271 , 0.4159582 , ..., 0.31557882, 0.49494374,
       0.53514314], dtype=float32)

In [37]:
np.save(dir_out + 'y_pred_proba', np_y_pred_proba)

In [38]:
np.load(dir_out + 'y_pred_proba.npy')

array([0.32588145, 0.4276271 , 0.4159582 , ..., 0.31557882, 0.49494374,
       0.53514314], dtype=float32)

# Mise en service

## Arrêt

In [56]:
port_server = '5677'

In [57]:
mask = ':' + port_server
! pkill -f "$mask"

[2024-07-23 21:03:37 +0000] [3419] [INFO] Worker exiting (pid: 3419)
[2024-07-23 21:03:37 +0000] [3418] [INFO] Handling signal: term
[2024-07-23 21:03:39 +0000] [3418] [INFO] Shutting down: Master


## Démarrage

In [58]:
ip_host = '0.0.0.0'
shell_command =  'nohup mlflow models serve -m '
shell_command += model_path + ' -p ' + port_server + ' -h ' + ip_host + ' --no-conda &'
print(shell_command)

nohup mlflow models serve -m ./data/model/ -p 5677 -h 0.0.0.0 --no-conda &


In [59]:
get_ipython().system_raw(shell_command)          # runs model API in background

## Vérification d'exécution

Il y a 2 processus qui tournent par serveur

In [61]:
! ps aux | grep "scoring_server" | grep -v "grep" | awk '{print $2, $15, $19}'

3471 0.0.0.0:5677 mlflow.pyfunc.scoring_server.wsgi:app
3472 0.0.0.0:5677 mlflow.pyfunc.scoring_server.wsgi:app


[2024-07-23 21:08:11 +0000] [3471] [INFO] Handling signal: term
[2024-07-23 21:08:11 +0000] [3472] [INFO] Worker exiting (pid: 3472)
[2024-07-23 21:08:12 +0000] [3471] [INFO] Shutting down: Master


# Tests

In [49]:
url_api = 'localhost:' + port_server + '/invocations'
print('URL API    -> http://' + url_api)

URL API    -> http://localhost:5677/invocations


## Prédictions

Demande par requête POST de prédiction de la cible pour une observation

In [51]:
nb_observations = 1

### Cas TP

In [52]:
path_TP = dir_in + 'X_TP.csv'
df_TP = pd.read_csv(path_TP)

In [53]:
df_TP_sample = df_TP.sample(nb_observations)
df_TP_sample

Unnamed: 0,CODE_GENDER_M,EXT_SOURCE_3,EXT_SOURCE_2,NAME_EDUCATION_TYPE_Secondary_or_secondary_special,NAME_EDUCATION_TYPE_Higher_education,NAME_CONTRACT_TYPE_Cash_loans,NAME_INCOME_TYPE_Working
2011,0,0.598926,0.146104,1,0,1,1


In [54]:
def get_curl_command(df_sample, url) :
    str_features_values = df_sample.to_json(orient='split')
    str_data = '\'{"dataframe_split": ' + str_features_values + '}\' '
    return 'curl -d' + str_data + '''-H 'Content-Type: application/json' -X POST ''' + url

Vérifier que cette ligne de commande Linux renvoie une prédiction de classe "1"

In [55]:
shell_command = get_curl_command(df_TP_sample, url_api)
print(shell_command)

curl -d'{"dataframe_split": {"columns":["CODE_GENDER_M","EXT_SOURCE_3","EXT_SOURCE_2","NAME_EDUCATION_TYPE_Secondary_or_secondary_special","NAME_EDUCATION_TYPE_Higher_education","NAME_CONTRACT_TYPE_Cash_loans","NAME_INCOME_TYPE_Working"],"index":[2011],"data":[[0,0.5989262183,0.1461036398,1,0,1,1]]}}' -H 'Content-Type: application/json' -X POST localhost:5677/invocations


In [56]:
get_ipython().system_raw(shell_command)  

{"predictions": [0.5351431369781494]}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   330  100    37  100   293    305   2420 --:--:-- --:--:-- --:--:--  2727


### Cas TN

In [151]:
path_TN = dir_in + 'X_TN.csv'
df_TN = pd.read_csv(path_TN)

In [152]:
df_TN_sample = df_TN.sample(nb_observations)
df_TN_sample

Unnamed: 0,CODE_GENDER_M,EXT_SOURCE_3,EXT_SOURCE_2,NAME_EDUCATION_TYPE_Secondary_or_secondary_special,NAME_EDUCATION_TYPE_Higher_education,NAME_CONTRACT_TYPE_Cash_loans,NAME_INCOME_TYPE_Working
28726,0,0.513694,0.671199,1,0,1,1
9356,0,0.463275,0.715923,1,0,1,1
24251,0,0.656158,0.758948,0,1,1,0
31176,1,0.707699,0.367036,1,0,1,1
26053,0,0.432962,0.614909,0,1,1,0
30159,0,0.654529,0.744395,1,0,1,0
13343,0,0.588488,0.491529,0,1,0,1
26266,0,0.77641,0.425121,0,1,1,0
354,0,0.661024,0.476217,1,0,1,1
8663,0,0.665855,0.598748,1,0,1,1


Vérifier que cette ligne de commande Linux renvoie une prédiction de classe "0"

In [153]:
shell_command = get_curl_command(df_TN_sample, url_api)
print(shell_command)

curl -d'{"dataframe_split": {"columns":["CODE_GENDER_M","EXT_SOURCE_3","EXT_SOURCE_2","NAME_EDUCATION_TYPE_Secondary_or_secondary_special","NAME_EDUCATION_TYPE_Higher_education","NAME_CONTRACT_TYPE_Cash_loans","NAME_INCOME_TYPE_Working"],"index":[28726,9356,24251,31176,26053,30159,13343,26266,354,8663],"data":[[0,0.5136937663,0.6711988652,1,0,1,1],[0,0.4632753281,0.7159232202,1,0,1,1],[0,0.656158373,0.7589476174,0,1,1,0],[1,0.7076993447,0.367035797,1,0,1,1],[0,0.4329616671,0.6149092475,0,1,1,0],[0,0.6545292802,0.7443950327,1,0,1,0],[0,0.5884877883,0.491529006,0,1,0,1],[0,0.7764098512,0.4251207017,0,1,1,0],[0,0.6610235391,0.4762169811,1,0,1,1],[0,0.665854922,0.5987476984,1,0,1,1]]}}' -H 'Content-Type: application/json' -X POST localhost:5677/invocations


In [154]:
get_ipython().system_raw(shell_command)  

{"predictions": [0.4005722999572754, 0.419644296169281, 0.2925635278224945, 0.4535232186317444, 0.39870113134384155, 0.3309217691421509, 0.33661720156669617, 0.3511064052581787, 0.4159581959247589, 0.40274977684020996]}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   901  100   219  100   682   2546   7930 --:--:-- --:--:-- --:--:-- 10476


# Feature Importance

# Fin du traitement

In [225]:
assert(False) # prevents the execution of following cells

AssertionError: 

# Annexe

## Machine Virtuelle sur G-Cloud

### Création

|||||
|--|--|--|--|
|**Create VM**|Europe-Paris|E2-micro|Firewall allow : http & https|

#### Bash commands
<code>
sudo apt install git
git clone https://github.com/JeanRosselVallee/project_8
cd project_8
export PATH="/usr/bin:$PATH"
sudo apt-get install python3-pip
sudo apt install python3.11-venv
python3 -m venv ./my_env
source ./my_env/bin/activate
pip3 install jupyter
jupyter notebook --version
./launch_jupyter.sh
</code>

In [3]:
! cat ~/project_8/shl/launch_jupyter.sh

source ~/project_8/my_env/bin/activate
nohup jupyter notebook --no-browser  --ip=0.0.0.0 --port=5555 &
sleep 1
jupyter notebook list
echo "Process & Listening Port :"
ps aux | grep "jupy" | grep -v "grep" | awk '{print $2, $15, $19}'
ss -tuln | grep 5555



### Arrêt

<code>
pkill -f ":5677"
jupyter notebook stop 5555
</code>

G-Cloud : VM > Stop

### Re-démarrage

- G-Cloud : VM > Start
- <code>./launch_mlflow.sh</code>

In [4]:
! cat ~/project_8/shl/launch_mlflow.sh

source ~/project_8/my_env/bin/activate
pkill -f ":5677"
sleep 3
nohup mlflow models serve -m ~/project_8/data/model/ -p 5677 -h 0.0.0.0 --no-conda &
sleep 6
ps aux | grep "scoring_server" | grep -v "grep" | awk '{print $2, $15, $19}'


## Jupyter

### Arrêt

In [45]:
# ! jupyter notebook stop 5555

### Re-démarrage

- get IP@ 
    - VM > Details > "Network Interfaces" > External IP @
    - http://4.233.201.217:5555/tree?token=
- <code>./launch_jupyter.sh</code>
- Notebook "server_mlflow"
    - Sections
        - "Initialisation" : model_path
        - "Mise en service"
- Test from [Web App](https://project8-dashboard.streamlit.app/)