# Model Prediction Workflow

Since model prediction workflows are mainly involved with well-structured, clean, and modularized codes, all of the codes from previous experimental phase (credit_default_risk_experiment.ipynb) are organized and broken down into each sub-function. 

Those functions are also grouped into modules (.py file) and only main function, which is typically a sequence of sub-function, is called in each step to achieve the task.

Moreover, the SKlean Pipeline is also used at this stage to assemble all estimator, such as, MinMaxScaler, OneHotEncoder, together as a one big chain to sequentially apply a list of transformers to the data at once.

All of well-structured codes are contained inside each .py file in the following structure :

```
ROOT
├── src
│   ├── utils.py
│   ├── data_preprocessing.py
│   ├── model_training.py
│   └── model_evaluation.py
├── pipeline_train.py **
└── pipeline_inference.py **
```

where all .py files in <code>src</code> folder contain all the sub-functions and all the files outside particular folder are the main files to call for executing the Training and Inference Pipeline.

In [1]:
import os, sys
dir_project_root = os.path.dirname(os.getcwd())
sys.path.append(dir_project_root)

# Training Pipeline

As previously mentioned, the Training Pipeline was designed as the main function which call sub-functions from those modules inside <code>src</code> folder.


Training Pipeline is mainly involved with 4 steps, which are including :
 - Data Preprocessing step - preprocess raw data into a format which suitable for model training
   <p><em>function : <code>preprocess_step()</code></em></p>
 - Model Training step - train the SKLearn Pipeline object using data from previous steps, together with configurations found from experiment phase
   <p><em>function : <code>model_training_step()</code></em></p>
 - Model Evaluation step - apply the prediction on hold-out testing data and evaluate the performance (AUC score)
   <p><em>function : <code>model_evaluation_step()</code></em></p>
 - Inference Config step - the purpose of this step is mainly to document the latest version of trained model as a referece for Inference Pipeline while making prediction
    <p><em>function : <code>inference_config()</code></em></p>

The logging was also included to record the progress of pipeline execution along with some importance information, such as model accuracy.

In order to execute the pipeline, you can only use the magic command given in next cell to execte the file directly. After that, all the Training process will be executed automatically to create all of those artifacts in the directory.

<em><u>NOTE</u> : If you want to test the execution, please only execute the file in this Jupyter notebook using the given command below since there is the robustness issue on the logic being used to find root directory. This issue might cause execution error if you execute the file directly from elsewhere.</em>

In [2]:
%run {dir_project_root}\pipeline_train.py

2024-02-08 02:24:10 - __main__ - INFO - main:287 - Start training pipeline
2024-02-08 02:24:10 - __main__ - DEBUG - preprocess_step:65 - STEP PROCESSING
2024-02-08 02:26:32 - __main__ - DEBUG - preprocess_step:114 - Training features shape --> (276759, 64), Training target shape --> (276759,)
2024-02-08 02:26:32 - __main__ - DEBUG - preprocess_step:115 - Evaluation features shape --> (30752, 64), Evaluation target shape --> (30752,)
2024-02-08 02:26:41 - __main__ - DEBUG - save_csv_file:153 - Csv file saved at c:\Users\11413929\repos\int_ass\data\production\process_data\train\features_train_2024_02_08_02_26_32.csv
2024-02-08 02:26:42 - __main__ - DEBUG - save_csv_file:153 - Csv file saved at c:\Users\11413929\repos\int_ass\data\production\process_data\test\features_test_2024_02_08_02_26_32.csv
2024-02-08 02:26:42 - __main__ - DEBUG - save_ndarray:189 - Numpy array file saved at c:\Users\11413929\repos\int_ass\data\production\process_data\train\target_train_2024_02_08_02_26_32.npy
2024-

# Inference Pipeline

Inference Pipeline was also designed as the main function which call sub-functions from those modules inside <code>src</code> folder.


Training Pipeline is mainly involved with 3 steps, which are including :
 - Read Config step - read the configs which are previously documented by Training Pipeline
   <p><em>function : <code>read_training_cfg()</code></em></p>
 - Data Preprocessing step - preprocess raw data into a same format which used while training model in order to make a prediction
   <p><em>function : <code>model_training_step()</code></em></p>
 - Model Inference step - predict on unseen dataset as both scores (probability) and labels using given probability threshold
   <p><em>function : <code>inference_step()</code></em></p>

The logging was also included to record the progress of pipeline execution along with some importance information.

In order to execute the pipeline, you can only use the magic command given in next cell to execte the file directly. After that, all the Inference process will be executed automatically to create prediction file in the directory.

<em><u>NOTE</u> : If you want to test the execution, please only execute the file in this Jupyter notebook using the given command below since there is the robustness issue on the logic being used to find root directory. This issue might cause execution error if you execute the file directly from elsewhere.</em>

In [3]:
%run {dir_project_root}\pipeline_inference.py

2024-02-08 02:27:50 - __main__ - INFO - main:153 - Start inference pipeline
2024-02-08 02:27:50 - __main__ - INFO - main:153 - Start inference pipeline
2024-02-08 02:27:50 - __main__ - DEBUG - load_dict:182 - Dict file loaded from c:\Users\11413929\repos\int_ass\config\inference\inference_config.json
2024-02-08 02:27:50 - __main__ - DEBUG - load_dict:182 - Dict file loaded from c:\Users\11413929\repos\int_ass\config\inference\inference_config.json
2024-02-08 02:27:50 - __main__ - DEBUG - preprocess_step:58 - STEP PROCESSING
2024-02-08 02:27:50 - __main__ - DEBUG - preprocess_step:58 - STEP PROCESSING


2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos\int_ass\data\production\artifacts\list_col_high_corr_2024_02_08_02_26_32.sav
2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos\int_ass\data\production\artifacts\list_col_high_corr_2024_02_08_02_26_32.sav
2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos\int_ass\data\production\artifacts\list_col_heavy_nan_2024_02_08_02_26_32.sav
2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos\int_ass\data\production\artifacts\list_col_heavy_nan_2024_02_08_02_26_32.sav
2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos\int_ass\data\production\artifacts\list_col_nominal_2024_02_08_02_26_32.sav
2024-02-08 02:28:17 - __main__ - DEBUG - load_list:169 - List file loaded from c:\Users\11413929\repos

After the exection was completed, we can inspect the prediction results using the code below

In [7]:
from src.utils import get_latest_file_dir, load_csv_file, DIR_INFERENCE_OUTPUT
import pandas as pd

last_prediction_file_name = get_latest_file_dir(DIR_INFERENCE_OUTPUT, file_type='.csv')
last_prediction_file_path = os.path.join(DIR_INFERENCE_OUTPUT, last_prediction_file_name)
df_prediction = pd.read_csv(last_prediction_file_path, index_col=0)

In [9]:
print("Last prediction directory: {}".format(last_prediction_file_path))
df_prediction.head(6)

Last prediction directory: c:\Users\11413929\repos\int_ass\data\production\inference_output\prediction_2024_02_08_02_28_47.csv


Unnamed: 0_level_0,pred,prob
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1
100001,0,0.354832
100005,1,0.62249
100013,0,0.191359
100028,1,0.425296
100038,1,0.633031
100042,0,0.234196
