# Machine learning Workshop (3. Inference)

## <u>Table of contents</u>

### 1. ELT and EDA

1. Import Python essential modules and dataset
2. Preliminary data and data understanding
3. Prepareing data before use in model

### 2. Modeling

1. Commonly Function hyperparameter
2. Commonly Model hyperparameter tuning
3. Import Python essential modules and dataset, and prepare data
4. Training model (1st attempt)
5. Error analysis
6. Training model (2nd attempt)
7. Save model

### 3. Inference

1. Import Python essential modules and dataset
2. Prepare data to for training data
3. Load Model
4. Predict with prepared data
5. Deploy with Gradio

---

## <u>Contents</u>

After we have trained the model, we want to predict and deploy it using the Gradio library.

## 1. Import Python essential modules and dataset

In [None]:
# ! pip install gradio

In [None]:
import os
import pandas as pd
import gradio as gr

from joblib import load

In [None]:
data_inference = pd.read_csv("dataset/test_data.csv")

In [None]:
data_inference.head()

In [None]:
data_inference.shape

## 2. Prepare data to for training data

We must prepare the same workflow as we did for preparing the training data. <br>
<b> *** Except for operations that remove rows, such as `drop`, `drop_duplicates`, `filter`. *** </b>

- Change the data type of the columns.
- Replace `CREDIT_SCORE` values lower than 0 with NaN, and set `ANNUAL_MILEAGE` values lower than 0 to 0.
- Categorize (bin) the `PAST_ACCIDENTS`, `DUIS`, and `SPEEDING_VIOLATIONS` features.
- Remove the columns are not relevan.

First, we change the data type of the columns.

In [None]:
data_inference["MARRIED"] = data_inference["MARRIED"].astype(bool)
data_inference["CHILDREN"] = data_inference["CHILDREN"].astype(bool)

Then, we replace `CREDIT_SCORE` values lower than 0 with NaN, and set `ANNUAL_MILEAGE` values lower than 0 to 0.

In [None]:
data_inference.loc[data_inference["CREDIT_SCORE"] > 1, "CREDIT_SCORE"] = pd.NA

In [None]:
data_inference.loc[data_inference["ANNUAL_MILEAGE"] < 0, "ANNUAL_MILEAGE"] = 0

Moreover, we categorize the `PAST_ACCIDENTS`, `DUIS`, and `SPEEDING_VIOLATIONS` features.

In [None]:
def accident_binning(row):
    past_accident = row["PAST_ACCIDENTS"] 
    if past_accident in [0]:
        return "Never"
    elif past_accident in [1,2]:
        return "Rarely"
    else:
        return "Often"

In [None]:
def duis_binning(row):
    duis = row["DUIS"] 
    if duis in [0]:
        return "Never"
    else:
        return "Used to"

In [None]:
def speed_binning(row):
    speed = row["SPEEDING_VIOLATIONS"] 
    if speed in [0]:
        return "Never"
    elif speed in [1,2,3,4,5]:
        return "Rarely"
    else:
        return "Often"

In [None]:
data_inference["FREQUENT_ACCIDENT"] = data_inference.apply(accident_binning, axis=1)

In [None]:
data_inference["USED_TO_DUIS"] = data_inference.apply(duis_binning, axis=1)

In [None]:
data_inference["FREQUENT_SPEED_VIOLATIONS"] = data_inference.apply(speed_binning, axis=1)

Finally, we remove the columns are not relevan.

In [None]:
# before remove ID column, we save ID for map in prediction data
id_test = data_inference["ID"].tolist()

In [None]:
data_inference = data_inference.drop(["ID", "POSTAL_CODE"], axis=1)

In [None]:
# preview test input
data_inference.head()

## 3. Load Model

In [None]:
model = load('model/best_model.joblib')

In [None]:
model

## 4. Predict with prepared data

The desired output prediction which we want is shown in `submission_template.csv` file.

<img src="./image/submission_template.png" style="height:300px"/>

In [None]:
y_predict = model.predict(data_inference)
y_predict

After obtaining the prediction data, we will create a DataFrame with it.

In [None]:
df_submission = pd.DataFrame({
    "ID": id_test,
    "OUTCOME": y_predict
})
df_submission

Finally, we save submission table to csv.

In [None]:
df_submission.to_csv("dataset/submission_1.csv", index=False)

## 5. Deploy with Gradio

In this example, we will mainly use `gradio.Interface` <br><br>
For more information, you can look into this bibliography: <br>
https://www.gradio.app/docs/gradio/introduction

With `gr.Interface`, you simply combine 3 ingredients:
- `fn` (a Python function)
- `inputs` (input component)
- `outputs` (output component)

First, we will warp up preprocess data code into function.

In [None]:
def preprocess_data(path):
    data = pd.read_csv(path)

    data["MARRIED"] = data["MARRIED"].astype(bool)
    data["CHILDREN"] = data["CHILDREN"].astype(bool)
    
    data.loc[data["CREDIT_SCORE"] > 1, "CREDIT_SCORE"] = pd.NA
    data.loc[data["ANNUAL_MILEAGE"] < 0, "ANNUAL_MILEAGE"] = 0
    
    data["FREQUENT_ACCIDENT"] = data.apply(accident_binning, axis=1)
    data["USED_TO_DUIS"] = data.apply(duis_binning, axis=1)
    data["FREQUENT_SPEED_VIOLATIONS"] = data.apply(speed_binning, axis=1)

    id = data["ID"].tolist()
    data = data.drop(["ID", "POSTAL_CODE"], axis=1)
    
    return id_test, data

In [None]:
def predict_model(id, data):
    if "OUTCOME" in data.columns:
        data = data.drop(["OUTCOME"], axis=1)

    y_predict = model.predict(data_inference)
    df_submission = pd.DataFrame({
        "ID": id_test,
        "OUTCOME": y_predict
    })
    
    return df_submission

Then, we create inference function for Gradio interface.

In [None]:
def inference(path):
    id, df_preprocessed = preprocess_data(path)
    df_submission = predict_model(id, df_preprocessed)
    df_submission.to_csv("output.csv", index=False)
    
    return "output.csv", df_submission

Finally, we use Gradio interface.

In [None]:
demo = gr.Interface(
    fn=inference,

    inputs="file",

    outputs=["file", "dataframe"],

    title="Upload the CSV file to obtain the prediction values.",

    description="This deployment use XGBoost model"
)

In [None]:
demo.launch()

---
---