# Machine learning Workshop (3. Inference)

## <u>Table of contents</u>

### 1. ELT and EDA

1. Import Python essential modules and dataset
2. Preliminary data and data understanding
3. Prepareing data before use in model

### 2. Modeling

1. Commonly Function hyperparameter
2. Commonly Model hyperparameter tuning
3. Import Python essential modules and dataset, and prepare data
4. Training model (1st attempt)
5. Error analysis
6. Training model (2nd attempt)
7. Save model

### 3. Inference

1. Import Python essential modules and dataset
2. Prepare data to for training data
3. Load Model
4. Predict with prepared data
5. Deploy with Gradio

---

## <u>Contents</u>

After we have trained the model, we want to predict and deploy it using the Gradio library.

## 1. Import Python essential modules and dataset

In [1]:
# ! pip install gradio

Collecting gradio
  Downloading gradio-4.44.1-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from g

In [2]:
import os
import pandas as pd
import gradio as gr

from joblib import load

In [3]:
data_inference = pd.read_csv("test_data.csv")

In [4]:
data_inference.head()

Unnamed: 0,ID,AGE,GENDER,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS
0,710627,26-39,male,10-19y,university,middle class,0.539283,before 2015,0,0,10238,18000.0,sedan,1,0,0
1,591696,65+,male,20-29y,university,upper class,0.636651,after 2015,0,1,10238,,sedan,3,0,1
2,858911,16-25,female,0-9y,high school,working class,0.371904,before 2015,0,0,10238,14000.0,sedan,0,0,0
3,674516,16-25,female,0-9y,high school,working class,,before 2015,1,0,10238,,sedan,0,0,0
4,535845,65+,male,20-29y,high school,upper class,0.70275,before 2015,1,1,32765,9000.0,sedan,8,1,1


In [5]:
data_inference.shape

(61, 16)

## 2. Prepare data to for training data

We must prepare the same workflow as we did for preparing the training data. <br>
<b> *** Except for operations that remove rows, such as `drop`, `drop_duplicates`, `filter`. *** </b>

- Change the data type of the columns.
- Replace `CREDIT_SCORE` values lower than 0 with NaN, and set `ANNUAL_MILEAGE` values lower than 0 to 0.
- Categorize (bin) the `PAST_ACCIDENTS`, `DUIS`, and `SPEEDING_VIOLATIONS` features.
- Remove the columns are not relevan.

First, we change the data type of the columns.

In [6]:
data_inference["MARRIED"] = data_inference["MARRIED"].astype(bool)
data_inference["CHILDREN"] = data_inference["CHILDREN"].astype(bool)

Then, we replace `CREDIT_SCORE` values lower than 0 with NaN, and set `ANNUAL_MILEAGE` values lower than 0 to 0.

In [7]:
data_inference.loc[data_inference["CREDIT_SCORE"] > 1, "CREDIT_SCORE"] = pd.NA

In [8]:
data_inference.loc[data_inference["ANNUAL_MILEAGE"] < 0, "ANNUAL_MILEAGE"] = 0

Moreover, we categorize the `PAST_ACCIDENTS`, `DUIS`, and `SPEEDING_VIOLATIONS` features.

In [9]:
def accident_binning(row):
    past_accident = row["PAST_ACCIDENTS"]
    if past_accident in [0]:
        return "Never"
    elif past_accident in [1,2]:
        return "Rarely"
    else:
        return "Often"

In [10]:
def duis_binning(row):
    duis = row["DUIS"]
    if duis in [0]:
        return "Never"
    else:
        return "Used to"

In [11]:
def speed_binning(row):
    speed = row["SPEEDING_VIOLATIONS"]
    if speed in [0]:
        return "Never"
    elif speed in [1,2,3,4,5]:
        return "Rarely"
    else:
        return "Often"

In [12]:
data_inference["FREQUENT_ACCIDENT"] = data_inference.apply(accident_binning, axis=1)

In [13]:
data_inference["USED_TO_DUIS"] = data_inference.apply(duis_binning, axis=1)

In [14]:
data_inference["FREQUENT_SPEED_VIOLATIONS"] = data_inference.apply(speed_binning, axis=1)

Finally, we remove the columns are not relevan.

In [15]:
# before remove ID column, we save ID for map in prediction data
id_test = data_inference["ID"].tolist()

In [16]:
data_inference = data_inference.drop(["ID", "POSTAL_CODE"], axis=1)

In [17]:
# preview test input
data_inference.head()

Unnamed: 0,AGE,GENDER,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_YEAR,MARRIED,CHILDREN,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,FREQUENT_ACCIDENT,USED_TO_DUIS,FREQUENT_SPEED_VIOLATIONS
0,26-39,male,10-19y,university,middle class,0.539283,before 2015,False,False,18000.0,sedan,1,0,0,Never,Never,Rarely
1,65+,male,20-29y,university,upper class,0.636651,after 2015,False,True,,sedan,3,0,1,Rarely,Never,Rarely
2,16-25,female,0-9y,high school,working class,0.371904,before 2015,False,False,14000.0,sedan,0,0,0,Never,Never,Never
3,16-25,female,0-9y,high school,working class,,before 2015,True,False,,sedan,0,0,0,Never,Never,Never
4,65+,male,20-29y,high school,upper class,0.70275,before 2015,True,True,9000.0,sedan,8,1,1,Rarely,Used to,Often


## 3. Load Model

In [19]:
model = load('best_model.joblib')

In [20]:
model

## 4. Predict with prepared data

The desired output prediction which we want is shown in `submission_template.csv` file.

<img src="https://drive.google.com/uc?id=1oEoHmbi8qRYA9WYRrFkiSxrmD3jn4Ztx" style="height:300px"/>

In [21]:
y_predict = model.predict(data_inference)
y_predict

array([1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1])

After obtaining the prediction data, we will create a DataFrame with it.

In [22]:
df_submission = pd.DataFrame({
    "ID": id_test,
    "OUTCOME": y_predict
})
df_submission

Unnamed: 0,ID,OUTCOME
0,710627,1
1,591696,0
2,858911,1
3,674516,1
4,535845,0
...,...,...
56,608297,0
57,247809,0
58,62005,1
59,83170,1


Finally, we save submission table to csv.

In [23]:
df_submission.to_csv("submission_1.csv", index=False)

## 5. Deploy with Gradio

In this example, we will mainly use `gradio.Interface` <br><br>
For more information, you can look into this bibliography: <br>
https://www.gradio.app/docs/gradio/introduction

With `gr.Interface`, you simply combine 3 ingredients:
- `fn` (a Python function)
- `inputs` (input component)
- `outputs` (output component)

First, we will warp up preprocess data code into function.

In [24]:
def preprocess_data(path):
    data = pd.read_csv(path)

    data["MARRIED"] = data["MARRIED"].astype(bool)
    data["CHILDREN"] = data["CHILDREN"].astype(bool)

    data.loc[data["CREDIT_SCORE"] > 1, "CREDIT_SCORE"] = pd.NA
    data.loc[data["ANNUAL_MILEAGE"] < 0, "ANNUAL_MILEAGE"] = 0

    data["FREQUENT_ACCIDENT"] = data.apply(accident_binning, axis=1)
    data["USED_TO_DUIS"] = data.apply(duis_binning, axis=1)
    data["FREQUENT_SPEED_VIOLATIONS"] = data.apply(speed_binning, axis=1)

    id = data["ID"].tolist()
    data = data.drop(["ID", "POSTAL_CODE"], axis=1)

    return id_test, data

In [25]:
def predict_model(id, data):
    if "OUTCOME" in data.columns:
        data = data.drop(["OUTCOME"], axis=1)

    y_predict = model.predict(data_inference)
    df_submission = pd.DataFrame({
        "ID": id_test,
        "OUTCOME": y_predict
    })

    return df_submission

Then, we create inference function for Gradio interface.

In [26]:
def inference(path):
    id, df_preprocessed = preprocess_data(path)
    df_submission = predict_model(id, df_preprocessed)
    df_submission.to_csv("output.csv", index=False)

    return "output.csv", df_submission

Finally, we use Gradio interface.

In [27]:
demo = gr.Interface(
    fn=inference,

    inputs="file",

    outputs=["file", "dataframe"],

    title="Upload the CSV file to obtain the prediction values.",

    description="This deployment use XGBoost model"
)

In [28]:
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://681733f929f5d0b9b9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




---
---