# Machine learning Workshop (1. ELT and EDA)

## <u>Table of contents</u>

### 1. ELT and EDA

1. Import Python essential modules and dataset
2. Preliminary data and data understanding
3. Prepareing data before use in model

### 2. Modeling

1. Commonly Function hyperparameter
2. Commonly Model hyperparameter tuning
3. Import Python essential modules and dataset, and prepare data
4. Training model (1st attempt)
5. Error analysis
6. Training model (2nd attempt)
7. Save model

### 3. Inference

1. Import Python essential modules and dataset
2. Prepare data to for training data
3. Load Model
4. Predict with prepared data
5. Deploy with Gradio

---

## <u>Contents</u>

## 1. Data science workflow

Microsoft created the workflow diagram for data science is called `TDSP` (Team Data Science Process)

<img src="https://drive.google.com/uc?id=1LPCQ_9ASzFd0RPzDUsGsldWAFK33mghA" style="height:400px"/>

Reference: https://harshvardhan.blog/data-science-process-frameworks

## 2. Import Python essential modules and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
from itertools import product

In [None]:
import warnings
warnings.filterwarnings("ignore")

The company has shared its annual car insurance data. Now, you have to find out the real customer behaviors over the data. <br><br>
Ref: https://www.kaggle.com/datasets/sagnik1511/car-insurance-data/data

In [None]:
data = pd.read_csv("train_data.csv")

The first task of data science is <b>data understanding</b>.

---

## 2. Preliminary data and data understanding

Understanding meaning and behavior of columns.

In [None]:
data.columns

In [None]:
data.head()

In this example, we can conclude the data information as following:

<table align="left">
    <tr>
        <th>No</th>
        <th>Column names</th>
        <th>Type</th>
        <th>Data unique</th>
        <th>Remark</th>
    </tr>
    <tr>
        <td>1</td>
        <td>ID</td>
        <td>Object</td>
        <td>-</td>
        <td>-</td>
    </tr>
    <tr>
        <td>2</td>
        <td>AGE</td>
        <td>Object (Category)</td>
        <td>['16-25', '26-39', '40-64', '65+']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>3</td>
        <td>GENDER</td>
        <td>Object (Category)</td>
        <td>['male', 'female']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>4</td>
        <td>DRIVING_EXPERIENCE</td>
        <td>Object (Category)</td>
        <td>['0-9y', '10-19y', '20-29y', '30y+']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>5</td>
        <td>EDUCATION</td>
        <td>Object (Category)</td>
        <td>['none', 'high school', 'university']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>6</td>
        <td>INCOME</td>
        <td>Object (Category)</td>
        <td>['poverty', 'working class', 'middle class', 'upper class']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>7</td>
        <td>CREDIT_SCORE</td>
        <td>Float</td>
        <td>0.-1.</td>
        <td>-</td>
    </tr>
    <tr>
        <td>8</td>
        <td>VEHICLE_YEAR</td>
        <td>Object (Category)</td>
        <td>['before 2015', 'after 2015']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>9</td>
        <td>MARRIED</td>
        <td>Bool</td>
        <td>[True, False]</td>
        <td>-</td>
    </tr>
    <tr>
        <td>10</td>
        <td>CHILDREN</td>
        <td>Bool</td>
        <td>[True, False]</td>
        <td>-</td>
    </tr>
    <tr>
        <td>11</td>
        <td>POSTAL_CODE</td>
        <td>Object</td>
        <td>-</td>
        <td>-</td>
    </tr>
    <tr>
        <td>12</td>
        <td>ANNUAL_MILEAGE</td>
        <td>Float</td>
        <td>0-inf</td>
        <td>-</td>
    </tr>
    <tr>
        <td>13</td>
        <td>VEHICLE_TYPE</td>
        <td>Object (Category)</td>
        <td>['sedan', 'sports car']</td>
        <td>-</td>
    </tr>
    <tr>
        <td>14</td>
        <td>SPEEDING_VIOLATIONS</td>
        <td>Int</td>
        <td>0-inf</td>
        <td>Counting number</td>
    </tr>
    <tr>
        <td>15</td>
        <td>DUIS</td>
        <td>Int</td>
        <td>0-inf</td>
        <td>Counting number <br> DUIS = Driving under the influence</td>
    </tr>
    <tr>
        <td>16</td>
        <td>PAST_ACCIDENTS</td>
        <td>Int</td>
        <td>0-inf</td>
        <td>Counting number</td>
    </tr>
    <tr>
        <td>17</td>
        <td>OUTCOME</td>
        <td>Bool, Target</td>
        <td>[True, False]</td>
        <td>True: claimed, False: not claimed</td>
    </tr>
</table>

Check the missing values and data type in each columns.

In [None]:
data.info()

<b>Key finding </b> <br>
1. `MARRIED` and `ANNUAL_MILEAGE` columns have missing values.
2. `ID`, `MARRIED`, `CHILDREN`, `POSTAL_CODE`, and `OUTCOME` columns have wrong data type.

We change `ID`, `MARRIED`, `CHILDREN`, `POSTAL_CODE`, and `OUTCOME` to a correct data type.

In [None]:
data["ID"] = data["ID"].astype(object)
data["MARRIED"] = data["MARRIED"].astype(bool)
data["CHILDREN"] = data["CHILDREN"].astype(bool)
data["POSTAL_CODE"] = data["POSTAL_CODE"].astype(object)
data["OUTCOME"] = data["OUTCOME"].astype(bool)

### 2.1. Investigative numerical columns

We investigate the numerical columns via descriptive statistics.

In [None]:
numerical_col = ["CREDIT_SCORE", "ANNUAL_MILEAGE", "SPEEDING_VIOLATIONS", "DUIS", "PAST_ACCIDENTS"]

In [None]:
data[numerical_col].describe()

<b>Key finding </b> <br>
The `ANNUAL_MILEAGE` and `CREDIT_SCORE` columns are out-of-range data

We need to drill down the data in the `ANNUAL_MILEAGE` and `CREDIT_SCORE` columns. <br>
We can investigate columns, as following:
- Filtering in the out-of-range data
- Visualizing distribution data in the interested column, such as, boxplot, histogram.

In [None]:
# Trying drill down on the `CREDIT_SCORE` column by using filtering data
data.loc[data["CREDIT_SCORE"] > 1, :]

In [None]:
# Example of boxplot
# sns.boxplot(y=data["CREDIT_SCORE"], hue=data["OUTCOME"], hue_order=[True, False])
# plt.show()

Maybe, it was a calculation error. <br>
We will solve this problem by changing the values to NaN

In [None]:
data.loc[data["CREDIT_SCORE"] > 1, "CREDIT_SCORE"] = pd.NA

---

### Exercise 1

How we can investigate problem in the `ANNUAL_MILEAGE` column and solve that issue?

In [None]:
# Write the code for exercise 1-1 here.


---

We check other numerical columns via counting frequency. <br>
But in this exercise, we will show only the `PAST_ACCIDENTS` column.

We can count frequency, as following:
- A `value_counts` method in the Pandas library
- Visualizing frequency in the interested column, such as, bar plot.

In [None]:
# Checking count frequency by using "value_counts" method in the Pandas library
sns.countplot(
    x=data["PAST_ACCIDENTS"],
    order=range(data["PAST_ACCIDENTS"].min(), data["PAST_ACCIDENTS"].max())
)
plt.show()

In [None]:
# Example of "value_counts" method in the Pandas
# data["PAST_ACCIDENTS"].value_counts().sort_index()

There is nothing wrong in the `PAST_ACCIDENTS` column.

We can create data binning column in any numerical columns. <br>
It can be automatically created from the `cut` and `qcut` method. <br>
But in this example, we will create from the custom function in the `PAST_ACCIDENTS` column.

A Example case:
- "Never": [0]
- "Rarely": [1,2]
- "Often": [3,4,5,6,7,8,9,10,11,12,13,14]

In [None]:
def accident_binning(row):
    past_accident = row["PAST_ACCIDENTS"]
    if past_accident in [0]:
        return "Never"
    elif past_accident in [1,2]:
        return "Rarely"
    else:
        return "Often"

In [None]:
data["FREQUENT_ACCIDENT"] = data.apply(accident_binning, axis=1)

In [None]:
sns.countplot(x=data["FREQUENT_ACCIDENT"], hue=data["OUTCOME"], order=["Never", "Rarely", "Often"])
plt.show()

### 2.2. Investigative categorical columns

Same as above, We investigate categorical columns by using descriptive statistics.

In [None]:
categorical_col = ["AGE", "GENDER", "DRIVING_EXPERIENCE", "EDUCATION", "INCOME", "VEHICLE_YEAR", "MARRIED", "CHILDREN", "VEHICLE_TYPE", "OUTCOME"]

In [None]:
data[categorical_col].describe()

<b>Key finding </b> <br>
- The `GENDER` column has too many distinct count.
- Except in boolean column, there are extra a distinct count.

We need to drill down the data.

We can check unique values, as following:
- A `unique` method in the Pandas library
- A `value_counts` method in the Pandas library
- Visualizing frequency in the interested column, such as, bar plot.

Starting drill down in the `GENDER` column.

In [None]:
# Checking unique value by using visualizing frequency
sns.countplot(x=data["GENDER"])
plt.show()

In [None]:
# Example of "unique" method in the Pandas library
# data["GENDER"].unique()

In [None]:
# Example of "value_counts" method in the Pandas library
# data["GENDER"].value_counts()

Found that, the letter format in the `GENDER` column is not same and has abnormal value (0). <br>
We check abnormal value row in the data.

In [None]:
data[data["GENDER"] == "0"]

It seems has error row in the data. We will remove it.

In [None]:
data = data[data["GENDER"] != "0"]

Then, we will fix letter format in `GENDER` column by changing to lower case.

In [None]:
data["GENDER"] = data["GENDER"].str.lower()

We check counting frequency in the categorical column. <br>
But in this exercise, we will show only the `MARRIED` column.

In [None]:
vehicle_count = data[["VEHICLE_TYPE", "OUTCOME"]].value_counts()
vehicle_count

We can change value to percentage of total each category.

In [None]:
(vehicle_count / vehicle_count.groupby(level=0).transform('sum')) * 100

---

## 3. Prepareing data before use in model

We delete the `ID` and `POSTAL_CODE` columns which is not relevan in the target column.

In [None]:
data = data.drop(["ID", "POSTAL_CODE"], axis=1)

Then, we drop duplicated rows because the most of ML assumptions is independent.

In [None]:
data = data.drop_duplicates()

Some models can handle training with missing values, such as decision trees, random forests, and XGBoost. <br>
However, others, like linear regression, logistic regression, and support vector machines, require dataset without missing values.

Next, we talk about relationship between 2 variables. <br>
A type of column must be considered because it effect method to calculate.

<table align="left">
    <tr>
        <th></th>
        <th>Categorical column</th>
        <th>Numerical column</th>
    </tr>
    <tr>
        <td>Categorical column</td>
        <td>Chi-Square Test</td>
        <td>One-way ANOVA test</td>
    </tr>
    <tr>
        <td>Numerical column</td>
        <td>One-way ANOVA test</td>
        <td>Pearson correlation <br> Spearman's rank correlation</td>
    </tr>
</table>

### 3.1. Numerical - Numerical

We can calulate pearson correlation to measure linear relationship between 2 numerical variables.

There are 5 cases to interprete pearson correlation value. <br>

<table align="left">
    <tr>
        <th>|x|</th>
        <th>meaning</th>
    </tr>
    <tr>
        <td>0 and 0.05</td>
        <td>no correlation</td>
    </tr>
    <tr>
        <td>0.05 and 0.20</td>
        <td>weak correlation</td>
    </tr>
    <tr>
        <td>0.20 and 0.70</td>
        <td>medium correlation</td>
    </tr>
    <tr>
        <td>0.70 and 0.95</td>
        <td>strong correlation</td>
    </tr>
    <tr>
        <td>0.95 and 1.00</td>
        <td>Columns can be used interchangeably.</td>
    </tr>
</table>

The sign of the coefficient mean the direction of the relationship.

In [None]:
numerical_col = ["CREDIT_SCORE", "ANNUAL_MILEAGE", "SPEEDING_VIOLATIONS", "DUIS", "PAST_ACCIDENTS"]

In [None]:
# method{‘pearson’, ‘spearman’}
data[numerical_col].corr()

### 3.2. Categorical - Categorical

It is not make sense to use pearson correlation to measure relationship in categorical variables. <br>
We must use Chi-Square Test to test relationship.

In [None]:
categorical_col_1 = ["AGE", "GENDER", "DRIVING_EXPERIENCE", "EDUCATION", "INCOME", "VEHICLE_YEAR", "MARRIED", "CHILDREN", "VEHICLE_TYPE", "FREQUENT_ACCIDENT"]
categorical_col_2 = ["OUTCOME"]

In [None]:
cat_var_prod = list(product(categorical_col_1, categorical_col_2, repeat = 1))

In [None]:
result = []
for col_name in cat_var_prod:
    if col_name[0] != col_name[1]:
        result.append(
            (
                col_name[0],
                col_name[1],
                list(ss.chi2_contingency(pd.crosstab(data[col_name[0]], data[col_name[1]])))[1]
            )
        )

In [None]:
chi_test_output = pd.DataFrame(result, columns = ["var1", "var2", "coeff"])
chi_test_output.pivot(index="var1", columns="var2", values="coeff")

All categorical columns except `VEHICLE_TYPE` have relationship in `OUTCOME` column (p value ≤ 0.05)

### 3.3. Numerical - Categorical

We can use one-way ANOVA test to measure relationship between numerical and categorical variable.

In [None]:
numerical_col = ["CREDIT_SCORE", "ANNUAL_MILEAGE", "SPEEDING_VIOLATIONS", "DUIS", "PAST_ACCIDENTS"]
categorical_col = ["OUTCOME"]

In [None]:
var_prod = list(product(numerical_col, categorical_col, repeat = 1))

In [None]:
result = []
for col_name in var_prod:
    unique_cat = data[col_name[1]].unique()
    numerical_list = []
    for unique in unique_cat:
        numerical_list.append(data.loc[data[col_name[1]] == unique, col_name[0]].dropna().to_list())

    result.append(
            (
                col_name[0],
                col_name[1],
                ss.f_oneway(*numerical_list).pvalue
            )
        )

In [None]:
chi_test_output = pd.DataFrame(result, columns = ["var1", "var2", "coeff"])
chi_test_output.pivot(index="var1", columns="var2", values="coeff")

All numerical columns have relationship in the `OUTCOME` column (p value ≤ 0.05)

From relationship test, we remove the `VEHICLE_TYPE` column in the data.

In [None]:
data = data.drop(["VEHICLE_TYPE"], axis=1)

Last, we save dataset before use it for model training.

In [None]:
data.to_csv("data_prepared.csv", index=False)

---
---