# Machine learning Workshop (2. Modeling)

## <u>Table of contents</u>

### 1. ELT and EDA

1. Import Python essential modules and dataset
2. Preliminary data and data understanding
3. Prepareing data before use in model

### 2. Modeling

1. Commonly Function hyperparameter
2. Commonly Model hyperparameter tuning
3. Import Python essential modules and dataset, and prepare data
4. Training model (1st attempt)
5. Error analysis
6. Training model (2nd attempt)
7. Save model

### 3. Inference

1. Import Python essential modules and dataset
2. Prepare data to for training data
3. Load Model
4. Predict with prepared data
5. Deploy with Gradio

---

## <u>Contents</u>

## 1. Commonly Function hyperparameter

<table align="left">
    <tr>
        <th>Function</th>
        <th>Type</th>
        <th>Hyperparameter</th>
    </tr>
    <tr>
        <td>One Hot Encoder</td>
        <td>Preprocressing</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Ordinal Encoder</td>
        <td>Preprocressing</td>
        <td>-</td>
    </tr>
    <tr>
        <td>PolynomialFeatures</td>
        <td>Preprocressing</td>
        <td>degree</td>
    </tr>
    <tr>
        <td>Standard Scaler</td>
        <td>Normalization</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Power Transformer</td>
        <td>Normalization</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Pipeline</td>
        <td>Pipeline</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Train test split</td>
        <td>Model selection</td>
        <td>shuffle, stratify</td>
    </tr>
    <tr>
        <td>Cross-validator types <br> (ex. Stratified K-Folds, Time Series)</td>
        <td>Model selection</td>
        <td>n_splits, (or test_size)</td>
    </tr>
    <tr>
        <td>Grid Search CV</td>
        <td>Model selection</td>
        <td>param_grid, scoring, cv</td>
    </tr>
    <tr>
        <td>Randomized Search CV</td>
        <td>Model selection</td>
        <td>param_distributions, n_iter, scoring, cv</td>
    </tr>
    <tr>
        <td>et cetera ~</td>
        <td>~</td>
        <td>~</td>
    </tr>
</table>

---

## 2. Commonly Model hyperparameter tuning

<table align="left">
    <tr>
        <th>Model</th>
        <th>Type</th>
        <th>Normalization</th>
        <th>Hyperparameter</th>
    </tr>
    <tr>
        <td>Linear Regression</td>
        <td>Linear</td>
        <td>Yes</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Logistic Regression</td>
        <td>Linear</td>
        <td>Yes</td>
        <td>penalty, C</td>
    </tr>
    <tr>
        <td>Ridge Regression (L2)</td>
        <td>Linear</td>
        <td>Yes</td>
        <td>alpha</td>
    </tr>
    <tr>
        <td>Lasso Regression (L1)</td>
        <td>Linear</td>
        <td>Yes</td>
        <td>alpha</td>
    </tr>
    <tr>
        <td>Elastic Net Regression <br>(combine L2 and L1)</td>
        <td>Linear</td>
        <td>Yes</td>
        <td>alpha, l1_ratio</td>
    </tr>
    <tr>
        <td>Naïve Bayes</td>
        <td>Naïve Bayes</td>
        <td>Not necessary</td>
        <td>-</td>
    </tr>
    <tr>
        <td>K-Nearest Neighbors</td>
        <td>Nearest Neighbors</td>
        <td>Not necessary</td>
        <td>n_neighbors</td>
    </tr>
    <tr>
        <td>Support Vector Mechine</td>
        <td>Support Vector Mechine</td>
        <td>Yes</td>
        <td>C, gamma, epsilon (regression)</td>
    </tr>
    <tr>
        <td>Decision Tree</td>
        <td>Tree</td>
        <td>Not necessary</td>
        <td>max_depth, min_samples_leaf</td>
    </tr>
    <tr>
        <td>Random Forest</td>
        <td>Ensemble (Tree)</td>
        <td>Not necessary</td>
        <td>n_estimators, max_depth, min_samples_leaf</td>
    </tr>
    <tr>
        <td>AdaBoost</td>
        <td>Ensemble (Tree)</td>
        <td>Not necessary</td>
        <td>n_estimators, learning_rate</td>
    </tr>
    <tr>
        <td>XGBoost<br>(adapt from gradient boosting)</td>
        <td>Ensemble (Tree)</td>
        <td>Not necessary</td>
        <td>eta, gamma, max_depth, lambda, objective,<br>rate_drop (dart), skip_drop (dart)</td>
    </tr>
    <tr>
        <td>LightGBM<br>(adapt from gradient boosting)</td>
        <td>Ensemble (Tree)</td>
        <td>Not necessary</td>
        <td>learning_rate, objective,<br>max_depth, min_data_in_leaf, lambda_l2,<br>drop_rate (dart), skip_drop (dart)</td>
    </tr>
    <tr>
        <td>Catboost<br>(adapt from gradient boosting)</td>
        <td>Ensemble (Tree)</td>
        <td>Not necessary</td>
        <td>learning_rate, loss_function, eval_metric, <br>depth, min_data_in_leaf</td>
    </tr>
    <tr>
        <td>et cetera ~</td>
        <td>~</td>
        <td>~</td>
        <td>~</td>
    </tr>
</table>

---

## 3. Import Python essential modules and dataset, and prepare data

In [None]:
# ! pip install xgboost

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn.inspection import permutation_importance

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

In [None]:
from joblib import dump, load

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("data_prepared.csv")

In [None]:
data.shape

First, we must split train and test data before use it for training model.

In [None]:
X = data.drop(["OUTCOME"], axis=1)
y = data["OUTCOME"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, random_state=42, stratify=y)

---

# 4. Training model (1st attempt)

We will create 3 models:

1. Logistic regression
2. K-Nearest Neighbors
3. XGBoost

In [None]:
def confusion_report_formatting(y, report):
    y_unique = y.unique()
    confusion = pd.DataFrame(
        report,
        columns=np.column_stack([ ["Predicted"]*(len(y_unique)), np.sort(y_unique)]),
        index=np.column_stack([["Actual"]*(len(y_unique)), np.sort(y_unique)])
)
    return confusion

In [None]:
categorical_features = ["AGE", "GENDER", "DRIVING_EXPERIENCE", "EDUCATION", "INCOME", "VEHICLE_YEAR", "MARRIED", "CHILDREN", "FREQUENT_ACCIDENT"]
numeric_features = ["CREDIT_SCORE", "ANNUAL_MILEAGE", "SPEEDING_VIOLATIONS", "DUIS", "PAST_ACCIDENTS"]

We create an empty data frame to gather the model's report.

In [None]:
report_train = pd.DataFrame(columns=["Accuracy", "F1 score"])
report_test = pd.DataFrame(columns=["Accuracy", "F1 score"])

### 4.1. Logistic regression

In [None]:
transformers_list=[
    (
        "category",
        make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(drop="first")),
        categorical_features
    ),
    (
        "numeric",
        make_pipeline(SimpleImputer(strategy="mean"), StandardScaler()),
        numeric_features
    )
]
preprocess_transformer = ColumnTransformer(transformers_list)

In [None]:
steps = [
    ("preprocess", preprocess_transformer),
    ("logistic", LogisticRegression(tol=1e-2, solver="liblinear"))
]
logistic_pipeline = Pipeline(steps)

We create a pipeline because we want to use the warp-up function to create a model.

This is pipeline workflow which is written above.

<img src="https://drive.google.com/uc?id=1Lg-ZnQae5G-Y1eqReODKCyeZ2-6E8HLB" style="height:250px"/>

In [None]:
logistic_params = {
    'logistic__C': np.logspace(-2, 2, 5)
}

In [None]:
logistic_search = GridSearchCV(estimator = logistic_pipeline,
                                       param_grid = logistic_params,
                                       cv = 5, scoring='f1', verbose=2)

In [None]:
logistic_search.fit(X_train, y_train)

In [None]:
y_train_predict = logistic_search.predict(X_train)

In [None]:
y_test_predict = logistic_search.predict(X_test)

We will report the data by using confusion_matrix, precision, recall, and f1-score.

In [None]:
confusion_report_formatting(y_train, confusion_matrix(y_train, y_train_predict))

In [None]:
confusion_report_formatting(y_test, confusion_matrix(y_test, y_test_predict))

In [None]:
print(classification_report(y_train, y_train_predict))

In [None]:
print(classification_report(y_test, y_test_predict))

And then, we concat the model metric to report the table.

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_train, y_train_predict),
    f1_score(y_train, y_train_predict, average='weighted'),
]], index=["logistic"], columns=["Accuracy", "F1 score"])

In [None]:
report_train = pd.concat([report_train, score])
report_train

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_test, y_test_predict),
    f1_score(y_test, y_test_predict, average='weighted'),
]], index=["logistic"], columns=["Accuracy", "F1 score"])

In [None]:
report_test = pd.concat([report_test, score])
report_test

We repeat this workflow on other models.

### 4.2. K-Nearest neighbors</b>

In [None]:
transformers_list=[
        (
            "category",
            make_pipeline(SimpleImputer(strategy="most_frequent"), OrdinalEncoder(handle_unknown="use_encoded_value", encoded_missing_value=-1, unknown_value=-2)),
            categorical_features
        ),
        (
            "numeric",
            SimpleImputer(strategy="mean"),
            numeric_features
        )
]
preprocess_transformer = ColumnTransformer(transformers_list)

In [None]:
steps = [
    ("preprocess", preprocess_transformer),
    ("min_max", MinMaxScaler()),
    ("knn", KNeighborsClassifier())
]
knn_pipeline = Pipeline(steps)

In [None]:
knn_params = {
    'knn__n_neighbors': [3, 5, 7],
    'knn__leaf_size': np.linspace(10, 80, 5, dtype=int)
}

This is pipeline workflow which is written above.

<img src="https://drive.google.com/uc?id=1Vmeq_sTp2-qC1OfJpWiJzHmiq_HbLdYR" style="height:300px"/>

In [None]:
knn_search = GridSearchCV(estimator = knn_pipeline,
                          param_grid = knn_params,
                          cv = 5, scoring='f1', verbose=2)

In [None]:
knn_search.fit(X_train, y_train)

In [None]:
y_train_predict = knn_search.predict(X_train)
y_test_predict = knn_search.predict(X_test)

In [None]:
# Example of evaluate model
# confusion_report_formatting(y_train, confusion_matrix(y_train, y_train_predict))
# confusion_report_formatting(y_test, confusion_matrix(y_test, y_test_predict))

# print(classification_report(y_train, y_train_predict))
# print(classification_report(y_test, y_test_predict))

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_train, y_train_predict),
    f1_score(y_train, y_train_predict, average='weighted'),
]], index=["knn"], columns=["Accuracy", "F1 score"])

In [None]:
report_train = pd.concat([report_train, score])
report_train

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_test, y_test_predict),
    f1_score(y_test, y_test_predict, average='weighted'),
]], index=["knn"], columns=["Accuracy", "F1 score"])

In [None]:
report_test = pd.concat([report_test, score])
report_test

### 4.3. XGBoost

In [None]:
transformers_list=[
        ("category", OrdinalEncoder(handle_unknown="use_encoded_value", encoded_missing_value=-1, unknown_value=-2), categorical_features),
        ("numeric", "passthrough", numeric_features)
]
preprocess_transformer = ColumnTransformer(transformers_list)

In [None]:
xgb_pipeline = Pipeline([
    ("preprocess", preprocess_transformer),
    ("xgb", XGBClassifier())
])

In [None]:
xgb_params = {
    'xgb__max_depth': [3, 5, 7],
    'xgb__min_child_weight': [0.1, 1, 10]
}

This is pipeline workflow which is written above.

<img src="https://drive.google.com/uc?id=1YhzRuEqjA5WtQKuvRgpcQ_0vtRThAbl2" style="height:200px"/>

In [None]:
xgb_search = GridSearchCV(estimator = xgb_pipeline,
                                       param_grid = xgb_params,
                                       cv = 5, scoring='f1', verbose=2)

In [None]:
xgb_search.fit(X_train, y_train)

In [None]:
y_train_predict = xgb_search.predict(X_train)
y_test_predict = xgb_search.predict(X_test)

In [None]:
# Example of evaluate model
# confusion_report_formatting(y_train, confusion_matrix(y_train, y_train_predict))
# confusion_report_formatting(y_test, confusion_matrix(y_test, y_test_predict))

# print(classification_report(y_train, y_train_predict))
# print(classification_report(y_test, y_test_predict))

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_train, y_train_predict),
    f1_score(y_train, y_train_predict, average='weighted'),
]], index=["xgb (1st)"], columns=["Accuracy", "F1 score"])

In [None]:
report_train = pd.concat([report_train, score])
report_train

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_test, y_test_predict),
    f1_score(y_test, y_test_predict, average='weighted'),
]], index=["xgb (1st)"], columns=["Accuracy", "F1 score"])

In [None]:
report_test = pd.concat([report_test, score])
report_test

---

# 5. Error analysis

## 5.1. Checking learning curve

XGBoost is our the best model. <br>
First, We will see effect of training size by create a graph of learning curve.

In [None]:
train_sizes, train_scores, test_scores = learning_curve(xgb_search.best_estimator_, X, y,
                                                        scoring='f1', cv=5, train_sizes=np.linspace(0.1, 1.0, 20))

In [None]:
plt.figure(figsize=(6,3))
plt.plot(train_sizes, train_scores.mean(axis=1), "-o", label="train set")
plt.plot(train_sizes, test_scores.mean(axis=1), "-o", label="test set")
plt.xlabel("Train sizes")
plt.ylabel("F1 score")
plt.legend()
plt.ylim([0,1])
plt.show()

Found that, we have enough training size to train this model.

## 5.2. Compare actual and predict values

Then, we concat prediction value to the starting dataset for visualization.

In [None]:
y_train_predict = pd.DataFrame(y_train_predict, index=y_train.index).rename({0:"predict"},axis=1)
data_concat_train = pd.concat([X_train, y_train, y_train_predict], axis=1).rename({"OUTCOME":"actual"},axis=1)
data_concat_train["is_error"] = data_concat_train["actual"] != data_concat_train["predict"]

In this example, we investigate numerical columns which are only the `CREDIT_SCORE` column.

In [None]:
def subplot_distribution(data, x, hue, bins=20, discrete=False):
    y_unique = y.unique()
    ncol = y_unique.shape[0]

    fig, ax = plt.subplots(nrows=1, ncols=ncol, figsize=(8,3), sharex=True, sharey=True)
    for i in range(ncol):
        ax1 = sns.histplot(ax=ax[i], data=data[data["actual"]==y_unique[i]], x=x, hue=hue, bins=bins, hue_order=[True, False], discrete=discrete, palette=sns.set_palette(sns.color_palette(["#FF0B04", "#4374B3"])))
        ax1.set_title(f"actual class = {y_unique[i]}")

    plt.tight_layout()
    plt.show()

In [None]:
subplot_distribution(data_concat_train, x="CREDIT_SCORE", hue="is_error")

Found that, A error range of actual class False is not equal class True. <br>
Maybe, if we expand hyperparameter grid in `GridSearchCV`, we will get better model.

We look into a categorical column which is only the `AGE` column on this example.

In [None]:
subplot_distribution(data_concat_train, x="AGE", hue="is_error", discrete=True)

The conclude from above visualizes is same that we found from the `CREDIT_SCORE` and `ANNUAL_MILEAGE` columns.

## 5.3. Features effect in model   

There are many methods to validate feature effect in model, such as,
- Coefficients in linear model
- Feature importance in tree base model
- Permutation importance
- LIME and SHAP

### 5.3.1. Coefficients in linear model

Validating feature by coefficients is limited in linear base model (model type see in table 2)

<img src="https://drive.google.com/uc?id=1uUxKOo_aVh8yX5SGpVjyRGJj2HEv_Hh8" style="height:300px"/>

figure ref: https://www.saedsayad.com/logistic_regression.htm

In this section, we will validate logistic model which is trained on previous. <br>
The image of logistic pipeline is describe, as following.

<img src="https://drive.google.com/uc?id=1jYw-Lu1WafZ1obHQ8e04W6-be8c_u14V" style="height:300px"/>

In [None]:
# Get best model in GridSearchCV
best_logistic = logistic_search.best_estimator_

In [None]:
# Get logistic model in Pipeline
logistic_model = best_logistic["logistic"] # "logistic" is writen on train pipeline
# logistic_model = best_logistic[-1] # mostly model is in the latest pipeline layer

In [None]:
# Get coefficients in logistic model
coef_logistic = logistic_model.coef_
coef_logistic

In [None]:
# Example of get coefficients in logistic model in one line
# logistic_search.best_estimator_["logistic"].coef_

To clarity, we will add feature name and convert it to Pandas DataFrame.

Due to pipeline have one-hot encoding, we can not directly add feature columns. <br>
We must get feature out from encoder and concat with the rest of data.

In [None]:
# Get column name input from Pipeline
cat_colname = logistic_search.best_estimator_["preprocess"]["category"].feature_names_in_
num_colname = logistic_search.best_estimator_["preprocess"]["numeric"].feature_names_in_

In [None]:
# Get column name input from one-hot encoder
cat_onehot_colname = logistic_search.best_estimator_["preprocess"]["category"][-1].get_feature_names_out()
cat_onehot_colname

In [None]:
# Create dict for replace x in cat_onehot_colname
cat_word = {f"x{i}": n for (i, n) in enumerate(cat_colname)}
cat_word

In [None]:
# Replace x in cat_onehot_colname
for index, data in enumerate(cat_onehot_colname):
    for key, value in cat_word.items():
        if key in data:
            cat_onehot_colname[index]=data.replace(key, cat_word[key])

cat_onehot_colname

In [None]:
# Concat one-hot output name and numerical output name
colname = np.concatenate([cat_onehot_colname, num_colname])

In [None]:
# Concat with coefficients, make to DataFrame, and sort values
coef_data = pd.DataFrame(np.vstack([colname, coef_logistic[0]]), index=["feature_name", "coefficient"]).T.sort_values("coefficient", ascending=False)
coef_data

The sign of the coefficient mean the direction of the relationship.

If we are interested in magnitude of coefficient, we will apply absolute value.

In [None]:
coef_data["abs_coefficient"] = coef_data["coefficient"].abs()

In [None]:
coef_data.sort_values("abs_coefficient", ascending=False)

From above, the `DRIVING_EXPERIENCE` and `GENDER` columns have the most effect in the logistic regression model. <br>
However, the `CREDIT_SCORE`is the least effect in this model, following by, `DUIS`, and `EDUCATION` respectively.

### 5.3.2. Feature importance in tree base model

Validating feature by feature importance is limited in tree model (model type see in table 2)

In this section, we will validate XGBoost model which is trained on previous. <br>
The image of XGBoost pipeline is describe, as following.

<img src="https://drive.google.com/uc?id=1kVtDVGjmCQqKpq7WCDx-UJdAoehtqzX-" style="height:250px"/>

In [None]:
# Get XGBoost model from GridSearchCV
xgb_model = xgb_search.best_estimator_["xgb"]

In [None]:
# Get feature importances in XGBoost
fim_xgb = xgb_model.feature_importances_
fim_xgb

In [None]:
# Example of feature importances in XGBoost in one line
# xgb_search.best_estimator_["xgb"].feature_importances_

Same as latest section, we will add feature name and convert it to Pandas DataFrame. <br>
This time, we used an ordinal encoder instead of a one-hot encoder, allowing us to directly add the feature columns, with categorical features encoded first, followed by the numeric ones.

In [None]:
# Get column name input from Pipeline
cat_colname = xgb_search.best_estimator_["preprocess"]["category"].feature_names_in_
num_colname = xgb_search.best_estimator_["preprocess"]["numeric"].feature_names_in_

In [None]:
# Concat categorical output name and numerical output name
colname = np.concatenate([cat_colname, num_colname])

In [None]:
# Concat with coefficients, make to DataFrame, and sort values
fim_data = pd.DataFrame(np.vstack([colname, fim_xgb]), index=["feature_name", "feature_importances"]).T.sort_values("feature_importances", ascending=False)
fim_data

Unlike logistic regression, we found that the `DRIVING_EXPERIENCE` and `VEHICLE_YEAR` columns have the most effect in the XGBoost model. <br>
However, the `DUIS` and `SPEEDING_VIOLATIONS` columns are the least effect in this model.

### 5.3.3. Permutation importance

We observe that permuting a predictive feature breaks the correlation between the feature and the target.

<img src="https://drive.google.com/uc?id=1JiM1XfVqHEkd5_-_vKBSPnx8ijMVQwTz" style="height:300px"/>

<img src="https://drive.google.com/uc?id=1wdqdRziiRcqoYZzTy7bjZBLpMdu5kEXl" style="height:300px"/>

Therefore, we must retrain the interesting model which is XGBoost.

In [None]:
result = permutation_importance(xgb_search.best_estimator_, X_test, y_test, n_repeats=20, random_state=0)

In [None]:
index_sort = result.importances_mean.argsort()[::-1].tolist()

In [None]:
pd.DataFrame({
    "column_name":X_test.columns[index_sort],
    "permutation_importances": [f"{result.importances_mean[i]:.3f} +/- {result.importances_std[i]:.3f}" for i in index_sort]
})

The logical interpretation of permutation importances is same as coefficient. <br>
If the importance value is close to zero, the feature will less effect in the model.

The `DRIVING_EXPERIENCE` and `VEHICLE_YEAR` columns have the most effect in the XGBoost model. <br>
However, the `AGE` and `CHILDREN` columns are the least effect in this model.

### 5.3.4. LIME and SHAP

In this example, we will not focus on LIME and SHAP. However, if you are interested in these methods, you can read the following bibliography.

- SHAP and LIME: https://www.datacamp.com/tutorial/explainable-ai-understanding-and-trusting-machine-learning-models

- LIME 1: https://github.com/marcotcr/lime
- LIME 2: https://towardsdatascience.com/decrypting-your-machine-learning-model-using-lime-5adc035109b5

- SHAP 1: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html
- SHAP 2: https://medium.com/@anshulgoel991/model-exploitability-using-shap-shapley-additive-explanations-and-lime-local-interpretable-cb4f5594fc1a

---

All the methods we use to validate the impact of features on the model yield different results. <br>
It is highly recommended to choose one consistent method for the entire project.

In this example, we will use <b> feature importance in tree base model </b> to investigate the effect of features.

In [None]:
fim_data

The model highly effect in the `DRIVING_EXPERIENCE` and `VEHICLE_YEAR` columns. <br>
we can add "lambda" in grid hyperparameter to focus on L2 regularization term on weights.

Moreover, The `FREQUENT_ACCIDENT` column is more effect this model than the `PAST_ACCIDENTS` column. <br>
Binning some features can improve model performance.

---

# 6. Training model (2nd attempt)

Data preprocessing is applied starting from the error analysis, and the steps are as follows:
- Binning certain features
- Including 'lambda' in the grid hyperparameters
- Expanding the grid of hyperparameters

First, we will binning the latest features effect in previous model which are `DUIS`, and `SPEEDING_VIOLATIONS`.

In this example, The `DUIS` column will be divided into `Never` and `Used to`. <br>
Additionally, the `SPEEDING_VIOLATIONS` column will be categorized as follows: <br>
- "Never": [0]
- "Rarely": [1,2,3,4,5]
- "Often": 6+

In [None]:
data = pd.read_csv("data_prepared.csv")

In [None]:
def duis_binning(row):
    duis = row["DUIS"]
    if duis in [0]:
        return "Never"
    else:
        return "Used to"

In [None]:
def speed_binning(row):
    speed = row["SPEEDING_VIOLATIONS"]
    if speed in [0]:
        return "Never"
    elif speed in [1,2,3,4,5]:
        return "Rarely"
    else:
        return "Often"

In [None]:
data["USED_TO_DUIS"] = data.apply(duis_binning, axis=1)

In [None]:
data["FREQUENT_SPEED_VIOLATIONS"] = data.apply(speed_binning, axis=1)

Dividing train and test dataset.

In [None]:
X = data.drop(["OUTCOME"], axis=1)
y = data["OUTCOME"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, random_state=42, stratify=y)

Finally, we will include "lambda" in the grid of hyperparameters and expand the overall grid.

### Exercise 2

Complete the code for training an XGBoost model.

In [None]:
categorical_features = ["AGE", "GENDER", "DRIVING_EXPERIENCE", "EDUCATION", "INCOME", "VEHICLE_YEAR", "MARRIED", "CHILDREN", "FREQUENT_ACCIDENT", "USED_TO_DUIS", "FREQUENT_SPEED_VIOLATIONS"]
numeric_features = ["CREDIT_SCORE", "ANNUAL_MILEAGE", "SPEEDING_VIOLATIONS", "DUIS", "PAST_ACCIDENTS"]

In [None]:
transformers_list=[
        ("category", ____________, categorical_features),
        ("numeric", "passthrough", numeric_features)
]
preprocess_transformer = ____________(transformers_list)

In [None]:
xgb_pipeline = ____________([
    ("preprocess", preprocess_transformer),
    ("xgb", ____________)
])

In [None]:
xgb_params = {
    'xgb__max_depth': [2, 3, 5],
    'xgb__min_child_weight': [0.1, 1, 10, 100],
    'xgb__lambda': [0.1, 1, 10, 100]
}

In [None]:
xgb_second_search = ____________(estimator = xgb_pipeline,
                                       param_grid = xgb_params,
                                       cv = 5, scoring='f1', verbose=2)

---

Training and evaluating model

In [None]:
xgb_second_search.fit(X_train, y_train)

In [None]:
y_train_predict = xgb_second_search.predict(X_train)
y_test_predict = xgb_second_search.predict(X_test)

In [None]:
# Example of evaluate model
# confusion_report_formatting(y_train, confusion_matrix(y_train, y_train_predict))
# confusion_report_formatting(y_test, confusion_matrix(y_test, y_test_predict))

# print(classification_report(y_train, y_train_predict))
# print(classification_report(y_test, y_test_predict))

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_train, y_train_predict),
    f1_score(y_train, y_train_predict, average='weighted'),
]], index=["xgb (2nd)"], columns=["Accuracy", "F1 score"])

In [None]:
report_train = pd.concat([report_train, score])
report_train

In [None]:
score = pd.DataFrame([[
    accuracy_score(y_test, y_test_predict),
    f1_score(y_test, y_test_predict, average='weighted'),
]], index=["xgb (2nd)"], columns=["Accuracy", "F1 score"])

In [None]:
report_test = pd.concat([report_test, score])
report_test

In [None]:
# check best params
xgb_second_search.best_params_

To improve the model, we can do another error analysis, create other feature, group (cluter) feature, or try a different model.

---

## 7. Save model

Lastly, we save model to use in the next Python notebook.

In [None]:
if os.path.exists("./model") != 1:
    os.mkdir("./model")

In [None]:
dump(xgb_second_search, 'model/best_model.joblib')

In [None]:
# Example for load model in joblib library
# model = load('model/best_model.joblib')

---
---