<a href="https://colab.research.google.com/github/Kurorz2004/NUS-Datathon/blob/main/NUS_DATATHON_SINGLIFE_26.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Datathon 2024: Category A (Champions) Workshop

# 1 Preparing data

A model needs data to be trained for predictions. But there are steps needed to ensure that the data is cleaned and ready for a model - if the data is wrongly formatted or missing, the model will not work as intended.

The steps below will help you get started should you be new to the concept.

## 1.1 Preparing Data

To use colab environment, we need to mount Google Drive to Colab. Ensure you have dataset saved in or a shortcut to it created in 'My Drive', on your google drive. Use the cell below to open dataset on Google Colab.

In [None]:
%pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.3f}'.format)

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import r2_score

from catboost import CatBoostRegressor

from google.colab import drive

drive.mount('/content/drive')

SEED = 42

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/catA_train.csv')
print(f'Shape of data: {df.shape}')
df.head()

Shape of data: (29182, 28)


Unnamed: 0,LATITUDE,LONGITUDE,AccountID,Company,SIC Code,Industry,8-Digit SIC Code,8-Digit SIC Description,Year Found,Entity Type,Parent Company,Parent Country,Ownership Type,Company Description,Square Footage,Company Status (Active/Inactive),Employees (Single Site),Employees (Domestic Ultimate Total),Employees (Global Ultimate Total),Sales (Domestic Ultimate Total USD),Sales (Global Ultimate Total USD),Import/Export Status,Fiscal Year End,Global Ultimate Company,Global Ultimate Country,Domestic Ultimate Company,Is Domestic Ultimate,Is Global Ultimate
0,1.285,103.844,LAKB2BID4559214,FRANK CONSULTING SERVICES PRIVATE LIMITED,7361.0,Employment Agencies,73610000.0,Employment agencies,2020.0,Subsidiary,FRANK RECRUITMENT GROUP PRIVATE LTD.,Singapore,Private,Frank Consulting Services Private Limited is p...,,Active,15.0,25.0,,2209224.0,4637871.0,,,FINDERS HOLDCO LIMITED,United Kingdom,FRANK RECRUITMENT GROUP PRIVATE LTD.,0.0,0.0
1,1.291,103.827,LAKB2BID7610849,NEW DESERT ORCHID SHIPPING PTE. LTD.,4449.0,"Water Transportation of Freight, Not Elsewhere...",44490000.0,Water transportation of freight,2015.0,Subsidiary,FORTITUDE SHIPPING PTE. LTD.,Singapore,Private,New Desert Orchid Shipping Pte. Ltd. is primar...,,Active,39.0,100.0,100.0,7093536000.0,7093536000.0,,,PETREDEC PTE. LIMITED,Singapore,,0.0,0.0
2,1.3,103.858,LAKB2BID5461679,2MBAO BIOCELLBANK PTE. LTD.,6719.0,"Offices of Holding Companies, Not Elsewhere Cl...",67190000.0,"Holding companies, nec",1993.0,Subsidiary,MADISON LIGHTERS AND WATCHES CO LTD,Hong Kong SAR,Private,2Mbao Biocellbank Pte. Ltd. is primarily engag...,,Active,4.0,4.0,4.0,1026308.0,1026308.0,,,MADISON LIGHTERS AND WATCHES CO LTD,Hong Kong SAR,2MBAO BIOCELLBANK PTE. LTD.,1.0,0.0
3,1.301,103.791,LAKB2BID5088529,NEWBLOOM PTE. LTD.,6719.0,"Offices of Holding Companies, Not Elsewhere Cl...",67190000.0,"Holding companies, nec",2006.0,Subsidiary,WILMAR INTERNATIONAL LIMITED,Singapore,Private,Newbloom Pte. Ltd. is primarily engaged in hol...,,Active,10.0,100.0,100.0,73398976000.0,73398976000.0,,,WILMAR INTERNATIONAL LIMITED,Singapore,WILMAR INTERNATIONAL LIMITED,0.0,0.0
4,1.299,103.859,LAKB2BID1268831,ASIA GREEN CAPITAL PTE. LTD.,6719.0,"Offices of Holding Companies, Not Elsewhere Cl...",67190000.0,"Holding companies, nec",2006.0,Parent,ASIA GREEN CAPITAL PTE. LTD.,Singapore,Private,Asia Green Capital Pte. Ltd. is primarily enga...,,Active,,4.0,4.0,432213.0,432213.0,Exports,,ASIA GREEN CAPITAL PTE. LTD.,Singapore,ASIA GREEN CAPITAL PTE. LTD.,1.0,1.0


In [None]:
# Show information better than describe() and info()
desc = pd.DataFrame(index=df.columns)
desc["count"] = df.count()
desc["null"] = df.isna().sum()
desc["%null"] = desc["null"] / len(df) * 100
desc["nunique"] = df.nunique()
desc["%unique"] = desc["nunique"] / len(df) * 100
desc["type"] = df.dtypes
desc = pd.concat([desc, df.describe().T.drop("count", axis=1)], axis=1)

# styles = [dict(selector=f".row_heading", props=[('text-align', 'left')])]
# desc = desc.style.set_table_styles(styles)
desc

Unnamed: 0,count,null,%null,nunique,%unique,type,mean,std,min,25%,50%,75%,max
LATITUDE,29062,120,0.411,9305,31.886,float64,1.321,0.044,1.239,1.285,1.31,1.338,1.47
LONGITUDE,29062,120,0.411,9307,31.893,float64,103.843,0.054,103.611,103.832,103.849,103.866,104.003
AccountID,29182,0,0.0,29182,100.0,object,,,,,,,
Company,29182,0,0.0,29182,100.0,object,,,,,,,
SIC Code,29182,0,0.0,582,1.994,float64,6169.271,1705.846,132.0,5084.0,6719.0,7311.0,9721.0
Industry,29182,0,0.0,580,1.988,object,,,,,,,
8-Digit SIC Code,29182,0,0.0,2255,7.727,float64,61690923.55,17057775.477,1320000.0,50840000.0,67190000.0,73110000.0,97219905.0
8-Digit SIC Description,29182,0,0.0,2191,7.508,object,,,,,,,
Year Found,28748,434,1.487,106,0.363,float64,2004.506,13.464,1819.0,1997.0,2008.0,2014.0,2023.0
Entity Type,29182,0,0.0,4,0.014,object,,,,,,,


## RF

In [None]:
df_tmp = df.drop(["Square Footage", "Import/Export Status", "Fiscal Year End"], axis=1)
df_tmp = df_tmp.dropna()
df_tmp = df_tmp.select_dtypes(include=np.number)

X = df_tmp.drop("Sales (Domestic Ultimate Total USD)", axis=1)
y = df_tmp["Sales (Domestic Ultimate Total USD)"]

cat_features = list(X.select_dtypes(include=["category", "object"]).columns)
print(cat_features)

cat = CatBoostRegressor(cat_features=cat_features, verbose=0)

cat.fit(X, y)

[]


<catboost.core.CatBoostRegressor at 0x7f67b2934d90>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
cat = CatBoostRegressor(cat_features=cat_features, verbose=0)

cat.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x7f67b4265750>

In [None]:
r2_score(y_test, cat.predict(X_test))
# r2_score(y_train, cat.predict(X_train))

-0.46253119692495415

In [None]:
from sklearn.ensemble import RandomForestRegressor

df_tmp = df.drop(["Square Footage", "Import/Export Status", "Fiscal Year End"], axis=1)
df_tmp = df_tmp.dropna()
df_tmp = df_tmp.select_dtypes(include=np.number)

X = df_tmp.drop("Sales (Domestic Ultimate Total USD)", axis=1)
y = df_tmp["Sales (Domestic Ultimate Total USD)"]
rf = RandomForestRegressor()
rf.fit(X, y)

In [None]:
# prompt: plot feature importance of rf
plt.figure(figsize=(15, 10))
sns.barplot(x=rf.feature_importances_, y=X.columns)
plt.show()

## 1.2 Processing Data


#### Drop NaN Values
- We choose to drop rows without latitude and longitude coordinates as they only form a very small part of our dataset.

#### One-Hot Encoding
We will also handle categorical variables by performing one-hot encoding on all columns with important categories as values. This is done by using the pd.get_dummies function.

An alternative approach using the LabelEncoder from scikit-learn is also demonstrated.

- Be cautious about the Curse of Dimensionality!

In [None]:
# Remove rows where the Company Status is 'Inactive'
df = df[df['Company Status (Active/Inactive)'] == 'Active']

In [None]:
# to ensure important info isnt being lost
df['Import/Export Status_Missing'] = df['Import/Export Status'].isna()

df3 = df.dropna(subset=["Employees (Single Site)", "Employees (Domestic Ultimate Total)", "Employees (Global Ultimate Total)",
                                       "Year Found"])

df3 = pd.get_dummies(df3, columns=['Entity Type'], prefix='Entity_Type')
df3 = pd.get_dummies(df3, columns=['Ownership Type'], prefix='Ownership_Type')
df3 = pd.get_dummies(df3, columns=['Import/Export Status'], prefix='Import_Export_Status')
df3 = df3.drop(columns=[col for col in ["Company Status (Active/Inactive)", "Entity Type", 'Import/Export Status', 'Ownership Type'] if col in df3.columns], errors='ignore') # are all Active]




In [None]:
# Check the number of null values
df3.isna().sum()

In [None]:
# Find the frequency of each industry
sic_code_frequency = df3['SIC Code'].value_counts()

# Set a threshold for low-frequency SIC Codes
threshold = 7  # Adjust this threshold based on your preference

# Identify SIC Codes with frequency below the threshold
low_frequency_sic_codes = sic_code_frequency[sic_code_frequency < threshold].index

# Replace these low-frequency SIC Codes with a common label "Others"
df3['SIC Code'] = df3['SIC Code'].replace(low_frequency_sic_codes, 'Others')

sic_code_frequency1 = df3['SIC Code'].value_counts()

# Display the updated DataFrame
print(sic_code_frequency1)

In [None]:
# convert to str cus we dont want it to be trained as a numeric value
# Convert the entire column to a common data type (string in this case)
df3['SIC Code'] = df3['SIC Code'].astype(str)

The alternative approach to encoding is indicated below to deal with SIC Codes.

For more information on SIC Codes, visit the site here: https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Use label encoding for 'SIC Code'
label_encoder = LabelEncoder()
df3['SIC Code'] = label_encoder.fit_transform(df3['SIC Code'])

In [None]:
df3["SIC Code"].unique()

# 2. EDA

We will now proceed to analyze and visualize a subset of our data to gain some rough insight into how inputs in our data are related to our outputs.

## 2.1. Plotting Graphs

### Correlation Matrix

We will first use a Correlation Matrix to observe how a variable in the dataset are affected by other variables.

In [None]:
plt.figure(figsize=(15, 15))
sns.heatmap(df.corr(numeric_only=True), square=True, annot=True)

### Geospatial Analysis

### AccountID
No clear pattern -> Drop

In [None]:
df["AccountID"]

### "Company" features


In [None]:
# Company	Parent Company Company Description Global Ultimate Company Domestic Ultimate Company
df[["Company", "Parent Company", "Company Description", "Global Ultimate Company", "Domestic Ultimate Company"]]
# df.loc[0, "Company Description"]

In [None]:
df["Parent Company"].value_counts()

In [None]:
# Find the frequency of each parent company
par_freq = df["Parent Company"].value_counts()

# Set a threshold for low-frequency SIC Codes
threshold = 15  # Adjust this threshold based on your preference

# Identify SIC Codes with frequency below the threshold
low_frequency_par = par_freq[par_freq < threshold].index

df2 = df.copy()
# Replace these low-frequency SIC Codes with a common label "Others"
df2["Parent Company"] = df2["Parent Company"].replace(low_frequency_par, 'Others')

df2["Parent Company"].value_counts()

In [None]:
# mean_sale_by_par = df2.groupby("Parent Company")["Sales (Domestic Ultimate Total USD)"].mean()
# df2["Sales (Domestic Ultimate Total USD)"] = np.log(df2["Sales (Domestic Ultimate Total USD)"])
plt.figure(figsize=(15, 10))
plt.scatter(data=df2[df2["Parent Company"] != "Others"], x="Sales (Domestic Ultimate Total USD)", y="Parent Company")

### Industry

In [None]:
df2 = df.copy()
# Find the frequency of each industry
sic_code_frequency = df2['Industry'].value_counts()

# Set a threshold for low-frequency SIC Codes
threshold = 10  # Adjust this threshold based on your preference

# Identify SIC Codes with frequency below the threshold
low_frequency_sic_codes = sic_code_frequency[sic_code_frequency < threshold].index

# Replace these low-frequency SIC Codes with a common label "Others"
df2['Industry'] = df2['Industry'].replace(low_frequency_sic_codes, 'Others')

df2['Industry'].value_counts()

In [None]:
largest_industries = df2["Industry"].value_counts()[:10].index
sns.barplot(data=df2[df2["Industry"].isin(largest_industries)], y="Industry", x="Sales (Domestic Ultimate Total USD)")

### Year Found

Rac

### Entity Type

In [None]:
df["Entity Type"].value_counts()

In [None]:
sns.barplot(data=df, y="Entity Type", x="Sales (Domestic Ultimate Total USD)")

### Parent Country and Global Ultimate Country

In [None]:
df2 = df.copy()
# Find the frequency of each industry
pc_frequency = df2['Parent Country'].value_counts()
guc_frequency = df2['Global Ultimate Country'].value_counts()

# Set a threshold for low-frequency SIC Codes
threshold = 10  # Adjust this threshold based on your preference

# Identify SIC Codes with frequency below the threshold
low_frequency_pc = pc_frequency[pc_frequency < threshold].index
low_frequency_guc = guc_frequency[guc_frequency < threshold].index

# Replace these low-frequency SIC Codes with a common label "Others"
df2['Parent Country'] = df2['Parent Country'].replace(low_frequency_pc, 'Others')
df2['Global Ultimate Countryy'] = df2['Global Ultimate Country'].replace(low_frequency_guc, 'Others')

display(df2['Parent Country'].value_counts())
display(df2['Global Ultimate Country'].value_counts())

### Ownership Type


In [None]:
sns.barplot(data=df, y="Ownership Type", x="Sales (Domestic Ultimate Total USD)")

### Employees
Rac

### Sales (Global Ultimate Total USD)

In [None]:

sns.scatterplot(data=df, x="Sales (Global Ultimate Total USD)", y="Sales (Domestic Ultimate Total USD)")
plt.xscale('log')
plt.yscale('log')

### Import/Export Status

In [None]:
df2 = df.copy()
df2.loc[df2["Import/Export Status"].isna(), "Import/Export Status"] = "Missing"
df2["Import/Export Status"].value_counts()

In [None]:
sns.barplot(data=df2, y="Import/Export Status", x="Sales (Domestic Ultimate Total USD)")

### Fiscal Year End


In [None]:
df["Fiscal Year End"].value_counts()

### Is Domestic Ultimate and Is Global Ultimate

In [None]:
df2 = df.copy()
df2["Is Domestic Ultimate"] = df2["Is Domestic Ultimate"].astype('category')
df2["Is Global Ultimate"] = df2["Is Global Ultimate"].astype('category')
display(df2["Is Domestic Ultimate"].value_counts())
display(df2["Is Global Ultimate"].value_counts())

In [None]:
sns.barplot(data=df2, y="Is Domestic Ultimate", x="Sales (Domestic Ultimate Total USD)")

In [None]:
sns.barplot(data=df2, y="Is Global Ultimate", x="Sales (Domestic Ultimate Total USD)")

### Is Ultimate vs Entity Type

In [None]:
sns.barplot(data=df2, y="Entity Type", x="Sales (Domestic Ultimate Total USD)", hue="Is Domestic Ultimate")

In [None]:
sns.barplot(data=df2, y="Entity Type", x="Sales (Domestic Ultimate Total USD)", hue="Is Global Ultimate")

## 2.2 Feature Selection
We will remove features that are of the following:
- seemingly irrelevant to affecting the output by domain knowledge;
- Low/Zero-variance Factors.

We will also filter our dataset to select only the rows of data we are interested in.

In [None]:
# Specify columns to drop
columns_to_drop = ["error", "Fiscal Year End", "Sales (Global Ultimate Total USD)", "Global Ultimate Company", "Domestic Ultimate Company", "Web Address",
                   "Sales (Global Ultimate Total USD)", "Square Footage", "Company Description", "PostCode", "8-Digit SIC Code", "8-Digit SIC Description", "AccountID",
                   "Parent Company", "City", "Country", "Address", "Address1", "Industry", "Region", "Parent Country", "Global Ultimate Country", "Company"]

# Drop columns if they exist in the DataFrame
df4 = df3.drop(columns=[col for col in columns_to_drop if col in df3.columns], errors='ignore')

In [None]:
df4 = df4.dropna(subset=["Employees (Single Site)", "Employees (Domestic Ultimate Total)", "Employees (Global Ultimate Total)",
                                       "Year Found"])

### Data Type Conversion

We will need to convert some columns into datatypes that are suitable for analysis. This makes sure that the values in these fields make sense.

In [None]:
# Convert 'Is Domestic Ultimate' to True/False
df4['Is Domestic Ultimate'] = df4['Is Domestic Ultimate'] == 1
df4['Is Global Ultimate'] = df4['Is Global Ultimate'] == 1

In [None]:
df4.columns

### 2.3. Preprocessing

In [None]:
def clean(df):
  chosen_cols = [
    "LATITUDE",
    "LONGITUDE",
    # "Industry",
    "Entity Type",
    # "Parent Country",
    # "Global Ultimate Country",
    "Ownership Type",
    "Import/Export Status",
    "Is Domestic Ultimate",
    "Is Global Ultimate",
  ]

  df = df.copy()
  df = df[chosen_cols]

  df.loc[df["Import/Export Status"].isna(), "Import/Export Status"] = "Missing"

  df = df.dropna()

  return df

In [None]:
def preprocess(df):
  df = df.copy()
  df = pd.get_dummies(df, columns=['Entity Type'], prefix='EntityType')
  df = pd.get_dummies(df, columns=['Ownership Type'], prefix='OwnershipType')
  df = pd.get_dummies(df, columns=['Import/Export Status'], prefix='ImportExport')
  return df

In [None]:
df_processed = clean(df)
print(df_processed.shape)
df_processed.head()

## 3. Modelling

In [None]:
X = clean(df)
y = df.loc[X.index, "Sales (Domestic Ultimate Total USD)"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [None]:
# def cv_score(model, report_mean=True):
#     FOLDS = 8
#     kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)

#     scores = []
#     for fold, (train, test) in enumerate(kf.split(X, y)):
#         X_train, X_test = X.iloc[train], X.iloc[test]
#         y_train, y_test = y.iloc[train], y.iloc[test]

#         y_pred = model.fit(X_train, y_train).predict_proba(X_test)
#         scores.append(balanced_log_loss(y_test, y_pred))
#     if report_mean:
#         print(f"CV: {np.mean(scores):.4f} ({np.std(scores):.4f})")
#     else:
#         return scores

def cv_score(model, preprocess):
  FOLDS = 5
  print(X_train.shape, y_train.shape)
  X_processed = preprocess(X_train)
  scores = cross_val_score(model, X_processed, y_train, cv=FOLDS, scoring='r2')
  print(f"CV: {np.mean(scores):.4f} ({np.std(scores):.4f})")

In [None]:
rf = RandomForestRegressor()
cv_score(rf, preprocess)

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)


In [None]:
r2_score(y_train, rf.predict(X_train))

In [None]:
cat_features = list(X_train.select_dtypes(include=["category", "object"]).columns)
print(cat_features)

cat = CatBoostRegressor(cat_features=cat_features, verbose=0)

cat.fit(X_train, y_train)

In [None]:
# r2_score(y_train, cat.predict(X_train))
r2_score(y_test, cat.predict(X_test))

# 3 Model Training and Evaluation

## 3.1 Selecting a Model
Selecting a model involves a couple of decisions to make:

- Train-test Split: the proportion of data used to train, test and evaluate our data,
- Type of ML model used (well-known ones include Decision Trees, Random Forests, Support Vector Machines, Linear/Logistic Regression and Neural Networks).

In our case, we will use the Gradient Boosting Regressor provisioned by scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Separate features and target variable
X = df4.drop('Sales (Domestic Ultimate Total USD)', axis=1)
y = df4['Sales (Domestic Ultimate Total USD)']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the GradientBoostingRegressor
model = GradientBoostingRegressor(random_state=42)

### Cross-Validation

We will also perform a K-fold Cross Validation. This helps us mitigate the risk of overfitting on a specific set of data, by being able to split the data into K number of folds, test on 1 fold and train for the rest for K number of times.

In [None]:
# Lists to store results
n_folds_values = list(range(4, 16))
mean_r2_scores = []
std_r2_scores = []

# Iterate over different numbers of folds
for n_folds in n_folds_values:
    # Use k-fold cross-validation with the current number of folds
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    # Perform cross-validation and get R-squared scores
    cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')

    # Append mean and standard deviation of R-squared scores to lists
    mean_r2_scores.append(cv_scores.mean())
    std_r2_scores.append(cv_scores.std())

# Plot the results
plt.errorbar(n_folds_values, mean_r2_scores, yerr=std_r2_scores, marker='o', linestyle='-', label='R-squared scores')
plt.xlabel('Number of Folds')
plt.ylabel('R-squared Score')
plt.title('Cross-Validated R-squared Scores for Different Numbers of Folds')
plt.legend()
plt.show()

## 3.2 Model Evalutation Metrics

There are many ways to evaluate a Machine Learning model:

- Residual Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Average Percentage Error (MAPE) for Regression Tasks;
- Confusion Matrix, AUC-ROC Curve for Classification Problems;
- and other variants of such metrics.

In this problem, MSE will be very big as the sales are very large by nature. We will opt to use the R-Squared Score (used to score how well a regression model fits its data).

In [None]:
# Use k-fold cross-validation with 10 folds
model_10 = GradientBoostingRegressor(random_state=42)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Perform cross-validation and get R-squared scores
cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')

# Print the R-squared scores for each fold
print("Cross-Validation R-squared scores:", cv_scores)

# Print the mean and standard deviation of the R-squared scores
print("Mean R-squared score:", cv_scores.mean())
print("Standard Deviation of R-squared scores:", cv_scores.std())

# Train the model on the entire training set
model_10.fit(X_train, y_train)

# Evaluate the model on the test set
test_score = model_10.score(X_test, y_test)
print("Test R-squared score:", test_score)

minimal submission -- can go steps further to improve the model, ie creating artificial features, feature engineering, clustering etc.

# Saving and testing the model

We will export our model into a Hierarchical Data Format 5 File (.h5) for use by others. A general use case will also be covered in the form of a function below.

In [None]:
import joblib

# Save the base model to an HDF5 file
joblib.dump(model_10, 'base_model.h5')

In [None]:
def test_model(data):
    # we assume the data is cleaned
    # To load the model in the future
    loaded_model = joblib.load('./base_model.h5')
    predictions = loaded_model.predict(data)

    return predictions

# Extract the last row of the test set
last_row = X_test.iloc[[1]]

# Make predictions on the last row
print(test_model(last_row))

Feel free to explore the functions as written above and play around with more models! You may be able to achieve a higher accuracy.

Happy learning and coding!