# **<span style="color: green;">Predictive Regression Tool Using Gradio</span>**

## **1. Medical Insurance Cost **
---
(https://www.kaggle.com/datasets/mosapabdelghany/medical-insurance-cost-dataset)

---

For Assessment 2, we are required to solve a Machine Learning problem of any type. For this project, I decided to work with a Regression Task. We were also instructed to use Gradio to work with the Machine Learning Model.

### **1.1 Importing**
---

In [1]:
import gradio as gr
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from bayes_opt import BayesianOptimization
from sklearn.datasets import make_regression

### **1.2 Printing CSV and Data Checking**
---

In [2]:
mic = pd.read_csv('/kaggle/input/medical-insurance-cost-dataset/insurance.csv')
print(mic.shape)
print(mic.columns)
mic.head()

(1338, 7)
Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#### **1.21 Cleaning Data**
---
It is still good to check if any cells have missing or unsupported data. The isna().sum() functions from pandas are used together to check for missing values.

In [3]:
mic.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

### **1.3 Regression Task**
---

In [4]:
X = mic.drop('charges', axis=1)
X = pd.get_dummies(X, columns=['sex', 'smoker', 'region'], drop_first=True)
y = mic['charges']

train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.2, random_state=42)

LR = LinearRegression().fit(train_X, train_y)
DTR = DecisionTreeRegressor().fit(train_X, train_y)
RF = RandomForestRegressor().fit(train_X, train_y)

### **1.4 Model Performance & Optimization**
---
The accuracy of the models have to be determined to see if the tool is reliable and ready to be deployed for people to use.

In [5]:
def evaluate(model, X_val, y_val):
    preds = model.predict(X_val)
    r2 = r2_score(y_val, preds)
    mae = mean_absolute_error(y_val, preds)
    mse = mean_squared_error(y_val, preds)
    rmse = np.sqrt(mean_squared_error(y_val, preds))
    return r2, mae, mse, rmse

# Linear Regression 
r2_lr, mae_lr, mse_lr, rmse_lr = evaluate(LR, val_X, val_y) 
print("\nLinear Regression") 
print(f"R²: {r2_lr:.2f}") 
print(f"MAE: {mae_lr:,.2f}") 
print(f"MSE: {mse_lr:,.2f}")
print(f"RMSE: {rmse_lr:,.2f}")
    
# Decision Tree 
r2_dt, mae_dt, mse_dt, rmse_dt = evaluate(DTR, val_X, val_y) 
print("\nDecision Tree") 
print(f"R²: {r2_dt:.2f}") 
print(f"MAE: {mae_dt:,.2f}") 
print(f"MSE: {mse_dt:,.2f}") 
print(f"RMSE: {rmse_dt:,.2f}")

# Random Forest 
r2_rf, mae_rf, mse_rf, rmse_rf = evaluate(RF, val_X, val_y) 
print("\nRandom Forest") 
print(f"R²: {r2_rf:.2f}") 
print(f"MAE: {mae_rf:,.2f}") 
print(f"MSE: {mse_rf:,.2f}")
print(f"RMSE: {rmse_rf:,.2f}")


Linear Regression
R²: 0.78
MAE: 4,181.19
MSE: 33,596,915.85
RMSE: 5,796.28

Decision Tree
R²: 0.73
MAE: 3,199.41
MSE: 41,951,477.91
RMSE: 6,477.00

Random Forest
R²: 0.86
MAE: 2,552.78
MSE: 21,237,081.99
RMSE: 4,608.37


#### **1.41 Optimization**
---
With evaluation metrics, lower values are better for MAE, MSE, and RMSE. As for the R² score, it should be higher. Out of the models evaluated so far, Random Forest had the best performance out of the box, with the Decision Tree model performing the least.

##### **1.41a Linear Regression Optimization** >> (Ridge)
---

##### **1.41b Decision Tree Optimization** >> (Hyperparameter Tuning)
---

In [6]:
# code used from geeksforgeeks, using bayesian optimization (https://www.geeksforgeeks.org/machine-learning/how-to-tune-a-decision-tree-in-hyperparameter-tuning/)
def dt_mic(max_depth, min_samples_split, min_samples_leaf):
    estimator = DecisionTreeRegressor(
        max_depth=int(max_depth),
        min_samples_split=int(min_samples_split),
        min_samples_leaf=int(min_samples_leaf),
        random_state=2
    )
    cval = cross_val_score(estimator, train_X, train_y, scoring='neg_mean_squared_error', cv=5)
    return cval.mean()

param_bounds = {
    'max_depth': (1, 20),
    'min_samples_split': (2, 20),
    'min_samples_leaf': (1, 20)
}

optimizer = BayesianOptimization(
    f=dt_mic,
    pbounds=param_bounds,
    random_state=42,
)

optimizer.maximize(n_iter=25, init_points=5)
best_params_bayes = optimizer.max['params']
best_params_bayes['max_depth'] = int(best_params_bayes['max_depth'])
best_params_bayes['min_samples_split'] = int(best_params_bayes['min_samples_split'])
best_params_bayes['min_samples_leaf'] = int(best_params_bayes['min_samples_leaf'])
best_score_bayes = optimizer.max['target']

print(f"Best Parameters (Bayesian Optimization): {best_params_bayes}")
print(f"Best Score (Bayesian Optimization): {best_score_bayes}")

|   iter    |  target   | max_depth | min_sa... | min_sa... |
-------------------------------------------------------------
| [39m1        [39m | [39m-2.46e+07[39m | [39m8.1162622[39m | [39m19.112857[39m | [39m14.907884[39m |
| [39m2        [39m | [39m-3.28e+07[39m | [39m12.374511[39m | [39m4.8083355[39m | [39m3.9638958[39m |
| [39m3        [39m | [39m-2.57e+07[39m | [39m2.1035886[39m | [39m17.591170[39m | [39m12.421185[39m |
| [35m4        [39m | [35m-2.42e+07[39m | [35m14.453378[39m | [35m2.3705208[39m | [35m19.428287[39m |
| [39m5        [39m | [39m-2.96e+07[39m | [39m16.816410[39m | [39m5.8221039[39m | [39m4.4546743[39m |
| [35m6        [39m | [35m-2.39e+07[39m | [35m11.385296[39m | [35m10.542765[39m | [35m20.0     [39m |
| [39m7        [39m | [39m-2.39e+07[39m | [39m20.0     [39m | [39m14.138457[39m | [39m20.0     [39m |
| [39m8        [39m | [39m-2.55e+07[39m | [39m20.0     [39m | [39m20.0     [39m | [

Adapting for model use
---

In [7]:
DT_Bayes = DecisionTreeRegressor( 
    max_depth=best_params_bayes['max_depth'], 
    min_samples_split=best_params_bayes['min_samples_split'], 
    min_samples_leaf=best_params_bayes['min_samples_leaf'], 
    random_state=42) 

DT_Bayes.fit(train_X, train_y)

rse_translate = -best_score_bayes
rmse_translate = np.sqrt(rse_translate)
preds = DT_Bayes.predict(val_X)
r2 = r2_score(val_y, preds) 

print(f"Optimized R²: {r2:,.4f}")
print(f"Optimized RMSE: {rmse_translate:,.2f}")

Optimized R²: 0.8653
Optimized RMSE: 4,893.66


##### **1.41c Random Forest Optimization** >> (Hyperparameter Tuning)
---

In [8]:
def rf_mic(n_estimators, max_depth, min_samples_split):
    estimator = RandomForestRegressor(
        n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        min_samples_split=int(min_samples_split),
        random_state=42
    )
    cval = cross_val_score(estimator, train_X, train_y, scoring='neg_mean_squared_error', cv=5)
    return cval.mean()

param_bounds = {
    'n_estimators': (50, 300),
    'max_depth': (3, 20),
    'min_samples_split': (2, 20)
}

optimizer = BayesianOptimization(
    f=rf_mic,
    pbounds=param_bounds,
    random_state=42,
)

optimizer.maximize(init_points=5, n_iter=25)

best_params_bayes = optimizer.max['params']
best_params_bayes['n_estimators'] = int(best_params_bayes['n_estimators'])
best_params_bayes['max_depth'] = int(best_params_bayes['max_depth'])
best_params_bayes['min_samples_split'] = int(best_params_bayes['min_samples_split'])
best_score_bayes = optimizer.max['target']

print(f"Best Parameters (Bayesian Optimization): {best_params_bayes}")
print(f"Best Score (Bayesian Optimization): {best_score_bayes}")

|   iter    |  target   | n_esti... | max_depth | min_sa... |
-------------------------------------------------------------
| [39m1        [39m | [39m-2.28e+07[39m | [39m143.63502[39m | [39m19.162143[39m | [39m15.175890[39m |
| [35m2        [39m | [35m-2.20e+07[39m | [35m199.66462[39m | [35m5.6523168[39m | [35m4.8079013[39m |
| [39m3        [39m | [39m-2.30e+07[39m | [39m64.520903[39m | [39m17.724994[39m | [39m12.820070[39m |
| [35m4        [39m | [35m-2.19e+07[39m | [35m227.01814[39m | [35m3.3499364[39m | [35m19.458377[39m |
| [39m5        [39m | [39m-2.24e+07[39m | [39m258.11066[39m | [39m6.6097648[39m | [39m5.2728494[39m |
| [35m6        [39m | [35m-2.16e+07[39m | [35m225.18033[39m | [35m4.2505167[39m | [35m19.573685[39m |
| [39m7        [39m | [39m-2.19e+07[39m | [39m219.23389[39m | [39m5.3204561[39m | [39m17.235988[39m |
| [39m8        [39m | [39m-2.25e+07[39m | [39m224.71841[39m | [39m11.136050[39m | [

Adapting for model use
---

In [9]:
RF_Bayes = RandomForestRegressor( 
    n_estimators=best_params_bayes['n_estimators'], 
    max_depth=best_params_bayes['max_depth'], 
    min_samples_split=best_params_bayes['min_samples_split'], 
    random_state=42) 

RF_Bayes.fit(train_X, train_y)

rse_translate = -best_score_bayes
rmse_translate = np.sqrt(rse_translate)
preds = RF_Bayes.predict(val_X)
r2 = r2_score(val_y, preds) 

print(f"Optimized R²: {r2:,.4f}")
print(f"Optimized RMSE: {rmse_translate:,.2f}")

Optimized R²: 0.8716
Optimized RMSE: 4,656.96


### **1.5 Functions**
---

#### **1.51 Function 1**
---

In [10]:
def build_input(age, sex, bmi, children, smoker, region):
    input_row = pd.DataFrame(0, index=[0], columns=X.columns)

    input_row['age'] = age
    input_row['bmi'] = bmi
    input_row['children'] = children

    # one-hot categorical
    col = f"sex_{sex}"
    if col in input_row.columns:
        input_row[col] = 1

    col = f"smoker_{smoker}"
    if col in input_row.columns:
        input_row[col] = 1

    col = f"region_{region}"
    if col in input_row.columns:
        input_row[col] = 1

    return input_row

#### **1.52 Function 2**
---

In [11]:
def predict(model, age, sex, bmi, children, smoker, region):
    # dataset columns are lowercase, but user input first letters are uppercased for formatting
    sex = sex.lower()
    smoker = smoker.lower()
    region = region.lower()

    # sex, smoker, and regions are categorical strings 
    input_df = build_input(age, sex, bmi, children, smoker, region)
    
    if model == 'Linear Regression':
        preds = LR.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Decision Tree':
        preds = DTR.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Random Forest':
        preds = RF.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Optimized DT':
        preds = DT_Bayes.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Optimized RF':
        preds = RF_Bayes.predict(input_df)[0]
        return f"${preds:,.2f}"

### **1.6 Gradio Interface**
---

In [12]:
inputs = [
    gr.Dropdown(["Linear Regression", "Decision Tree", "Random Forest", "Optimized DT", "Optimized RF"], value="Random Forest", label="Model"),
    gr.Number(label="Age", minimum=0),
    gr.Dropdown(["Male", "Female"], label="Sex"),
    gr.Number(label="BMI"),
    gr.Number(label="Number of children"),
    gr.Dropdown(["No", "Yes"], label="Smoker"),
    gr.Dropdown(["Northeast", "Northwest", "Southeast", "Southwest"], label="Location")
]

outputs = gr.Textbox(label="Predicted Charge")

PredictiveTool = gr.Interface(fn=predict, inputs=inputs, outputs=outputs, title="Medical Insurance Cost")
PredictiveTool.launch()

* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://bd44c70fdec3faf6f3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## **2. Deliverables**

### **2.1 Dataset Description**

### **2.2 Preprocessing**

### **2.3 Model Implementation**

### **2.4 Results**

### **2.5 Interpretation**

### **2.6 Critical Reflection**
---

Initially, 