# **<span style="color: green;">Predictive Regression Tool Using Gradio</span>**

## **1. Medical Insurance Cost **
---
(https://www.kaggle.com/datasets/mosapabdelghany/medical-insurance-cost-dataset)

---

For Assessment 2, we are required to solve a Machine Learning problem of any type. For this project, I decided to work with a Regression Task. We were also instructed to use Gradio to work with the Machine Learning Model.

### **1.1 Importing**
---

In [1]:
import gradio as gr
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from bayes_opt import BayesianOptimization
from sklearn.datasets import make_regression

### **1.2 Printing CSV and Data Checking**
---

In [2]:
mic = pd.read_csv('/kaggle/input/medical-insurance-cost-dataset/insurance.csv')
print(mic.shape)
print(mic.columns)
mic.head()

(1338, 7)
Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#### **1.21 Cleaning Data**
---
It is still good to check if any cells have missing or unsupported data. The isna().sum() functions from pandas are used together to check for missing values.

In [3]:
mic.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

### **1.3 Regression Task**
---

In [4]:
X = mic.drop('charges', axis=1)
X = pd.get_dummies(X, columns=['sex', 'smoker', 'region'], drop_first=True)
y = mic['charges']

train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.2, random_state=42)

LR = LinearRegression().fit(train_X, train_y)
DTR = DecisionTreeRegressor().fit(train_X, train_y)
RF = RandomForestRegressor().fit(train_X, train_y)

### **1.4 Model Performance & Optimization**
---
The accuracy of the models have to be determined to see if the tool is reliable and ready to be deployed for people to use.

In [7]:
def evaluate(model, X_val, y_val):
    preds = model.predict(X_val)
    r2 = r2_score(y_val, preds)
    mae = mean_absolute_error(y_val, preds)
    mse = mean_squared_error(y_val, preds)
    rmse = np.sqrt(mean_squared_error(y_val, preds))
    return r2, mae, mse, rmse

# Linear Regression 
r2_lr, mae_lr, mse_lr, rmse_lr = evaluate(LR, val_X, val_y) 
print("\nLinear Regression") 
print(f"R²: {r2_lr:.2f}") 
print(f"MAE: {mae_lr:,.2f}") 
print(f"MSE: {mse_lr:,.2f}")
print(f"RMSE: {rmse_lr:,.2f}")
    
# Decision Tree 
r2_dt, mae_dt, mse_dt, rmse_dt = evaluate(DTR, val_X, val_y) 
print("\nDecision Tree") 
print(f"R²: {r2_dt:.2f}") 
print(f"MAE: {mae_dt:,.2f}") 
print(f"MSE: {mse_dt:,.2f}") 
print(f"RMSE: {rmse_dt:,.2f}")

# Random Forest 
r2_rf, mae_rf, mse_rf, rmse_rf = evaluate(RF, val_X, val_y) 
print("\nRandom Forest") 
print(f"R²: {r2_rf:.2f}") 
print(f"MAE: {mae_rf:,.2f}") 
print(f"MSE: {mse_rf:,.2f}")
print(f"RMSE: {rmse_rf:,.2f}")


Linear Regression
R²: 0.78
MAE: 4,181.19
MSE: 33,596,915.85
RMSE: 5,796.28

Decision Tree
R²: 0.71
MAE: 3,306.00
MSE: 44,361,200.17
RMSE: 6,660.42

Random Forest
R²: 0.86
MAE: 2,590.52
MSE: 21,461,042.41
RMSE: 4,632.61


#### **1.41 Optimization**
---
With evaluation metrics, lower values are better for MAE, MSE, and RMSE. As for the R² score, it should be higher. Out of the models evaluated so far, Random Forest had the best performance out of the box, with the Decision Tree model performing the least.

##### **1.41a Linear Regression Optimization** >> (Gradient Descent)
---

##### **1.41b Decision Tree Optimization** >> (Hyperparameter Tuning)
---

##### **1.41c Random Forest Optimization** >> ()
---

### **1.5 Functions**
---

#### **1.51 Function 1**
---

In [5]:
def build_input(age, sex, bmi, children, smoker, region):
    input_row = pd.DataFrame(0, index=[0], columns=X.columns)

    input_row['age'] = age
    input_row['bmi'] = bmi
    input_row['children'] = children

    # one-hot categorical
    col = f"sex_{sex}"
    if col in input_row.columns:
        input_row[col] = 1

    col = f"smoker_{smoker}"
    if col in input_row.columns:
        input_row[col] = 1

    col = f"region_{region}"
    if col in input_row.columns:
        input_row[col] = 1

    return input_row

#### **1.52 Function 2**
---

In [6]:
def predict(model, age, sex, bmi, children, smoker, region):
    # dataset columns are lowercase, but user input first letters are uppercased for formatting
    sex = sex.lower()
    smoker = smoker.lower()
    region = region.lower()

    # sex, smoker, and regions are categorical strings 
    input_df = build_input(age, sex, bmi, children, smoker, region)
    
    if model == 'Linear Regression':
        preds = LR.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Decision Tree':
        preds = DTR.predict(input_df)[0]
        return f"${preds:,.2f}"
    elif model == 'Random Forest':
        preds = RF.predict(input_df)[0]
        return f"${preds:,.2f}"

### **1.6 Gradio Interface**
---

In [8]:
inputs = [
    gr.Dropdown(["Linear Regression", "Decision Tree", "Random Forest"], value="Random Forest", label="Model"),
    gr.Number(label="Age", minimum=0),
    gr.Dropdown(["Male", "Female"], label="Sex"),
    gr.Number(label="BMI"),
    gr.Number(label="Number of children"),
    gr.Dropdown(["No", "Yes"], label="Smoker"),
    gr.Dropdown(["Northeast", "Northwest", "Southeast", "Southwest"], label="Location")
]

outputs = gr.Textbox(label="Predicted Charge")

PredictiveTool = gr.Interface(fn=predict, inputs=inputs, outputs=outputs, title="Medical Insurance Cost")
PredictiveTool.launch()

* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://6d7e402b3c7c6a163b.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## **2. Deliverables**

### **2.1 Dataset Description**

### **2.2 Preprocessing**

### **2.3 Model Implementation**

### **2.4 Results**

### **2.5 Interpretation**

### **2.6 Critical Reflection**
---

Initially, 