## Model Training

### Notes
#### Predictions will be based on
- Blood Type
- Medical Condition
- Insurance Provider
- Medication

#### 1.1 Import Data and Required Packages
##### Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [47]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings

#### Import the CSV Data as Pandas DataFrame

In [48]:
##Reads the data set from csv
df = pd.read_csv('data/healthcare_dataset.csv')

##Removes duplicates
df = df.drop_duplicates()

## displays top 5 entries
df.head()


Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


#### Preparing X and Y variables

In [49]:
X = df.drop(columns=['Billing Amount', 'Hospital', 'Doctor', 'Room Number', 'Admission Type', 'Discharge Date', 'Date of Admission', 'Test Results', 'Age', 'Name', 'Gender'],axis=1)

In [50]:
X.head()

Unnamed: 0,Blood Type,Medical Condition,Insurance Provider,Medication
0,B-,Cancer,Blue Cross,Paracetamol
1,A+,Obesity,Medicare,Ibuprofen
2,A-,Obesity,Aetna,Aspirin
3,O+,Diabetes,Medicare,Ibuprofen
4,AB+,Cancer,Aetna,Penicillin


In [51]:
print("Categories in 'Gender' variable:     ",end=" " )
print(df['Gender'].unique())

print("Categories in 'Medical Condition' variable:  ",end=" ")
print(df['Medical Condition'].unique())


print("Categories in 'Hospital' variable:     ",end=" " )
print(df['Hospital'].unique())

print("Categories in 'Insurance Provider' variable:     ",end=" " )
print(df['Insurance Provider'].unique())

print("Categories in 'Admission Type' variable:     ",end=" " )
print(df['Admission Type'].unique())

print("Categories in 'Test Results' variable:     ",end=" " )
print(df['Test Results'].unique())

print("Categories in 'Medication' variable:     ",end=" " )
print(df['Medication'].unique())

Categories in 'Gender' variable:      ['Male' 'Female']
Categories in 'Medical Condition' variable:   ['Cancer' 'Obesity' 'Diabetes' 'Asthma' 'Hypertension' 'Arthritis']
Categories in 'Hospital' variable:      ['Sons and Miller' 'Kim Inc' 'Cook PLC' ... 'Guzman Jones and Graves,'
 'and Williams, Brown Mckenzie' 'Moreno Murphy, Griffith and']
Categories in 'Insurance Provider' variable:      ['Blue Cross' 'Medicare' 'Aetna' 'UnitedHealthcare' 'Cigna']
Categories in 'Admission Type' variable:      ['Urgent' 'Emergency' 'Elective']
Categories in 'Test Results' variable:      ['Normal' 'Inconclusive' 'Abnormal']
Categories in 'Medication' variable:      ['Paracetamol' 'Ibuprofen' 'Aspirin' 'Penicillin' 'Lipitor']


In [52]:
y = df['Billing Amount']

In [53]:
y

0        18856.281306
1        33643.327287
2        27955.096079
3        37909.782410
4        14238.317814
             ...     
55495     2650.714952
55496    31457.797307
55497    27620.764717
55498    32451.092358
55499     4010.134172
Name: Billing Amount, Length: 54966, dtype: float64

In [57]:
# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

##Z Score Normalization
numeric_transformer = StandardScaler()

##Transformer for Categorical Variables that works well w/ not too many unique 
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)

In [58]:
X = preprocessor.fit_transform(X)

In [59]:
X.shape

(54966, 24)

In [60]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

print(X_train.shape)

(43972, 24)


#### Create an Evaluate Function to give all metrics after model Training

In [18]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [19]:
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 14188.3829
- Mean Absolute Error: 12270.3629
- R2 Score: 0.0006
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 14272.6909
- Mean Absolute Error: 12369.9907
- R2 Score: -0.0002


Lasso
Model performance for Training set
- Root Mean Squared Error: 14188.3879
- Mean Absolute Error: 12270.4864
- R2 Score: 0.0006
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 14272.4653
- Mean Absolute Error: 12369.7799
- R2 Score: -0.0002


Ridge
Model performance for Training set
- Root Mean Squared Error: 14188.3829
- Mean Absolute Error: 12270.3634
- R2 Score: 0.0006
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 14272.6901
- Mean Absolute Error: 12369.9901
- R2 Score: -0.0002


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 15154.2183
- Mean Absolute E

### Results

In [20]:
pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)

Unnamed: 0,Model Name,R2_Score
8,AdaBoost Regressor,-8.7e-05
1,Lasso,-0.00017
2,Ridge,-0.000201
0,Linear Regression,-0.000201
7,CatBoosting Regressor,-0.016299
6,XGBRegressor,-0.019749
4,Decision Tree,-0.026125
5,Random Forest Regressor,-0.026512
3,K-Neighbors Regressor,-0.191362


## Conclusion:
- Most models show poor predictive power with negative or near-zero R² scores.
- This aligns with the initial observation of uniform distributions in the data.
- The lack of variance due to it being synthetic in the data likely hampers the model's ability to make accurate predictions.
- Future improvements could include real-world data.