# Modeling

In this notebook, we will train various regression models to predict medical insurance costs. We will evaluate their performance and select the best model based on metrics such as Mean Absolute Error (MAE) and R-squared.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

# Load the processed data
data = pd.read_csv('../data/processed/sleep_cleaned.csv')

# Display the first few rows of the dataset
data.head()

In [None]:
# Split the data into features and target variable
X = data.drop('Insurance_Cost', axis=1)  # Replace 'Insurance_Cost' with the actual target column name
y = data['Insurance_Cost']  # Replace with the actual target column name

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Train a Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_linear = linear_model.predict(X_test_scaled)

# Evaluate the model
mae_linear = mean_absolute_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print(f'Linear Regression MAE: {mae_linear}')
print(f'Linear Regression R^2: {r2_linear}')

In [None]:
# Train a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f'Random Forest MAE: {mae_rf}')
print(f'Random Forest R^2: {r2_rf}')

## Conclusion

In this notebook, we trained and evaluated Linear Regression and Random Forest models. Based on the MAE and R-squared values, we can determine which model performs better for predicting medical insurance costs.