In [2]:
import pandas as pd

data_path = 'medical_insurance.csv'
data = pd.read_csv(data_path)

data


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
2767,47,female,45.320,1,no,southeast,8569.86180
2768,21,female,34.600,0,no,southwest,2020.17700
2769,19,male,26.030,1,yes,northwest,16450.89470
2770,23,male,18.715,0,no,northwest,21595.38229


The dataset contains the following features:

    age: The age of the individual.
    sex: The gender of the individual (male or female).
    bmi: Body Mass Index, which provides an understanding of body weight in relation to height.
    children: The number of children/dependents covered by health insurance.
    smoker: Whether the individual is a smoker or not.
    region: The beneficiary's residential area in the US.
    charges: The individual medical costs billed by health insurance.

To make a prediction on insurance costs, we can train a machine learning model using these features. Before proceeding, we need to prepare the data. This may include converting categorical variables to numerical format, handling missing values if any, and splitting the data into a training set and a testing set.


## Linear Regression

A good starting point could be to use linear regression due to its simplicity and interpretability, especially if you have no specific preference. Later on, we can explore more complex models like Random Forest or Gradient Boosting if necessary.

Before training the model, we need to prepare the data. This includes:

    Converting categorical variables into numeric format: We can use one-hot encoding to transform the categorical variables (sex, smoker, region) into a format that can be used to train the model.
    Splitting the data: We'll separate the data into a training set and a testing set to evaluate our model's performance on data it hasn't seen during training.
    Standardization: It's not strictly necessary for linear regression, but it can help with numerical stability and the interpretation of coefficients, especially if we decide to explore other models later on.

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


# Re-defining the column identifiers and transformers for clarity in English
numerical_cols = ['age', 'bmi', 'children']
categorical_cols = ['sex', 'smoker', 'region']

# Preprocessing for numerical and categorical columns
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Model
model = LinearRegression()

# Creating the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])

# Splitting the data into training and testing sets
X = data.drop('charges', axis=1)
y = data['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Training the model
pipeline.fit(X_train, y_train)

# Evaluating the model
score = pipeline.score(X_test, y_test)

score


0.7465278904179782

The linear regression model trained has achieved a coefficient of determination R2R2 of approximately 0.75 on the test set. This means the model can explain 75% of the variance in health insurance costs based on the provided features. It's a fairly good result for a first attempt, but there's always room to improve the model, for instance, by exploring more complex models or refining the data preprocessing.

## Random Forest Regressor Model

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Re-defining the preprocessor and creating a new model with Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=0)

# Creating a new pipeline for Random Forest
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rf_model)])

# Training the Random Forest model
rf_pipeline.fit(X_train, y_train)

# Evaluating the Random Forest model
rf_score = rf_pipeline.score(X_test, y_test)
rf_predictions = rf_pipeline.predict(X_test)
rf_rmse = sqrt(mean_squared_error(y_test, rf_predictions))

rf_score, rf_rmse


(0.9616343700872745, 2391.1580012338836)

The Random Forest Regressor model achieved a coefficient of determination (R2R2) of approximately 0.96 on the test set, indicating a very high level of explanatory power for the variance in health insurance costs. Additionally, the model has a Root Mean Square Error (RMSE) of about 2391.16. This lower RMSE value compared to the expected insurance charges suggests that the model's predictions are quite accurate, with a relatively small average error magnitude.

This model significantly outperforms the previous linear regression model, demonstrating the power of more complex ensemble methods like Random Forest for capturing non-linear relationships and interactions between features