![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [9]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [10]:
# Implement model creation and training here
# Use as many cells as you need

# Full, robust solution — run in one cell
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score as _r2_score

# -------------------------
# Load data
# -------------------------
insurance = pd.read_csv("insurance.csv")
validation_data = pd.read_csv("validation_dataset.csv")

# -------------------------
# Clean target: 'charges'
# -------------------------
# Remove dollar signs and commas, coerce to float
insurance['charges'] = (insurance['charges']
                        .astype(str)
                        .str.replace(r'[\$,]', '', regex=True)
                        .str.strip()
                        .replace('', np.nan)
                        .astype(float))

# -------------------------
# Define columns
# -------------------------
categorical_cols = ['sex', 'smoker', 'region']
numeric_cols = ['age', 'bmi', 'children']

# Basic standardization of string categories (helps avoid small text mismatches)
for c in categorical_cols:
    insurance[c] = insurance[c].astype(str).str.strip().str.lower()
    # if validation has these columns, standardize them too (if column missing we'll handle later)
    if c in validation_data.columns:
        validation_data[c] = validation_data[c].astype(str).str.strip().str.lower()

# Ensure numeric columns are numeric
insurance[numeric_cols] = insurance[numeric_cols].apply(pd.to_numeric, errors='coerce')
validation_data[numeric_cols] = validation_data[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Drop rows in training that still have missing critical values
insurance = insurance.dropna(subset=numeric_cols + categorical_cols + ['charges'])

# Split features/target
X = insurance.drop(columns=['charges'])
y = insurance['charges']

# -------------------------
# Preprocessing + model pipeline
# - numeric: median imputation
# - categorical: constant imputer + OneHotEncoder(handle_unknown='ignore')
# -------------------------
num_transformer = SimpleImputer(strategy='median')
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer([
    ('num', num_transformer, numeric_cols),
    ('cat', cat_transformer, categorical_cols)
], remainder='drop')   # drop any other columns

model = Pipeline([
    ('pre', preprocessor),
    ('reg', RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1))
])

# -------------------------
# Train / evaluate
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)

# save R^2 in variable r2_score as required
r2_score = float(_r2_score(y_test, y_pred_test))
print("R^2 on holdout test set:", r2_score)

# -------------------------
# Predict on validation set
# -------------------------
# Ensure validation contains required feature columns; if not, add missing columns filled with NaN
for c in numeric_cols + categorical_cols:
    if c not in validation_data.columns:
        validation_data[c] = np.nan

# Make predictions (preprocessor handles missing values)
val_X = validation_data[numeric_cols + categorical_cols]
validation_preds = model.predict(val_X)

# Enforce minimum charge of 1000
validation_data['predicted_charges'] = np.maximum(validation_preds, 1000.0)

# Final object required by the task:
# - r2_score  (float)
# - validation_data  (pandas DataFrame, with 'predicted_charges' column)
print("\nSample predictions:")
print(validation_data[['predicted_charges']].head())

# validation_data is the final DataFrame requested


R^2 on holdout test set: 0.8001308602531637

Sample predictions:
   predicted_charges
0        3173.637163
1       20345.595443
2       18516.302949
3       49032.662032
4        7119.909259
