![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [1]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [2]:
# Implement model creation and training here
# Use as many cells as you need

In [3]:
# Display summary statistics
print(insurance.describe())

# Display dataset information
print(insurance.info())

               age          bmi     children
count  1272.000000  1272.000000  1272.000000
mean     35.214623    30.560550     0.948899
std      22.478251     6.095573     1.303532
min     -64.000000    15.960000    -4.000000
25%      24.750000    26.180000     0.000000
50%      38.000000    30.210000     1.000000
75%      51.000000    34.485000     2.000000
max      64.000000    53.130000     5.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB
None


In [4]:
# Display unique values for categorical columns
print("Unique values in 'sex' column:")
print(insurance['sex'].unique())

print("\nUnique values in 'smoker' column:")
print(insurance['smoker'].unique())

print("\nUnique values in 'region' column:")
print(insurance['region'].unique())

Unique values in 'sex' column:
['female' 'male' 'woman' 'F' 'man' nan 'M']

Unique values in 'smoker' column:
['yes' 'no' nan]

Unique values in 'region' column:
['southwest' 'Southeast' 'southeast' 'northwest' 'Northwest' 'Northeast'
 'northeast' 'Southwest' nan]


## Data Cleaning and Preprocessing

In [5]:
def clean_dataset(insurance):
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    insurance = insurance[insurance["age"] > 0]
    insurance.loc[insurance["children"] < 0, "children"] = 0
    insurance["region"] = insurance["region"].str.lower()
    return insurance.dropna()

In [6]:
# Clean the dataset
cleaned_insurance = clean_dataset(insurance)

In [7]:
# Display unique values after standardizing
print("\nUnique values after cleaning:")
print("Unique values in 'sex' column after cleaning:")
print(insurance['sex'].unique())

print("\nUnique values in 'smoker' column after cleaning:")
print(insurance['smoker'].unique())

print("\nUnique values in 'region' column after cleaning:")
print(insurance['region'].unique())


Unique values after cleaning:
Unique values in 'sex' column after cleaning:
['female' 'male' nan]

Unique values in 'smoker' column after cleaning:
['yes' 'no' nan]

Unique values in 'region' column after cleaning:
['southwest' 'Southeast' 'southeast' 'northwest' 'Northwest' 'Northeast'
 'northeast' 'Southwest' nan]


In [8]:
# Display summary statistics after cleaning
print(insurance.describe())

# Display dataset information after cleaning
print(insurance.info())

# Confirming 'charges' column type
print("\nData type of 'charges' column:")
print(insurance['charges'].dtype)

               age          bmi     children       charges
count  1272.000000  1272.000000  1272.000000   1272.000000
mean     35.214623    30.560550     0.948899  13286.594477
std      22.478251     6.095573     1.303532  12142.505233
min     -64.000000    15.960000    -4.000000   1121.873900
25%      24.750000    26.180000     0.000000   4733.582163
50%      38.000000    30.210000     1.000000   9382.033000
75%      51.000000    34.485000     2.000000  16579.959053
max      64.000000    53.130000     5.000000  63770.428010
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1272 non-null   float64
dtypes: floa

## Encoding Categorical Variables
We'll use one-hot encoding for the categorical variables (sex, smoker, and region) to convert them into a format suitable for machine learning models.

## Splitting Data into Training and Test Sets
We'll split the data into training and test sets to evaluate our model's performance.

## Model Building and Evaluation
We'll use a Linear Regression model and evaluate it using the R-Squared score.

In [9]:
# Importing necessary libraries for encoding and modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline

def create_and_evaluate_regression_model(insurance):
    X = insurance.drop('charges', axis=1)
    y = insurance['charges']
    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']
    
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)
    
    lin_reg = LinearRegression()
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)
    
    insurance_model_pipeline.fit(X_scaled, y)
    
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2, X_processed

insurance_model, mean_mse, mean_r2, X_processed = create_and_evaluate_regression_model(cleaned_insurance)
print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)

Mean MSE: 37431001.52191915
Mean R2: 0.7450511466263761


In [10]:
print(X_processed.head())

    age     bmi  children  ...  region_northwest  region_southeast  region_southwest
0  19.0  27.900       0.0  ...                 0                 0                 1
1  18.0  33.770       1.0  ...                 0                 1                 0
2  28.0  33.000       3.0  ...                 0                 1                 0
3  33.0  22.705       0.0  ...                 1                 0                 0
4  32.0  28.880       0.0  ...                 1                 0                 0

[5 rows x 8 columns]


## Applying the Model to the Validation Dataset
1. Load the validation dataset.
2. Perform the same preprocessing steps on the validation dataset.
3. Predict the charges using the trained model.
4. Handle any negative predicted values by replacing them with the minimum basic charge of 1000.

In [11]:
# Loading the validation dataset
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)

In [12]:
validation_data['sex'] = validation_data['sex'].str.lower()
validation_data['sex'] = validation_data['sex'].replace({'m': 'male', 'man': 'male', 'f': 'female', 'woman': 'female'})
validation_data['region'] = validation_data['region'].str.lower()
validation_data['age'] = pd.to_numeric(validation_data['age'], errors='coerce')
validation_data['children'] = pd.to_numeric(validation_data['children'], errors='coerce')
validation_data = validation_data.dropna(subset=['age', 'children'])
validation_data = validation_data[(validation_data['age'] >= 0) & (validation_data['children'] >= 0)]

In [13]:
validation_data_processed = pd.get_dummies(validation_data, drop_first=True)
for col in X_processed.columns:
    if col not in validation_data_processed.columns:
        validation_data_processed[col] = 0

In [14]:
validation_data_processed = validation_data_processed[X_processed.columns]
validation_data_scaled = insurance_model.named_steps['scaler'].transform(validation_data_processed)
validation_predictions = insurance_model.predict(validation_data_scaled)
validation_data['predicted_charges'] = validation_predictions
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the first few rows of the validation dataset with predictions
print(validation_data.head())

    age     sex        bmi  children smoker     region  predicted_charges
0  18.0  female  24.090000       1.0     no  southeast      128624.195643
1  39.0    male  26.410000       0.0    yes  northeast      220740.537449
2  27.0    male  29.150000       0.0    yes  southeast      181357.588606
3  71.0    male  65.502135      13.0    yes  southeast      423490.687270
4  28.0    male  38.060000       0.0     no  southeast      193247.431989


In [15]:
# Save the validation dataset with predictions to a new CSV file
validation_data.to_csv('validation_dataset_with_predictions.csv', index=False)

# Their solution

In [17]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Load the dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)

def clean_dataset(insurance):
    """
    Cleans the insurance dataset by performing several preprocessing tasks:
    - Corrects the 'sex' column values to a standard format ('male', 'female').
    - Removes the dollar sign from the 'charges' column and converts it to float.
    - Drops negative 'age' values.
    - Converts negative 'children' values to zero.
    - Converts 'region' values to lowercase.
    - Drops rows with any missing values.
    
    Parameters:
    - insurance: pandas DataFrame, the insurance dataset.
    
    Returns:
    - DataFrame after cleaning.
    """
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    insurance = insurance[insurance["age"] > 0]
    insurance.loc[insurance["children"] < 0, "children"] = 0
    insurance["region"] = insurance["region"].str.lower()

    return insurance.dropna()

def create_and_evaluate_regression_model(insurance):
    """
    Prepares the data, fits a linear regression model, and evaluates it using cross-validation.
    
    Parameters:
    - insurance: pandas DataFrame, the cleaned insurance dataset.
    
    Returns:
    - A tuple containing the fitted sklearn Pipeline object, mean MSE, and mean R2 scores.
    """
    # Preprocessing
    X = insurance.drop('charges', axis=1)
    y = insurance['charges']
    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']
    
    # Convert categorical variables to dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    
    # Combine numerical features with dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    # Scaling numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)
    # Linear regression model
    lin_reg = LinearRegression()
    
    # Pipeline
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)
    
    # Fitting the model
    insurance_model_pipeline.fit(X_scaled, y)
    
    # Evaluating the model
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2

# Usage example
cleaned_insurance = clean_dataset(insurance)
insurance_model, mean_mse, r2_score = create_and_evaluate_regression_model(cleaned_insurance)
print("Mean MSE:", mean_mse)
print("Mean R2:", r2_score)

# Predict on validation data
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)

# Ensure categorical variables are properly transformed
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

# Make predictions using the trained model
validation_predictions = insurance_model.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_data.head()

Mean MSE: 37431001.52191915
Mean R2: 0.7450511466263761


Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989
