![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!


Develop a regression model using the insurance.csv dataset to predict charges. Evaluate the model's accuracy using the R-Squared Score. Then, apply the model to estimate predicted_charges for unseen data in validation_dataset.csv.

- Build a regression model to predict charges using the insurance.csv dataset. Evaluate the R-Squared Score of your trained model and save it as a variable named r2_score. The model's success will be assessed based on its R-Squared Score, which must exceed a threshold of 0.65.

- Use the trained model to predict charges for the data in validation_dataset.csv. Store the predictions in a new column named predicted_charges within the validation dataset, and save it as a pandas DataFrame called validation_data. Ensure a minimum basic charge of 1000.

⚠️ Note: If you encounter errors during model training, make sure the insurance DataFrame is properly cleaned and ready for modeling.



In [3]:
# importing the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import pickle

In [4]:
# loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
print(insurance.shape)
insurance.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [5]:
# Remove non-numeric characters (like '$') from the 'charges' column
insurance['charges'] = insurance['charges'].str.replace('$', '', regex=False)

# Convert the cleaned 'charges' column to numeric
insurance['charges'] = pd.to_numeric(insurance['charges'], errors='coerce')

insurance = insurance[insurance["age"] > 0]
insurance.loc[insurance["children"] < 0, "children"] = 0
insurance["region"] = insurance["region"].str.lower()
insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})

In [6]:
# preprocessing the data

# 1. encoding the categorical column
def encode_categorical_column(dataset):
    """Function to encode categorical columns."""

    # Handle missing values and standardize strings
    dataset.loc[:, 'region'] = dataset['region'].astype(str).str.capitalize()
    dataset.loc[:, 'sex'] = dataset['sex'].astype(str).str.lower()

    # Map binary categories for 'sex'
    dataset.loc[:, 'sex'] = dataset['sex'].map({
        'female': 0,
        'male': 1,
    })

    # Encode 'region' column directly as numerical labels
    region_mapping = {region: idx for idx, region in enumerate(dataset['region'].unique())}
    dataset.loc[:, 'region'] = dataset['region'].map(region_mapping)

    # Encode 'smoker' column directly as numerical labels
    if 'smoker' in dataset.columns:
        dataset.loc[:, 'smoker'] = dataset['smoker'].map({'yes': 1, 'no': 0})

    return dataset


# 2. defining a function to fill in missing values
def filling_missing_values(dataset):
    """The function fills missing values in the dataset"""
    return dataset.dropna()


In [7]:
# applying the preprocessing functions using the pipeline
insurance_processed = filling_missing_values(insurance)
insurance_processed = encode_categorical_column(insurance_processed)


# 3. scaling the numeric columns
numeric_columns = ['age', 'bmi', 'children']
scaler = StandardScaler()
insurance_processed.loc[:, numeric_columns] = scaler.fit_transform(insurance_processed[numeric_columns])
# insurance

insurance_processed


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.loc[:, 'region'] = dataset['region'].astype(str).str.capitalize()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.loc[:, 'sex'] = dataset['sex'].astype(str).str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.loc[:, 'sex'] = dataset['sex'].map({
A value is trying to be

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,-1.427171,0,-0.439874,-0.853770,1,0,16884.92400
1,-1.497807,1,0.519066,-0.014607,0,1,1725.55230
2,-0.791445,1,0.393276,1.663719,0,1,4449.46200
3,-0.438264,1,-1.288543,-0.853770,0,2,21984.47061
4,-0.508900,1,-0.279778,-0.853770,0,2,3866.85520
...,...,...,...,...,...,...,...
1332,0.903824,0,2.304620,1.663719,0,0,11411.68500
1333,0.762551,1,0.061650,1.663719,0,2,10600.54830
1335,-1.497807,0,1.022223,-0.853770,0,1,1629.83350
1336,-1.285898,0,-0.782935,-0.853770,0,0,2007.94500


In [8]:
# selecting the the features
explanatory_var = insurance_processed.drop(columns=['charges'], axis=1)
target_features = insurance_processed['charges']

# fitting the model
model = LinearRegression()
model.fit(explanatory_var, target_features)

# cross validation
# Evaluating the model
mse_scores = -cross_val_score(model, explanatory_var, target_features, cv=5, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(model, explanatory_var, target_features, cv=5, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)

print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)

Mean MSE: 37330581.03710814
Mean R2: 0.7457258064178742


In [9]:
# testing the model
validation_data = pd.read_csv('validation_dataset.csv')
validation_data.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
dtype: int64

In [10]:
validation_data = encode_categorical_column(validation_data)
validation_data = filling_missing_values(validation_data)

validation_data['predicted_charges'] = model.predict(validation_data)
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000
validation_data

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,0,24.09,1.0,0,0,123262.08751
1,39.0,1,26.41,0.0,1,1,229544.726349
2,27.0,1,29.15,0.0,1,0,190106.181156
3,71.0,1,65.502135,13.0,1,0,431703.126076
4,28.0,1,38.06,0.0,0,0,187562.266383
5,70.0,0,72.958351,11.0,1,0,442106.183208
6,29.0,0,32.11,2.0,0,2,181133.415471
7,42.0,0,41.325,1.0,0,1,247008.852106
8,48.0,0,36.575,0.0,0,2,259709.440026
9,63.0,1,33.66,3.0,0,0,310196.954062


In [11]:
# Save the trained model
with open('insurance_cost_model.pkl', 'wb') as f:
    pickle.dump(model, f)

In [15]:
from flask import Flask, request, jsonify
import pickle
import numpy as np

# Load the model
with open('insurance_cost_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Initialize Flask app
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    # Get input data from the request
    data = request.json
    features = np.array([[
        data['age'], 
        data['sex'],  # Ensure 'sex' is encoded as 0 or 1
        data['bmi'], 
        data['children'], 
        data['smoker'],  # Ensure 'smoker' is encoded as 0 or 1
        data['region']  # Ensure 'region' is encoded as a numeric value
    ]])

    # Make prediction
    prediction = model.predict(features)

    # Return prediction
    return jsonify({'predicted_charges': prediction[0]})

if __name__ == '__main__':
    app.run(port=8000, debug=True)


 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on


OSError: [Errno 98] Address already in use

: 

In [None]:
import streamlit as st
import requests

st.title("Healthcare Cost Prediction")

# Collect user inputs
age = st.number_input("Age", min_value=0, max_value=100)
sex = st.selectbox("Sex", ["Male", "Female"])
bmi = st.number_input("BMI", min_value=0.0)
children = st.number_input("Number of Children", min_value=0, max_value=10)
smoker = st.selectbox("Smoker", ["Yes", "No"])
region = st.selectbox("Region", ["Northeast", "Southeast", "Southwest", "Northwest"])

# Map categorical inputs to numbers
sex = 1 if sex == "Male" else 0
smoker = 1 if smoker == "Yes" else 0
region_mapping = {"Northeast": 0, "Southeast": 1, "Southwest": 2, "Northwest": 3}
region = region_mapping[region]

# Send input to backend for prediction
if st.button("Predict"):
    response = requests.post("http://127.0.0.1:8000/predict", json={
        "age": age, 
        "sex": sex, 
        "bmi": bmi, 
        "children": children, 
        "smoker": smoker, 
        "region": region
    })
    prediction = response.json()
    st.write(f"Predicted Healthcare Costs: ${prediction['predicted_charges']:.2f}")
