![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [86]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


# analysing data - Insurance

In [87]:
# Finding rows with missing values for each column
missing_indices_per_column = {column: insurance.index[insurance[column].isnull()].tolist() for column in insurance.columns}

# Counting unique row indices across all columns
unique_missing_rows = sorted(set(index for indices in missing_indices_per_column.values() for index in indices))

# Print the unique list of row numbers with missing values
print(f'Unique row indices with missing values: {len(unique_missing_rows)}')

Unique row indices with missing values: 130


In [88]:
df_cleaned = insurance.drop(index=unique_missing_rows).reset_index(drop=True)

# Display the cleaned DataFrame
print(df_cleaned.isna().sum())


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [89]:
counts = []
for i in range(len(df_cleaned["age"])):
    if df_cleaned["age"][i] < 0:
        counts.append(i)


In [90]:
df_cleaned = df_cleaned.drop(index=counts).reset_index(drop=True)

# Display the cleaned DataFrame
print(df_cleaned.isna().sum())

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [91]:
df_cleaned['sex'] = df_cleaned['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
df_cleaned["sex"].value_counts()

male      583
female    566
Name: sex, dtype: int64

In [92]:
df_cleaned["children"].value_counts()

 0.0    495
 1.0    264
 2.0    190
 3.0    126
 4.0     18
-1.0     17
 5.0     15
-2.0     12
-3.0      9
-4.0      3
Name: children, dtype: int64

In [93]:
df_cleaned.loc[df_cleaned['children'] < 0, 'children'] = 0

In [94]:
df_cleaned["children"].value_counts()

0.0    536
1.0    264
2.0    190
3.0    126
4.0     18
5.0     15
Name: children, dtype: int64

In [95]:

df_cleaned['region'] = df_cleaned['region'].str.lower()
df_cleaned["region"].value_counts()

southeast    308
northwest    282
northeast    280
southwest    279
Name: region, dtype: int64

In [96]:
df_cleaned['charges'] = df_cleaned['charges'].replace({'\$': ''}, regex=True).astype(float)
df_cleaned['charges'].dtype

dtype('float64')

In [97]:
df_cleaned.dtypes
columns_to_encode=["sex", "smoker", "region"]
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for i in columns_to_encode:
    df_cleaned[i] = le.fit_transform(df_cleaned[i])
df_cleaned.dtypes

age         float64
sex           int64
bmi         float64
children    float64
smoker        int64
region        int64
charges     float64
dtype: object

# splitting and scaling

In [98]:
features = df_cleaned.drop(['charges'], axis=1)
target = df_cleaned['charges']

In [99]:
# from sklearn.preprocessing import StandardScaler

# sc = StandardScaler()
# features_scaled = sc.fit_transform(features)
# features_scaled.shape

In [100]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                        features, target, test_size=.2, random_state=42)

# modelling

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [102]:
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using R-squared score
r2_score = r2_score(y_test, y_pred)
print(f'R-squared score: {r2_score:.4f}')

R-squared score: 0.6673


# predicting validation data

In [103]:
v_data_path = 'validation_dataset.csv'
v = pd.read_csv(v_data_path)
v.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,female,24.09,1.0,no,southeast
1,39.0,male,26.41,0.0,yes,northeast
2,27.0,male,29.15,0.0,yes,southeast
3,71.0,male,65.502135,13.0,yes,southeast
4,28.0,male,38.06,0.0,no,southeast


In [104]:

columns_to_encode=["sex", "smoker", "region"]
from sklearn.preprocessing import LabelEncoder

les = LabelEncoder()
for i in columns_to_encode:
    v[i] = les.fit_transform(v[i])
v.dtypes

age         float64
sex           int64
bmi         float64
children    float64
smoker        int64
region        int64
dtype: object

In [105]:
v.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,0,24.09,1.0,0,2
1,39.0,1,26.41,0.0,1,0
2,27.0,1,29.15,0.0,1,2
3,71.0,1,65.502135,13.0,1,2
4,28.0,1,38.06,0.0,0,2


In [106]:
v["predicted_charges"] = model.predict(v)

In [107]:
validation_data = v
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,0,24.09,1.0,0,2,382.853046
1,39.0,1,26.41,0.0,1,0,32121.781497
2,27.0,1,29.15,0.0,1,2,29047.361993
3,71.0,1,65.502135,13.0,1,2,55073.434853
4,28.0,1,38.06,0.0,0,2,7360.358088


In [108]:
counts = 0
for i in range(len(validation_data["predicted_charges"])):
    if validation_data["predicted_charges"][i] < 0:
        counts+=1

In [109]:
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000