![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [95]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [96]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


In [97]:
insurance.describe()

Unnamed: 0,age,bmi,children
count,1272.0,1272.0,1272.0
mean,35.214623,30.56055,0.948899
std,22.478251,6.095573,1.303532
min,-64.0,15.96,-4.0
25%,24.75,26.18,0.0
50%,38.0,30.21,1.0
75%,51.0,34.485,2.0
max,64.0,53.13,5.0


In [98]:
columns = insurance.columns
for column in columns:
    print(f"Unique values of column {column} are:")
    print(insurance[column].unique(), "\n")

Unique values of column age are:
[ 19.  18.  28.  33.  32. -31.  46.  37.  60.  25.  62.  23.  56. -27.
  52. -23.  30. -34.  59.  63.  55.  31.  22.  nan  26.  35.  24.  41.
  21.  48.  36.  40.  58.  34.  43.  64.  20.  61.  27.  53.  44.  57.
 -41.  45. -35.  54.  38.  29.  49.  47.  51.  42.  50. -44. -39. -28.
 -40.  39. -25. -52. -26. -47. -45. -57. -43. -50. -58. -56. -30. -51.
 -60. -37. -55. -64. -22. -36. -21. -18. -20. -19. -33.] 

Unique values of column sex are:
['female' 'male' 'woman' 'F' 'man' nan 'M'] 

Unique values of column bmi are:
[27.9   33.77  33.    22.705 28.88  25.74  33.44  27.74  29.83  25.84
 26.22  26.29  34.4   39.82  42.13  24.6   30.78  23.845 40.3   35.3
 36.005 32.4   34.1      nan 28.025 27.72  23.085 32.775 17.385 36.3
 35.6   26.315 28.6   28.31  36.4   20.425 32.965 20.8   36.67  39.9
 26.6   36.63  21.78  37.3   38.665 34.77  24.53  35.625 28.    34.43
 28.69  36.955 31.825 31.68  22.88  37.335 27.36  33.66  24.7   25.935
 22.42  28.9   39.1   3

1. Correct the `sex` column values to standard format ('male', 'female')
2. Drop rows with missing values
3. Remove '$' from `charges` column
4. Change negative values of `children` column to 0
5. Drop negative `age` values
6. Convert `region` values to lowercase

In [99]:
insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
insurance = insurance[insurance["age"] > 0]
insurance["children"] = insurance["children"].apply(lambda x: max(x, 0))
insurance["region"] = insurance["region"].str.lower()

insurance.dropna(inplace=True)

In [100]:
for column in columns:
    print(f"Unique values of column {column} are:")
    print(insurance[column].unique(), "\n")

Unique values of column age are:
[19. 18. 28. 33. 32. 46. 37. 60. 25. 62. 23. 56. 52. 30. 59. 63. 55. 31.
 22. 26. 35. 24. 41. 48. 36. 40. 58. 34. 43. 64. 20. 61. 27. 53. 44. 57.
 21. 45. 54. 38. 29. 49. 47. 51. 42. 50. 39.] 

Unique values of column sex are:
['female' 'male'] 

Unique values of column bmi are:
[27.9   33.77  33.    22.705 28.88  33.44  27.74  29.83  25.84  26.22
 26.29  34.4   39.82  24.6   30.78  40.3   35.3   36.005 32.4   34.1
 28.025 27.72  23.085 32.775 17.385 36.3   35.6   26.315 28.31  36.4
 20.425 32.965 20.8   36.67  39.9   26.6   36.63  21.78  37.3   38.665
 34.77  24.53  35.625 28.    34.43  28.69  36.955 31.825 31.68  37.335
 27.36  33.66  24.7   22.42  39.1   36.19  23.98  24.75  28.5   28.1
 32.01  27.4   34.01  35.53  26.885 38.285 37.62  41.23  34.8   22.895
 31.16  27.2   26.98  39.49  24.795 31.3   30.8   38.28  19.95  19.3
 31.6   30.115 29.92  27.5   28.4   30.875 27.94  35.09  33.63  29.7
 35.72  32.205 49.06  27.17  23.37  37.1   23.75  28.975 31

In [101]:
insurance.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [102]:
features = insurance.drop('charges', axis=1)
target = insurance['charges']

In [103]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1149 non-null   float64
 1   sex       1149 non-null   object 
 2   bmi       1149 non-null   float64
 3   children  1149 non-null   float64
 4   smoker    1149 non-null   object 
 5   region    1149 non-null   object 
 6   charges   1149 non-null   float64
dtypes: float64(4), object(3)
memory usage: 71.8+ KB


In [104]:
categorical_features = ["sex", "smoker", "region"]
numerical_features = ["age", "bmi", "children"]

In [105]:
features_categorical = pd.get_dummies(features[categorical_features], drop_first=True)

In [106]:
features_final = pd.concat([features[numerical_features], features_categorical], axis=1)

In [107]:
scaler = StandardScaler()

In [108]:
features_scaled = scaler.fit_transform(features_final)

In [109]:
from sklearn.linear_model import LinearRegression

In [110]:
lin_model = LinearRegression()

In [111]:
from sklearn.pipeline import Pipeline

In [112]:
steps = [('scaler', scaler), ('lin_model', lin_model)]
insurance_pipeline = Pipeline(steps)

In [113]:
insurance_pipeline.fit(features_scaled, target)

mse_scores = -cross_val_score(insurance_pipeline, features_scaled, target, cv=5, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(insurance_pipeline, features_scaled, target, cv=5, scoring='r2')
mean_mse = np.mean(mse_scores)
r2_score = np.mean(r2_scores)

In [114]:
print("Mean MSE:", mean_mse)
print("Mean R2:", r2_score)

Mean MSE: 37431001.52191915
Mean R2: 0.7450511466263761


In [115]:
# Predict on validation data
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)

# Ensure categorical variables are properly transformed
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

# Make predictions using the trained model
validation_predictions = insurance_pipeline.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989
