![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [86]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [87]:
# Implement model creation and training here
# Use as many cells as you need

In [88]:
print(insurance.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB
None


In [89]:
print(insurance.describe())

               age          bmi     children
count  1272.000000  1272.000000  1272.000000
mean     35.214623    30.560550     0.948899
std      22.478251     6.095573     1.303532
min     -64.000000    15.960000    -4.000000
25%      24.750000    26.180000     0.000000
50%      38.000000    30.210000     1.000000
75%      51.000000    34.485000     2.000000
max      64.000000    53.130000     5.000000


In [90]:
#Assigning the correct Data Types
# Handle non-finite values before converting to integer
insurance['age'] = pd.to_numeric(insurance['age'], errors='coerce').fillna(0).astype('int')
insurance['children'] = pd.to_numeric(insurance['children'], errors='coerce').fillna(0).astype('int')
insurance['bmi'] = pd.to_numeric(insurance['bmi'], errors='coerce').astype('float')

# Convert categorical variables to category type
insurance['sex'] = insurance['sex'].astype('category')
insurance['smoker'] = insurance['smoker'].astype('category')
insurance['region'] = insurance['region'].astype('category')

# Convert charges to float, handling any non-numeric values
insurance['charges'] = pd.to_numeric(insurance['charges'], errors='coerce')

In [91]:
#Checking newly assigned data types
insurance.dtypes

age            int64
sex         category
bmi          float64
children       int64
smoker      category
region      category
charges      float64
dtype: object

In [92]:
#Checking missing values
print(insurance.isna().sum().sort_values())

age           0
children      0
sex          66
bmi          66
smoker       66
region       66
charges     321
dtype: int64


In [93]:
insurance.shape

(1338, 7)

In [94]:
#Drop missing values
insurance = insurance.dropna()
print(insurance.isna().sum().sort_values())
 



age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [95]:
#Checking for discrepancies
print(insurance['sex'].unique())
print(insurance['smoker'].unique())
print(insurance['region'].unique())

['female', 'male', 'woman', 'F', 'man', 'M']
Categories (6, object): ['F', 'M', 'female', 'male', 'man', 'woman']
['yes', 'no']
Categories (2, object): ['no', 'yes']
['southwest', 'Southeast', 'southeast', 'Northwest', 'northwest', 'northeast', 'Southwest', 'Northeast']
Categories (8, object): ['Northeast', 'Northwest', 'Southeast', 'Southwest', 'northeast', 'northwest', 'southeast', 'southwest']


In [96]:
#Replacing incorrect values
insurance['sex'].replace({'F': 'female'}, inplace=True)
insurance['sex'].replace({'M': 'male'}, inplace=True)
insurance['sex'].replace({'woman': 'female'}, inplace=True)
insurance['sex'].replace({'man': 'male'}, inplace=True)
#setting region values to lowercase
insurance['region'] = insurance['region'].str.lower()

In [97]:
#Check again
print(insurance['sex'].unique())
print(insurance['smoker'].unique())
print(insurance['region'].unique())

['female', 'male']
Categories (2, object): ['female', 'male']
['yes', 'no']
Categories (2, object): ['no', 'yes']
['southwest' 'southeast' 'northwest' 'northeast']


In [98]:
#Splitting data into features and target variables
X = insurance[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
y = insurance['charges']

In [99]:
#Encoding categorical variables in features variable
X_encoded = pd.get_dummies(X, drop_first=True)


In [100]:
#Split target and feature variables to test and training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Instantiate and fit regression model
from sklearn.linear_model import LinearRegression
reg_mod = LinearRegression()
reg_mod.fit(X_train_scaled, y_train)

# Make predictions
y_pred = reg_mod.predict(X_test_scaled)

In [101]:
#Evaluating 
cv_scores = cross_val_score(reg_mod, X_encoded, y, scoring='r2')
print(cv_scores)
r2_score = np.mean(cv_scores)
print(r2_score)

[0.6925984  0.66607056 0.72171354 0.70195271 0.7082948 ]
0.698126000731359


In [102]:
# Loading the validation dataset
validation_data_path = 'validation_dataset.csv'
validation_dataset = pd.read_csv(validation_data_path)
validation_dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,female,24.09,1.0,no,southeast
1,39.0,male,26.41,0.0,yes,northeast
2,27.0,male,29.15,0.0,yes,southeast
3,71.0,male,65.502135,13.0,yes,southeast
4,28.0,male,38.06,0.0,no,southeast


In [103]:
#Checking validation dataset
validation_dataset.dtypes

age         float64
sex          object
bmi         float64
children    float64
smoker       object
region       object
dtype: object

In [104]:
#Cleaning Validation data set
# Assigning the correct Data Types
# Handle non-finite values before converting to integer

validation_dataset['age'] = pd.to_numeric(validation_dataset['age'], errors='coerce').fillna(0).astype('int')
validation_dataset['children'] = pd.to_numeric(validation_dataset['children'], errors='coerce').fillna(0).astype('int')
validation_dataset['bmi'] = pd.to_numeric(validation_dataset['bmi'], errors='coerce').astype('float')

# Convert categorical variables to category type
validation_dataset['sex'] = validation_dataset['sex'].astype('category')
validation_dataset['smoker'] = validation_dataset['smoker'].astype('category')
validation_dataset['region'] = validation_dataset['region'].astype('category')


In [105]:
#Check Again
validation_dataset.dtypes

age            int64
sex         category
bmi          float64
children       int64
smoker      category
region      category
dtype: object

In [106]:
#Checking missing values in validation dataset
print(validation_dataset.isna().sum().sort_values())

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
dtype: int64


In [107]:
#Encoding categorical values in validation dataset
validation_dataset_encoded = pd.get_dummies(validation_dataset, drop_first=True)

In [108]:
#Scaling validation dataset
validation_dataset_scaled = scaler.transform(validation_dataset_encoded)

In [109]:
#Predicting charges  
validation_pred = reg_mod.predict(validation_dataset_scaled)

In [110]:
# Handling negative predicted values by replacing them with the minimum basic charge (set at 1000)
min_basic_charge = 1000
validation_pred[validation_pred < min_basic_charge] = min_basic_charge

#Saving everything as a panda dataframe
validation_data = pd.DataFrame(validation_dataset_scaled)

# Add the predicted_charges column to the DataFrame
validation_data['predicted_charges'] = validation_pred
validation_data.head(30)
 


Unnamed: 0,0,1,2,3,4,5,6,7,predicted_charges
0,-0.927337,-1.087196,0.03971,-1.009121,-0.549877,-0.557875,1.608119,-0.547876,3053.58168
1,0.131739,-0.707972,-0.745333,0.990962,1.818588,-0.557875,-0.621845,-0.547876,30737.782195
2,-0.473448,-0.260095,-0.745333,0.990962,1.818588,-0.557875,1.608119,-0.547876,29556.796478
3,1.745568,5.681981,9.460234,0.990962,1.818588,-0.557875,1.608119,-0.547876,53785.048472
4,-0.423015,1.196323,-0.745333,0.990962,-0.549877,-0.557875,1.608119,-0.547876,9459.798061
5,1.695136,6.900765,7.890147,-1.009121,1.818588,-0.557875,1.608119,-0.547876,56527.957143
6,-0.372583,0.223743,0.824754,-1.009121,-0.549877,1.792516,-0.621845,-0.547876,8370.573986
7,0.283035,1.730016,0.03971,-1.009121,-0.549877,-0.557875,-0.621845,-0.547876,14000.577746
8,0.585628,0.953587,-0.745333,-1.009121,-0.549877,1.792516,-0.621845,-0.547876,12153.074416
9,1.342111,0.477105,1.609798,0.990962,-0.549877,-0.557875,1.608119,-0.547876,13073.39631


In [111]:
# Save the modified validation dataset to a new CSV file if needed
validation_data.to_csv('validation_data_with_predictions.csv', index=False)