## Dataset Summary

The `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs.

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |

In [183]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [184]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


In [185]:
insurance = insurance.dropna()
insurance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1208 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1208 non-null   float64
 1   sex       1208 non-null   object 
 2   bmi       1208 non-null   float64
 3   children  1208 non-null   float64
 4   smoker    1208 non-null   object 
 5   region    1208 non-null   object 
 6   charges   1208 non-null   object 
dtypes: float64(3), object(4)
memory usage: 75.5+ KB


In [186]:
insurance['sex'].unique()

array(['female', 'male', 'woman', 'F', 'man', 'M'], dtype=object)

In [187]:
insurance['sex'] = insurance['sex'].replace({'woman':'female', 'F':'female',
                                             'M':'male', 'man':'male' })

In [188]:
insurance['sex'].unique()

array(['female', 'male'], dtype=object)

In [189]:
insurance.describe()

Unnamed: 0,age,bmi,children
count,1208.0,1208.0,1208.0
mean,35.35596,30.574971,0.942881
std,22.061241,6.117562,1.311809
min,-64.0,15.96,-4.0
25%,24.75,26.195,0.0
50%,38.0,30.23,1.0
75%,51.0,34.58,2.0
max,64.0,53.13,5.0


In [190]:
insurance = insurance[insurance['age'] > 0]
insurance.describe()

Unnamed: 0,age,bmi,children
count,1149.0,1149.0,1149.0
mean,39.204526,30.59262,0.947781
std,14.163214,6.124013,1.314243
min,18.0,15.96,-4.0
25%,26.0,26.2,0.0
50%,39.0,30.3,1.0
75%,51.0,34.7,2.0
max,64.0,53.13,5.0


In [191]:
insurance.loc[insurance['children'] < 0,"children"] = 0

In [192]:
insurance['region'] = insurance['region'].str.lower()

In [193]:
insurance['charges'] = insurance['charges'].str.replace('$', '').astype(float)

In [194]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,4449.462
3,33.0,male,22.705,0.0,no,northwest,21984.47061
4,32.0,male,28.88,0.0,no,northwest,3866.8552


In [195]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1149 non-null   float64
 1   sex       1149 non-null   object 
 2   bmi       1149 non-null   float64
 3   children  1149 non-null   float64
 4   smoker    1149 non-null   object 
 5   region    1149 non-null   object 
 6   charges   1149 non-null   float64
dtypes: float64(4), object(3)
memory usage: 71.8+ KB


In [196]:
def create_and_train_model(insurance):
    X = insurance.drop('charges', axis=1)
    Y = insurance['charges']
    
    numerical_features = ['age', 'bmi', 'children']
    categorical_features = ['sex','smoker','region']

    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    
    scaler = StandardScaler()
    lin_reg = LinearRegression()
    X_train_scaled = scaler.fit_transform(X_processed)
    #pipeline
    insurance_model = Pipeline([('scaler', scaler), ('classifier', lin_reg)])
    #fitting the model
    insurance_model.fit(X_train_scaled, Y)
    #evaluation 
    mse_scores = -cross_val_score(insurance_model, X_train_scaled, Y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model, X_train_scaled, Y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)

    return insurance_model, mean_mse, mean_r2

In [197]:
insurance_model, mean_mse, mean_r2 = create_and_train_model(insurance)

print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)

Mean MSE: 37431001.52191915
Mean R2: 0.7450511466263761


In [198]:
validation_data = pd.read_csv('validation_dataset.csv')

# Preprocess categorical features using one-hot encoding
validation_data_processed = pd.get_dummies(validation_data,columns=['sex', 'smoker', 'region'],drop_first=True)

# Make predictions using the trained model
validation_predictions = insurance_model.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989
