## Business Problem

Insurance companies need to estimate premiums that reflect the risk associated with insuring individuals. Key challenges include:
- Identifying which factors most significantly influence insurance charges.
- Predicting insurance charges for new applicants based on demographic and health-related features.
- Ensuring fairness in premium estimation while maintaining profitability.

### Predicting the Insurance Charges Based on customer Data

## Import required libraries

In [72]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, root_mean_squared_error

### Import/Read Data

In [4]:
insurance_data = pd.read_csv('insurance.csv')

### Initial Analysis

In [5]:
insurance_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [6]:
insurance_data.shape

(1338, 7)

In [7]:
insurance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [79]:
insurance_data['children'].max()

np.int64(5)

In [9]:
insurance_data.describe(include='all')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


In [10]:
insurance_data.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

## EDA - Exploratory Data Analysis

### Hypothesis Testing

1. 
- H0: Smokers incur higher insurance charges than non-smokers.
- Ha: Smokers does not incur higher insurance charges than non-smokers.

alpha = 0.05

In [13]:
smoker = insurance_data[insurance_data['smoker']=='yes']['charges']
non_smoker = insurance_data[insurance_data['smoker']=='no']['charges']

In [14]:
t_stats, p_value = stats.ttest_ind(smoker,non_smoker,equal_var=False)

In [18]:
print("Ha P_val {}",format(round(p_value,2)))

Ha P_val {} 0.0


In [19]:
alpha = 0.05
if p_value < alpha:
    print("We are rejecting the null hypothesis and accepting the alternate hypothesis")
else:
    print("Reject the alternate hypothesis")

We are rejecting the null hypothesis and accepting the alternate hypothesis


2.

- H0: BMI is positively correlated with insurance charges (higher BMI results in higher charges).

In [20]:
insurance_data[['bmi','charges']].corr()

Unnamed: 0,bmi,charges
bmi,1.0,0.198341
charges,0.198341,1.0


### Its not necessary that customer needs to pay higher charges for higher BMI ration

3.

- H0: Insurance charges increase with age due to higher health risks.

In [21]:
insurance_data[['age','charges']].corr()

Unnamed: 0,age,charges
age,1.0,0.299008
charges,0.299008,1.0


### There is some correlation between the age and charges however, its not necessary that higher the age higher will be the insurance charges

4.

- H0: Geographic region influences insurance charges due to regional cost variations.
- Ha: Geographic region does not influences insurance charges due to regional cost variations.

alpha  = 0.05

In [23]:
insurance_data['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [25]:
southwest_data = insurance_data[insurance_data['region'] == 'southwest']['charges']
southeast_data = insurance_data[insurance_data['region'] == 'southeast']['charges']
northwest_data = insurance_data[insurance_data['region'] == 'northwest']['charges']
northeast_data = insurance_data[insurance_data['region'] == 'northeast']['charges']

In [26]:
t_stats, p_value = stats.f_oneway(southwest_data,southeast_data,northwest_data,northeast_data)

In [28]:
alpha = 0.05

if p_value < alpha:
    print("We will reject the null hypothesis and accept alternative hypothesis")
else:
    print("We will accept the null hypothesis")

We will reject the null hypothesis and accept alternative hypothesis


## Feature Engineering Activity

Male - 1
Female - 0

Smoker
 Yes - 1
 No - 0

sex, smoker, region

In [31]:
encoder = OneHotEncoder()

In [32]:
categorical_features = ['sex','smoker','region']

In [37]:
encoded_features = encoder.fit_transform(insurance_data[categorical_features]).toarray()

In [41]:
categorical_fearure_names = encoder.get_feature_names_out(categorical_features)

In [43]:
categorical_encoded_features = pd.DataFrame(encoded_features,columns=categorical_fearure_names)

In [63]:
numerical_features = insurance_data[['age','bmi','children']]
final_data = pd.concat([numerical_features,categorical_encoded_features,insurance_data['charges']],axis=1)

### Model Building and Evaluation

In [65]:
X = final_data.drop(['charges'],axis=1)
y = final_data['charges']

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=12)

In [67]:
model = LinearRegression()

In [68]:
model.fit(X_train,y_train)

In [71]:
y_pred = model.predict(X_test)

In [74]:
print("R2 Score : {}\nMean Squared Error: {}\nmean_absolute_error : {}\nroot_mean_squared_error: {}".format(r2_score(y_test,y_pred),mean_squared_error(y_test,y_pred),mean_absolute_error(y_test,y_pred),root_mean_squared_error(y_test,y_pred)))

R2 Score : 0.6908753357757642
Mean Squared Error: 41525886.58280326
mean_absolute_error : 4311.092308479656
root_mean_squared_error: 6444.058238625971


#### Train the model with Scaled numbers

In [75]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [76]:
scaled_model = LinearRegression()
scaled_model.fit(X_train_scaled,y_train)

In [77]:
y_pred_with_scaled_x = scaled_model.predict(X_test_scaled)

In [78]:
print("R2 Score : {}\nMean Squared Error: {}\nmean_absolute_error : {}\nroot_mean_squared_error: {}".format(r2_score(y_test,y_pred_with_scaled_x),mean_squared_error(y_test,y_pred_with_scaled_x),mean_absolute_error(y_test,y_pred_with_scaled_x),root_mean_squared_error(y_test,y_pred_with_scaled_x)))

R2 Score : 0.6806623803374146
Mean Squared Error: 42897831.55611179
mean_absolute_error : 4504.416088953354
root_mean_squared_error: 6549.6436205424025
