### Business Problem Understanding

- The goal is to predict a person's medical expenses based on factors like age, BMI, smoking status, and other demographic information. This helps in estimating healthcare costs and making informed insurance or medical decisions.

In [37]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

###  Data Understanding

In [3]:
df = pd.read_csv(r'C:\Users\sahur\Downloads\insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
1333,50,male,31.0,3,no,northwest,10600.55
1334,18,female,31.9,0,no,northeast,2205.98
1335,18,female,36.9,0,no,southeast,1629.83
1336,21,female,25.8,0,no,southwest,2007.95


**Dataset Understanding**

In [4]:
df.shape

(1338, 7)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   expenses  1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [6]:
df['sex'].value_counts()

sex
male      676
female    662
Name: count, dtype: int64

In [7]:
df['children'].value_counts()

children
0    574
1    324
2    240
3    157
4     25
5     18
Name: count, dtype: int64

In [8]:
df['smoker'].value_counts()

smoker
no     1064
yes     274
Name: count, dtype: int64

In [9]:
df['region'].value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

**Exploratory Data Analysis**

In [10]:
continuous_features = ['age','bmi','expenses']
discrete_categorical = ['sex','smoker','region']
discrete_count = ['children']

In [11]:
df[continuous_features].describe()

Unnamed: 0,age,bmi,expenses
count,1338.0,1338.0,1338.0
mean,39.207025,30.665471,13270.422414
std,14.04996,6.098382,12110.01124
min,18.0,16.0,1121.87
25%,27.0,26.3,4740.2875
50%,39.0,30.4,9382.03
75%,51.0,34.7,16639.915
max,64.0,53.1,63770.43


In [12]:
df[discrete_categorical].describe()

Unnamed: 0,sex,smoker,region
count,1338,1338,1338
unique,2,2,4
top,male,no,southeast
freq,676,1064,364


In [13]:
df[continuous_features].corr()

Unnamed: 0,age,bmi,expenses
age,1.0,0.109341,0.299008
bmi,0.109341,1.0,0.198576
expenses,0.299008,0.198576,1.0


### Data Preprocessing

**Data Cleaning**

In [15]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
expenses    0
dtype: int64

In [16]:
df.duplicated().sum()

1

In [18]:
df.drop_duplicates(inplace = True)

In [19]:
df.shape

(1337, 7)

In [20]:
df.drop('region',axis = 1,inplace = True)

**Encoding**

In [23]:
# encoding sex column
df['sex'].replace({'female':0,'male':1},inplace = True)

# encoding 'smoker' column
df['smoker'].replace({'no':0,'yes':1},inplace = True)

**x&y**

In [25]:
x = df.drop('expenses',axis = 1)
y = df['expenses']

**Train Test Split**

In [27]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size = 0.8, random_state = 13)

### Modelling & Evaluation

In [28]:
from sklearn.model_selection import GridSearchCV

# model
from sklearn.linear_model import Lasso
estimator = Lasso()

# parameter & value
param_grid = {'alpha':list(range(1,100))}

# Identifying the best value of the parameter within given values for the given data 
model_hp = GridSearchCV(estimator,param_grid,cv = 5,scoring = 'r2')

model_hp.fit(x_train,y_train)

model_hp.best_params_

{'alpha': 64}

**Build Lasso Model using best hyperparameters**

In [30]:
# Modelling
from sklearn.linear_model import Lasso
lasso_best =Lasso(alpha=64)
lasso_best.fit(x_train,y_train)

print('Intercept:',lasso_best.intercept_)
print('coefficient:',lasso_best.coef_)

# prediction & Evaluation on trian data
ypred_train = lasso_best.predict(x_train)

from sklearn.metrics import r2_score
print('Trian R2:',r2_score(y_train,ypred_train))

from sklearn.model_selection import cross_val_score 
print('Cross validation score:',cross_val_score(lasso_best,x_train,y_train,cv = 5).mean())

ypred_test = lasso_best.predict(x_test)
print('Test R2:',r2_score(y_test,ypred_test))

Intercept: -12021.003805310957
coefficient: [  257.50021231    -0.           327.32419947   392.64442115
 23759.71445301]
Trian R2: 0.7478682741667431
Cross validation score: 0.7406417007941529
Test R2: 0.754622022325151


### Final Model

In [31]:
x = x.drop('sex',axis=1)
y=df['expenses']

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 9)

# Modelling
from sklearn.linear_model import Lasso
lasso_best = Lasso(alpha=60)
lasso_best.fit(x_train,y_train)

print('Intercept:',lasso_best.intercept_)
print('coefficient:',lasso_best.coef_)

# prediction & Evaluation on train data 
ypred_train = lasso_best.predict(x_train)

print('Trian R2:',r2_score(y_train,ypred_train))
print('Cross Validation Score:',cross_val_score(lasso_best,x_train,y_train,cv = 5).mean())

# prediction & Evaluation on train data 
ypred_test = lasso_best.predict(x_test)
print('Test R2:',r2_score(y_test,ypred_test))

Intercept: -12045.187463841941
coefficient: [  264.37194096   317.04095573   373.19607238 23621.90427308]
Trian R2: 0.7592042058163876
Cross Validation Score: 0.7538402453637711
Test R2: 0.7008929179833459


### Prediction on New Data 

**Data**

In [32]:
input_data = {'age':35,
             'sex':'Male',
              'bmi':31.4,
              'children':5,
              'smoker':'yes',
              'region':'southeast'
             }

In [33]:
df_test = pd.DataFrame(input_data,index = [0])
df_test

Unnamed: 0,age,sex,bmi,children,smoker,region
0,35,Male,31.4,5,yes,southeast


**preprocessing the data**

In [34]:
df_test.drop(['region','sex'],axis = 1,inplace = True)

df_test['smoker'].replace({'no':0,'yes':1},inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test['smoker'].replace({'no':0,'yes':1},inplace = True)
  df_test['smoker'].replace({'no':0,'yes':1},inplace = True)


**predict**

In [35]:
lasso_best.predict(df_test)

array([32650.80111484])

In [36]:
-12045.187463841941 + (264.37194096* 35) + (317.04095573 * 31.4) + (373.19607238 * 5) + (23621.90427308 * 1)

32650.801114660055