### Step-1:Business Problem Understanding

- The goal is to predict a person's medical expenses based on factors like age, BMI, smoking status, and other demographic information. This helps in estimating healthcare costs and making informed insurance or medical decisions.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.simplefilter('ignore')

In [2]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import Ridge

from sklearn.metrics import r2_score 

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV

### Step-2 :Data Understanding

**Load Data & Understand every variable**

In [3]:
df = pd.read_csv(r'C:\Users\sahur\Downloads\insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


**Dataset Understanding**

In [4]:
df.shape

(1338, 7)

**Step - 3:Data Preprocessing**

In [5]:
df.duplicated().sum()

1

In [6]:
df.drop_duplicates(inplace = True)

In [7]:
df.shape

(1337, 7)

In [8]:
# drop the region column 
df.drop('region',axis = 1,inplace = True)

In [22]:
# encoding sex column
df['sex'].replace({'female':0,'male':1},inplace = True)

# encoding 'smoker' column 
df['smoker'].replace({'no':0,'yes':1},inplace = True)

In [23]:
x = df.drop('expenses',axis = 1)
y = df['expenses']

In [24]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size = 0.8,random_state = 9)

### Step-4 :Modelling

**Applying Hyperparameter tuning for Ridge Regression**

In [25]:
# model
estimator = Ridge()

# parameters & values
param_grid = {'alpha':list(range(1,100))}

# Identifying the best value of the paramer within given values for the given data 
model_hp = GridSearchCV(estimator,param_grid,cv=5,scoring = 'r2')

model_hp.fit(x_train,y_train)
model_hp.best_params_

{'alpha': 1}

**Modelling Ridge Regression using best hyperparameters**

In [27]:
# Modelling

ridge_best = Ridge(alpha = 1)
ridge_best.fit(x_train,y_train)

print('Intercept:',ridge_best.intercept_)
print('Coefficient:',ridge_best.coef_)

Intercept: -12131.383174500314
Coefficient: [  264.4786592   -112.37962155   318.56350557   413.12069122
 23853.85951773]


### Step-5 :Evaluation 

**Evaluation on Trian Data**

In [30]:
ypred_train = ridge_best.predict(x_train)

print('Train R2:',r2_score(y_train,ypred_train))
print('cross validation score:',cross_val_score(ridge_best,x_train,y_train,cv = 5).mean())

Train R2: 0.7593639632162803
cross validation score: 0.7534705953944542


**Evaluation on Test Data**

In [31]:
ypred_test = ridge_best.predict(x_test)
print('Test R2:',r2_score(y_test,ypred_test))

Test R2: 0.7008629672692219


#### Prediction on Unknown Data 

**Data**

In [32]:
input_data = {'age':31,
              'sex':'female',
              'bmi':25.74,
              'children':0,
              'smoker':'no',
              'region':'northeast' 
             }

              
            

**Step-1:preprocessing the data**

In [42]:
df_test = pd.DataFrame(input_data,index = [0])

df_test.drop('region',axis = 1,inplace = True)
df_test['sex'].replace({'female':0,'male':1},inplace = True)

df_test['smoker'].replace({'no':0,'yes':1},inplace = True)

df_test

Unnamed: 0,age,sex,bmi,children,smoker
0,31,0,25.74,0,0


**step-2:predict**

In [43]:
ridge_best.predict(df_test)

array([4267.27989412])