# Medical Cost Prediction
You are provided with the medical cost dataset. You need to predict individual medical costs billed by health insurance.


### Topics Covered:

Training a Linear Regression model

Predicting using the trained model

Evaluating a model: R2-score and Root Mean Squared Error

Finding out coefficients and intercept

##### Loading the data in a pandas DataFrame. Check the shape of the dataset, also check if the dataset contains any null values

In [4]:
import pandas as pd
df=pd.read_csv('C:/Users/MANISH/python/DS & ML/Projects/insurance (1).csv')


In [5]:
print(df.head())
df.shape


   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


(1338, 7)

In [6]:
df.isnull().sum()


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

#### convert categorical features to numeric values using One Hot Encoding


In [7]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
d=df.select_dtypes(include=[object])
encoded_labels = ohe.fit_transform(d).toarray()

In [8]:
df_encoded=pd.DataFrame(encoded_labels)
df.drop(columns=['sex','smoker','region'],inplace=True)
df3 = pd.concat([df_encoded,df], ignore_index=True,axis=1)

In [9]:
df3

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,19,27.900,0,16884.92400
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,18,33.770,1,1725.55230
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,28,33.000,3,4449.46200
3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,33,22.705,0,21984.47061
4,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,32,28.880,0,3866.85520
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,50,30.970,3,10600.54830
1334,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,18,31.920,0,2205.98080
1335,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,18,36.850,0,1629.83350
1336,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,21,25.800,0,2007.94500


#### Split the dataset into training and testing features keeping 25% of the data for testing

In [10]:
X=df3.iloc[:,:-1]
y=df3.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state = 7)
print(X_train.shape)
print(X_test.shape)

(1003, 11)
(335, 11)


#### linear regression to train a model for prediction

In [14]:
from sklearn.linear_model import LinearRegression 
regr = LinearRegression()
regr.fit(X_train,y_train)

LinearRegression()

#### coefficients and intercept from the trained model


In [20]:
print('The final coefficients after training is:',regr.coef_)
print('The final intercept after training is:',regr.intercept_)

The final coefficients after training is: [    97.18113355    -97.18113355 -11905.32547314  11905.32547314
    667.24158209     15.86338029   -233.63531263   -449.46964974
    251.90247991    353.38540435    465.2280675 ]
The final intercept after training is: -887.6917181755362


#### Predict the prices from the test data and calculate r2 score and root mean squared error

In [21]:
from sklearn.metrics import r2_score,mean_squared_error
y_pred = regr.predict(X_test)
print('Predictions for test data:', y_pred[:5])
print("r2 score of our model is:", r2_score(y_test,y_pred))
print("root mean squared error of our model is:", mean_squared_error(y_test,y_pred,squared=False))


Predictions for test data: [15248.874306   11126.97945225 -2048.68105088 29282.63519248
  9070.8295246 ]
r2 score of our model is: 0.7509741262661104
root mean squared error of our model is: 6080.977967616992
