### Predicting Health Insurance Charges

##### Objective:

The objective of this data science problem is to develop a predictive model that can accurately estimate health insurance charges for individuals based on their demographic and health-related information.

### Q1. Define an approach you would take to solve the problem and document it.

### Approach to solve the problem.

##### Prediction Problem. 
- Understanding the Problem:The goal is to develop a predictive model that accurately estimates health insurance charges based on demographic and health-related information.
- Understand the dataset provided and its features.


### Q2. Get the data and determine what type of machine learning problem it is.

- Obtain and upload the dataset into a DataFrame.
- Since the goal is to predict a continuous target variable (health insurance charges), this is a regression problem.


### Q3. Outline possible algorithms you would use to create the model. 

- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- Support Vector Machine(SVM) Regression
- Gradient Boosting Regressio
- Neural Network Regression

### Q4. Conduct exploratory analysis (EDA) to understand the distribution of variables, identify any correlations, and gain insights into the dataset.

- Explore the distribution of variables using statistical measures and visualizations.
- Identify correlations between features and the target variable.
- Look for outliers and understand their impact on the model.


### Q5. Handle missing values, encode categorical variables, and scale numerical features if necessary. 

HINT: Encoding is converting from categorical to numerical using libraries like Label Encoder and OHE (One Hot Encoder). Scaling features is ensuring there’s no disparity in the variations for the features so that no feature is given preference. There are libraries like Standard Scaling and Min-Max scaling for that

##### Data Preprocessing
- Handle missing values: Impute missing values using appropriate strategies like mean, median, or mode imputation.
- Encode categorical variables: Convert categorical variables into numerical format using techniques like One-Hot Encoding or Label Encoding.
- Scale numerical features: Normalize or standardize numerical features to ensure fair comparison among them.


### Q6. Extract additional features if needed, such as interaction terms or polynomial features.

- Extract additional features if needed, such as interaction terms or polynomial features, to capture complex relationships between variables.


### Q7. Evaluate different regression algorithms (e.g., linear regression, decision tree regression, random forest regression, etc.) and select the one with the best performance based on evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared.

- Split the dataset into training and testing sets.
- Train different regression algorithms using the training set.
- Evaluate each model's performance on the test set using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.
- Select the model with the best performance based on evaluation metrics.


### Q8. Train the selected model on the training dataset.

- Train the selected model using the entire training dataset.
- Evaluate the trained model's performance on the test dataset to ensure generalization and assess its predictive capabilities.


### Q9. Evaluate the trained model's performance on the test dataset using appropriate evaluation metrics.

- Deploy the trained model in a production environment for making predictions.
- Monitor the model's performance over time and update it as needed to maintain accuracy and relevance.


#### Import necessary libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


#### Load the dataset

In [3]:
df = pd.read_csv('insurance.csv')

#### Explore the dataset

In [4]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [5]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [7]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [8]:
df.isnull()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
1333,False,False,False,False,False,False,False
1334,False,False,False,False,False,False,False
1335,False,False,False,False,False,False,False
1336,False,False,False,False,False,False,False


In [9]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337    False
Length: 1338, dtype: bool

In [41]:
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

In [42]:
X.select_dtypes

<bound method DataFrame.select_dtypes of       age  sex     bmi  children  smoker  region
0      19    0  27.900         0       1       3
1      18    1  33.770         1       0       2
2      28    1  33.000         3       0       2
3      33    1  22.705         0       0       1
4      32    1  28.880         0       0       1
...   ...  ...     ...       ...     ...     ...
1333   50    1  30.970         3       0       1
1334   18    0  31.920         0       0       0
1335   18    0  36.850         0       0       2
1336   21    0  25.800         0       0       3
1337   61    0  29.070         0       1       1

[1338 rows x 6 columns]>

In [43]:
X.select_dtypes(include=['int64', 'float64']).columns

Index(['age', 'bmi', 'children'], dtype='object')

In [44]:
X.select_dtypes

<bound method DataFrame.select_dtypes of       age  sex     bmi  children  smoker  region
0      19    0  27.900         0       1       3
1      18    1  33.770         1       0       2
2      28    1  33.000         3       0       2
3      33    1  22.705         0       0       1
4      32    1  28.880         0       0       1
...   ...  ...     ...       ...     ...     ...
1333   50    1  30.970         3       0       1
1334   18    0  31.920         0       0       0
1335   18    0  36.850         0       0       2
1336   21    0  25.800         0       0       3
1337   61    0  29.070         0       1       1

[1338 rows x 6 columns]>

In [45]:
X.select_dtypes(include=['object']).columns

Index([], dtype='object')

#### Encode categorical variables

In [12]:
label_encoder = LabelEncoder()
df['sex'] = label_encoder.fit_transform(df['sex'])
df['smoker'] = label_encoder.fit_transform(df['smoker'])
df['region'] = label_encoder.fit_transform(df['region'])


In [15]:
df['sex']

0       0
1       1
2       1
3       1
4       1
       ..
1333    1
1334    0
1335    0
1336    0
1337    0
Name: sex, Length: 1338, dtype: int32

In [16]:
df['smoker']

0       1
1       0
2       0
3       0
4       0
       ..
1333    0
1334    0
1335    0
1336    0
1337    1
Name: smoker, Length: 1338, dtype: int32

In [17]:
df['region']

0       3
1       2
2       2
3       1
4       1
       ..
1333    1
1334    0
1335    2
1336    3
1337    1
Name: region, Length: 1338, dtype: int32


#### Split the data into features and target

In [18]:
X = df.drop(columns=['charges'])
y = df['charges']

In [19]:
X

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.900,0,1,3
1,18,1,33.770,1,0,2
2,28,1,33.000,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.880,0,0,1
...,...,...,...,...,...,...
1333,50,1,30.970,3,0,1
1334,18,0,31.920,0,0,0
1335,18,0,36.850,0,0,2
1336,21,0,25.800,0,0,3


In [20]:
y

0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1338, dtype: float64

#### Split the data into training and testing sets

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
train_test_split(X, y, test_size=0.2, random_state=42)

[      age  sex     bmi  children  smoker  region
 560    46    0  19.950         2       0       1
 1285   47    0  24.320         0       0       0
 1142   52    0  24.860         0       0       2
 969    39    0  34.320         5       0       2
 486    54    0  21.470         3       0       1
 ...   ...  ...     ...       ...     ...     ...
 1095   18    0  31.350         4       0       0
 1130   39    0  23.870         5       0       2
 1294   58    1  25.175         0       0       0
 860    37    0  47.600         2       1       3
 1126   55    1  29.900         0       0       3
 
 [1070 rows x 6 columns],
       age  sex     bmi  children  smoker  region
 764    45    0  25.175         2       0       0
 887    36    0  30.020         0       0       1
 890    64    0  26.885         0       1       1
 1293   46    1  25.745         3       0       1
 259    19    1  31.920         0       1       1
 ...   ...  ...     ...       ...     ...     ...
 109    63    1  35.09

#### Scale numerical features

In [24]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [26]:
X_train_scaled

array([[ 0.47222651, -1.0246016 , -1.75652513,  0.73433626, -0.50874702,
        -0.45611589],
       [ 0.54331294, -1.0246016 , -1.03308239, -0.91119211, -0.50874702,
        -1.35325561],
       [ 0.8987451 , -1.0246016 , -0.94368672, -0.91119211, -0.50874702,
         0.44102382],
       ...,
       [ 1.3252637 ,  0.97598911, -0.89153925, -0.91119211, -0.50874702,
        -1.35325561],
       [-0.16755139, -1.0246016 ,  2.82086429,  0.73433626,  1.96561348,
         1.33816354],
       [ 1.1120044 ,  0.97598911, -0.10932713, -0.91119211, -0.50874702,
         1.33816354]])

In [27]:
X_test_scaled

array([[ 0.40114007, -1.0246016 , -0.89153925,  0.73433626, -0.50874702,
        -1.35325561],
       [-0.23863782, -1.0246016 , -0.08946143, -0.91119211, -0.50874702,
        -0.45611589],
       [ 1.75178229, -1.0246016 , -0.60845296, -0.91119211,  1.96561348,
        -0.45611589],
       ...,
       [-0.09646495,  0.97598911, -0.41972876, -0.08842793, -0.50874702,
        -1.35325561],
       [ 1.04091797, -1.0246016 ,  2.78941026, -0.91119211,  1.96561348,
         0.44102382],
       [ 0.82765867, -1.0246016 ,  0.60252728, -0.08842793, -0.50874702,
         1.33816354]])

#### Initialize regressors

In [28]:
linear_reg = LinearRegression()
random_forest_reg = RandomForestRegressor()
decision_tree_reg = DecisionTreeRegressor()
svr_reg = SVR()

#### Train and evaluate models

In [32]:
for model in [linear_reg, random_forest_reg, decision_tree_reg, svr_reg]:
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Model: {model.__class__.__name__}")
    print(f"Mean Absolute Error: {mae}")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    print("------------------------")

Model: LinearRegression
Mean Absolute Error: 4186.508898366437
Mean Squared Error: 33635210.43117844
R-squared: 0.7833463107364536
------------------------
Model: RandomForestRegressor
Mean Absolute Error: 2517.142255962781
Mean Squared Error: 21068869.503180083
R-squared: 0.864289586776453
------------------------
Model: DecisionTreeRegressor
Mean Absolute Error: 3029.225887485075
Mean Squared Error: 41989539.20374115
R-squared: 0.7295337694532715
------------------------
Model: SVR
Mean Absolute Error: 8599.328962388287
Mean Squared Error: 165839509.92452022
R-squared: -0.06821813183902203
------------------------
