<h1> <center> Insurance premium prediction <center> </h1>
    
    
- Context:

The **insurance.csv** dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.

- Inspiration: 

The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

In [9]:
### import all packages needed 
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt 
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import *
import warnings
warnings.filterwarnings("ignore")

In [5]:
#### Import the dataset 
Data = pd.read_csv("insurance.csv")

In [6]:
### The first five rows of the dataframe 
Data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [8]:
### Get the data info 
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
expenses    1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Preprocessing stage:

- At this stage of our work, we want to perform a preprocessing on the dataset. 
    - Convert all categorical variable into nominal cases 
    - Standardized the continuous features to be on the same scale. 

In [10]:
#### Import preprocessing packages 
from sklearn.preprocessing import MinMaxScaler
MinMax = MinMaxScaler()

In [11]:
#### perform a transformation on age and bmi features in the datasets. 
numeric_scaling = MinMax.fit_transform(Data[["age","bmi"]].values)

In [23]:
### Extract all categorical features in the dataframe 
def categorical_extraction(Data):
    """
    Argument:
        Takes in the dataframe as input argument
    operation:
        Extract all categorical features in the dataframe 
    return:
        List of categorical features in the dataframe. 
        
    """
    categorical_features = []
    for column in Data.columns:
        if column == "age" or column == "bmi" or column == "expenses" or column=="children":
            continue 
        else:
            categorical_features.append(column)
    return categorical_features
categorical_features = categorical_extraction(Data=Data)

In [24]:
### categorical features in the dataframe. 
categorical_features

['sex', 'smoker', 'region']

In [25]:
#### dataframe of categorical features 
categorical_data = Data[categorical_features]

In [26]:
categorical_data.columns

Index(['sex', 'smoker', 'region'], dtype='object')

In [27]:
#### Encode all categorical variables 
categorical_encoding = pd.get_dummies(categorical_data,categorical_features,drop_first=False)

In [36]:
#### concatenate the values together 
x_values = np.concatenate([categorical_encoding.values,Data["children"].values.reshape(-1,1),numeric_scaling],axis=1)
y_value = Data["expenses"].values 

In [38]:
print(f"The of x_values: {x_values.shape} \n")
print(f"The of y_values: {y_value.shape} \n")

The of x_values: (1338, 11) 

The of y_values: (1338,) 



### Model comparison and analysis:

- At this point of our analysis, we want to compare various model and analyse them. This is to evaluate the performance of the model on the insurance data. 
- Before we do that, we have to split the data in train and test set for analysis. 

In [39]:
### splitting the data into train and test datasets. 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train , y_test = train_test_split(x_values,y_value,random_state=50,test_size=0.25)

### Building various model 

- The following are the models:

    - Linear regression
    - Decision tress regression
    - Random forest regressor 
    - Gradient Boosting regression 
    - Mult-layer preceptron regression 
    
 The goal is to perform comparison analysis of various model to assess which best solve the problem. 

In [58]:

Dict = {
    "Multi-layer preceptron": MLPRegressor(hidden_layer_sizes=(22,),activation="relu"),
    "Linear" : LinearRegression(),
    "Decision regressor" : DecisionTreeRegressor(presort=True),
    "Random regressor" : RandomForestRegressor(n_estimators=10),
    "AdaboostReg" : AdaBoostRegressor(n_estimators=55),
    "Gradientboost" : GradientBoostingRegressor(n_estimators=100)  
}

In [59]:
Dict.items()

dict_items([('Multi-layer preceptron', MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(22,), learning_rate='constant',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)), ('Linear', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)), ('Decision regressor', DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=True, random_state=None, splitter='best'

In [61]:
#### This is to perform comparison analysis of various model. 
models = []
r2_square = []
mse = []
for name, model in Dict.items():
    y_pred = model.fit(x_train,y_train).predict(x_test)
    models.append(name)
    r2_square.append(r2_score(y_pred,y_test))
    mse.append(mean_squared_error(y_pred,y_test))
Information = pd.DataFrame({
    "Models": models,
    "r2_score":r2_square,
    "mean_squared_error":mse
})
#### Final results 

In [62]:
Information

Unnamed: 0,Models,r2_score,mean_squared_error
0,Multi-layer preceptron,-100497.803578,332751300.0
1,Linear,0.696189,33419940.0
2,Decision regressor,0.712351,42121810.0
3,Random regressor,0.83871,21918330.0
4,AdaboostReg,0.797163,26406240.0
5,Gradientboost,0.858734,18910460.0


In [63]:
MLPRegressor??