# Application of supervised learning

The "Medical Cost Personal" dataset, available on Kaggle, provides a rich source of information on individual patients' health insurance data, which can be instrumental in understanding the factors influencing the cost of their medical treatment. The dataset encompasses six independent features, namely age, sex, body mass index (BMI), number of children, smoking status, and region of residence. 

The data is in the "data/supervised_learning_data" folder and is available on kaggle: [https://www.kaggle.com/datasets/mirichoi0218/insurance](https://www.kaggle.com/datasets/mirichoi0218/insurance)

### Why Linear regression model?

Linear regression is a supervised learning algorithm used when target / dependent variable continues real number. It establishes relationship between dependent variable  y and one or more independent variable  x using best fit line.

Now we need to load the data, conduct different analysis and if needed preprocess the data so that we can apply the linear regression model.

### Load the Data

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from pathlib import Path

import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

import plotly.express as px

In [None]:
dataset_folder = Path('../../data/supervised_learning_data/')
dataset_folder.resolve()

In [None]:
data = pd.read_csv(dataset_folder.joinpath('insurance.csv'))
data.head()

In [None]:
data.shape

### EDA approach (Exploratory Data Analysis)

EDA is an approach to analyzing and summarizing datasets in order to gain insights and understand the underlying patterns and relationships within the data. 

#### Do we need to clean the data?

In [None]:
data.info()

In [None]:
data.describe()

We can see that there is no missing values using the data.info and that the values for each column seems consistent. 
We need to note however that the columns named "children" and "charges" are skewed. 

#### Check if there are correlations between variables

note : the closer from 1, the higher the correlation is

In [None]:
my_corr_data = data.corr()
sns.heatmap(my_corr_data, annot= True)

#### Using all of this we can see that :
- The dataset, shaped as (1338,7), comprises 1338 individual patient entries (rows) and 7 attributes (columns).
- 'Charges' is the target variable we aim to predict, while the other six (age, sex, BMI, children, smoker, and region) are the independent variables used for prediction.
- Given the presence of multiple independent variables, a Multiple Linear Regression model is needed to best predict the 'charges' based on these variables.

Attention: important to note that high correlation doesn't mean causation

#### See for exemple the correlation between sex and charge

In [None]:
df = data.copy()
fig = px.histogram(df, 
                   x='charges', 
                   color='sex',
                   color_discrete_sequence=['blue', 'orange']
)
fig.show()

In [None]:
fig = px.histogram(df, 
                   x='bmi', 
                   color='sex',
                   color_discrete_sequence=['blue', 'orange']
)
fig.show()

In [None]:
sns.lmplot(x='bmi',y='charges',hue='sex',data=data,aspect=1.5,height=5)

We see that for most costumers the price is between 0 and 20k, and that there are more male costumer than female costumer.
We can also see that there are more male costumers that have a greater bmi.

The general trend seems to be that being overweighted will be charged more and because more males are overweighted, there is indeed a correlation between having more medical fees and being a male, though maybe small.

#### We are quickly looking at the correlation between charge and three interesting variables : age, smokers and regions

In [None]:
sns.lmplot(x='bmi',y='charges',hue='age',data=data,aspect=1.5,height=5)

It seems that the older they get, the higher their medical fees is.

In [None]:
sns.lmplot(x='bmi',y='charges',hue='smoker',data=data,aspect=1.5,height=5)

In [None]:
sns.lmplot(x='bmi',y='charges',hue='region',data=data,aspect=1.5,height=5)

It seems that those variables that we talk about above have a correlation (that is strong for age or smoker for exemple).

Then let's do a linear regression on multiple features.

### Function that split and train the model

In [341]:
def compute(data, param = False, polynomial = False):
    X = data.drop('charges',axis=1) # Independet variable
    y = data['charges'] # dependent variable
    X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.3,random_state=1)

    # model
    model = LinearRegression(fit_intercept= param)
    
    #The updated version using polynomial features
    if polynomial:
        poly_mod = PolynomialFeatures(degree = 2, interaction_only = True)
        X_train = poly_mod.fit_transform(X_train)
        X_test = poly_mod.fit_transform(X_test)
    
    model.fit(X_train, Y_train)
    predictions = model.predict(X_test)
    
    print('The score of my model is: ', model.score(X_test, Y_test))
    print('The R squared score of my model is: ', r2_score(predictions, Y_test))
    print('The coefficients are: ', model.coef_)
    print('The Mean Square Error is: ', np.sqrt(mean_squared_error(predictions, Y_test)))

    #Plot the corresponding graph for linearity
    plt.plot(Y_test,predictions,'o')
    m,b = np.polyfit(Y_test,predictions,1)
    plt.plot(Y_test,m*Y_test+b)
    plt.show()
    
    return model

In [None]:
def cross_val(model, data):
    X = data.drop('charges',axis=1) # Independet variable
    y = data['charges'] # dependent variable
    kfold = 50
    return cross_val_score(model, X, y, cv=kfold)

### Data processing

However before doing so, as ML algorithms doesn't work with categorical data directly we need to turn it into numerical values. For these we have three possibilities: "Dummy variable", "Label Encoding" and "One hot encoding"

To do all of them we will use the get_dummies 

In [None]:
# get dummies for categorical columns
data = pd.get_dummies(data, columns = ['sex', 'smoker', 'region'],drop_first =True,
              dtype='int8')
data.head()

In [None]:
print(data.columns.values)

In [None]:
sns.pairplot(data.iloc[:, 0:4], diag_kind = 'kde')

### Estimator for data with preprocessing

In [None]:
model = compute(data)

It seems that the model that returns the estimator of 0.70 % fits the data, however we can see on the graph that the relationship between the dependent and independent variable is not exactly linear.

It is not the best situation and we should handle the model a bit differently so that we could get better results.

### However the model can be better

In order to improve the model's accuracy there are three techniques:
- Multicollinearity is addressed by identifying and removing highly correlated independent variables, improving model stability and interpretability.
- Polynomial Features are used to capture more complex relationships.
- Gradient Descent is an optimization algorithm that iteratively adjusts model parameters to minimize the difference between the predicted and actual values, boosting model accuracy.

Here we will focus on polynomial features.

#### Polynomial Features

In [None]:
model = compute(data, polynomial = True)

Now the estimation is higher than before being at 0.84% instead of 70%.
The graph shows also that the model seems to fit the data better which is confirmed with the mean squared error which is smaller than before.

So we can see that polynomial features feat the data better as it helps with relationship between variables aren't exactly linear, which helps with enhancing the accuracy.

### Conclusion

Conducting a linear regression on this dataset was a better idea than doing a classification as it seems to feat the data better. 

However after:
- exploring the data and finding correlations between inputs and targets
- picking the linear regression model
- scaling numeric variables and one-hot encode categorical data
- setting aside a test set (using a fraction of the training set)
- training the model
- making predictions on the test set 

We have seen that it was necessary to optimize the problem at hand as we have done in the exemple with the  polynomial features.

However as the model stills needs improvement, it might have been better to compare with other model as XGBoost regression, as it must have more parameters that we could ajust.