# Problem 5 & 6:

## Problem 5 : Multiple Regression

Your goal is to predict the insurance charges (charges) from a patient’s BMI (bmi), age (age),
and number of children (children). Thus, you need to estimate three regression weights (β1, β2,
and β3), along with the intercept (α), and the noise parameter (σ). 

It is also recommended that you
standardize your predictors (i.e., subtract the means from the input variables and divide by their
standard deviations) in order to bring them to a common scale. Split the data into a training set
and a test set and fit the model only to the training set. Perform the usual convergence checks and
describe your results. Which of the three variables is the best predictor of Insurance Charges?

***Alternative problem choosen:***

> Use the Bayesian Ridge regression implementation from scikit-learn: https://
scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html, also
for the next task. If you go this path, include a small description on how the Bayesian ridge differs from the model implementation suggested above.

In [124]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [125]:
# Data formatting:
data = pd.read_csv('insurance.csv', header=0)
data.head()

# standardize the predictor variables and true output
data['age'] = (data['age'] - data['age'].mean()) / data['age'].std()
data['bmi'] = (data['bmi'] - data['bmi'].mean()) / data['bmi'].std()
data['children'] = (data['children'] - data['children'].mean()) / data['children'].std()
data['smoker'] = data['smoker'].map({'yes': 0, 'no': 1})
data['smoker'] = (data['smoker'] - data['smoker'].mean()) / data['smoker'].std()

data['charges'] = (data['charges'] - data['charges'].mean()) / data['charges'].std()


# create test/train split
train_test_split(data, test_size=0.2)
train_set, test_set = train_test_split(data, test_size=0.33, random_state=691)

data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,-1.438227,female,-0.453151,-0.908274,-1.96985,southwest,0.298472
1,-1.509401,male,0.509431,-0.078738,0.507273,southeast,-0.953333
2,-0.797655,male,0.383164,1.580335,0.507273,southeast,-0.728402
3,-0.441782,male,-1.305043,-0.908274,0.507273,northwest,0.719574
4,-0.512957,male,-0.292447,-0.908274,0.507273,northwest,-0.776512


In [126]:
# Model creation
br_model = linear_model.BayesianRidge()

# Model training
br_model.fit(train_set[['age', 'bmi', 'children']], train_set['charges'])

### Difference bewteen Bayesian Ridge & The STAN approach

The Baysian ridge and the proposed Baysian multiple regression model are both intended to serve the same goal of predicting the dependent variable based on independent variables using priors. However, the difference is in how they solve this problem; the key differentiation is that the Bayesian ridge model has a prior distribution on the model's coefficients that penalizes large values, leading the model to be less sensitive to small fluctuations in the data as the coefficients become closer to zero.

## Problem 6: Predictive Distribution

Use the generated quantities block in the Stan program to also pass the test data and sample
from the predictive distribution. Extract the samples from the predictive distribution, compute the
means predictive means from the samples, calculate the root-mean-squared error (RMSE) between
the predictive means and the actual charges in the test set.


where M denotes the number of test instances and ˆym denotes the predictive means. How good are
your predictions? What information did you lose by computing the predictive means? How could
you possibly propagate the uncertainty information encoded in the predictive distribution to obtain
a distribution over the test RMSE values?

***This has been appropriatly modified to fit the alternative approach option (using Bayesian Ridge instead of STAN approach)***

In [127]:
# Calcuate Root Mean Square Error
RMSE = np.sqrt(np.mean((br_model.predict(test_set[['age', 'bmi', 'children']]) - test_set['charges']) ** 2))
print("Root Mean Square Error over test set:", RMSE)

Root Mean Square Error over test set: 0.9831064038894856


In [128]:
br_model.coef_
br_model.feature_names_in_
pd.DataFrame(br_model.coef_, index=br_model.feature_names_in_, columns=['feature importance'])

Unnamed: 0,feature importance
age,0.265412
bmi,0.172207
children,0.031349


### Discussion:

Based on the high Root Mean Square Error score of ~.983, which is close to 1, we can say that this model's predictions are fairly strong. We can also see a connection between the coefficients of the trained model and their estimated importance-with, surprisingly, the number of children a person has to have to be the most significant feature.