The Medical Insurance Dataset provides data on the different attributes of patients and how much money the costs their insurance companies incurred. You task is to predict insurance charges ('charges'). Each of the text comments that some of the sub tasks require should not be more than 150 words.

Your project's grade breakdown is the following:

1. Build OLS regression and get at least 70% of adjusted R squared. (15%)
2. Explain OLS regression output -  R-squared and adjusted R squared and their differences. (10%)
3. Explain OLS regression output - beta coefficients. Also write the OLS function formula.  Do all of your beta coefficients seem logical? Comment on the beta coefficients you recieved. (20%)
4. Explain OLS regression output - p-values. Indicate which of your p-values are above the threshold (choose 0.05). Why might it be this way? (10%)
5. Write the OLS function formula - confidence intervals. Comment on the confidence intervals you recieved. (5%)
6. Calculate SST, SSR and SSE. What do these metrics mean? Comment on your SST, SSR and SSE.(5%)
7. Calculate MAE, MSE, RMSE. Your MAE should be about 3500-4500. Which ones are the best metric in your case? Comment on your MAE, MSE, RMSE. (20%)
8. Build residual plot and comment it. (5%)
9. Build actual vs predicted charges plot and comment it. (5%)
10. Briefly explain what are degrees of freedom and dummy variables trap (5%)



In [1]:
import numpy as np
import pandas as pd
import os
import warnings
warnings.filterwarnings("ignore")
import sys, os
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import mean_squared_log_error

In [3]:
#read file
df = pd.read_csv('insurance.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


# Task1

In [10]:
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=["sex", "smoker", "region"], drop_first=True)

In [9]:
# Prepare data for model

X = df_encoded.drop("charges", axis=1)
y = df_encoded["charges"]

In [7]:
X = sm.add_constant(X)

In [13]:
model = sm.OLS(y, X).fit()
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:                charges   R-squared (uncentered):                   0.874
Model:                            OLS   Adj. R-squared (uncentered):              0.874
Method:                 Least Squares   F-statistic:                              1158.
Date:                Sun, 27 Aug 2023   Prob (F-statistic):                        0.00
Time:                        19:55:05   Log-Likelihood:                         -13618.
No. Observations:                1338   AIC:                                  2.725e+04
Df Residuals:                    1330   BIC:                                  2.729e+04
Df Model:                           8                                                  
Covariance Type:            nonrobust                                                  
                       coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------

# Task2

The **R-squared** value measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. Here, R-squared is 0.751, which means that approximately 75.1% of the variance in charges is explained by the independent variables in the model.

The **adjusted R-squared** value adjusts the R-squared value based on the number of predictors and observations. It takes into account the potential for overfitting due to including too many predictors. Here, the adjusted R-squared is 0.749, which is slightly lower than the R-squared. This suggests that the inclusion of the predictors in the model contributes to explaining the variance, but not all of them contribute significantly.

The difference between R-squared and adjusted R-squared lies in how they account for model complexity. R-squared can increase by simply adding more predictors, which can lead to overfitting. Adjusted R-squared penalizes models with more predictors that do not significantly improve the fit, making it a more reliable measure of model fit.

# Task3

**coef**: These are the estimated coefficients (beta coefficients) for each predictor variable, indicating the change in the dependent variable for a one-unit change in the predictor variable, holding other variables constant.

**std err**: Standard errors associated with the estimated coefficients. These give an idea of the uncertainty in the estimated coefficients.

**t**: The t-statistic measures the number of standard deviations a coefficient is away from zero. It's used to test whether the coefficient is statistically significant.

**P>|t|**: This is the p-value associated with the t-statistic. It indicates the probability of observing a t-statistic as extreme as the one calculated, assuming the null hypothesis that the coefficient is zero.

**[0.025, 0.975]**: These are the lower and upper bounds of the 95% confidence interval for the estimated coefficient. They provide a range within which the true coefficient value is likely to fall.

The OLS regression function formula is:
Y = α + β1X1 + β2X2 + β3X3 Y = α

# Task4

P>|t| (P-value): This column shows the p-value associated with the t-statistic for each coefficient. A p-value indicates the probability of observing a t-statistic as extreme as the one calculated, assuming the null hypothesis that the coefficient is zero.

The p-value for region_northwest and sex_male are above the threshold. This suggests that that these coefficients aren't statistically significant. This suggests that being male or female might not be a significant predictor of the charges in this model, and also there isn't enough evidence to conclude that being in the "northwest" region significantly affects the charges

# Task5

Dependent Variable=β
0
​
 +β
1
​
 ×Predictor
1
​
 +β
2
​
 ×Predictor
2
​
 +…+β
k
​
 ×Predictor
k
​
 +ϵ

**Comments on the Confidence Intervals:**

The confidence intervals provide insight into the precision of the coefficient estimates. Wider intervals indicate less precise estimates, while narrower intervals indicate more precise estimates.

For the coefficient estimates of "const," "age," "bmi," and "children," the confidence intervals are relatively narrow. This suggests that these coefficients are estimated with relatively high precision.

The "const" coefficient represents the intercept. Its confidence interval suggests that the true population intercept lies between approximately -13900 and -10000.

For the "age," "bmi," and "children" coefficients, the lower and upper bounds of the confidence intervals do not include zero. This indicates that these predictor variables are likely to have a statistically significant effect on the dependent variable (charges), as they are not crossing the zero effect mark.