# Medical Cost Models
In this page of the notebook, I have imported medical cost data from Kaggle and have create different models in hopes of predicting medical charges.

In [28]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

In [2]:
def configure_plots():
    '''Configures plots by making some quality of life adjustments'''
    for _ in range(2):
        plt.rcParams['figure.figsize'] = [16, 9]
        plt.rcParams['axes.titlesize'] = 20
        plt.rcParams['axes.labelsize'] = 16
        plt.rcParams['xtick.labelsize'] = 14
        plt.rcParams['ytick.labelsize'] = 14
        plt.rcParams['lines.linewidth'] = 2

In [3]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

## Data Preparation

In [4]:
data = pd.read_csv("./archive (2)/medical_cost.csv")

In [5]:
data.head()

Unnamed: 0,Id,age,sex,bmi,children,smoker,region,charges
0,1,19,female,27.9,0,yes,southwest,16884.924
1,2,18,male,33.77,1,no,southeast,1725.5523
2,3,28,male,33.0,3,no,southeast,4449.462
3,4,33,male,22.705,0,no,northwest,21984.47061
4,5,32,male,28.88,0,no,northwest,3866.8552


In [12]:
#To change categorical variables into numerical
data_encoded = pd.get_dummies(data, drop_first=True)

In [13]:
#Setting x and y variables
X = data_encoded.drop('charges', axis=1)
y = data_encoded['charges']

#Splitting data into testing and training and fitting the Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10

## Modeling

In this section, I created four differnet models and calculated their mean squared error, root mean squared, and r-squared statistics.

### Model 1: Linear Regression

In [19]:
model_reg = LinearRegression().fit(X_train, y_train)

In [20]:
#stats of model
y_pred_reg = model_reg.predict(X_test)

mse_reg = mean_squared_error(y_test, y_pred_reg)
rmse_reg = np.sqrt(mse_reg)
r2_reg = r2_score(y_test, y_pred_reg)
b = model_reg.intercept_

print(f"Intercept:{b}")
print(f"Mean Squared Error: {mse_reg}")
print(f"Root Mean Squared: {rmse_reg}")
print(f"R-Squared: {r2_reg}")

Intercept:-12889.557611588012
Mean Squared Error: 42729492.91371807
Root Mean Squared: 6536.7800111154165
R-Squared: 0.6953348996308837


### Model 2: Decision Tree

In [21]:
model_tree = DecisionTreeRegressor(random_state=42).fit(X_train, y_train)

In [22]:
#stats of model
y_pred_reg = model_tree.predict(X_test)

mse_reg = mean_squared_error(y_test, y_pred_reg)
rmse_reg = np.sqrt(mse_reg)
r2_reg = r2_score(y_test, y_pred_reg)

print(f"Mean Squared Error: {mse_reg}")
print(f"Root Mean Squared: {rmse_reg}")
print(f"R-Squared: {r2_reg}")

Mean Squared Error: 51276239.37604052
Root Mean Squared: 7160.742934643061
R-Squared: 0.6343958340999446


### Model 3: Random Forest

In [26]:
model_forest = RandomForestRegressor(n_estimators=100, random_state=42).fit(X_train, y_train)

In [27]:
#stats of model
y_pred_reg = model_forest.predict(X_test)

mse_reg = mean_squared_error(y_test, y_pred_reg)
rmse_reg = np.sqrt(mse_reg)
r2_reg = r2_score(y_test, y_pred_reg)

print(f"Mean Squared Error: {mse_reg}")
print(f"Root Mean Squared: {rmse_reg}")
print(f"R-Squared: {r2_reg}")

Mean Squared Error: 26097755.63852684
Root Mean Squared: 5108.5962493161305
R-Squared: 0.8139206716757501


### Model 4: k-Nearest Neighbors

In [29]:
model_kneigh = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)

In [30]:
#stats of model
y_pred_reg = model_kneigh.predict(X_test)

mse_reg = mean_squared_error(y_test, y_pred_reg)
rmse_reg = np.sqrt(mse_reg)
r2_reg = r2_score(y_test, y_pred_reg)

print(f"Mean Squared Error: {mse_reg}")
print(f"Root Mean Squared: {rmse_reg}")
print(f"R-Squared: {r2_reg}")

Mean Squared Error: 153431541.8301949
Root Mean Squared: 12386.748638371366
R-Squared: -0.0939805952267061


## Analysis of Results

Starting with the root mean squared error (RMSE), the k-nearest neighbors has the largest with 12386.7486 while the random forest model has the smallest with 5108.596. This indicates that the random forest model has the least average difference between the model's predicted values and the actual values, while the k-nearest neighbors has the highest.

Moving on to R-squared values, the decision tree model has the lowest positive r-squared of 0.63. This is not a terrible R-squared, but it could be higher. The random forest model has the highest positive r-squared of 0.81, which is very good. It is important to note that the k-nearest neighbors model ended up producing a negative r-squared value. This means that the model is fitted worse than the average fitted model (look at 3 in README). 

Given the large RMSE and negative R-squred model, the k-nearest neighbors is the worst model out of the 4. On the other hand, the random forest model is the best, since it has a small RMSE and large r-squared. 