# Predictive Analysis: D209 Task 2

## **Medical Readmission**

### Natalie Toler

## Table of Contents
### Part 1. Research Question
#### A1. Research Question
#### A2. Goal

### Part 2. Method Justificaiton
#### B1.  Ridge
#### B2. Assumption
#### B3. Packages and Libraries 

### Part 3. Data Preparation
#### C1. Preprocessing
#### C2. Variables
#### C3. Data Preparation
#### C4. Copy of Cleaned Data

### Part 4. Analysis
#### D1. Test and Training Split
#### D2. Analysis Test
#### D3. Predictive Analysis Code

### Part 5. Data Summary and Implications
#### E1. Accuracy of Model
#### E2. Results and Implications
#### E3. Limitations
#### E4. Course of Action

### Part 6. Data Summary and Implications
#### F. Panopto Video

### Sources
#### G. Web Sources
#### H. Source references

## Part 1

### A1. Research Question

**Which variables affect the price of hospitalization for the patient?** Since this is a question that uses a continuous numeric variable as the independent variable I will be using the ridge regression method to create my model.

### A2. Goals

The goal of this project is to look at the price of hospitalization for a patient and see which other gathered variables may influence the price. This is an important question to be able to answer so that the hospital system can correctly communicate with patients what affects their cost of being hospitalized.

## Part 2

### B1. Ridge Regression

The ridge regression method is a linear regression model that uses the ordinary least squares (OLS) function to regularize the model. This regularization penalizes the square of the coefficients which will help prevent any over or under-fitting. Ridge regression uses an alpha value, with a larger alpha penalizing the coefficients more aggressively while an alpha of 0 would have similar results to a non penalized regression model. This attempts to keep the model from relying heavily on certain variables over others. (DataCamp Team, 2022) [Lasso and Ridge Regression in Python](https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression)

### B2. Assumption

One of the important assumptions of an linear model is that the relationship of the independent and dependent variables are linear. (Vishalmendekarhere, 2021) [Its All About Assumptions](https://medium.com/swlh/its-all-about-assumptions-pros-cons-497783cfed2d)

### B3. Packages and Libraries

For this project I will be using the following libraries and packages:

- pandas for handling the dataset
- numpy for performing certain operations
- matplotlib.pyplot for plotting
- sklearn.preprocessing: StandardScaler for standardizing the values of the variables
- sklearn.model_selection: train_test_split from splitting the dataset into train and test
- sklearn.linear_model: Ridge for performing the Ridge regression model
- sklearn.metrics: mean_squared_error for calculating the mean squared error for assessing the accuracy of the model

In [1]:
# Import the libraries and packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as mse

# Import the dataset CSV
df = pd.read_csv('medical_clean.csv', index_col=0)

# Check the Dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer_id         10000 non-null  object 
 1   Interaction         10000 non-null  object 
 2   UID                 10000 non-null  object 
 3   City                10000 non-null  object 
 4   State               10000 non-null  object 
 5   County              10000 non-null  object 
 6   Zip                 10000 non-null  int64  
 7   Lat                 10000 non-null  float64
 8   Lng                 10000 non-null  float64
 9   Population          10000 non-null  int64  
 10  Area                10000 non-null  object 
 11  TimeZone            10000 non-null  object 
 12  Job                 10000 non-null  object 
 13  Children            10000 non-null  int64  
 14  Age                 10000 non-null  int64  
 15  Income              10000 non-null  float64
 16  Marital  

## Part 3. Data Preparation

### C1. Preprocessing

As with most models the most important preprocessing that needs to be done is transforming all categorical variables to numeric through encoding. This dataset is particularly heavy in binary variables which will all need to be encoded into 1/0 binaries. Additionally all of the nominal categorical variables need to be enocoding using the one hot encoding process. Without this process the model would be unable to use the categorical variables which are a huge part of the dataset. (Shmueli, 2015) [Categorical Predictors](https://www.bzst.com/2015/08/categorical-predictors-how-many-dummies.html)

### C2. Variables

I will be using the following variables:

**Dependent Variable**
- **Total Charge**
    - **Continuous Numeric**

**Independent Variables**

**Numeric Variables**
- *Discrete*
    - Age
    - Initial Days
- *Continuous*
    - Income
    - Vitamin D Levels
    
**Categorical**
- *Binary*
    - High Blood Pressure
    - Stroke
    - Overweight
    - Arthritis
    - Diabetes
    - Back Pain
    - Anxiety
    - Allergic Rhinitis
    - Reflux Esophagitis
    - Asthma
    - Readmission
- *Ordinal*
    - Complication Risk
- *Nominal*
    - Area
    - Maritial Status
    - Gender
    - Initial Admission
    - Services

### C3. Data Preparation

In [2]:
# Drop any Variables that aren't being used
df_clean = df.drop(['Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Zip', 'Lat', 'Lng', 'Population',
              'TimeZone', 'Job', 'Children', 'Doc_visits', 'Full_meals_eaten', 'vitD_supp', 'Soft_drink',
              'Additional_charges', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'],
             axis = 1)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Area                10000 non-null  object 
 1   Age                 10000 non-null  int64  
 2   Income              10000 non-null  float64
 3   Marital             10000 non-null  object 
 4   Gender              10000 non-null  object 
 5   ReAdmis             10000 non-null  object 
 6   VitD_levels         10000 non-null  float64
 7   Initial_admin       10000 non-null  object 
 8   HighBlood           10000 non-null  object 
 9   Stroke              10000 non-null  object 
 10  Complication_risk   10000 non-null  object 
 11  Overweight          10000 non-null  object 
 12  Arthritis           10000 non-null  object 
 13  Diabetes            10000 non-null  object 
 14  Hyperlipidemia      10000 non-null  object 
 15  BackPain            10000 non-null  object 
 16  Anxiety  

In [3]:
# Encode the binary columns to 1/0
binary_columns = ['ReAdmis', 'HighBlood', 'Stroke', 'Complication_risk', 'Overweight', 'Arthritis', 'Diabetes',
                 'Hyperlipidemia', 'BackPain', 'Anxiety', 'Allergic_rhinitis', 'Reflux_esophagitis', 'Asthma']
binary_encoding = {'Yes': 1, 'No': 0}
for col in binary_columns:
    df_clean[col] = df_clean[col].replace(binary_encoding)
df_clean.value_counts().sum()

  df_clean[col] = df_clean[col].replace(binary_encoding)


10000

In [4]:
# Encode the Ordinal Categorical Variable to 1, 2, 3 for Low, medium, high
risk_mapping = {'High': 3, 'Medium': 2, 'Low': 1}
# Map the values in the "complication_risk" column using the defined mapping
df_clean['Complication_risk'] = df_clean['Complication_risk'].map(risk_mapping)
# Convert to int type
df_clean['Complication_risk'] = df_clean['Complication_risk'].astype(int)

In [5]:
#One Hot encoding for nominal categorical
categorical_columns = ['Area', 'Marital', 'Gender', 'Initial_admin', 'Services']
df_clean = pd.get_dummies(df_clean, columns=categorical_columns, drop_first=False)

In [6]:
# Isolate continuous variables to normalize the data
continuous_columns = ['Age', 'Income', 'VitD_levels', 'Initial_days', 'TotalCharge']

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the selected columns
df_clean[continuous_columns] = scaler.fit_transform(df_clean[continuous_columns])

In [7]:
# Check the DataFrame
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 36 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Age                                  10000 non-null  float64
 1   Income                               10000 non-null  float64
 2   ReAdmis                              10000 non-null  int64  
 3   VitD_levels                          10000 non-null  float64
 4   HighBlood                            10000 non-null  int64  
 5   Stroke                               10000 non-null  int64  
 6   Complication_risk                    10000 non-null  int32  
 7   Overweight                           10000 non-null  int64  
 8   Arthritis                            10000 non-null  int64  
 9   Diabetes                             10000 non-null  int64  
 10  Hyperlipidemia                       10000 non-null  int64  
 11  BackPain                         

In [8]:
# Rename the Columns
pythonic_columns = ['age', 'income', 'readmission', 'vit_d', 'high_blood', 'stroke', 'complication_risk', 'overweight',
                   'arthritis', 'diabetes', 'hyperlipidemia', 'back_pain', 'anxiety', 'allergic_rhinitis', 'reflux_esophagitis',
                   'asthma', 'initial_days', 'total_charge', 'area_rural', 'area_suburban', 'area_urban', 'marital_divorced',
                   'marital_married', 'marital_never', 'marital_seperated', 'marital_widowed', 'gender_female',
                   'gender_male', 'gender_nonbinary', 'initial_admin_elective', 'initial_admin_emergency',
                   'initial_admin_observation', 'services_blood_work', 'services_ct', 'services_iv', 'services_mri']
df_clean = df_clean.set_axis(pythonic_columns, axis=1)

In [9]:
# Check the Dataframe
df_clean.head()

Unnamed: 0_level_0,age,income,readmission,vit_d,high_blood,stroke,complication_risk,overweight,arthritis,diabetes,...,gender_female,gender_male,gender_nonbinary,initial_admin_elective,initial_admin_emergency,initial_admin_observation,services_blood_work,services_ct,services_iv,services_mri
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.024795,1.615914,0,0.583603,1,0,2,0,1,1,...,False,True,False,False,True,False,True,False,False,False
2,-0.121706,0.221443,0,0.483901,1,0,3,1,0,0,...,True,False,False,False,True,False,False,False,True,False
3,-0.024795,-0.91587,0,0.046227,1,0,2,1,0,1,...,True,False,False,True,False,False,True,False,False,False
4,1.186592,-0.026263,0,-0.687811,0,1,2,0,1,0,...,False,True,False,True,False,False,True,False,False,False
5,-1.526914,-1.377325,0,-0.260366,0,0,1,0,0,0,...,True,False,False,True,False,False,False,True,False,False


In [10]:
df_clean.shape

(10000, 36)

In [11]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 36 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        10000 non-null  float64
 1   income                     10000 non-null  float64
 2   readmission                10000 non-null  int64  
 3   vit_d                      10000 non-null  float64
 4   high_blood                 10000 non-null  int64  
 5   stroke                     10000 non-null  int64  
 6   complication_risk          10000 non-null  int32  
 7   overweight                 10000 non-null  int64  
 8   arthritis                  10000 non-null  int64  
 9   diabetes                   10000 non-null  int64  
 10  hyperlipidemia             10000 non-null  int64  
 11  back_pain                  10000 non-null  int64  
 12  anxiety                    10000 non-null  int64  
 13  allergic_rhinitis          10000 non-null  int64  


### C4. Copy of Cleaned Data

In [12]:
# Save Preprocessed Dataframe into a csv
df_clean.to_csv('d209_task_2.csv', index=False)

## Part 4. Analysis

### D1. Test and Training Split

In order to know the accuracy of the model I have to split the dataset into a test and train set. I will be holding back 20% of the data for the test set.

In [13]:
# Seperate the y variable from the independent variables
X = df_clean.drop('total_charge', axis=1).copy()
y = df_clean['total_charge'].copy()

In [14]:
y.shape

(10000,)

In [15]:
X.shape

(10000, 35)

In [16]:
# Split data into train and test sets keeping 30% of the data out for the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### D2. Analysis Test

Before I can run the Ridge model I need to know which alpha is the best for this dataset. As such I will go through the process of having the Ridge test multiple alphas to pick which one is best. 

In [17]:
# Set Up the Alpha testing
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
    ridge = Ridge(alpha = alpha)
    ridge.fit(X_train, y_train)
    score = ridge.score(X_test, y_test)
    ridge_scores.append(score)
print(ridge_scores)

[0.9978431986648607, 0.9978429712779706, 0.9978338956168774, 0.997264755983, 0.9797184654965866, 0.6830825629228375]


In [18]:
# Find the index of the maximum score
best_alpha_index = np.argmax(ridge_scores)

# Retrieve the corresponding best alpha value
best_alpha = alphas[best_alpha_index]

print("Ridge Scores:", ridge_scores)
print("Best Alpha Value:", best_alpha)

Ridge Scores: [0.9978431986648607, 0.9978429712779706, 0.9978338956168774, 0.997264755983, 0.9797184654965866, 0.6830825629228375]
Best Alpha Value: 0.1


As we can see here the best alpha value to use is 0.1. 

### D3. Predictive Analysis Code



In [19]:
# Set up the model
ridge = Ridge(alpha = 0.1)

# Fit the model to the training data
ridge.fit(X_train, y_train)

## Part 5. Data Summary and Implications

### E1. Accuracy of Model



In [20]:
# Create the accuracy scores
train_score = ridge.score(X_train, y_train)
test_score = ridge.score(X_test, y_test)

print("The train score for the Ridge model is {}".format(train_score))
print("The test score for the Ridge model is {}".format(test_score))

The train score for the Ridge model is 0.9978241739814598
The test score for the Ridge model is 0.9978431986648607


In [21]:
# Find the mean squared error and root mean squared error
predictions = ridge.predict(X_test)
r_squared = test_score
mse_value = mse(y_test, predictions)
rmse = mse(y_test, predictions, squared=False)
print("R^2: {}".format(r_squared))
print("MSE: {}".format(mse_value))
print("RMSE: {}".format(rmse))

R^2: 0.9978431986648607
MSE: 0.0021731708618094947
RMSE: 0.04661728072088177




As seen the training score for the Ridge model is approximately 0.9978, so we can say that 99.78% of the variance in the training data is explained by the model. The test score is, when rounded, approximately the same as the training score which indicates that the model does a good job of predicting the test data. 

The R squared is the same as the test score so it's not different than the test score accuracy. The RMSE value of 0.0466 is fairly low which would indicate that the model, on average, predicts values that are close to the actual values. 

All of this would suggest that the Ridge model is accurately predicting the dependent variable of total cost.

### E2. Results and Implications

The results of the model show that this model could be used to accurately predict the cost of hospitalization for a patient based on the independent variables used. The high score for the test dataset and the low value of RMSE are signs of a model that fits well. 

### E3. Limitations

I didn't use any hyperparameter tuning or feature selection in choosing the variables for this model. As such there are certainly some issues that the variables used could be causing that would result in overfitting of the model. If there is a lot of multicollinearity between the independent variables that could be resulting in a model that results in good accuracy but isn't actually good to use. 

### E4. Course of Action

Since the model seems fairly good at predicting the cost of hospitalization based on the variables I would suggest to the hospital system that they could use this model to figure out where the highest costs are coming from for patients and subsequently adjust systems so that they can lower the cost to the patients. 

### F. Panopto Video

The video link is provided with submission.
https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a30e7255-3349-4198-a83b-b108000ff678

## Sources
### G. Web Sources

Web sources used to create the code for this project were:

- D209 recorded webinars were used to create the code for the feature selection.
- D209 Datacamp pathway was used to help create most of the code used for this project.

### H. Source references

Other sources used for understanding and explaining the models and methods were:

> Vishalmendekarhere. (2021, January 22). It’s All About Assumptions, Pros & Cons. The Startup. https://medium.com/swlh/its-all-about-assumptions-pros-cons-497783cfed2d
- [Its All About Assumptions](https://medium.com/swlh/its-all-about-assumptions-pros-cons-497783cfed2d)

> Shmueli, Galit . “Categorical Predictors: How Many Dummies to Use in Regression vs. K-Nearest Neighbors.” BzST | Business Analytics, Statistics, Teaching, 19 Aug. 2015, www.bzst.com/2015/08/categorical-predictors-how-many-dummies.html. Accessed 1 Feb. 2024.
- [Categorical Predictors](https://www.bzst.com/2015/08/categorical-predictors-how-many-dummies.html)

> DataCamp Team. “Lasso and Ridge Regression Tutorial.” Www.datacamp.com, Mar. 2022, www.datacamp.com/tutorial/tutorial-lasso-ridge-regression. Accessed 1 Feb. 2024.
- [Lasso and Ridge Regression in Python](https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression)