<h1><center>D209: Data Mining I</center></h1>
<br>
<center>Task 2: Random Forest</center>
<br>
<center>Michelle Nelson</center>
<br>
<center>Department of Information Technology, Western Governor's University</center>
<br>
<center>Dr. Eric Straw</center>
<br>
<center>February 28, 2024</center>
<br>
<br>
<br>
<br>

## A1. Research Question

For this paper, I will be re-analyzing the question I asked in my D208 Task 1 paper: What factors influence the length of a hospital stay? The linear regression model I used to analyze the data for this question in my previous paper had weak predictive power because of its large error. It is my hope that a random forest, an ensemble method, will prove to be a stronger model for predicting length of hospital stay, represented in this dataset by "Initial_days."

Decreasing the number of days hospitalized while also still providing quality care ensures that resources such as physicians, medical supplies, and time are not wasted and used efficiently. In addition, shortening days hospitalized should increase patient satisfaction scores so long as adequate time is still taken to care for the patient and they are not rushed out. Thus, identifying the risk factors that contribute to a longer hospital stay could be important for a hospital so the hospital can be proactive instead of reactive when managing patients.

## A2. Analysis Objectives and Goals

As aforementioned, my goal for this analysis is to create a model that is better able to predict length of hospital stay than my previous attempt using linear regression. As such, I will be using a new type of model, a random forest regressor, to attempt to do so. 

Speaking more broadly, however, the goal is to attempt to help the hospital to which this data belongs reduce their average length of hospital stay by first identifying the factors that contribute to lengthy stays in the first place. Then, the hospital can use such information to come up with a proactive plan. Long term, the hospital could expect to see benefits such as improved efficiency and reduced costs, as mentioned above, which might also be communicated as a goal of this analysis to a board of executives. 

## B1. Justification of Classification Method

In order to explain a random forest model, I must first explain a decision tree, on which this ensemble model is built. A decision tree works by learning a sequence of if/else questions about the individual features in the dataset it was trained on so it can infer the value of a new observation that is fed to it. These if/else questions subset the data into groups, then subset those subsets into still more subsets until a prediction can be made. When a decision tree is trained, it aims to maximize information gained from each node/question. 

A decision tree is made up of a hierarchy of "nodes," which can represent either a question or prediction. Each node has only two answers. There are a few types of nodes. A root node is the starting node and represents the initial question the tree asks of the observation to understand which of two groups it belongs to. These child nodes are the first internal nodes, which represent more questions the tree asks of the observation. In this way, the observation follows a path down the tree, being determined to be in one group or the other whenever a question is asked of it at a node. Eventually, this line of questioning culminates in a prediction at the end of a tree, the final node, called a "leaf." Since this is difficult to explain with words, I will include a picture of a decision tree for visualization purposes (Saini, 2024).

![Decision Tree Picture](https://av-eks-blogoptimized.s3.amazonaws.com/498772.png)

A random forest uses a large number of decision trees together in an ensemble. Many decision trees are made using different randomly selected subsets of both the data and the features. Each tree in the ensemble then makes a prediction. The random forest model then makes its own prediction by looking at all of the predictions made by the many trees within it. It averages those predictions into one final prediction. Since this may be difficult to visualize, I have included an image to better explain this concept too (Machine Learning Random Forest Algorithm - JavatPoint, n.d.).

![Random Forest Picture](https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm.png)

## B2. Random Forest Model Assumptions

Random forest models make very few assumptions. The one I have chosen to highlight is the assumption that the data has no formal distribution such as a normal distribution. Because a random forest is a non-parametric model, it can still predict on skewed data, or data with multiple modes (Vishalmendekarhere, 2021). If the data in question does have a formal distribution, it might be wiser to choose a technique that makes an assumption of whatever formal distribution the data displays.

## B3. Benefits of Python Packages (Done)

For this project, I will be using Python and the following packages:

* Pandas
    * Pandas is useful because it allows us a framework for working with the data. Without it, using only numpy arrays would be rather clunky. Pandas allows the data to resemble a spreadsheet.
* NumPy
    * In this case, I will be using numpy for certain mathematical operations like summing, squaring, or square rooting.
* Seaborn
    * This package is handy in combination with matplotlib.pyplot because it expands the kinds of graphs we can use to plot the data. With seaborn, I can create scatterplots, plots with a line of best fit, and so on.
* Matplotlib.pyplot
    * I will primarily be using this to create graphs.
* Sklearn
    * sklearn.ensemble: I will be using RandomeForestRegressor from this package, which is the regression algorithm I will be using to run analysis on this data.
    * sklearn.model_selection: I will use train_test_split from this package, which I will use to easily divide the dataset into train and test datasets. I will also use RandomizedSearchCV from this package to do some basic hyperparameter tuning.
    * sklearn.metrics: I will use this package's mean_squared_error and r2_score to evaluate the model's performance.

In [2]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score


#load csv into pandas dataframe. The CSV's first column is an index, so we let pandas know that too.
df=pd.read_csv('C:/Users/essay/Documents/D209 PA Dataset/medical_clean.csv', index_col = 0)

#visually inspect dataframe's datatypes and size to ensure it loaded properly.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer_id         10000 non-null  object 
 1   Interaction         10000 non-null  object 
 2   UID                 10000 non-null  object 
 3   City                10000 non-null  object 
 4   State               10000 non-null  object 
 5   County              10000 non-null  object 
 6   Zip                 10000 non-null  int64  
 7   Lat                 10000 non-null  float64
 8   Lng                 10000 non-null  float64
 9   Population          10000 non-null  int64  
 10  Area                10000 non-null  object 
 11  TimeZone            10000 non-null  object 
 12  Job                 10000 non-null  object 
 13  Children            10000 non-null  int64  
 14  Age                 10000 non-null  int64  
 15  Income              10000 non-null  float64
 16  Mari

In [3]:
# Visually inspect dataframe to ensure data loaded as expected and do initial visual exploration
pd.set_option("display.max_columns", None)
df.head(10)

Unnamed: 0_level_0,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,Population,Area,TimeZone,Job,Children,Age,Income,Marital,Gender,ReAdmis,VitD_levels,Doc_visits,Full_meals_eaten,vitD_supp,Soft_drink,Initial_admin,HighBlood,Stroke,Complication_risk,Overweight,Arthritis,Diabetes,Hyperlipidemia,BackPain,Anxiety,Allergic_rhinitis,Reflux_esophagitis,Asthma,Services,Initial_days,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,2951,Suburban,America/Chicago,"Psychologist, sport and exercise",1,53,86575.93,Divorced,Male,No,19.141466,6,0,0,No,Emergency Admission,Yes,No,Medium,No,Yes,Yes,No,Yes,Yes,Yes,No,Yes,Blood Work,10.58577,3726.70286,17939.40342,3,3,2,2,4,3,3,4
2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,11303,Urban,America/Chicago,Community development worker,3,51,46805.99,Married,Female,No,18.940352,4,2,1,No,Emergency Admission,Yes,No,High,Yes,No,No,No,No,No,No,Yes,No,Intravenous,15.129562,4193.190458,17612.99812,3,4,3,4,4,4,3,3
3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,17125,Suburban,America/Chicago,Chief Executive Officer,3,53,14370.14,Widowed,Female,No,18.057507,4,1,0,No,Elective Admission,Yes,No,Medium,Yes,No,Yes,No,No,No,No,No,No,Blood Work,4.772177,2434.234222,17505.19246,2,4,4,4,3,4,3,3
4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,2162,Suburban,America/Chicago,Early years teacher,0,78,39741.49,Married,Male,No,16.576858,4,1,0,No,Elective Admission,No,Yes,Medium,No,Yes,No,No,No,No,No,Yes,Yes,Blood Work,1.714879,2127.830423,12993.43735,3,5,5,3,4,5,5,5
5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,5287,Rural,America/New_York,Health promotion specialist,1,22,1209.56,Widowed,Female,No,17.439069,5,0,2,Yes,Elective Admission,No,No,Low,No,No,No,Yes,No,No,Yes,No,No,CT Scan,1.254807,2113.073274,3716.525786,2,1,3,3,5,3,4,3
6,S543885,e3b0a319-9e2e-4a23-8752-2fdc736c30f4,03e447146d4a32e1aaf75727c3d1230c,Braggs,OK,Muskogee,74423,35.67302,-95.1918,981,Urban,America/Chicago,Corporate treasurer,3,76,81999.88,Never Married,Male,No,19.612646,6,0,0,No,Observation Admission,No,No,Medium,Yes,Yes,Yes,No,Yes,No,Yes,No,No,Blood Work,5.95725,2636.69118,12742.58991,4,5,4,4,3,5,4,6
7,E543302,2fccb53e-bd9a-4eaa-a53c-9dfc0cb83f94,e4884a42ba809df6a89ded6c97f460d4,Thompson,OH,Geauga,44086,41.67511,-81.05788,2558,Rural,America/New_York,Hydrologist,0,50,10456.05,Never Married,Male,No,14.751687,6,0,0,No,Emergency Admission,Yes,No,Low,Yes,Yes,Yes,Yes,Yes,Yes,No,Yes,No,Intravenous,9.05821,3694.627161,16815.5136,4,3,3,2,3,4,5,5
8,K477307,ab634508-dd8c-42e5-a4e4-d101a46f2431,5f78b8699d1aa9b950b562073f629ca2,Strasburg,VA,Shenandoah,22641,39.08062,-78.3915,479,Urban,America/New_York,Psychiatric nurse,7,40,38319.29,Divorced,Female,No,19.688673,7,2,0,No,Observation Admission,No,No,Medium,Yes,No,No,No,No,No,No,No,No,Intravenous,14.228019,3021.499039,6930.572138,1,2,2,5,4,2,4,2
9,Q870521,67b386eb-1d04-4020-9474-542a09d304e3,e8e016144bfbe14974752d834f530e26,Panama City,FL,Bay,32404,30.20097,-85.5061,40029,Urban,America/Chicago,Computer games developer,0,48,55586.48,Widowed,Male,No,19.65332,6,3,0,No,Emergency Admission,No,No,Low,Yes,No,No,Yes,No,No,No,No,No,Intravenous,6.180339,2968.40286,8363.18729,3,3,2,3,3,3,4,2
10,Z229385,5acd5dd3-f0ae-41c7-9540-cf3e4ecb2e27,687e7ba1b80022c310fa2d4b00db199a,Paynesville,MN,Stearns,56362,45.40325,-94.71424,5840,Urban,America/Chicago,"Production assistant, radio",2,78,38965.22,Never Married,Female,No,18.224324,7,1,2,No,Emergency Admission,Yes,No,High,Yes,No,No,No,No,No,Yes,Yes,Yes,Blood Work,1.632554,3147.855813,26225.98991,5,5,5,3,4,2,3,2


## C1. Data Preparation Goals

I will be performing several data preparation steps so that the data is ready for random forest regression, many of which are directly taken from my D208 paper. However, the most important step I feel I should highlight is creating dummy variables using one-hot encoding. This process converts a categorical variable column's values from their labels like "male" and "female" to integer representations that a random forest regression model can understand, with one column for each category. 

For example, let's take a look at Gender. If a 1 appears in the dummy column for male, we know the patient was male. If a 1 appears in the dummy column for female, we know the patient was female. If 0 appears in both columns, we know the patient was non-binary, but for a random forest regressor, we should actually keep the dummy non-binary column, which would hold a 1 if the patient is non-binary. For linear and logistic regression, it is common practice to drop one of these three columns because zeros in the other two tell you that the third would hold a one. This is because for linear and logistic regression, keeping a dummy column for each category present in the original column would cause multicollinearity issues, which linear and logistic regression are both sensitive to. A random forest model does not suffer from this problem because it does not make linear combinations of the features (Categorical Predictors: How Many Dummies to Use in Regression Vs. K-nearest Neighbors, n.d.). Thus, when I create dummy variables for my categorical columns, I will keep a dummy column for each category in the column and refrain from dropping one.

## C2. Variable Selection

For my analysis, I will not use columns whose data is categorical with a high cardinality of choices. These columns would require one-hot encoding, which, when performed on columns such as these, greatly expands the dataset columnwise. For example, the column "Job" has 639 different choices for occupation (see the code for obtaining this number below.) Performing one-hot encoding on this column would result in the addition of 639 new columns to the dataframe and the removal of the original column. Expanding the dataframe to such a size would not only be cumbersome, but it also increases the risk that my laptop would be set aflame by the sheer processing power required for a random forest model running on that many features. For my D208 paper, I excluded high cardinality categorical variables for the same reason.

In [4]:
print(len(pd.unique(df.Job)))

639


Other variables will be excluded simply because I want to do a comparison of this random forest model and the linear regression I did in my previous paper. Thus, I will limit myself to the same variable set I used for multiple linear regression. The variables I will use are as follows:

* Age: Numeric
* Gender: Categorical
* ReAdmis: Categorical
* Doc_visits: Numeric
* HighBlood: Categorical
* Initial_admin: Categorical
* Stroke: Categorical
* Diabetes: Categorical
* Complication_risk: Categorical
* BackPain: Categorical
* Anxiety: Categorical
* Services: Categorical
* Initial_days: Numeric

## C3. Data Preparation

Since the dataframe is stored as a CSV, there are a number of errors that will need to be fixed that are more in the data cleaning realm. The CSV format has caused the Zip Code column to lose leading zeros that are part of the zip code. Those will be replaced. There are also a number of columns stored in the incorrect datatype. These will be converted, beginning with the Zip Code column, which is erroneously stored as an integer. Many of the remaining columns are string objects that would better be stored as categories. Below you will find the code used to do these two things.

In [5]:
#Clean datatypes up using code from D206 PA. 
#[In-Text Citation: (Nelson, 2023).]

# Convert Zip to string from integer.
df['Zip'] = df['Zip'].astype('str')
# Add leading zeros using zfill()
df['Zip'] = df['Zip'].str.zfill(5)
# Identify columns that can be converted all at once to category datatype using for loop.
category_cols = df[['Area', 'Marital', 'Initial_admin', 'Complication_risk', 'Services', 'ReAdmis',
                         'Soft_drink', 'HighBlood', 'Stroke', 'Arthritis', 'Diabetes', 'Hyperlipidemia', 'BackPain',
                         'Allergic_rhinitis', 'Reflux_esophagitis', 'Asthma']]
# Will do Item# columns later since they require an order. Timezone needs a dict written, will do that later too.
for col in category_cols:
    df[col] = df[col].astype('category')
# Convert gender to category datatype.
df['Gender'] = df['Gender'].astype('category')
# Convert Overweight to category datatype.
df['Overweight'] = df['Overweight'].astype('category')
# Convert Anxiety to category datatype.
df['Anxiety'] = df['Anxiety'].astype('category')
# Convert Job to category datatype.
df['Job'] = df['Job'].astype('category')

# Create ordered categories for Item# variables. 8 is "least important" and 1 is "most important"
survey_scores = CategoricalDtype(categories=['8', '7', '6', '5', '4', '3', '2', '1'], ordered=True)
# Identify columns that need to become ordered categorical
ord_cat_cols = df[['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8']]
# Create for loop to convert columns above to string (issues if not string first,) then ordered categorical datatype.
for col in ord_cat_cols:
    df[col] = df[col].astype('str')
    df[col] = df[col].astype(survey_scores)
# convert Initial_days to an integer.
df['Initial_days'] = df['Initial_days'].astype('int64')

While not necessary for this paper, I will also reduce the timezone column's cardinality from its many, detailed choices to just the commonly recognized timezones. This will then be converted to a category datatype as well.

In [6]:
# Write dictionary for option reduction for Timezone column
mapping_timezone = {'America/Puerto_Rico' : 'Atlantic',
          'America/New_York' : 'Eastern',
          'America/Detroit' : 'Eastern',
          'America/Indiana/Indianapolis' : 'Eastern',
          'America/Indiana/Vevay' : 'Eastern',
          'America/Indiana/Vincennes' : 'Eastern',
          'America/Kentucky/Louisville' : 'Eastern',
          'America/Toronto' : 'Eastern',
          'America/Indiana/Marengo' : 'Eastern',
          'America/Indiana/Winamac' : 'Eastern',
          'America/Chicago' : 'Central',
          'America/Menominee' : 'Central',
          'America/Indiana/Knox' : 'Central',
          'America/Indiana/Tell_City' : 'Central',
          'America/North_Dakota/Beulah' : 'Central',
          'America/North_Dakota/New_Salem' : 'Central',
          'America/Denver' : 'Mountain',
          'America/Boise' : 'Mountain',
          'America/Phoenix' : 'Mountain',
          'America/Los_Angeles' : 'Pacific',
          'America/Nome' : 'Alaskan',
          'America/Anchorage' : 'Alaskan',
          'America/Sitka' : 'Alaskan',
          'America/Yakutat' : 'Alaskan',
          'America/Adak' : 'Hawaiian',
          'Pacific/Honolulu' : 'Hawaiian'
          }
# Use dictionary to convert timezone options.
df.TimeZone.replace(mapping_timezone, inplace=True)
# Convert timezone to category datatype.
df['TimeZone'] = df['TimeZone'].astype('category')

Lastly, there are a couple categories that would better be stored with less precision, as money isn't stored with more than two decimal places. I will round these.

In [7]:
# Round TotalCharge to 2 decimal places
df['TotalCharge'] = df.TotalCharge.round(2)
# Round Additional_charges to 2 decimal places
df['Additional_charges'] = df.Additional_charges.round(2)

In [8]:
#Re-inspect dataframe to see if changes took.
df.info(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Customer_id         10000 non-null  object  
 1   Interaction         10000 non-null  object  
 2   UID                 10000 non-null  object  
 3   City                10000 non-null  object  
 4   State               10000 non-null  object  
 5   County              10000 non-null  object  
 6   Zip                 10000 non-null  object  
 7   Lat                 10000 non-null  float64 
 8   Lng                 10000 non-null  float64 
 9   Population          10000 non-null  int64   
 10  Area                10000 non-null  category
 11  TimeZone            10000 non-null  category
 12  Job                 10000 non-null  category
 13  Children            10000 non-null  int64   
 14  Age                 10000 non-null  int64   
 15  Income              10000 non-null  

In [9]:
df.head(10)

Unnamed: 0_level_0,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,Population,Area,TimeZone,Job,Children,Age,Income,Marital,Gender,ReAdmis,VitD_levels,Doc_visits,Full_meals_eaten,vitD_supp,Soft_drink,Initial_admin,HighBlood,Stroke,Complication_risk,Overweight,Arthritis,Diabetes,Hyperlipidemia,BackPain,Anxiety,Allergic_rhinitis,Reflux_esophagitis,Asthma,Services,Initial_days,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,2951,Suburban,Central,"Psychologist, sport and exercise",1,53,86575.93,Divorced,Male,No,19.141466,6,0,0,No,Emergency Admission,Yes,No,Medium,No,Yes,Yes,No,Yes,Yes,Yes,No,Yes,Blood Work,10,3726.7,17939.4,3,3,2,2,4,3,3,4
2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,11303,Urban,Central,Community development worker,3,51,46805.99,Married,Female,No,18.940352,4,2,1,No,Emergency Admission,Yes,No,High,Yes,No,No,No,No,No,No,Yes,No,Intravenous,15,4193.19,17613.0,3,4,3,4,4,4,3,3
3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,17125,Suburban,Central,Chief Executive Officer,3,53,14370.14,Widowed,Female,No,18.057507,4,1,0,No,Elective Admission,Yes,No,Medium,Yes,No,Yes,No,No,No,No,No,No,Blood Work,4,2434.23,17505.19,2,4,4,4,3,4,3,3
4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,2162,Suburban,Central,Early years teacher,0,78,39741.49,Married,Male,No,16.576858,4,1,0,No,Elective Admission,No,Yes,Medium,No,Yes,No,No,No,No,No,Yes,Yes,Blood Work,1,2127.83,12993.44,3,5,5,3,4,5,5,5
5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,5287,Rural,Eastern,Health promotion specialist,1,22,1209.56,Widowed,Female,No,17.439069,5,0,2,Yes,Elective Admission,No,No,Low,No,No,No,Yes,No,No,Yes,No,No,CT Scan,1,2113.07,3716.53,2,1,3,3,5,3,4,3
6,S543885,e3b0a319-9e2e-4a23-8752-2fdc736c30f4,03e447146d4a32e1aaf75727c3d1230c,Braggs,OK,Muskogee,74423,35.67302,-95.1918,981,Urban,Central,Corporate treasurer,3,76,81999.88,Never Married,Male,No,19.612646,6,0,0,No,Observation Admission,No,No,Medium,Yes,Yes,Yes,No,Yes,No,Yes,No,No,Blood Work,5,2636.69,12742.59,4,5,4,4,3,5,4,6
7,E543302,2fccb53e-bd9a-4eaa-a53c-9dfc0cb83f94,e4884a42ba809df6a89ded6c97f460d4,Thompson,OH,Geauga,44086,41.67511,-81.05788,2558,Rural,Eastern,Hydrologist,0,50,10456.05,Never Married,Male,No,14.751687,6,0,0,No,Emergency Admission,Yes,No,Low,Yes,Yes,Yes,Yes,Yes,Yes,No,Yes,No,Intravenous,9,3694.63,16815.51,4,3,3,2,3,4,5,5
8,K477307,ab634508-dd8c-42e5-a4e4-d101a46f2431,5f78b8699d1aa9b950b562073f629ca2,Strasburg,VA,Shenandoah,22641,39.08062,-78.3915,479,Urban,Eastern,Psychiatric nurse,7,40,38319.29,Divorced,Female,No,19.688673,7,2,0,No,Observation Admission,No,No,Medium,Yes,No,No,No,No,No,No,No,No,Intravenous,14,3021.5,6930.57,1,2,2,5,4,2,4,2
9,Q870521,67b386eb-1d04-4020-9474-542a09d304e3,e8e016144bfbe14974752d834f530e26,Panama City,FL,Bay,32404,30.20097,-85.5061,40029,Urban,Central,Computer games developer,0,48,55586.48,Widowed,Male,No,19.65332,6,3,0,No,Emergency Admission,No,No,Low,Yes,No,No,Yes,No,No,No,No,No,Intravenous,6,2968.4,8363.19,3,3,2,3,3,3,4,2
10,Z229385,5acd5dd3-f0ae-41c7-9540-cf3e4ecb2e27,687e7ba1b80022c310fa2d4b00db199a,Paynesville,MN,Stearns,56362,45.40325,-94.71424,5840,Urban,Central,"Production assistant, radio",2,78,38965.22,Never Married,Female,No,18.224324,7,1,2,No,Emergency Admission,Yes,No,High,Yes,No,No,No,No,No,Yes,Yes,Yes,Blood Work,1,3147.86,26225.99,5,5,5,3,4,2,3,2


Now that the data is cleaner (despite being labeled as already cleaned,) let us move on to items that are more in the data preparation realm. First, I will create dummy variables for all of the categorical columns I intend to use for random forest analysis since a random forest requires numeric representations of categorical variables. For columns with choice cardinality greater than two, I will use get_dummies. I will keep every column generated by get_dummies instead of dropping one of them as I would for logistic and linear regression. For columns with choice cardinality equal to two, I will simply make a boolean mapping dictionary and use that to map yes to 1 and no to 0. These dummy variables will be inserted into a new dataframe made to house solely the variables I am interested in for random forest regression.

In [10]:
# Create dictionary needed to re-map boolean columns.
boolean_map = {"No" : 0, "Yes" : 1}
# Map all boolean variables and convert to int
df["ReAdmis"] = df["ReAdmis"].map(boolean_map)
df["ReAdmis"] = df["ReAdmis"].astype("int64")
df["Soft_drink"] = df["Soft_drink"].map(boolean_map)
df["HighBlood"] = df["HighBlood"].map(boolean_map)
df["HighBlood"] = df["HighBlood"].astype('int64')
df["Stroke"] = df["Stroke"].map(boolean_map)
df["Stroke"] = df["Stroke"].astype('int64')
df["Overweight"] = df["Overweight"].map(boolean_map)
df["Overweight"] = df["Overweight"].astype('int64')
df["Arthritis"] = df["Arthritis"].map(boolean_map)
df["Diabetes"] = df["Diabetes"].map(boolean_map)
df["Diabetes"] = df["Diabetes"].astype('int64')
df["Hyperlipidemia"] = df["Hyperlipidemia"].map(boolean_map)
df["BackPain"] = df["BackPain"].map(boolean_map)
df["BackPain"] = df["BackPain"].astype('int64')
df["Anxiety"] = df["Anxiety"].map(boolean_map)
df["Anxiety"] = df["Anxiety"].astype('int64')
df["Allergic_rhinitis"] = df["Allergic_rhinitis"].map(boolean_map)
df["Reflux_esophagitis"] = df["Reflux_esophagitis"].map(boolean_map)
df["Asthma"] = df["Asthma"].map(boolean_map)

# Check that these took.
df.head(10)

Unnamed: 0_level_0,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,Population,Area,TimeZone,Job,Children,Age,Income,Marital,Gender,ReAdmis,VitD_levels,Doc_visits,Full_meals_eaten,vitD_supp,Soft_drink,Initial_admin,HighBlood,Stroke,Complication_risk,Overweight,Arthritis,Diabetes,Hyperlipidemia,BackPain,Anxiety,Allergic_rhinitis,Reflux_esophagitis,Asthma,Services,Initial_days,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,2951,Suburban,Central,"Psychologist, sport and exercise",1,53,86575.93,Divorced,Male,0,19.141466,6,0,0,0,Emergency Admission,1,0,Medium,0,1,1,0,1,1,1,0,1,Blood Work,10,3726.7,17939.4,3,3,2,2,4,3,3,4
2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,11303,Urban,Central,Community development worker,3,51,46805.99,Married,Female,0,18.940352,4,2,1,0,Emergency Admission,1,0,High,1,0,0,0,0,0,0,1,0,Intravenous,15,4193.19,17613.0,3,4,3,4,4,4,3,3
3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,17125,Suburban,Central,Chief Executive Officer,3,53,14370.14,Widowed,Female,0,18.057507,4,1,0,0,Elective Admission,1,0,Medium,1,0,1,0,0,0,0,0,0,Blood Work,4,2434.23,17505.19,2,4,4,4,3,4,3,3
4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,2162,Suburban,Central,Early years teacher,0,78,39741.49,Married,Male,0,16.576858,4,1,0,0,Elective Admission,0,1,Medium,0,1,0,0,0,0,0,1,1,Blood Work,1,2127.83,12993.44,3,5,5,3,4,5,5,5
5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,5287,Rural,Eastern,Health promotion specialist,1,22,1209.56,Widowed,Female,0,17.439069,5,0,2,1,Elective Admission,0,0,Low,0,0,0,1,0,0,1,0,0,CT Scan,1,2113.07,3716.53,2,1,3,3,5,3,4,3
6,S543885,e3b0a319-9e2e-4a23-8752-2fdc736c30f4,03e447146d4a32e1aaf75727c3d1230c,Braggs,OK,Muskogee,74423,35.67302,-95.1918,981,Urban,Central,Corporate treasurer,3,76,81999.88,Never Married,Male,0,19.612646,6,0,0,0,Observation Admission,0,0,Medium,1,1,1,0,1,0,1,0,0,Blood Work,5,2636.69,12742.59,4,5,4,4,3,5,4,6
7,E543302,2fccb53e-bd9a-4eaa-a53c-9dfc0cb83f94,e4884a42ba809df6a89ded6c97f460d4,Thompson,OH,Geauga,44086,41.67511,-81.05788,2558,Rural,Eastern,Hydrologist,0,50,10456.05,Never Married,Male,0,14.751687,6,0,0,0,Emergency Admission,1,0,Low,1,1,1,1,1,1,0,1,0,Intravenous,9,3694.63,16815.51,4,3,3,2,3,4,5,5
8,K477307,ab634508-dd8c-42e5-a4e4-d101a46f2431,5f78b8699d1aa9b950b562073f629ca2,Strasburg,VA,Shenandoah,22641,39.08062,-78.3915,479,Urban,Eastern,Psychiatric nurse,7,40,38319.29,Divorced,Female,0,19.688673,7,2,0,0,Observation Admission,0,0,Medium,1,0,0,0,0,0,0,0,0,Intravenous,14,3021.5,6930.57,1,2,2,5,4,2,4,2
9,Q870521,67b386eb-1d04-4020-9474-542a09d304e3,e8e016144bfbe14974752d834f530e26,Panama City,FL,Bay,32404,30.20097,-85.5061,40029,Urban,Central,Computer games developer,0,48,55586.48,Widowed,Male,0,19.65332,6,3,0,0,Emergency Admission,0,0,Low,1,0,0,1,0,0,0,0,0,Intravenous,6,2968.4,8363.19,3,3,2,3,3,3,4,2
10,Z229385,5acd5dd3-f0ae-41c7-9540-cf3e4ecb2e27,687e7ba1b80022c310fa2d4b00db199a,Paynesville,MN,Stearns,56362,45.40325,-94.71424,5840,Urban,Central,"Production assistant, radio",2,78,38965.22,Never Married,Female,0,18.224324,7,1,2,0,Emergency Admission,1,0,High,1,0,0,0,0,0,1,1,1,Blood Work,1,3147.86,26225.99,5,5,5,3,4,2,3,2


In [11]:
# Create dummy variables, keeping all category columns.
Gender_dum = pd.get_dummies(data=df['Gender'], drop_first=False)
Initial_admin_dum = pd.get_dummies(data=df['Initial_admin'], drop_first=False)
Comp_risk_dum = pd.get_dummies(data=df['Complication_risk'], drop_first=False)
Services_dum = pd.get_dummies(data=df['Services'], drop_first=False)

# Create regression variable only dataframe and insert dummy columns into it.
# [In-Text Citation: (GeeksforGeeks, 2023).]
RFR_df = df[['Age', 'ReAdmis', 'HighBlood', 'Doc_visits', 'Stroke', 'Diabetes',
            'BackPain', 'Anxiety', 'Initial_days']]
RFR_df.insert(1,"dummy_male", Gender_dum.Male)
RFR_df.insert(1,"dummy_female", Gender_dum.Female)
RFR_df.insert(1,"dummy_nonbinary", Gender_dum.Nonbinary)
RFR_df.insert(5,"dummy_emergency", Initial_admin_dum["Emergency Admission"])
RFR_df.insert(5,"dummy_observation", Initial_admin_dum["Observation Admission"])
RFR_df.insert(5,"dummy_elective", Initial_admin_dum["Elective Admission"])
RFR_df.insert(9,"dummy_comp_risk_low", Comp_risk_dum.Low)
RFR_df.insert(9,"dummy_comp_risk_medium", Comp_risk_dum.Medium)
RFR_df.insert(9,"dummy_comp_risk_high", Comp_risk_dum.High)
RFR_df.insert(14,"dummy_CT_scan", Services_dum["CT Scan"])
RFR_df.insert(14,"dummy_intravenous", Services_dum.Intravenous)
RFR_df.insert(14,"dummy_Blood_work", Services_dum["Blood Work"])
RFR_df.insert(14,"dummy_MRI", Services_dum.MRI)

In [12]:
# Check that dataframe looks correct
RFR_df.head(20)

Unnamed: 0_level_0,Age,dummy_nonbinary,dummy_female,dummy_male,ReAdmis,dummy_elective,dummy_observation,dummy_emergency,HighBlood,dummy_comp_risk_high,dummy_comp_risk_medium,dummy_comp_risk_low,Doc_visits,Stroke,dummy_MRI,dummy_Blood_work,dummy_intravenous,dummy_CT_scan,Diabetes,BackPain,Anxiety,Initial_days
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,53,0,0,1,0,0,0,1,1,0,1,0,6,0,0,1,0,0,1,1,1,10
2,51,0,1,0,0,0,0,1,1,1,0,0,4,0,0,0,1,0,0,0,0,15
3,53,0,1,0,0,1,0,0,1,0,1,0,4,0,0,1,0,0,1,0,0,4
4,78,0,0,1,0,1,0,0,0,0,1,0,4,1,0,1,0,0,0,0,0,1
5,22,0,1,0,0,1,0,0,0,0,0,1,5,0,0,0,0,1,0,0,0,1
6,76,0,0,1,0,0,1,0,0,0,1,0,6,0,0,1,0,0,1,1,0,5
7,50,0,0,1,0,0,0,1,1,0,0,1,6,0,0,0,1,0,1,1,1,9
8,40,0,1,0,0,0,1,0,0,0,1,0,7,0,0,0,1,0,0,0,0,14
9,48,0,0,1,0,0,0,1,0,0,0,1,6,0,0,0,1,0,0,0,0,6
10,78,0,1,0,0,0,0,1,1,1,0,0,7,0,0,1,0,0,0,0,0,1


In order to use sklearn for random forest analysis, we need to separate the predictive variables from the response variable, Initial_days. Each set will be stored as an array. I will do that using the code below.

In [13]:
X = RFR_df[['Age', 'dummy_nonbinary', 'dummy_female', 'dummy_male', 'ReAdmis', 'dummy_elective', 'dummy_observation', 'dummy_emergency', 'HighBlood', 'dummy_comp_risk_high', 'dummy_comp_risk_medium', 'dummy_comp_risk_low', 'Doc_visits', 'Stroke', 'dummy_MRI', 'dummy_Blood_work', 'dummy_intravenous', 'dummy_CT_scan', 'Diabetes', 'BackPain', 'Anxiety']].values
y = RFR_df['Initial_days'].values

For a random forest regressor, it is not required that I perform any kind of standardization or normalization, as distance is not important to a random forest. Thus, I will not perform any such scaling of the data.

## C4. Prepared Dataset

My random forest dataset is currently stored in two arrays, so for the purpose of providing my dataset as a CSV, I will simply provide the original dataframe from which my two arrays were split, RFR_df. This dataframe should reflect all data preparation steps performed except for the splitting of the predictive variables from the response variable.

In [14]:
RFR_df.to_csv('RFR_prepared_Task2.csv', index=False)

## D1. Data Splitting

Please note that the numpy arrays, when exported to a csv, are difficult to read since they do not retain the column names the dataframe from which they came had. The columns in these files, however, should be in the same order as the columns were in RFR_df. 

In [15]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7, stratify = y)

In [16]:
# Export train and test sets to CSV
# [In-Text Citation: (GfG, 2024).]
np.savetxt('X_train_RFR_task2.csv', X_train, delimiter = ',')
np.savetxt('X_test_RFR_task2.csv', X_test, delimiter = ',')
np.savetxt('y_train_RFR_task2.csv', y_train, delimiter = ',')
np.savetxt('y_test_RFR_task2.csv', y_test, delimiter = ',')

## D2. Analysis Technique & Calculations

As mentioned previously, random forest regression is a group of decision trees. First, a specified number of decision trees are created, each using a random subset of datapoints from the training dataset. When a new datapoint is fed to this ensemble of trees, each tree makes its own prediction by asking a series of true or false questions of the datapoint to figure out which groups and sub-groups learned by the tree that the new datapoint falls into until it reaches a leaf node (the end of the tree,) where a prediction is made about the datapoint. The predictions of all the individual decision trees are then averaged to produce a final prediction.

For example, say we want to predict a day's temperature from weather data. The first question or root node a decision tree might ask is, "Is it summer?" If true, the tree would follow the true path to the next node, where it might ask, "Is it overcast?" If false, the tree would then follow the false path to the next node, where it might ask, "Is the humidity less than .5?" This question's answer might then lead to a leaf node. The number of branches a tree can have before the end is reached is specified by the analyst's supplied max_depth parameter. At the leaf node, this individual decision tree will make a prediction, let's say, 78 degrees. Another tree in the ensemble might come to the conclusion of 70, while yet another might predict 82. If these three trees are the entire ensemble, these three predictions are then averaged, resulting in a final prediction of 76.7 degrees.

It is important to note that I will be using only the default settings for this random forest regression model with the exception of n_estimators, max_depth, and min_samples_leaf, which I will discuss later. There are many parameters for a random forest, but I will go over the most commonly used ones that influence how the random forest behaves in regards to split calculation and training here. The default "criterion" parameter for random forest is "squared_error." This means that the quality of a split (the result of asking a question of a datapoint,) is measured using the function mean squared error. Since some of the other options are significantly slower to use when training the model, the default will be kept. The default for min_samples_split is 2, which indicates that the minimum number of samples required to split an internal node is two. Some might set this higher, but this makes sense as an absolute minimum-- you physically cannot split 1 sample into two branches. Because I don't want to limit the trees too much in my initial random forest, I won't raise this value and will just keep the default here as well. If I find out later that the ensemble overfits, I could raise this to combat overfitting. Another parameter I will leave at the default is min_weight_fraction_leaf. This defaults to 0.0, indicating that there is no weighting of features involved in the trees. This parameter is useful if there is some indication that some features in the predictor dataset are more important and reliable than others. I do not believe this to be the case, so I will not change the default. Max_features defaults to 1.0, which I will not change. This influences the number of features considered when attempting to find the best split. The value 1.0 considers all features. This parameter is useful in decorrelating trees, if correlation is found to be a problem. This is another parameter I would consider changing after evaluating the initial model's performance. Max_leaf_nodes defaults to "None," which indicates there is no limit on the number of leaves a tree can have. I see no reason to limit the trees on the random forest's initial run, as it limits the trees. Min_impurity_decrease defaults to 0.0, which indicates that the trees must make splits where the impurity decreases by at least 0.0. Setting a threshold here can help deter the tree from making less meaningful splits, and thus it is yet another parameter I would consider changing after evaluating the initial model's performance. It is customary for trees to be built from bootstrap samples of the training dataset. This helps introduce randomness in the tree building process that can improve model performance. Thus, I will keep the bootstrap parameter equal to its default "True."

Let us now discuss n_estimators, max_depth, and min_samples_leaf. These important parameters can be arbitrarily picked, but it is common also to use hyperparameter tuning to pick more ideal values for them. This assignment does not require hyperparameter tuning, but in the absence of a better way to pick the values for these parameters, I will perform it anyway for this limited set of parameters using RandomizedSearchCV. RandomizedSearchCV, unlike GridSearchCV, checks several random combinations of the parameters supplied to it, while GridSearchCV checks all combinations of the parameters supplied to it. As such, RandomizedSearchCV takes less time and computational power while still providing an improvement in performance over a model with parameters chosen arbitrarily. With RandomizedSearchCV, I can feed in larger lists of possible parameters without the risk of my computer catching fire from the sheer computational power required to do GridSearchCV on the same lists. It also saves my hair from being pulled out from the enormous amount of time it would take GridSearchCV on such a set of lists.

Because hyperparameter tuning does not require the codification of mathematical formulas, **I did not perform any intermediate calculations.**

## D3. Analysis Code

Below you will find the code used to run the random forest model and some very basic hyperparameter tuning.

In [17]:
rfr = RandomForestRegressor(random_state = 7)
params = {'n_estimators': [10,20,50,100,200,500,1000,1500], 'max_depth': [2,4,6,8,10,20], 'min_samples_leaf': [0.08,0.1,0.2,0.3]}
grid = RandomizedSearchCV(estimator=rfr, cv=5, param_distributions=params, scoring = 'neg_mean_squared_error', n_jobs = -1, random_state=7,verbose = 1)
grid.fit(X_train, y_train)
best_params = grid.best_params_
print(best_params)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
{'n_estimators': 1000, 'min_samples_leaf': 0.1, 'max_depth': 4}


In [18]:
# Use best model to predict
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# Print MSE & RMSE
print("Test MSE: " + str(MSE(y_test, y_pred)))
print("Test RMSE: " + str(np.sqrt(MSE(y_test, y_pred))))
print("R-squared: " + str(r2_score(y_test, y_pred)))

Test MSE: 195.34812860227586
Test RMSE: 13.976699488873468
R-squared: 0.7174279127068248


## E1. Accuracy & MSE

The code that returns MSE, which quantifies the model's accuracy, can be viewed above. I have also provided code for the calculation of RMSE, which is easier to interpret, and R-squared to help explain the model's performance.

Unfortunately, this model is not any better than the linear regression model I used in my previous paper, which continues to cause me frustration. The MSE, or mean squared error, for this model is 195.3. It is better for MSE to be lower rather than higher. A higher MSE means the errors from which it is calculated are, on average, high. A smaller MSE means the opposite. The smaller the MSE, the smaller the errors between the predictions and the true values, and thus, the better the model. The MSE for this model is rather high. The MSE for an ideal model would be as close to 0 as possible.

MSE, however, is difficult to interpret by itself, since the errors are squared and the unit, in this case, days squared, is less intuitive. Thus, it is helpful to take MSE and calculate RMSE from it, which uses the natural unit of the outcome variable in question, making it easier to understand. The RMSE of this model was 14.0 days, rounded to the tenths place. This means that on average, the difference between the predictions and the true values of initial_days were 14 days apart-- an error of about two weeks. This value of RMSE, in my opinion, is decent, but those with domain knowledge may disagree based on how vital it is to know exactly how long a hospital will be. And, as always, the lower the RMSE, the better.

MSE, in my opinion, is better for comparing models, while RMSE is better for understanding the performance of a single model. The MSEs of the linear regression model and the random forest were almost exactly the same, with linear regression outperforming the random forest by a marginal amount. Thus, I have not succeeded in creating a better model. Rather, I have created a very similar one.

Finally, though it is not required by the rubric, let us discuss R-squared. R-squared, also known as the coefficient of determination, tells us how much of the variance in the response variable, length of hospital stay, can be predicted by the explanatory variables. The maximum for R-squared is 1, and the closer to 1 this value is, the better the model. The random forest model achieved an R-squared of 0.72. This, again, reinforces the fact that the random forest did marginally worse than the linear regression model, which got an R-squared value of 0.73. Neither of these R-squared values are impressive. Rather, they are fairly mediocre, explaining only roughly 70% of the variance in hospital stay.

## E2. Analysis Results

With an R-squared of 0.72 and an MSE of 195.3, the linear regression model narrowly outperformed the random forest model, and thus, is actually marginally worse than the linear regression model. However, the random forest model is still better at predicting than, say, throwing darts at a board of randomly placed and numbered Post-it notes to predict the length of hospital stay (unlike my classification model in Task 1.) However, both models are almost equally mediocre, and other analysis techniques that might perform better should be considered going forward, as there is a *lot* of room for improvement. 30% of the variation in length of hospital stay remains unexplained by the model, a gap that is almost unacceptable. The random forest model, and thus, the linear regression model, could both be deemed passable, however, if the 14-day error is "close enough" to the true value to be usable by the hospital in an actionable plan. Determining whether or not this level of precision is acceptable, however, would require domain knowledge I simply don't possess.

## E3. Analysis Limitations

The major limitation for the random forest analysis I performed is that this technique is rather biased when dealing with categorical variables (Vishalmendekarhere, 2021). The features I chose to use to predict length of hospital stay were almost all categorical. Of the 12 original variables (before they were dummied,) only two were numeric. The rest were all categorical. Because I had so many categorical variables, a weakness for random forests, performance could have suffered. In particular, random forests are biased towards categorical variables with multiple categories because feature selection is based on impurity reduction (Goyal, 2022). This means that the feature Services, which has four categories--more than any other categorical variable, might have been biased towards by the model. In addition, gender, initial_admin, and complication_risk, which each have three choices, may also have been biased towards more than the condition variables such as diabetes or anxiety, which only have two. This bias makes the results of the random forest questionable at best.

## E4. Recommended Action

Since this model is fairly mediocre, there are a few avenues I would recommend considering. First, one could try to improve the random forest model with more hyperparameter tuning. I tuned three hyperparameters--n_estimators, max_depth, and min_samples_leaf--which I chose to tune because they are the most commonly tuned hyperparameters for a random forest model. However, it might end up being beneficial to tune even more hyperparameters. Helpful hyperparameters to consider tuning can be found in the sklearn ensemble documentation that I briefly discussed in section D2.

Secondly, the analyst might perhaps include more variables in the feature dataset. I limited myself to 12 because I was trying to compare against a past model I created using linear regression. Thus, I would recommend the analyst re-run the random forest on as many features as possible while still excluding categorical variables with a high cardinality of choices, since this would bias the model and expand the dataset far too much column-wise.

Lastly, one might abandon the random forest altogether, since the dataset is composed of mostly categorical variables, a weakness for random forests. In that case, I would recommend that the analyst try other models such as Lasso or Ridge regression.

## F. Panopto

The link to the Panopto recording for this assignment can be found here: https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=22583f4c-cdca-4fd5-8aba-b128013b012d

## G. Code Sources

GeeksforGeeks. (2023, November 30). How to add column from another DataFrame in Pandas.<br>
&emsp;&emsp;https://www.geeksforgeeks.org/how-to-add-column-from-another-dataframe-in-pandas/

GfG. (2024, February 2). Convert a NumPy array into a CSV file. GeeksforGeeks. https://www.geeksforgeeks.org/convert-numpy-array-into-csv-file/

Nelson, M. (2023, August 8). *D206: Data Cleaning Performance Assessment.* Unpublished manuscript, Western Governors University.

## H. Content Sources


Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors. (n.d.).<br>
&emsp;&emsp; https://www.bzst.com/2015/08/categorical-predictors-how-many-dummies.html

Goyal, C. (2022, June 24). Bagging- 25 questions to test your skills on Random Forest Algorithm. Analytics Vidhya.<br>
&emsp;&emsp; https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-algorithm/

Regis College. (2022, August 10). *How Reducing Hospital Readmissions Benefits Patients and Hospitals.*<br>
&emsp;&emsp;https://online.regiscollege.edu/blog/reducing-hospital-readmissions/

Saini, A. (2024, January 5). Decision Tree – a Step-by-Step guide. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/

Vishalmendekarhere. (2021, December 27). It’s all about assumptions, pros & cons - the startup - medium. Medium.<br>
&emsp;&emsp; https://medium.com/swlh/its-all-about-assumptions-pros-cons-497783cfed2d