# COVID-19's Impact on Healthcare Accessibility
By: Tristan Call and Maria Elena Aviles-Baquero  
CPSC 322, Spring 2021  

# Introductory

## Our Database:
This section must briefly describe the dataset you used and the classification task you implemented (e.g., what were you trying to classify in the dataset).

We utilized week 21 of the Household Pulse Survey Public Use File, which covered the time period from December 9 – December 21.

## Our Findings:
You should also briefly describe your findings (e.g., what classifier approach performed the best).  
Overall we discovered that a random forest classifier was our best classifier.


# Data Analysis

## Database information
Information about the dataset itself, e.g., the attributes and attribute types, the number of instances, and the attribute being used as the label.


## Loading in the data

In [6]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# uncomment once you paste your mypytable.py into mysklearn package
import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.plot_utils
importlib.reload(mysklearn.plot_utils)
import mysklearn.plot_utils as plot_utils

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyDecisionTreeClassifier, MyZeroRClassifier, MyRandomClassifier, MyRandomForestClassifier


import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import os
import pandas as pd
from tabulate import tabulate

### Manipulate Data into Useable Format
The first thing we need to do is grab the data from the sas file and manipulate it into a format and size which is workable with our very much not optimized dataset. Part of this involves dropping rows with NaNs or -99s (seen but unanswered questions) in them ahead of time. Overall we aim to go from about 70,000 results to a more reasonable < 10,000 so that our computers can run it in a reasonable amount of time.

In [7]:
# Grab the data
week21_filename = os.path.join("input_data", "pulse2020_puf_21.sas7bdat")
iterator = pd.read_sas(week21_filename, chunksize=5000)
alldata = []
for chunk in iterator:
    alldata.append(chunk)

relevant_attributes = ["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME", "DELAY", "NOTGET"]

# Grab a chunk of data with the attributes we are interested in, minus Nans, and save to a local file
data = alldata[0][["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME", "DELAY", "NOTGET"]]
working_data_filename = os.path.join("input_data", "week21_working.csv")
nafree_data = data.dropna()

# Get rid of -99 results (aka seen but not answered)
nafree_data = nafree_data[nafree_data.INCOME != -99]
nafree_data = nafree_data[nafree_data.DELAY != -99]
nafree_data = nafree_data[nafree_data.NOTGET != -99]
print(nafree_data)

# Save to file
nafree_data.to_csv(working_data_filename)

      TBIRTH_YEAR  EGENDER  RHISPANIC  RRACE  EEDUC  INCOME  DELAY  NOTGET
1          1969.0      2.0        1.0    1.0    7.0     6.0    1.0     2.0
2          1959.0      2.0        1.0    1.0    7.0     4.0    1.0     1.0
4          1967.0      1.0        1.0    1.0    4.0     6.0    2.0     2.0
5          1965.0      1.0        1.0    1.0    7.0     6.0    2.0     2.0
6          1962.0      2.0        1.0    2.0    4.0     1.0    2.0     2.0
...           ...      ...        ...    ...    ...     ...    ...     ...
4993       1964.0      2.0        1.0    1.0    4.0     1.0    2.0     2.0
4994       1984.0      1.0        1.0    1.0    4.0     7.0    1.0     1.0
4995       1973.0      1.0        1.0    1.0    6.0     8.0    2.0     2.0
4997       1976.0      2.0        1.0    1.0    3.0     3.0    1.0     1.0
4999       1958.0      2.0        1.0    1.0    7.0     5.0    2.0     2.0

[3909 rows x 8 columns]


### Organize the data
Next we want to get the data into a more useful format. Step one of this is chunk years into decades to have a reasonable number of attribute values for year according to the below:

years | label
-|-
1932-1941 | 1
1942-1951 | 2
1952-1961 | 3
1962-1971 | 4
1972-1981 | 5
1982-1991 | 6
1992-2002 | 7

Next we want to create a DELAYNOTGET column as a composite of delay and notget so we can look into both these attributes at ounce.

In [8]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# Load the data into a mypytable for future analysis
overall_table = MyPyTable()
overall_table.load_from_file(working_data_filename)
overall_table.convert_to_numeric()

# Convert year into bigger categorical chunks
year_col = overall_table.get_column("TBIRTH_YEAR")
year_label = [x + 1 for x in range(7)]
cutoffs = [1932 + 10 * x for x in range(8)]
year_col = myutils.categorize_continuous_list(year_col, cutoffs, year_label)

# Create DELAYNOTGET column
delay = overall_table.get_column("DELAY")
notget = overall_table.get_column("NOTGET")
delaynotget = []
for i in range(len(delay)):
    if delay[i] == 1 or notget[i] == 1:
        delaynotget.append(1)
    else:
        delaynotget.append(2)
        
# Combine all the above into the overall_table
overall_table.column_names.append("DELAYNOTGET")
overall_table.data = [[overall_table.data[i][0]] + [year_col[i]] + overall_table.data[i][2:] + [delaynotget[i]] for i in range(len(year_col))]
#overall_table.pretty_print()

## Summary Statistics
Relevant summary statistics about the dataset.

In [4]:
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 
importlib.reload(mysklearn.plot_utils)
import mysklearn.plot_utils as plot_utils

# use overall_table object declared above to compute the stats for all attributes
table_stats = overall_table.compute_summary_statistics(overall_table.column_names[1:])
# print out the statistics table
table_stats.pretty_print()

items, values = myutils.get_item_frequency(overall_table.get_column("DELAYNOTGET"))

attribute      min    max    mid      avg    median       std
-----------  -----  -----  -----  -------  --------  --------
TBIRTH_YEAR      1      7    4    4.0967          4  1.55314
EGENDER          1      2    1.5  1.59964         2  0.489971
RHISPANIC        1      2    1.5  1.08314         1  0.276096
RRACE            1      4    2.5  1.31491         1  0.778868
EEDUC            1      7    4    5.33078         6  1.42407
INCOME           1      8    4.5  4.5559          5  2.07749
DELAY            1      2    1.5  1.65004         2  0.476958
NOTGET           1      2    1.5  1.74648         2  0.435025
DELAYNOTGET      1      2    1.5  1.61474         2  0.486658


## Data Visualizations
Data visualizations highlighting important/interesting aspects of your dataset. Visualizations may include frequency distributions, comparisons of attributes (scatterplot, multiple frequency diagrams), box and whisker plots, etc. The goal is not to include all possible diagrams, but instead to select and highlight diagrams that provide insight about the dataset itself.
Note that this section must describe the above (in paragraph form) and not just provide diagrams and statistics. Also, each figure included must have a figure caption (Figure number and textual description) that is referenced from the text (e.g., “Figure 2 shows a frequency diagram for ...”).


### Data Breakdown

In [None]:
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

# get subtables group by whether or not an instance was delayed/didn't get care or did receive care without delay
group_names, subtables = overall_table.group_by("DELAYNOTGET")
# first subtable represents the instances where the individual got delayed care or did not get any
delayed_or_none = MyPyTable(overall_table.column_names, subtables[0])



# Classification
First we will break the information into the appropriate format

In [9]:
# Break information into X_train and class_label
X_train = overall_table.get_columns(["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME"])
X_train = X_train.data
Y_train = overall_table.get_column("DELAYNOTGET")

### Process
To find the best classifier, we first computed 2 baseline classifiers, ZeroR and random. We then tested naive bayes, decision tree, and random forest classifiers and looked at how they compared to the baselines and each other. All of these classifiers were loosley based on the sklearn implementation, but featured simpler algorithms made by the authors and less optimization (https://scikit-learn.org/stable/).

### Evaluation
To determine which classifier was the best, we ran each of them over a stratified 10-fold cross validation testing technique for accuracy. The exception was the random forest classifier, which required its own unique approach. We then plugged these results into a confusion matrix to determine if the classifier was better at one or another prediction. Given roughly equal accuracy (within about 1%), the classifier where the recognition rates of all class labels were closer to being the same won out. Meaning any classifier which at least matched the zero R classifier in accuracy, but didn't have a 0%, 100% split in recognition rates, would win out.

## Compute the Baseline
First we will compute the baseline classifiers to get an idea of how must we must improve our classifiers.
### Zero R

In [10]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_zero = []
all_actual_delay_zero = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    zero = MyZeroRClassifier()
    zero.fit(xtrain, ytrain)
    predicted_delay = zero.predict(xtest)
    all_predicted_delay_zero += predicted_delay
    all_actual_delay_zero += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_zero, all_actual_delay_zero)
error_rate = 1- accuracy

print("Zero R: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Zero R: accuracy = 0.614735226400614, error rate = 0.38526477359938605


In [11]:
print('''===========================================
Confusion Matrices
===========================================
Zero R (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_zero, all_predicted_delay_zero, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Zero R (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                     0           1506     1506                  0
Not delayed                          0           2403     2403                100


### Random Classifier

In [12]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_random = []
all_actual_delay_random = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    random = MyRandomClassifier()
    random.fit(xtrain, ytrain)
    predicted_delay = random.predict(xtest)
    all_predicted_delay_random += predicted_delay
    all_actual_delay_random += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_random, all_actual_delay_random)
error_rate = 1- accuracy

print("Random: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Random: accuracy = 0.5305704783832182, error rate = 0.46942952161678175


In [13]:
print('''===========================================
Confusion Matrices
===========================================
Random (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_random, all_predicted_delay_random, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Random (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   581            925     1506              38.58
Not delayed                        910           1493     2403              62.13


## Compute Actual Classifiers
Next we try running naive bayes, decision tree, and random forest classifiers over our database and see how they compare.

### Naive Bayes

In [14]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_bayes = []
all_actual_delay_bayes = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    bayes = MyNaiveBayesClassifier()
    bayes.fit(xtrain, ytrain)
    predicted_delay = bayes.predict(xtest)
    all_predicted_delay_bayes += predicted_delay
    all_actual_delay_bayes += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_bayes, all_actual_delay_bayes)
error_rate = 1- accuracy

print("Naive bayes: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Naive bayes: accuracy = 0.6144794064978255, error rate = 0.3855205935021745


In [15]:
print('''===========================================
Confusion Matrices
===========================================
Naive bayes (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_bayes, all_predicted_delay_bayes, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Naive bayes (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   221           1285     1506              14.67
Not delayed                        222           2181     2403              90.76


Naive Bayes had comparable accuracy to the Zero R classifier, as well as better delayed/canceled recognition. It also has better accuracy, and worse recognition than the random classifier.

### Decision Tree

In [16]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_tree = []
all_actual_delay_tree = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    tree = MyDecisionTreeClassifier()
    tree.fit(xtrain, ytrain, ['RHISPANIC', 'RRACE', 'TBIRTH_YEAR', 'INCOME', 'label'])
    predicted_delay = tree.predict(xtest)
    all_predicted_delay_tree += predicted_delay
    all_actual_delay_tree += ytest
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_tree, all_actual_delay_tree)
error_rate = 1- accuracy

print("Decision Tree: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Decision Tree: accuracy = 0.6091071885392684, error rate = 0.3908928114607316


In [17]:
print('''===========================================
Confusion Matrices
===========================================
Decision Tree (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Decision Tree (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   157           1349     1506              10.42
Not delayed                        179           2224     2403              92.55


The decision tree, as you can see, did not have very good performance. It was out performed slightly by the zero R and Naive Bayes classifiers in terms of accuracy only had the benefit of better delayed/canceled recognition. The best accuracy results were found using hispanic, race, birth year, and income. Adding gender and education decreased accuracy and increased delayed/canceled recognition. With the current structure it has almost identical performance to the Naive Bayes classifier, if slightly less good. In accuracy it still outperformed the random classifier, though did worse in delayed/canceled recognition.

### Random Forest 
With the forest classifier we had to use a different method of evaluation due to the randomness that it featured. To do this we followed the random forest procedure described in class (located at https://github.com/GonzagaCPSC322/U7-Ensemble-Learning/blob/master/A%20Ensemble%20Learning.ipynb). We computed a stratified k fold cross validation with k=3. Then we selected one of the folds as our validation set, and the rest as our training set. We trained the forest over the training set, then tested it against the validation set. This was done 5 times to attempt to minimize the effects of randomness in evaluating the classifier. We then computed the overall accuracy and confusion matrixes, and continued as normal in our evaluation. Additionally, we varied the values of N, M, and F to see if they had an effect on accuracy

In [18]:
N = 10
M = 5
F = 1

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a0'], validation accuracy: 0.6357069143446853
Tree attributes: ['a0'], validation accuracy: 0.6253869969040248
Tree attributes: ['a3'], validation accuracy: 0.6247379454926625
Tree attributes: ['a5'], validation accuracy: 0.6219895287958115
Tree attributes: ['a5'], validation accuracy: 0.6180698151950719
Tree attributes: ['a4'], validation accuracy: 0.6363636363636364
Tree attributes: ['a4'], validation accuracy: 0.6341201716738197
Tree attributes: ['a0'], validation accuracy: 0.6260330578512396
Tree attributes: ['a0'], validation accuracy: 0.625531914893617
Tree attributes: ['a2'], validation accuracy: 0.6200828157349897
Tree attributes: ['a5'], validation accuracy: 0.6227224008574491
Tree attributes: ['a3'], validation accuracy: 0.6189967982924226
Tree attributes: ['a2'], validation accuracy: 0.6186094069529653
Tree attributes: ['a4'], validation accuracy: 0.6136125654450262
Tree attributes: ['a2'], validation accuracy:

In [19]:
N = 10
M = 5
F = 4

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a2', 'a5', 'a0', 'a3'], validation accuracy: 0.6247422680412371
Tree attributes: ['a0', 'a3', 'a2', 'a4'], validation accuracy: 0.6222684703433923
Tree attributes: ['a3', 'a5', 'a2', 'a0'], validation accuracy: 0.6221052631578947
Tree attributes: ['a0', 'a3', 'a4', 'a1'], validation accuracy: 0.6212278876170656
Tree attributes: ['a0', 'a2', 'a5', 'a1'], validation accuracy: 0.6176154672395274
Tree attributes: ['a5', 'a4', 'a2', 'a0'], validation accuracy: 0.6111111111111112
Tree attributes: ['a0', 'a5', 'a3', 'a4'], validation accuracy: 0.6080937167199149
Tree attributes: ['a3', 'a2', 'a4', 'a0'], validation accuracy: 0.6069489685124865
Tree attributes: ['a2', 'a3', 'a4', 'a5'], validation accuracy: 0.6064718162839249
Tree attributes: ['a2', 'a5', 'a1', 'a3'], validation accuracy: 0.6059670781893004
Tree attributes: ['a4', 'a5', 'a1', 'a3'], validation accuracy: 0.6213592233009708
Tree attributes: ['a0', 'a2', 'a5', 'a3']

Increasing F seems to have only decreased the accuracy of the classifier.

In [21]:
N = 20
M = 5
F = 2

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a0', 'a3'], validation accuracy: 0.6222684703433923
Tree attributes: ['a4', 'a1'], validation accuracy: 0.6220145379023884
Tree attributes: ['a2', 'a3'], validation accuracy: 0.6191950464396285
Tree attributes: ['a3', 'a2'], validation accuracy: 0.6136595310907238
Tree attributes: ['a2', 'a1'], validation accuracy: 0.6071055381400209
Tree attributes: ['a1', 'a4'], validation accuracy: 0.6308654848800834
Tree attributes: ['a2', 'a4'], validation accuracy: 0.6305931321540063
Tree attributes: ['a2', 'a5'], validation accuracy: 0.6267605633802817
Tree attributes: ['a0', 'a3'], validation accuracy: 0.6221052631578947
Tree attributes: ['a2', 'a3'], validation accuracy: 0.6113989637305699
Tree attributes: ['a0', 'a1'], validation accuracy: 0.6321243523316062
Tree attributes: ['a1', 'a4'], validation accuracy: 0.6318565400843882
Tree attributes: ['a2', 'a3'], validation accuracy: 0.6151452282157677
Tree attributes: ['a1', 'a0'], 

Increasing M seems to have marginally increased the accuracy, but not significantly

In [23]:
N = 20
M = 2
F = 1

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a3'], validation accuracy: 0.6334371754932503
Tree attributes: ['a5'], validation accuracy: 0.6302966101694916
Tree attributes: ['a4'], validation accuracy: 0.6446280991735537
Tree attributes: ['a4'], validation accuracy: 0.6372950819672131
Tree attributes: ['a4'], validation accuracy: 0.6421052631578947
Tree attributes: ['a4'], validation accuracy: 0.6364605543710021
Tree attributes: ['a3'], validation accuracy: 0.6469366562824507
Tree attributes: ['a3'], validation accuracy: 0.6365568544102019
Tree attributes: ['a4'], validation accuracy: 0.6378205128205128
Tree attributes: ['a1'], validation accuracy: 0.6331967213114754
Predictive Accuracy
Stratified 3-Fold Cross Validation
Forest: accuracy = 0.614735226400614, error rate = 0.38526477359938605
Confusion Matrices:
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   157           1349     1506              

Decreasing the M value also did not noticeably modify the results.

As such the best combination seems to have F=1, with M and N not having as much of an effect.

## Results
The forest classifier accuracy outperformed all the other classifiers. In terms of accuracy it beat Naive Bayes, Decision Tree, and Random classifiers. It was on par with the zero R classifier, but had a better Delayed/canceled recognition rate. It didn't have as good a delayed/canceled recognition rate as random did, but it was significantly more accurate, making it still the better choice. As such it was the best. 

That said it was not a particularly good classifier. It only managed to be on par with the zero R classifier in terms of accuracy, and only counted as a better algorithm because its delayed/canceled recognition rate was superior to zero R's. More work would be required for it to be utilized in an important role.

### Heroku TODOOOOO!!!!
Here we will store the last tree we made with one of the better results.

To test, type in the following url (TODO change to heroku): http://127.0.0.1:5000/predict?birth_year=1993&gender=2&hispanic=1&race=3&income=4&education=7  
You can then modify the values and see what it predicts. Reference the introduction section where all of these are explained to determine what numbers you should place. Invalid numbers will not be accepted.

In [29]:
import pickle
trees = forest.chosen_trees
package = []
for tree in trees:
    package.append([tree['tree'].tree, tree['tree'].header])
print(package)
pickle_path = "forest_pickler.py"
outfile = open(pickle_path, "wb")
pickle.dump(package, outfile)
outfile.close()

[[['Attribute', 'a4', ['Value', 2, ['Leaf', 2, 349, 2606]], ['Value', 4, ['Leaf', 2, 509, 2606]], ['Value', 5, ['Leaf', 2, 512, 2606]], ['Value', 1, ['Leaf', 2, 72, 2606]], ['Value', 3, ['Leaf', 2, 595, 2606]], ['Value', 6, ['Leaf', 2, 436, 2606]], ['Value', 7, ['Leaf', 2, 133, 2606]]], ['a4', 'label']], [['Attribute', 'a1', ['Value', 6, ['Leaf', 2, 445, 2606]], ['Value', 7, ['Leaf', 2, 140, 2606]], ['Value', 2, ['Leaf', 2, 343, 2606]], ['Value', 5, ['Leaf', 2, 419, 2606]], ['Value', 3, ['Leaf', 2, 612, 2606]], ['Value', 4, ['Leaf', 2, 554, 2606]], ['Value', 1, ['Leaf', 2, 93, 2606]]], ['a1', 'label']]]


# Conclusion

Overall the project was moderately successful. We started with a dataset with a number of demographic and socioeconomic attributes, and whether or not individuals suffered delayed or canceled care due to COVID-19. Ounce we found the correct dataset, we were able to relatively easily use part of that data for classification. We approached our classifiers by creating two baseline ones, then testing three others against those baselines to determine how effective they were. Our best classifier, the random forest, had comparable accuracy to the first baseline classifier with better recognition rates, and outperformed the second baseline classifier by a notable amount in accuracy. However, these results were not ideal, and future development would be needed before the classifier could be useful. 

Future work might try predicting based on other attributes that were included in the general dataset. Employment status, other health related information, housing status, or any number of elements may have a stronger correlation. Atlernatively, the correlations of some attributes may be greater in certain areas, such as race mattering in some states more than others. Finally, it should be noted that the current classifier utilized only about 3000 of the ~70,000 data points available in one week of a study that has been going along for at least 27 weeks due to runtime issues. Utilizing more computer power and optimized algorithms to process all of the data might reveal that the current snapshot was not representative of the whole, and should definitely be investigated.