# COVID-19's Impact on Healthcare Accessibility
### By: Tristan Call and Maria Elena Aviles-Baquero

# Classification
In this document we will aim to apply classifiers to the dataset

(question about this notebook: what is the difference b/w this one and the EDA?)

In [None]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# uncomment once you paste your mypytable.py into mysklearn package
import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.plot_utils
importlib.reload(mysklearn.plot_utils)
import mysklearn.plot_utils as plot_utils

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyDecisionTreeClassifier, MyZeroRClassifier, MyRandomClassifier, MyRandomForestClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import os
import pandas as pd
from tabulate import tabulate

In [None]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

working_data_filename = os.path.join("input_data", "week21_working.csv")

# Load the data into a mypytable for future analysis
overall_table = MyPyTable()
overall_table.load_from_file(working_data_filename)
overall_table.convert_to_numeric()

# Convert year into bigger categorical chunks
year_col = overall_table.get_column("TBIRTH_YEAR")
year_label = [str(1932 + 10 * x) + " to " + str(1941 + 10 * x) for x in range(6)]
year_label.append("1992 to 2002")
cutoffs = [1932 + 10 * x for x in range(8)]
year_col = myutils.categorize_continuous_list(year_col, cutoffs, year_label)

# Create DELAYNOTGET column
delay = overall_table.get_column("DELAY")
notget = overall_table.get_column("NOTGET")
delaynotget = []
for i in range(len(delay)):
    if delay[i] == 1 or notget[i] == 1:
        delaynotget.append(1)
    else:
        delaynotget.append(2)
        
# Combine all the above into the overall_table
overall_table.column_names.append("DELAYNOTGET")
overall_table.data = [[overall_table.data[i][0]] + [year_col[i]] + overall_table.data[i][2:] + [delaynotget[i]] for i in range(len(year_col))]
print(len(overall_table.data))

# Copy from here

# Classification
First we will break the information into the appropriate format

In [None]:
# Break information into X_train and class_label
X_train = overall_table.get_columns(["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME"])
X_train = X_train.data
Y_train = overall_table.get_column("DELAYNOTGET")

### Process
To find the best classifier, we first computed 2 baseline classifiers, ZeroR and random. We then tested naive bayes, decision tree, and random forest classifiers and looked at how they compared to the baselines and each other. All of these classifiers were loosley based on the sklearn implementation, but featured simpler algorithms and less streamlining (https://scikit-learn.org/stable/).

### Evaluation
To determine which classifier was the best, we ran each of them over a stratified 10-fold cross validation testing technique for accuracy. The exception was the random forest classifier, which required its own unique approach. We then plugged these results into a confusion matrix to determine if the classifier was better at one or another prediction. Given roughly equal accuracy (within about 1%), the classifier where the recognition rates of all class labels were closer to being the same won out. Meaning any classifier which at least matched the zero R classifier in accuracy, but didn't have a 0%, 100% split in recognition rates, would win out.

## Compute the Baseline
First we will compute the baseline classifiers to get an idea of how must we must improve our classifiers.
### Zero R

In [4]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_zero = []
all_actual_delay_zero = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    zero = MyZeroRClassifier()
    zero.fit(xtrain, ytrain)
    predicted_delay = zero.predict(xtest)
    all_predicted_delay_zero += predicted_delay
    all_actual_delay_zero += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_zero, all_actual_delay_zero)
error_rate = 1- accuracy

print("Zero R: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Zero R: accuracy = 0.614735226400614, error rate = 0.38526477359938605


In [5]:
print('''===========================================
Confusion Matrices
===========================================
Zero R (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_zero, all_predicted_delay_zero, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Zero R (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                     0           1506     1506                  0
Not delayed                          0           2403     2403                100


### Random Classifier

In [6]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_random = []
all_actual_delay_random = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    random = MyRandomClassifier()
    random.fit(xtrain, ytrain)
    predicted_delay = random.predict(xtest)
    all_predicted_delay_random += predicted_delay
    all_actual_delay_random += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_random, all_actual_delay_random)
error_rate = 1- accuracy

print("Random: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Random: accuracy = 0.5154771041187004, error rate = 0.4845228958812996


In [7]:
print('''===========================================
Confusion Matrices
===========================================
Random (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_random, all_predicted_delay_random, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Random (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   573            933     1506              38.05
Not delayed                        961           1442     2403              60.01


## Compute Actual Classifiers
Next we try running naive bayes, decision tree, and random forest classifiers over our database and see how they compare.

### Naive Bayes

In [8]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_bayes = []
all_actual_delay_bayes = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    bayes = MyNaiveBayesClassifier()
    bayes.fit(xtrain, ytrain)
    predicted_delay = bayes.predict(xtest)
    all_predicted_delay_bayes += predicted_delay
    all_actual_delay_bayes += ytest
    
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_bayes, all_actual_delay_bayes)
error_rate = 1- accuracy

print("Naive bayes: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Naive bayes: accuracy = 0.6144794064978255, error rate = 0.3855205935021745


In [9]:
print('''===========================================
Confusion Matrices
===========================================
Naive bayes (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_bayes, all_predicted_delay_bayes, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Naive bayes (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   221           1285     1506              14.67
Not delayed                        222           2181     2403              90.76


Naive Bayes had comparable accuracy to the Zero R classifier, as well as better delayed/canceled recognition. It also has better accuracy, and worse recognition than the random classifier.

### Decision Tree

In [63]:
print('''===========================================
Predictive Accuracy
===========================================
Stratified 10-Fold Cross Validation''')
k = 10
all_predicted_delay_tree = []
all_actual_delay_tree = []

# Get training data
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, Y_train, k)
for i in range(k):
    # Sort training data
    xtrain = myutils.distribute_data_by_index(X_train, train_folds[i])
    ytrain = myutils.distribute_data_by_index(Y_train, train_folds[i])
    xtest = myutils.distribute_data_by_index(X_train, test_folds[i])
    ytest = myutils.distribute_data_by_index(Y_train, test_folds[i])

    # Compute prediction and convert
    tree = MyDecisionTreeClassifier()
    tree.fit(xtrain, ytrain, ['RHISPANIC', 'RRACE', 'TBIRTH_YEAR', 'INCOME', 'label'])
    predicted_delay = tree.predict(xtest)
    all_predicted_delay_tree += predicted_delay
    all_actual_delay_tree += ytest
# Calculate overall accuracy
accuracy = myutils.calculate_accuracy(all_predicted_delay_tree, all_actual_delay_tree)
error_rate = 1- accuracy

print("Decision Tree: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))

Predictive Accuracy
Stratified 10-Fold Cross Validation
Decision Tree: accuracy = 0.6091071885392684, error rate = 0.3908928114607316


In [64]:
print('''===========================================
Confusion Matrices
===========================================
Decision Tree (Stratified 10-Fold Cross Validation):''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Confusion Matrices
Decision Tree (Stratified 10-Fold Cross Validation):
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   157           1349     1506              10.42
Not delayed                        179           2224     2403              92.55


The decision tree, as you can see, did not have very good performance. It was out performed slightly by the zero R and Naive Bayes classifiers in terms of accuracy only had the benefit of better delayed/canceled recognition. The best accuracy results were found using hispanic, race, birth year, and income. Adding gender and education decreased accuracy and increased delayed/canceled recognition. With the current structure it has almost identical performance to the Naive Bayes classifier, if slightly less good. In accuracy it still outperformed the random classifier, though did worse in delayed/canceled recognition.

In [75]:
# tree = MyDecisionTreeClassifier()
# tree.fit(X_train, Y_train)
# tree.print_decision_rules(attribute_names=["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME"], class_name="delayed/canceled (1 = yes)")

### Random Forest 
With the forest classifier we had to use a different method of evaluation due to the randomness that it featured. To do this we followed the random forest procedure described in class (located at https://github.com/GonzagaCPSC322/U7-Ensemble-Learning/blob/master/A%20Ensemble%20Learning.ipynb). We computed a stratified k fold cross validation with k=3. Then we selected one of the folds as our validation set, and the rest as our training set. We trained the forest over the training set, then tested it against the validation set. This was done 5 times to attempt to minimize the effects of randomness in evaluating the classifier. We then computed the overall accuracy and confusion matrixes, and continued as normal in our evaluation. 

In [34]:
N = 10
M = 5
F = 1

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a1'], validation accuracy: 0.6355329949238578
Tree attributes: ['a3'], validation accuracy: 0.6347177848775293
Tree attributes: ['a4'], validation accuracy: 0.6302083333333334
Tree attributes: ['a0'], validation accuracy: 0.6242171189979123
Tree attributes: ['a2'], validation accuracy: 0.6216494845360825
Tree attributes: ['a1'], validation accuracy: 0.6223132036847492
Tree attributes: ['a3'], validation accuracy: 0.6179196704428425
Tree attributes: ['a5'], validation accuracy: 0.6164102564102564
Tree attributes: ['a0'], validation accuracy: 0.6149068322981367
Tree attributes: ['a1'], validation accuracy: 0.6148936170212767
Tree attributes: ['a3'], validation accuracy: 0.6292016806722689
Tree attributes: ['a5'], validation accuracy: 0.6284544524053224
Tree attributes: ['a0'], validation accuracy: 0.6239583333333333
Tree attributes: ['a2'], validation accuracy: 0.6226012793176973
Tree attributes: ['a4'], validation accuracy

In [35]:
N = 10
M = 5
F = 4

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a1', 'a2', 'a0', 'a4'], validation accuracy: 0.6197478991596639
Tree attributes: ['a1', 'a3', 'a2', 'a0'], validation accuracy: 0.6170678336980306
Tree attributes: ['a3', 'a1', 'a2', 'a4'], validation accuracy: 0.6162162162162163
Tree attributes: ['a0', 'a3', 'a2', 'a4'], validation accuracy: 0.6078431372549019
Tree attributes: ['a4', 'a5', 'a3', 'a0'], validation accuracy: 0.6058091286307054
Tree attributes: ['a5', 'a2', 'a4', 'a1'], validation accuracy: 0.6201716738197425
Tree attributes: ['a1', 'a4', 'a5', 'a0'], validation accuracy: 0.6127049180327869
Tree attributes: ['a5', 'a1', 'a0', 'a4'], validation accuracy: 0.6123711340206186
Tree attributes: ['a1', 'a3', 'a2', 'a4'], validation accuracy: 0.6086508753861998
Tree attributes: ['a3', 'a4', 'a0', 'a5'], validation accuracy: 0.6086021505376344
Tree attributes: ['a1', 'a2', 'a3', 'a4'], validation accuracy: 0.625
Tree attributes: ['a3', 'a4', 'a5', 'a1'], validation 

Increasing F seems to have only decreased the accuracy of the classifier.

In [38]:
N = 20
M = 5
F = 2

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a4', 'a1'], validation accuracy: 0.6466165413533834
Tree attributes: ['a1', 'a3'], validation accuracy: 0.6236786469344608
Tree attributes: ['a2', 'a5'], validation accuracy: 0.6220806794055201
Tree attributes: ['a5', 'a1'], validation accuracy: 0.6194594594594595
Tree attributes: ['a0', 'a3'], validation accuracy: 0.6194503171247357
Tree attributes: ['a0', 'a3'], validation accuracy: 0.6274921301154249
Tree attributes: ['a1', 'a3'], validation accuracy: 0.6257995735607675
Tree attributes: ['a0', 'a4'], validation accuracy: 0.6175548589341693
Tree attributes: ['a1', 'a3'], validation accuracy: 0.6173733195449845
Tree attributes: ['a4', 'a0'], validation accuracy: 0.6161934805467929
Tree attributes: ['a5', 'a3'], validation accuracy: 0.6307363927427961
Tree attributes: ['a0', 'a4'], validation accuracy: 0.6218487394957983
Tree attributes: ['a2', 'a3'], validation accuracy: 0.6186094069529653
Tree attributes: ['a2', 'a4'], 

Increasing M seems to have marginally increased the accuracy, but not significantly

In [37]:
N = 20
M = 2
F = 1

print('''===========================================
Predictive Accuracy
===========================================''')
print("Individual Tree Accuracy:")
all_predicted_forest = []
all_actual_forest = []
# Run tests of each parameter 5 times
for i in range(5):
    forest = MyRandomForestClassifier(N, F, M)
    accuracy, predicted, actual = forest.test_tree_stratified_kfold(X_train, Y_train)
    all_predicted_forest += predicted
    all_actual_forest += actual
    
    # Print off each trees individual validation acccuracy
    for tree in forest.chosen_trees:
        print("Tree attributes: " + str(tree['attributes']) + ", validation accuracy: " + str(tree['accuracy']))
    
accuracy = myutils.calculate_accuracy(all_predicted_forest, all_actual_forest)
error_rate = 1- accuracy
print('''===========================================
Predictive Accuracy
===========================================
Stratified 3-Fold Cross Validation''')
print("Forest: accuracy = " + str(accuracy) + ", error rate = " + str(error_rate))
print('''Confusion Matrices:''')

ylabels = list(set(Y_train))
matrix = myevaluation.confusion_matrix(all_actual_delay_tree, all_predicted_delay_tree, ylabels)
header = myutils.format_confusion_matrix_into_table(matrix, ["Delayed/canceled", "Not delayed"], "Delayed/canceled")

print(tabulate(matrix, headers=header, tablefmt="rst", numalign="right"))

Predictive Accuracy
Individual Tree Accuracy:
Tree attributes: ['a2'], validation accuracy: 0.6240837696335079
Tree attributes: ['a0'], validation accuracy: 0.6217616580310881
Tree attributes: ['a3'], validation accuracy: 0.6426332288401254
Tree attributes: ['a1'], validation accuracy: 0.6358921161825726
Tree attributes: ['a5'], validation accuracy: 0.6370217166494312
Tree attributes: ['a3'], validation accuracy: 0.6325878594249201
Tree attributes: ['a2'], validation accuracy: 0.6551362683438156
Tree attributes: ['a3'], validation accuracy: 0.6337148803329865
Tree attributes: ['a2'], validation accuracy: 0.6432318992654774
Tree attributes: ['a1'], validation accuracy: 0.6371220020855057
Predictive Accuracy
Stratified 3-Fold Cross Validation
Forest: accuracy = 0.614735226400614, error rate = 0.38526477359938605
Confusion Matrices:
Delayed/canceled      Delayed/canceled    Not delayed    Total    Recognition (%)
Delayed/canceled                   333           1173     1506              

Decreasing the M value also did not noticeably modify the results.

As such the best combination seems to have F=1, with M and N not having as much of an effect.

## Best Classifier
The forest classifier accuracy outperformed all the other classifiers. In terms of accuracy it beat Naive Bayes, Decision Tree, and Random classifiers. It was on par with the zero R classifier, but had a better Delayed/canceled recognition rate. It didn't have as good a delayed/canceled recognition rate as random did, but it was significantly more accurate, making it still the better choice. As such it was the best. 

That said it was not a particularly good classifier. It only managed to be on par with the zero R classifier in terms of accuracy, and only counted as a better algorithm because its delayed/canceled recognition rate was superior to zero R's. More work would be required for it to be utilized in an important role.

### Heroku


# Conclusion

Overall the project was moderately successful. We started with a dataset with a number of demographic and socioeconomic attributes, and whether or not individuals suffered delayed or canceled care due to COVID-19. Ounce we found the correct dataset, we were able to relatively easily use part of that data for classification. We approached our classifiers by creating two baseline ones, then testing three others against those baselines to determine how effective they were. Our best classifier, the random forest, had comparable accuracy to the first baseline classifier with better recognition rates, and outperformed the second baseline classifier by a notable amount in accuracy. However, these results were not ideal, and future development would be needed before the classifier could be useful. 

Future work might try predicting based on other attributes that were included in the general dataset. Employment status, other health related information, housing status, or any number of elements may have a stronger correlation. Atlernatively, the correlations of some attributes may be greater in certain areas, such as race mattering in some states more than others. Finally, it should be noted that the current classifier utilized only about 3000 of the ~70,000 data points available in one week of a study that has been going along for at least 27 weeks due to runtime issues. Utilizing more computer power and optimized algorithms to process all of the data might reveal that the current snapshot was not representative of the whole, and should definitely be investigated.