# COVID-19's Impact on Healthcare Accessibility
By: Tristan Call and Maria Elena Aviles-Baquero  
CPSC 322, Spring 2021  

# Introductory

## Our Database:
This section must briefly describe the dataset you used and the classification task you implemented (e.g., what were you trying to classify in the dataset).

We utilized week 21 of the Household Pulse Survey Public Use File, which covered the time period from December 9 – December 21.

## Our Findings:
You should also briefly describe your findings (e.g., what classifier approach performed the best).  
Overall we discovered that a random forest classifier was our best classifier.


# Data Analysis

## Database information
Information about the dataset itself, e.g., the attributes and attribute types, the number of instances, and the attribute being used as the label.


## Loading in the data

In [5]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# uncomment once you paste your mypytable.py into mysklearn package
import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.plot_utils
importlib.reload(mysklearn.plot_utils)
import mysklearn.plot_utils as plot_utils

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import os
import pandas as pd
from tabulate import tabulate

### Manipulate Data into Useable Format
The first thing we need to do is grab the data from the sas file and manipulate it into a format and size which is workable with our very much not optimized dataset. Part of this involves dropping rows with NaNs or -99s (seen but unanswered questions) in them ahead of time. Overall we aim to go from about 70,000 results to a more reasonable < 10,000 so that our computers can run it in a reasonable amount of time.

In [6]:
# Grab the data
week21_filename = os.path.join("input_data", "pulse2020_puf_21.sas7bdat")
iterator = pd.read_sas(week21_filename, chunksize=5000)
alldata = []
for chunk in iterator:
    alldata.append(chunk)

relevant_attributes = ["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME", "DELAY", "NOTGET"]

# Grab a chunk of data with the attributes we are interested in, minus Nans, and save to a local file
data = alldata[0][["TBIRTH_YEAR", "EGENDER", "RHISPANIC", "RRACE", "EEDUC", "INCOME", "DELAY", "NOTGET"]]
working_data_filename = os.path.join("input_data", "week21_working.csv")
nafree_data = data.dropna()

# Get rid of -99 results (aka seen but not answered)
nafree_data = nafree_data[nafree_data.INCOME != -99]
nafree_data = nafree_data[nafree_data.DELAY != -99]
nafree_data = nafree_data[nafree_data.NOTGET != -99]
print(nafree_data)

# Save to file
nafree_data.to_csv(working_data_filename)

      TBIRTH_YEAR  EGENDER  RHISPANIC  RRACE  EEDUC  INCOME  DELAY  NOTGET
1          1969.0      2.0        1.0    1.0    7.0     6.0    1.0     2.0
2          1959.0      2.0        1.0    1.0    7.0     4.0    1.0     1.0
4          1967.0      1.0        1.0    1.0    4.0     6.0    2.0     2.0
5          1965.0      1.0        1.0    1.0    7.0     6.0    2.0     2.0
6          1962.0      2.0        1.0    2.0    4.0     1.0    2.0     2.0
...           ...      ...        ...    ...    ...     ...    ...     ...
4993       1964.0      2.0        1.0    1.0    4.0     1.0    2.0     2.0
4994       1984.0      1.0        1.0    1.0    4.0     7.0    1.0     1.0
4995       1973.0      1.0        1.0    1.0    6.0     8.0    2.0     2.0
4997       1976.0      2.0        1.0    1.0    3.0     3.0    1.0     1.0
4999       1958.0      2.0        1.0    1.0    7.0     5.0    2.0     2.0

[3909 rows x 8 columns]


### Organize the data
Next we want to get the data into a more useful format. Step one of this is chunk years into decades to have a reasonable number of attribute values for year according to the below:

years | label
-|-
1932-1941 | 1
1942-1951 | 2
1952-1961 | 3
1962-1971 | 4
1972-1981 | 5
1982-1991 | 6
1992-2002 | 7

Next we want to create a DELAYNOTGET column as a composite of delay and notget so we can look into both these attributes at ounce.

In [7]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# Load the data into a mypytable for future analysis
overall_table = MyPyTable()
overall_table.load_from_file(working_data_filename)
overall_table.convert_to_numeric()

# Convert year into bigger categorical chunks
year_col = overall_table.get_column("TBIRTH_YEAR")
year_label = [x + 1 for x in range(7)]
cutoffs = [1932 + 10 * x for x in range(8)]
year_col = myutils.categorize_continuous_list(year_col, cutoffs, year_label)

# Create DELAYNOTGET column
delay = overall_table.get_column("DELAY")
notget = overall_table.get_column("NOTGET")
delaynotget = []
for i in range(len(delay)):
    if delay[i] == 1 or notget[i] == 1:
        delaynotget.append(1)
    else:
        delaynotget.append(2)
        
# Combine all the above into the overall_table
overall_table.column_names.append("DELAYNOTGET")
overall_table.data = [[overall_table.data[i][0]] + [year_col[i]] + overall_table.data[i][2:] + [delaynotget[i]] for i in range(len(year_col))]
#overall_table.pretty_print()

## Summary Statistics
Relevant summary statistics about the dataset.

## Data Visualizations
Data visualizations highlighting important/interesting aspects of your dataset. Visualizations may include frequency distributions, comparisons of attributes (scatterplot, multiple frequency diagrams), box and whisker plots, etc. The goal is not to include all possible diagrams, but instead to select and highlight diagrams that provide insight about the dataset itself.
Note that this section must describe the above (in paragraph form) and not just provide diagrams and statistics. Also, each figure included must have a figure caption (Figure number and textual description) that is referenced from the text (e.g., “Figure 2 shows a frequency diagram for ...”).


# Classification
Classification Results: This section should describe the classification approach you developed and its performance. Explain what techniques you used, briefly how you designed and implemented the classifiers, how you evaluated your classifiers’ predictive ability, and how well the classifiers performed. Thoroughly describe how you evaluated performance, the comparison results, and which classifier is “best”. Include a link to a Heroku web app with this “best” classifier deployed with an API interface.


## Our Classifiers
For this project we utilized a Naive Bayes classifier, a Decision Tree classifier, and an Ensemble tree classifier.

All of these were loosley based on sklearn's DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html  
Specifically the inputs and outputs were designed so as to conform with sklearn's inputs and outputs. The internals of the algorithm were implemented by the authors.

## Evaluation
To determine which classifier was the best, we ran each of them over a stratified k fold cross validation testing technique for accuracy. We then plugged these results into a confusion matrix to determine if the classifier was better at one or another prediction.

In [9]:
# Put classifier code

## Results
Do result stuff + decide which is best  
Add Heroku app

# Conclusion
Conclusion: Provide a brief conclusion of your project, including a short summary of the dataset you used (and any of its inherent challenges for classification), the classification approach you developed, your classifiers performance, and any ideas you have on ways to improve its performance. 
