# SVI & COVID - SAN ANTONIO

## Summary

The CDC's social vulnerability index (SVI) is a scale that predicts the vulnerability of a population in the event of an emergency or natural disaster. COVID is the first global pandemic since the development of this measure. We will evaluate the association between SVI score and COVID case count in San Antonio, Texas. Features from this measure will be incorporated into a predictive model that can be used to guide recovery resource prioritization.



**Goals**      
1. Evaluate association between SVI score and COVID case count in San Antonio, TX     
2. Build a model based SVI score component features that can predict COVID cases by census tract within San Antonio, TX
3. Complete same evaluation and models for Dallas
4. Is there a difference between San Antonio and Dallas results?


### Problem Statement

Background
The SVI (Social Vulnerability Index) was developed to help city governments and first responders predict areas that are particularly vulnerable in emergency situations so that resources can be prioritized to help areas at high risk (Citation CDC Website). The CDC’s Social Vulnerability Index (CDC SVI) uses 15 U.S. census variables to classify census tracts with a composite score between 0 and 1 (lower scores = less vulnerability, higher score = greater vulnerability. This socre is calculated by first ranking every census tract, in every country, in every state, in the United States. Those ranked tracks are then broken up to 4 themes (socioeconomic status, household composition and disability, minority status and language, household type and transportation) and reclassified. This overall score is then tallied by summing the themed percentiles and ranked on a score between 0 and 1.

While SVI was designed to help city goverments repsond to emergency situations, the efficacy of the systems has never been tested on in response to a global pandemic. COVID-19 is the disease caused by a new coronavirus called SARS-CoV-2. WHO first learned of this new virus on 31 December 2019, following a report of a cluster of cases of ‘viral pneumonia’ in Wuhan, People’s Republic of China. (Citation WHO). As of 9 December 2020, more than 68.4 million cases have been confirmed, with more than 1.56 million deaths attributed to COVID-19.




### Work Plan

*Now that you have a clear idea of what you are trying to accomplish, you need to find a data source. Think of this as working through a maze forwards and backwards at the same time. At the start you have any number of data sets available to work with (and the whole Internet to search and scrape), and at the end is the hypothetical data set that would answer your question immediately if you had it.*

- What data, if you had it, would solve your problem right away?
    - Cases count per 100K by census tract nationally would be ideal
- What data do you have access to?
    - Have SVI from CDC by census track
    - Have Bexar county case count per 100K by zip code for single date 12/8/20
- What additional data would be good to have?
    - be great to have more historical data
- What data would be impossible to collect?
    - due to privacy/panic concerns case data at the census tract level is not being published for open access
- What are the best proxies you can find for unavailable or impossible data?
    - find a way to translate from census tract to zip code
- What are the legal or ethical issues you might run into if you were to try to collect all of the types of data you would like to work with?
    - considered web scraping for zip code level data where specific information is not available
    - likely better to hard code existing data that is publicly available

*Of course you need to make a plan to turn your data into the solutions you need. Think of what type of problem this is, what models are commonly used for those types of problems, what types of data those models require and special considerations that may need to be made. Do your homework and find out what approaches other people are using on similar tasks.*

- What ML paradigm are you working in? (Classification, Regression, Clustering, etc)
    - Regression
- What models are commonly used in this task?
    - Linear Regression, LassoLars, Polynomial Features, Tweedie Regressor
* What other solutions are being tried in this field?
* What special considerations need to be taken when dealing with these models? (i.e., imbalanced classes, text preprocessing, data leakage, etc)
* How will you know that your models work?
* How will you recognize and diagnose cases where the predictions are incorrect?

## Imports


In [1]:
import pandas as pd
import seaborn as sns

from scripts_python import acquire
# from .scripts_python import explore
# from .scripts_python import model_MAE

import matplotlib.pyplot as plt
import numpy as np


# from sklearn.metrics import mean_squared_error
# from sklearn.linear_model import LinearRegression
# from statsmodels.formula.api import ols
# from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
# from sklearn.feature_selection import f_regression, SelectKBest, RFE 
# from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
# from sklearn.preprocessing import PolynomialFeatures

# # Classfication Modeling:
# from sklearn.linear_model import LogisticRegression, LinearRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
# import model_classification

# from math import sqrt
# from scipy import stats

In [2]:
df = acquire.get_san_antonio_data()

In [3]:
df.head()

Unnamed: 0_level_0,zip,population,positive,casesp100000,zipint,activecases,activecaserate,shape_length,shape_area
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,78002,9061,378,4171.724975,78002,24,264.871427,0.427542,0.009546
2,78006,5243,96,1831.012779,78006,7,133.511348,0.552725,0.005416
4,78015,12254,192,1566.835319,78015,16,130.56961,0.278955,0.002312
5,78023,29569,642,2171.192803,78023,49,165.714092,0.886455,0.017922
7,78052,699,28,4005.722461,78052,1,143.061516,0.260085,0.001147


# Acquire & Prepare

- SVI data from the CDC's website
- downloaded COVID data for San Antonio and Dallas from the cities respective COVID data web portals
- HUD crosswalk provided a guide to transform the data
    - complicated because of overlap in census tracts and zip codes
    - found the Zip code that accounted for the highest percentage of addresses within the tract and assigned that as the sole Zip code for the tract
    - the ratio of addresses for the census tract was then used to calculate a cases per 100K measure for each tract 
- selected 29 features and renamed for clarity
- verified no null values to address
- military bases were removed from the data frame
- raw SVI score binned
    - bin_svi column = text label (low, low-mod, mod-high, high)
    - rank_svi column = numeric representation of SVI (1 representing a high score, 4 representing a low score)
- numeric columns with values greater than 4 were scaled using sklearn's MinMaxScaler
- six data frames returned at then end of wrangle including *train_explore* for exploration and individual scaled data frames for modeling  *X_train_scaled, y_train, X_test_scaled, y_test*.

In [None]:
df, train_exp, X_train_scaled, y_train, X_test_scaled, y_test = wrangle.wrangle_data()

# Explore

Exploration focuses on answering questions regarding the relationship between the CDC's range category SVI score and cases of COVID-19 per 100k.

- Visualize cases per 100K by binned SVI value
    - Appear to be distinct
    - Will conduct parametric ANOCA (Kruskal) test to confirm
- Verify raw SVI score relationship to cases per 100K
    - will conduct Pearson's R correlation test
- Explore distribution of casses per 100K with SVI score
- Explore distribution of flags by SVI score


## Hypothesis Testing

### Question One: Is there a correlation between the CDC's Range Category SVI Score and COVID-19 Infection Cases per 100k Individuals?

In [None]:
explore.sns_boxplot(train_exp)

**Takeaway:**
`There appears to be a correlation between COVID-19 Count and SVI Category. Next step is Hypothesis testing between categories to validate statistical significance`

In [None]:
# Mean COVID-19 Count By CDC's SVI Category
All = round(train_exp.tract_cases_per_100k.mean(),5)
low = round((train_exp[train_exp.bin_svi == 'Low']).tract_cases_per_100k.mean(),5)
low_mod = round((train_exp[train_exp.bin_svi == 'Low Moderate']).tract_cases_per_100k.mean(),6)
mod_high = round((train_exp[train_exp.bin_svi == 'Moderate High']).tract_cases_per_100k.mean(),6)
high = round((train_exp[train_exp.bin_svi== 'High']).tract_cases_per_100k.mean(),6)

print(f'The average number of cases per 100k for all CDC SVI Range Categories is {All}') 
print(f'The average number of cases per 100k for CDC SVI Range Category (low) is {low}')
print(f'The average number of cases per 100k for CDC SVI Range Category (low_mod) is {low_mod}')
print(f'The average number of cases per 100k for CDC SVI Range Category (mod_high) is {mod_high}')
print(f'The average number of cases per 100k for CDC SVI Range Category (high) is {high}')

In [None]:
low = (train_exp[train_exp.bin_svi == 'Low']).tract_cases_per_100k
low_mod = (train_exp[train_exp.bin_svi == 'Low Moderate']).tract_cases_per_100k
mod_high = (train_exp[train_exp.bin_svi == 'Moderate High']).tract_cases_per_100k
high = (train_exp[train_exp.bin_svi== 'High']).tract_cases_per_100k

#### Variance Test

In [None]:
stats.levene(low, low_mod, mod_high, high)

In [None]:
alpha = 0.01
null = "Average number of COVID-19 cases per 100k is the same across all CDC SVI Range Categories "
alternate = "Average number of COVID-19 cases per 100k is significantly different across all CDC SVI Range Categories "
explore.anova_test(low, low_mod, mod_high, high, null, alternate, alpha)

**Takeaway:**
`We can state with 99% certainty that there is a statistically significant difference between all of the CDC SVI Range Categories`

### Question Two: Is there a correlation between raw_svi and cases per 100k?

In [None]:
raw_svi = train_exp.raw_svi
cases_per_100k = train_exp.tract_cases_per_100k
alpha = 0.01
null = "There is no statistically significant difference betweeen raw_svi and cases per 100K "
alternate = "There is a statistically significant difference betweeen raw_svi and cases per 100K"
explore.pearson(raw_svi, cases_per_100k, null, alternate, alpha)

**Takeaway:**
`We can state with 99% certainty that there is not a statistically significant difference between the Social Vulnerability Index and Census Tract cases per 100,000 people.`

## Distribution Exploration

In [None]:
explore.joint_plot_index('raw_svi','tract_cases_per_100k', train_exp, 'bin_svi')

In [None]:
explore.my_plotter(train_exp, "all_flags_total", "All SVI Component Flags", "tract_cases_per_100k", "Cases by Tract per 100k")
plt.show()

In [None]:
explore.hist_case(train_exp.tract_cases_per_100k)

# Explore: Feature Engineering (Clustering)

Features that demonstrate potential for Clustering

1. E_POV (Persons below poverty estimate)
2. EP_POV (Percentage of persons below poverty estimate)
3. SPL_THEME1 (Sum of series for Socioeconomic theme)

#### Scatterplot spl_theme1


In [None]:
explore.cluster_scatter(train_exp, 
                'Sum of Flags for Socioeconomic Themes by Number of Cases per 100k', 
                'spl_theme1',
                'Sum of Flags for Socioeconomic Themes',
                'tract_cases_per_100k',
                'Number of Cases per 100,000')

***

#### Scatterplot e_pov

In [None]:
explore.cluster_scatter(train_exp, 
                'Persons Below Poverty Estimate by Number of Cases per 100k', 
                'e_pov',
                'Persons Below Poverty Estimate',
                'tract_cases_per_100k',
                'Number of Cases per 100,000')

***

#### Scatterplot ep_pov

In [None]:
explore.cluster_scatter(train_exp, 
                'Percentage of Persons Below Poverty Estimate by Number of Cases per 100k', 
                'ep_pov',
                'Percentage of Persons Below Poverty Estimate',
                'tract_cases_per_100k',
                'Number of Cases per 100,000')

***

## Creating a Poverty Cluster

#### Elbow Method to establish k

In [None]:
cluster_vars = ['spl_theme1_scaled', 'ep_pov_scaled', 'e_pov_scaled']
explore.elbow_plot(X_train_scaled, cluster_vars)

#### Create Clusters

In [None]:
train_clusters, kmeans = explore.run_kmeans(train_exp, X_train_scaled, k=3, cluster_vars=cluster_vars, cluster_col_name = 'poverty_cluster')
test_clusters = explore.kmeans_transform(X_test_scaled, kmeans, cluster_vars, cluster_col_name = 'poverty_cluster')

#### Get Centroids

In [None]:
centroids = explore.get_centroids(cluster_vars, cluster_col_name='poverty_cluster', kmeans= kmeans)

#### Append Cluster and Join Centroids



In [None]:
train_exp = explore.add_to_train(train_clusters, centroids, train_exp, cluster_col_name = 'poverty_cluster')
X_train_scaled = explore.add_to_train(train_clusters, centroids, X_train_scaled, cluster_col_name = 'poverty_cluster')
X_test_scaled = explore.add_to_train(test_clusters, centroids, X_test_scaled, cluster_col_name = 'poverty_cluster')

In [None]:
X_train_scaled.head(1)

## Are the clusters significant?

In [None]:
explore.sns_boxplot_hypothesis(train_exp.poverty_cluster, 
                       train_exp.tract_cases_per_100k, 
                       "Poverty Clusters", 
                       "Cases per 100k", 
                       "Are The Clusters Significant?")



Hypothesis Testing: (ANOVA/Kruskal)

Is there a statistically significant difference between poverty_clusters and cases per 100k?

Null Hypothesis: Mean # of cases is the same across all clusters

Alternative Hypothesis: Mean # of cases is different across clusters

alpha=0.01


In [None]:
cluster_0 = train_exp[train_exp.poverty_cluster == 0].tract_cases_per_100k
cluster_1 = train_exp[train_exp.poverty_cluster == 1].tract_cases_per_100k
cluster_2 = train_exp[train_exp.poverty_cluster == 2].tract_cases_per_100k

#### Variance Test

In [None]:
stats.levene(cluster_0, cluster_1, cluster_2)

In [None]:
alpha = 0.01
null = "Mean number of cases per 100k is the same across all clusters"
alternate = "Average number of cases per 100k is significantly different between clusters"
explore.anova_test(low, low_mod, mod_high, high, null, alternate, alpha)

#### Split clusters in to dummy variables for modeling

In [None]:
X_train_scaled = pd.get_dummies(X_train_scaled,
                           columns=["poverty_cluster"])
X_test_scaled = pd.get_dummies(X_test_scaled,
                           columns=["poverty_cluster"])

***

# Model the Data

- Baseline for modeling determined by plotting the histogram distribution of COVID-19 cases per 100k.
- The skew observed in the distribution led us to use the median for this value instead of the mean??
- Used cross validation due to limited size of dataset. Size of dataset limited by San Antonio number of census tracts.
- Three of the 4 models used all of the features in the dataset, one model used only the top4 features identified by RFE.
- Linear Regression, LassoLars, and 2 degree polynomial features used all features and a 2nd version of 2 degree polynomial was run with just the top4 features.
- Of these the LassoLars had the least MAE (mean absolute error) and was run on out of sample data (test).
- This model had nearly identical MAE when run on out of sample data, only a 0.7 difference in MAE.
- Overall this is a 25% improvement from mean baseline MAE.

## Create Baseline

In [None]:
# What is the mean vs median of the target variable?
y_train.tract_cases_per_100k.mean(), y_train.tract_cases_per_100k.median()

In [None]:
# calculate the mean absolute error (MAE) of the baseline using mean
mean_baseMAE, basepred1 = model_MAE.get_baseline_mean(y_train)

## Feature Ranking

- Use recursive feature elimination to evaluate features for modeling

In [None]:
rankdf = model_MAE.feature_ranking(X_train_scaled, y_train)
rankdf

## Feature Selection

In [None]:
# only raw svi score
X_raw_svi = X_train_scaled[['raw_svi']]
# binned svi score by CDC range category = 1st ranked
X_rank_svi_only = X_train_scaled[['rank_svi_scaled']]
# top 4 ranked features
X_top4 = X_train_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr']]
# only the summary of the flags = 19th ranked
X_all_flags_only = X_train_scaled[['all_flags_total_scaled']]
# only summary flags, should be the same as all flags total? = 5th, 12th, 15th, 21st
X_summary_flags = X_train_scaled[['f_comp_total_scaled', 'f_soci_total_scaled', 'f_status_total_scaled', 'f_trans_total_scaled']]
# all individual flags
X_not_summary_flags = X_train_scaled[['f_nohsdp_soci', 'f_minrty_status', 'f_groupq_trans', 'f_unemp_soci', 
                                     'f_disabl_comp', 'f_noveh_trans', 'f_mobile_trans', 'f_age65_comp', 
                                     'f_age17_comp', 'f_pov_soci', 'f_limeng_status', 'f_crowd_trans', 
                                      'f_pci_soci', 'f_sngpnt_comp', 'f_munit_trans']]
# top 10 by RFE
X_top10 = X_train_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr', 'rank_svi_scaled', 
                          'f_soci_total_scaled', 'f_pov_soci', 'ep_pov_scaled', 'raw_svi', 'f_age17_comp']]
# engineered features only
X_svifeatures = X_train_scaled[['rising', 'falling', 'delta', 'avg3yr', 'r_soci_rise', 'r_comp_rise', 
                                'r_status_rise', 'r_trans_rise', 'r_soci_fall', 'r_comp_fall', 'r_status_fall', 'r_trans_fall']]
X_Rlist = X_train_scaled[['rank_svi_scaled', 'rising', 'falling', 'delta', 'avg3yr']]

## Build Regression Models

- due to limited size of dataset (limited by number of zip codes in San Antonio area) cross validation will be used for the train and validate stages of modeling
- regression models will be used because the target variable is continuous
- models to try: linear regression, LassoLars, Tweedie Regressor Random Forest Regressor, Support Vector Regressor (SVR)

In [None]:
# create variables for loop
df2test = [X_rank_svi_only, X_top4, X_all_flags_only, X_summary_flags, X_not_summary_flags, X_train_scaled, 
           X_raw_svi, X_top10, X_svifeatures, X_Rlist]
target = y_train

In [None]:
# Linear Regression Models
cvlm_MAE_list = []
for df in df2test:
    cvlm_MAE = model_MAE.cvLinearReg(df, target) 
    cvlm_MAE_list.append(cvlm_MAE)

In [None]:
# LassoLars Models
cvll_MAE_list = []
for df in df2test:
    cvll_MAE = model_MAE.cvLassoLars(df, target, 1) 
    cvll_MAE_list.append(cvll_MAE)

In [None]:
# Random Forest Models
cvrf_MAE_list = []
for df in df2test:
    cvrf_MAE = model_MAE.cvRandomForest(df, target, 4) 
    cvrf_MAE_list.append(cvrf_MAE)

In [None]:
# Tweedie Regressor Models
cvtw_MAE_list = []
for df in df2test:
    cvtw_MAE = model_MAE.cvTweedie(df, target, 1.25, .25)
    cvtw_MAE_list.append(cvtw_MAE)


In [None]:
# Support Vector Models
cvSVRrbf_MAE_list = []
for df in df2test:
    cvSVRrbf_MAE = model_MAE.cvSVR(df, target, 'rbf')
    cvSVRrbf_MAE_list.append(cvSVRrbf_MAE)

In [None]:
# Support Vector Models
cvSVRlinear_MAE_list = []
for df in df2test:
    cvSVRlinear_MAE = model_MAE.cvSVR(df, target, 'linear')
    cvSVRlinear_MAE_list.append(cvSVRlinear_MAE)


### Interpret the Models

In [None]:
# create dataframe for results of all train models
df_list = ['rank_svi_only', 'top4', 'total_all_flags_only', 'summary_flags', 'not_summary_flags', 
           'all_features', 'raw_svi_only', 'top10', 'svi_features', 'Rlist']

results = pd.DataFrame(df_list, columns=['Features'])
results['Base_mean_MAE'] = mean_baseMAE
results['LinearRegression_MAE'] = cvlm_MAE_list
results['LassoLars_MAE'] = cvll_MAE_list
results['Tweedie_MAE'] = cvtw_MAE_list
results['RandomForest_MAE'] = cvrf_MAE_list
results['SVR_rbf_MAE'] = cvSVRrbf_MAE_list
results['SVR_linear_MAE'] = cvSVRlinear_MAE_list
results.sort_values('LinearRegression_MAE')

### Test Stage

In [None]:
# create test dataframe with Top10 features as identified by RFE as that is the best performing model
X_test_top10 = X_test_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr', 'rank_svi_scaled', 
                          'f_soci_total_scaled', 'f_pov_soci', 'ep_pov_scaled', 'raw_svi', 'f_age17_comp']]

In [None]:
# fit Linear Regression with Top4 features on train dataset, then use that model to predict test values
TWtestMAE, modelTW = model_MAE.tweedie_test(X_top10, y_train, X_test_top10, y_test, 1.5, .5)
TWtestMAE

In [None]:
LRtestMAE, modelLR = model_MAE.linear_test(X_top10, y_train, X_test_top10, y_test)
LRtestMAE

In [None]:
LLtestMAE, modelLL = model_MAE.lasso_lars_test(X_top10, y_train, X_test_top10, y_test)
LLtestMAE

#### Takeaways

- the Top10 performed best on train, but appear to be overfit
- to reduce overfitting will run the Top4 feature model instead

In [None]:
# create test dataframe with only Top4 features as identified by RFE as that is the best performing model
X_test_top4 = X_test_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr']]

In [None]:
# fit Linear Regression with Top4 features on train dataset, then use that model to predict test values
TWtestMAE, modelTW = model_MAE.tweedie_test(X_top4, y_train, X_test_top4, y_test, 1.5, .5)
TWtestMAE

In [None]:
LRtestMAE, modelLR = model_MAE.linear_test(X_top4, y_train, X_test_top4, y_test)
LRtestMAE

In [None]:
LLtestMAE, modelLL = model_MAE.lasso_lars_test(X_top4, y_train, X_test_top4, y_test)
LLtestMAE

### Takeaways:
1. Our models successfully beat the baseline.
2. Using the top 4 features (as selected by RFE) Tweedie Regression returned a lower MAE (767 vs 973), Linear Regression returned a lower MAE (785 vs 973) than the baseline, and LassoLars returned a lower MAE (780 vs 973) than the baseline model.
3. Due to this lower error measurement than baseline, we feel confident that our models can be used to effectively predict the graduated need for resources in specific areas in the event of another pandemic.
4. The Tweedie regression model is a 21% improvement over baseline

#### Tweedie regression Summary:

Tweedie is a generalized Linear Model with a Tweedie distribution. The power parameter adjusted the distribution and the alpha parameter determines the regularization strength.


In [None]:
# Tweedie Results table:
tw_result = pd.DataFrame()
x_train_columns = X_test_top4.columns.tolist()
tw_result['features'] = x_train_columns
tw_result['coefs'] = modelTW.coef_
tw_result['abs_coefs'] = abs(modelTW.coef_)
tw_result.sort_values(by = 'abs_coefs', ascending = False).reset_index()

#### Strengths

- The model demonstrated that the SVI does a good job of predicting in San Antonio the areas that were most at risk from the COVID infections. Interestingly enough, those areas most at risk from natural disasters (such as hurricanes, fires, and other acts of nature) the communities needing resources in response to a pandemic within San Antonio appear to be highly correlated to the overall SVI score.
- Looking inside the Index itself and reviewing the component parts, we discovered that socioeconomic "flags" or components tended to be a stronger correlation than other types of components of the overall Social Vulnerability Index.


#### Weaknesses

- There is no consensus in research literature as to which components of the SVI have the strongest correlation to health risks or outcomes. There are some demographic groups requiring careful interpretation of the results due to their unique characteristics.
- These models are not intended in any way to be presented as predictions of infection; the medical reasons for COVID transmission are still as yet undetermined, and would require a vastly more complex model than is presented here. The purpose of this model was to measure and predict those areas most in need of help and response from communities and local officials. 

## Classification Modeling

In [None]:
# bring in classification datasets with new y variable
class_df, class_train_exp, class_X_train_scaled, class_y_train, class_X_test_scaled, class_y_test = wrangle.wrangle_data_class()

In [None]:
# only raw svi score
cX_raw_svi = class_X_train_scaled[['raw_svi']]
# binned svi score by CDC range category = 1st ranked
cX_rank_svi_only = class_X_train_scaled[['rank_svi_scaled']]
# top 4 ranked features
cX_top4 = class_X_train_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr']]
# only the summary of the flags = 19th ranked
cX_all_flags_only = class_X_train_scaled[['all_flags_total_scaled']]
# only summary flags, should be the same as all flags total? = 5th, 12th, 15th, 21st
cX_summary_flags = class_X_train_scaled[['f_comp_total_scaled', 'f_soci_total_scaled', 'f_status_total_scaled', 'f_trans_total_scaled']]
# all individual flags
cX_not_summary_flags = class_X_train_scaled[['f_nohsdp_soci', 'f_minrty_status', 'f_groupq_trans', 'f_unemp_soci', 
                                     'f_disabl_comp', 'f_noveh_trans', 'f_mobile_trans', 'f_age65_comp', 
                                     'f_age17_comp', 'f_pov_soci', 'f_limeng_status', 'f_crowd_trans', 
                                      'f_pci_soci', 'f_sngpnt_comp', 'f_munit_trans']]
# top 10 by RFE
cX_top10 = class_X_train_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr', 'rank_svi_scaled', 
                          'f_soci_total_scaled', 'f_pov_soci', 'ep_pov_scaled', 'raw_svi', 'f_age17_comp']]
# engineered features only
cX_svifeatures = class_X_train_scaled[['rising', 'falling', 'delta', 'avg3yr', 'r_soci_rise', 'r_comp_rise', 
                                'r_status_rise', 'r_trans_rise', 'r_soci_fall', 'r_comp_fall', 'r_status_fall', 'r_trans_fall']]
cX_Rlist = class_X_train_scaled[['rank_svi_scaled', 'rising', 'falling', 'delta', 'avg3yr']]

In [None]:
# create variables for loop
cdf2test = [cX_rank_svi_only, cX_top4, cX_all_flags_only, cX_summary_flags, cX_not_summary_flags, class_X_train_scaled, 
           cX_raw_svi, cX_top10, cX_svifeatures, cX_Rlist]
ctarget = class_y_train

### Creating a Baseline model:

In [None]:
class_y_train.rank_cases.value_counts(normalize = True)

#### Baseline model we have chosen is the most commonly occuring case bin, rank 3, at 48.7%. 

- This means, if our model returns an accuracy that is > 48.7%, our model is better at predicting COVID case count than simply choosing the most commonly occuring bin.

In [None]:
class_df.tract_cases_per_100k.plot(kind = 'hist')

### Random Forest Model

In [None]:
model_classification.random_forest_class(cX_raw_svi, ctarget, max_depth = 4, n_estimators = 100)

In [None]:
# create variables for loop
crf2test = [
            cX_rank_svi_only, 
            cX_top4, 
            cX_all_flags_only, 
            cX_summary_flags, 
            cX_not_summary_flags, 
            class_X_train_scaled, 
            cX_raw_svi, 
            cX_top10, 
            cX_svifeatures, 
            cX_Rlist
           ]
# target var:
ctarget = class_y_train


# Random Forest Models Classification Loop
cvrf_class_list = []
for df in crf2test:
    cvrf_class = model_classification.random_forest_class(df, ctarget, max_depth = 3, n_estimators = 75) 
    cvrf_class_list.append(cvrf_class)

In [None]:
# create variables for loop
crf2test = [
            cX_rank_svi_only, 
            cX_top4, 
            cX_all_flags_only, 
            cX_summary_flags, 
            cX_not_summary_flags, 
            class_X_train_scaled, 
            cX_raw_svi, 
            cX_top10, 
            cX_svifeatures, 
            cX_Rlist
           ]
# target var:
ctarget = class_y_train


# Random Forest Models Classification Loop
cvrf_class_list = []
for df in crf2test:
    cvrf_class = model_classification.random_forest_class(df, ctarget, max_depth = 2, n_estimators = 200) 
    cvrf_class_list.append(cvrf_class)

### Takeaways:

- The best set of features for the Random Forest model was the Top 4 features as selected by RFE, which include the average change of the SVI in a given area over 3 years, the amount changed, and the engeineerd feature which measures if the SVI score was falling for a particular sub-group. These features together with the overall total number of features within the socioecomomic theme created the best model within the Random Forest training. 
- Overall, most of these models came close to or best the baseline.
- The best RF model beat the baseline by 14%, 62% vs 48%.

### KNN Model

In [None]:
model_classification.knn_classification(cX_rank_svi_only, class_y_train, n_neighbors=5, cv = 10)

In [None]:
# create variables for loop
cknn2test = [
            cX_rank_svi_only, 
            cX_top4, 
            cX_all_flags_only, 
            cX_summary_flags, 
            cX_not_summary_flags, 
            class_X_train_scaled, 
            cX_raw_svi, 
            cX_top10, 
            cX_svifeatures, 
            cX_Rlist
           ]
# target var:
cknn_target = class_y_train


# KNN Models Classification Loop
cvknn_class_list = []
for df in cknn2test:
    cvknn_class = model_classification.knn_classification(df, cknn_target, n_neighbors=5, cv = 10) 
    cvknn_class_list.append(cvknn_class)

### Takeaways:

- Best KNN model on our train dataset was the model using the top 4 features as defined by FRE. The accuracy of this model was 56%.
- This still beat the baseline by 8%, but since it wasn't as high a score as Random Forest, will will use RF in the test stage.
- We selected a k of 5, which yielded the best results from the limited dataset that we had. Still, our models were able to beat the baseline accuracy with all but one set of features (Which interestingly, was the raw SVI score itself). 
- This leads us to believe that there are very specific demographic realities within the San Antonio community that lends itself to more accurately being measuring by SVI than other measurement techniques.
- However, based on the lower score, we opted to take Random Forest to the next step of testing on unseen data.

### Test

In [None]:
X_test_top_4_class = class_X_test_scaled[['spl_theme1_scaled', 'r_status_fall', 'delta', 'avg3yr']]
raw_test_svi = class_X_test_scaled[['rank_svi_scaled']] 
class_y_test

In [None]:
model_classification.rf_test_class(cX_rank_svi_only, ctarget, raw_test_svi, class_y_test)

## Classification Modeling Summary
- In this classification modeling phase, we ran a series of classification models using the Random Forest and KNN algorithmns which sought to use SVI components as features in the models to predict the severity of COVID cases based on our contructed rankings of case counts: Low, Low-Moderate, Moderate-High, and High bins.
- What we found is the most useful features for our classification models were also similar to the features we found as useful in regression modeling.
- Our best model, using the Random Forest algorithmn and using the top 4 RFE features yielded an accuracy result of **55%**, which is an improvement over the baseline of **7%**, (or an increase in accuracy of 14.5%).

# Next Steps: What Can We Do Now?

## Is San Antonio different from other cities?


- San Antonio is very different from Dallas
- SVI and features do seem to correlate closely with high case rates in San Antonio

## Feature Engineering

- SVI trend for the county
    - is rising? is declining? 
    - delta of SVI change year over year?
    - std dev of SVI?

- This is an area for possible further exploration