# Active COVID-19 Cases in US Counties
Noah Litwiller

This purpose of this project is to use COVID-19 data from US counties to predict the number of active cases in the top 10 US counties. The dataset used was provided by the Center for Systems Science and Engineering at John Hopkins University. The data was taken on May 1st, 2020. For the purposes of this project, we will use the following attributes from the dataset:

*Lat*: The latitude of the county \
*Long_*: The longitutde of the county \
*Confirmed*: The total number of confirmed COVID-19 cases in the county \
*Deaths*: The total number of deaths from COVID-19 in the county \
*Recovered*: The total number of patients who have recovered from COVID-19 in the county \
*Active*: The number of people who currently have COVID-19 in the county. This is the attribute we will attempt to predict.

**Top 10 US counties by population (of the counties present in the dataset):**
1. Los Angeles, CA
2. Cook, IL
3. Harris, TX
4. Maricopa, AZ
5. San Diego, CA
6. Orange, CA
7. Miami-Dade, FL
8. Dallas, TX
9. Riverside, CA
10. King, WA

**Pre-processing**

Before pre-processing the data, I first selected only the records for US counties from the dataset. There were 2916 of these records, 10 of which are the top 10 counties that will be used for testing. For pre-processing I normalized the data and replaced any NaN values with the mean of their respective column. 

**Model Methodology** 

Since *Active* is a countinuous attribute, I chose to use a Stochastic Gradient Descent Regressor model. The squared_epsilon_insensitive loss function is used to create a balance between fitting outliers and more centralized data. The adaptive learning rate ensures the model continues to improve efficiently. I also chose to shuffle the data after each epoch to avoid overfitting.

# Import Data

In [None]:
import pandas as pd

#from https://github.com/CSSEGISandData/COVID-19/blob/df78742b57976079cad11110560ad6628742c134/csse_covid_19_data/csse_covid_19_daily_reports/05-01-2020.csv
covid5_1_20 = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/df78742b57976079cad11110560ad6628742c134/csse_covid_19_data/csse_covid_19_daily_reports/05-01-2020.csv")

In [None]:
covid5_1_20

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-05-02 02:32:27,34.223334,-82.461707,31,0,0,31,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-05-02 02:32:27,30.295065,-92.414197,133,10,0,123,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-05-02 02:32:27,37.767072,-75.632346,303,5,0,298,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-05-02 02:32:27,43.452658,-116.241552,681,16,0,665,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-05-02 02:32:27,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"
...,...,...,...,...,...,...,...,...,...,...,...,...
3183,,,,West Bank and Gaza,2020-05-02 02:32:27,31.952200,35.233200,353,2,76,275,West Bank and Gaza
3184,,,,Western Sahara,2020-05-02 02:32:27,24.215500,-12.885800,6,0,5,1,Western Sahara
3185,,,,Yemen,2020-05-02 02:32:27,15.552727,48.516388,7,2,1,4,Yemen
3186,,,,Zambia,2020-05-02 02:32:27,-13.133897,27.849332,109,3,74,32,Zambia


# Data Pre-processing

In [None]:
import numpy as np
from sklearn import preprocessing as pre
from sklearn.impute import SimpleImputer

covid5_1_20 = covid5_1_20.loc[0:2915,:] #Select only the records that are US counties

X = covid5_1_20.loc[:,["Lat", "Long_", "Confirmed", "Deaths", "Recovered"]]
y = np.array(covid5_1_20.loc[:,"Active"])

SI = SimpleImputer(missing_values=np.nan, strategy='mean')
X = SI.fit_transform(X) #Replace NaN values with the mean of their column

X = pre.normalize(X)

#Use the top 10 US counties as test data
X_test = X[[1546, 575, 1094, 1613, 2281, 1948, 1732, 652, 2219, 1356],:] 
y_test = y[[1546, 575, 1094, 1613, 2281, 1948, 1732, 652, 2219, 1356]]

#Use all other US counties as training data
X_train = np.delete(X,[1546, 575, 1094, 1613, 2281, 1948, 1732, 652, 2219, 1356], 0) 
y_train = np.delete(y, [1546, 575, 1094, 1613, 2281, 1948, 1732, 652, 2219, 1356], 0)


# SGD Regressor

In [None]:
from sklearn import linear_model

SGDRegressor = linear_model.SGDRegressor(learning_rate="adaptive", random_state=40, shuffle=True, max_iter=100000, loss="squared_epsilon_insensitive",
                                          penalty="l1")
SGDRegressor.fit(X_train, y_train)

yPredicted = SGDRegressor.predict(X_test)
meanError = np.abs(np.mean(yPredicted - y_test))
print("Mean error: %.0f" % meanError)


Mean error: 7212


# Predicted Active Cases vs. Actual Active Cases

In [None]:
yComparison = {'County' : ["Los Angeles, CA", "Cook, IL", "Harris, TX", "Maricopa, AZ", "San Diego, CA", "Orange, CA", "Miami-Dade, FL", "Dallas, TX", "Riverside, CA", "King, WA"],
               'Actual Active Cases': y_test, 'Predicted Active Cases': yPredicted}
yComparisonDF = pd.DataFrame(data=yComparison)
yComparisonDF


Unnamed: 0,County,Actual Active Cases,Predicted Active Cases
0,"Los Angeles, CA",23088,3104.283344
1,"Cook, IL",36995,3104.383795
2,"Harris, TX",6429,2956.738042
3,"Maricopa, AZ",4009,2928.693725
4,"San Diego, CA",3440,2893.117291
5,"Orange, CA",2487,2766.258101
6,"Miami-Dade, FL",12031,3036.628666
7,"Dallas, TX",3612,2912.664278
8,"Riverside, CA",3923,2929.057489
9,"King, WA",5822,3081.053929


# Conclusion

To make the effectiveness of this predictor more clear, we will use the metric of mean error to evaluate this model. This SGD model had a mean error of 7212 cases. This means the predicted active cases for a county were, on average, 7212 cases more or less than the actual number of active cases for that region. Looking at the above table, we can see that this regressor was much better at predicting counties with a lower number of actual active cases than those with a higher number of active cases. A possible reason for this behavior is that the vast majority of counties in the dataset had under 1,000 activate cases, with the average number of active cases being 355. 

A limitation of this project is the amount of data records used for prediction. Since 5 attributes are used for prediction, we should have approximately 100,000 records to train the model, but we only had 2915 training records in the dataset. There are less than 100,000 counties in the US, so using something more plentiful instead, such as cities, could train a more accurate model. Another limitation of this model is that it only considers one days worth of data. Global pandemics span over many days so any model not using a sequence of days for prediction fails to account for any trend that can be seen over time. 



