## Data 100 Final Project: COVID 19 Dataset
Collaborators:

1. Exploratory Data Analysis

Import, clean, and merge dataframes

Examine County features (which ones?) in relation to death/confirmed/mortality/rate of spread

Produce 2 visualizations

Assignments:

?: Clean states dataframe, ...

Morgan: Import, clean, merge and examine features for California

?: Import, clean, merge and examine features for New York


In [67]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns

#Import Datasets
counties = pd.read_csv('abridged_couties.csv')
deaths = pd.read_csv('time_series_covid19_deaths_US.csv')
cases = pd.read_csv('time_series_covid19_confirmed_US.csv')
states = pd.read_csv('4.18states.csv')

2. Describe any data cleaning or transformations that you perform and why they are motivated by your EDA.

3. Apply relevant inference or prediction methods (e.g., linear regression, logistic regression, or classification and regression trees), 
including, if appropriate, feature engineering and regularization.
4. Use cross-validation or test data as appropriate for model selection and evaluation. Make sure to
carefully describe the methods you are using and why they are appropriate for the question to be
answered.
5. Summarize and interpret your results (including visualization).
6. Provide an evaluation of your approach and discuss any limitations of the methods you used.
7. Describe any surprising discoveries that you made and future work.


## Data Cleaning on COVID Datasets

### States Dataframe 

For data cleaning, a good starting point is to check for null and missing values and interpret what they may mean, before replacing them with any particular value.

In [68]:
# Check for how many missing values are in every column
states.isnull().sum()

Province_State           0
Country_Region           0
Last_Update             83
Lat                      5
Long_                    5
Confirmed                0
Deaths                   0
Recovered               24
Active                   1
FIPS                    82
Incident_Rate            5
People_Tested           84
People_Hospitalized     91
Mortality_Rate           3
UID                      0
ISO3                     0
Testing_Rate            84
Hospitalization_Rate    91
dtype: int64

Replacing information like People_Hospitalized with zero may not be a great idea since we cannot assume there are no people hospitalized, hence, we can replace these missing values with the mean of that country in which specific state data is missing. We can tackle People_Hospitalized, Testing_Rate, Hospitalization_Rate, People_Tested in this way. 

The cell below calculates how many of the total values for each country are not missing. If a value turns out to be zero that means all data points in People_Hospitalized is missing for that country.

In [69]:
countries = states['Country_Region'].value_counts().index.tolist()
people_hospitalization_country_nulls = {}
def usable_vals(col):
    dic = {}
    for country in countries:
        dic[country] = len(states.loc[states['Country_Region'] == country,col].isnull()) - sum(states.loc[states['Country_Region'] == country,'People_Hospitalized'].isnull())
    return dic
People_Hospitalization_country_nulls = usable_vals('People_Hospitalized')
Hospitalization_Rate_country_nulls = usable_vals('Hospitalization_Rate')
print(Hospitalization_Rate_country_nulls == People_Hospitalization_country_nulls)
print('People_Hospitalized \n', usable_vals('People_Hospitalized'))
print('Hospitalization_Rate \n', usable_vals('Hospitalization_Rate'))
print('Testing_Rate \n', usable_vals('Testing_Rate'))
print('People Tested \n', usable_vals('People_Tested'))
print('Mortality Rate \n', usable_vals('Mortality_Rate'))

True
People_Hospitalized 
 {'US': 49, 'China': 0, 'Canada': 0, 'France': 0, 'United Kingdom': 0, 'Australia': 0, 'Netherlands': 0, 'Denmark': 0}
Hospitalization_Rate 
 {'US': 49, 'China': 0, 'Canada': 0, 'France': 0, 'United Kingdom': 0, 'Australia': 0, 'Netherlands': 0, 'Denmark': 0}
Testing_Rate 
 {'US': 49, 'China': 0, 'Canada': 0, 'France': 0, 'United Kingdom': 0, 'Australia': 0, 'Netherlands': 0, 'Denmark': 0}
People Tested 
 {'US': 49, 'China': 0, 'Canada': 0, 'France': 0, 'United Kingdom': 0, 'Australia': 0, 'Netherlands': 0, 'Denmark': 0}
Mortality Rate 
 {'US': 49, 'China': 0, 'Canada': 0, 'France': 0, 'United Kingdom': 0, 'Australia': 0, 'Netherlands': 0, 'Denmark': 0}


Only the data for US contains values that will be usable so we can go ahead and replace the NaN values in the US data with the mean of that column in the US and from here note that we can only incorporate People_Hospitalized in analysis within the US. 

In [83]:
People_Hospitalization_US_arr = states[states['Country_Region'] == 'US']['People_Hospitalized']
People_Hospitalization_US_arr_mean = People_Hospitalization_US_arr.mean()
People_Hospitalization_US_arr.fillna(People_Hospitalization_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','People_Hospitalized'] = People_Hospitalization_US_arr

Hospitalization_Rate_US_arr = states[states['Country_Region'] == 'US']['Hospitalization_Rate']
Hospitalization_Rate_US_arr_mean = Hospitalization_Rate_US_arr.mean()
Hospitalization_Rate_US_arr.fillna(Hospitalization_Rate_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','Hospitalization_Rate'] = Hospitalization_Rate_US_arr
states.isnull().sum()

Testing_Rate_US_arr = states[states['Country_Region'] == 'US']['Testing_Rate']
Testing_Rate_US_arr_mean = Testing_Rate_US_arr.mean()
Testing_Rate_US_arr.fillna(Testing_Rate_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','Testing_Rate'] = Testing_Rate_US_arr
states.isnull().sum()

People_Tested_US_arr = states[states['Country_Region'] == 'US']['People_Tested']
People_Tested_US_arr_mean = People_Tested_US_arr.mean()
People_Tested_US_arr.fillna(People_Tested_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','People_Tested'] = People_Tested_US_arr
states.isnull().sum()

Mortality_Rate_US_arr = states[states['Country_Region'] == 'US']['Mortality_Rate']
Mortality_Rate_US_arr_mean = Mortality_Rate_US_arr.mean()
Mortality_Rate_US_arr.fillna(Mortality_Rate_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','Mortality_Rate'] = Mortality_Rate_US_arr
states.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


Province_State           0
Country_Region           0
Last_Update             83
Lat                      5
Long_                    5
Confirmed                0
Deaths                   0
Recovered               24
Active                   1
FIPS                    82
Incident_Rate            5
People_Tested           81
People_Hospitalized     81
Mortality_Rate           1
UID                      0
ISO3                     0
Testing_Rate            81
Hospitalization_Rate    81
dtype: int64

There is still an uncovered NaN value in Mortality_Rate so we can explore that item, which happens to exist in position 115 from inspection. That item is weirdly a float object type, so we can manually replace this element with the mean. 

In [84]:
weird_nan = states['Mortality_Rate'].values[115]
print(type(weird_nan))
Mortality_Rate_US_arr.replace(to_replace=weird_nan, value=Mortality_Rate_US_arr_mean, inplace=True)
states.loc[states['Country_Region'] == 'US','Mortality_Rate'] = Mortality_Rate_US_arr
states.isnull().sum()


<class 'numpy.float64'>


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


Province_State           0
Country_Region           0
Last_Update             83
Lat                      5
Long_                    5
Confirmed                0
Deaths                   0
Recovered               24
Active                   1
FIPS                    82
Incident_Rate            5
People_Tested           81
People_Hospitalized     81
Mortality_Rate           1
UID                      0
ISO3                     0
Testing_Rate            81
Hospitalization_Rate    81
dtype: int64

### Cases DataFrame

In [46]:
cases.isnull().sum().values

array([0, 0, 0, 0, 4, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])