# Total deaths and deaths from COVID per County in Washington, Oregon and California, in year 2020

## The data source comes from https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy 

### Download the dataset in CSV version

### Import pandas function as Python needs to import Pandas Package to read our data frames


In [61]:
import pandas as pd

### Import the original data, which named "emma_original_data.csv"

In [62]:
data = pd.read_csv("emma_original_data.csv")

### Check the original data by using the code (data)
### The data used for this analysis is updated at 1/2/21

In [63]:
data

Unnamed: 0,Date as of,First week,Last week,State,County name,FIPS County Code,Urban Rural Code,Deaths involving COVID-19,Deaths from All Causes
0,1/6/21,1/4/20,1/2/21,AK,Anchorage Borough,2020,Medium metro,112,2143
1,1/6/21,1/4/20,1/2/21,AK,Fairbanks North Star Borough,2090,Small metro,22,509
2,1/6/21,1/4/20,1/2/21,AK,Kenai Peninsula Borough,2122,Noncore,10,358
3,1/6/21,1/4/20,1/2/21,AK,Matanuska-Susitna Borough,2170,Medium metro,12,591
4,1/6/21,1/4/20,1/2/21,AL,Autauga County,1001,Medium metro,41,488
...,...,...,...,...,...,...,...,...,...
1860,1/6/21,1/4/20,1/2/21,WY,Laramie County,56021,Small metro,53,926
1861,1/6/21,1/4/20,1/2/21,WY,Natrona County,56025,Small metro,105,993
1862,1/6/21,1/4/20,1/2/21,WY,Sheridan County,56033,Micropolitan,16,296
1863,1/6/21,1/4/20,1/2/21,WY,Sweetwater County,56037,Micropolitan,16,344


### Drop the variables that are unnecessary for this analysis and check the results

In [64]:
import numpy as np
to_drop = ['Date as of',
          'First week',
          'Last week']
data.drop(to_drop, inplace=True, axis=1)

data

Unnamed: 0,State,County name,FIPS County Code,Urban Rural Code,Deaths involving COVID-19,Deaths from All Causes
0,AK,Anchorage Borough,2020,Medium metro,112,2143
1,AK,Fairbanks North Star Borough,2090,Small metro,22,509
2,AK,Kenai Peninsula Borough,2122,Noncore,10,358
3,AK,Matanuska-Susitna Borough,2170,Medium metro,12,591
4,AL,Autauga County,1001,Medium metro,41,488
...,...,...,...,...,...,...
1860,WY,Laramie County,56021,Small metro,53,926
1861,WY,Natrona County,56025,Small metro,105,993
1862,WY,Sheridan County,56033,Micropolitan,16,296
1863,WY,Sweetwater County,56037,Micropolitan,16,344


### Keep data only from California, Washington, and Oregon states and check the results
### After discussion, our group decided to choose counties from California, Washington, and Oregon states for this analysis. Only keep data from these three states.

In [65]:
data = data[data.State.isin(['CA','WA','OR'])]

data

Unnamed: 0,State,County name,FIPS County Code,Urban Rural Code,Deaths involving COVID-19,Deaths from All Causes
111,CA,Alameda County,6001,Large central metro,573,10908
112,CA,Alpine County,6003,,0,
113,CA,Amador County,6005,Noncore,31,415
114,CA,Butte County,6007,Small metro,101,2313
115,CA,Calaveras County,6009,Noncore,12,385
...,...,...,...,...,...,...
1778,WA,Wahkiakum County,53069,,0,
1779,WA,Walla Walla County,53071,Small metro,38,651
1780,WA,Whatcom County,53073,Small metro,63,1881
1781,WA,Whitman County,53075,Micropolitan,21,269


### Change unreported data as 0 cases and check the results
#### For those counties without data for the number of deaths, their death cases are almost all 0 by 1/2/21

In [66]:
data['Deaths from All Causes'].fillna(0, inplace = True)

data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


Unnamed: 0,State,County name,FIPS County Code,Urban Rural Code,Deaths involving COVID-19,Deaths from All Causes
111,CA,Alameda County,6001,Large central metro,573,10908
112,CA,Alpine County,6003,,0,0
113,CA,Amador County,6005,Noncore,31,415
114,CA,Butte County,6007,Small metro,101,2313
115,CA,Calaveras County,6009,Noncore,12,385
...,...,...,...,...,...,...
1778,WA,Wahkiakum County,53069,,0,0
1779,WA,Walla Walla County,53071,Small metro,38,651
1780,WA,Whatcom County,53073,Small metro,63,1881
1781,WA,Whitman County,53075,Micropolitan,21,269


### Renaming column names and check the results
### In order to keep everyone's data consistent, we decided to change all variables' names. Basically, the rule is to change " " to "_" to reduce the confusion in the future analysis. 

#### Change variable 'County name' to 'County'

In [67]:
data_new = data.rename(columns={'County name':'County'})

#### Change variable 'FIPS County Code' to 'FIPS_code'

In [68]:
data_new = data.rename(columns={'FIPS County Code':'FIPS_code'})

#### Change variable 'Urban Rural Code' to 'Urban_Rural_Code'

In [69]:
data_new = data.rename(columns={'Urban Rural Code':'Urban_Rural_Code'})

#### Change variable 'Death involving COVID-19' to 'Deaths_COVID'

In [70]:
data_new = data.rename(columns={'Death involving COVID-19':'Deaths_COVID'})

#### Change variable 'Death from All Causes' to 'Deaths_total'

In [71]:
data_new = data.rename(columns={'Death from All Causes':'Deaths_total'})

#### Confirm the change to the original data and check the results

In [72]:
data_rename = data
data_rename =data.rename(columns={'County name':'County','FIPS County Code':'FIPS_code','Urban Rural Code':'Urban_Rural_Code','Deaths involving COVID-19':'Deaths_COVID','Deaths from All Causes':'Deaths_total'},inplace=True)

data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,State,County,FIPS_code,Urban_Rural_Code,Deaths_COVID,Deaths_total
111,CA,Alameda County,6001,Large central metro,573,10908
112,CA,Alpine County,6003,,0,0
113,CA,Amador County,6005,Noncore,31,415
114,CA,Butte County,6007,Small metro,101,2313
115,CA,Calaveras County,6009,Noncore,12,385
...,...,...,...,...,...,...
1778,WA,Wahkiakum County,53069,,0,0
1779,WA,Walla Walla County,53071,Small metro,38,651
1780,WA,Whatcom County,53073,Small metro,63,1881
1781,WA,Whitman County,53075,Micropolitan,21,269


### Drop the 'FIP_code' variable and check the results
### After our team meeting, we decided not to keep the 'FIP_code' variable in our analysis. Drop it in this step.

In [73]:
to_drop = ['FIPS_code']
data.drop(to_drop, inplace=True, axis=1)

data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,State,County,Urban_Rural_Code,Deaths_COVID,Deaths_total
111,CA,Alameda County,Large central metro,573,10908
112,CA,Alpine County,,0,0
113,CA,Amador County,Noncore,31,415
114,CA,Butte County,Small metro,101,2313
115,CA,Calaveras County,Noncore,12,385
...,...,...,...,...,...
1778,WA,Wahkiakum County,,0,0
1779,WA,Walla Walla County,Small metro,38,651
1780,WA,Whatcom County,Small metro,63,1881
1781,WA,Whitman County,Micropolitan,21,269


### To check the data type of all variables.
### This step can help you better understand the dataset, which is also a step that preparing for future regression.

#### Describe the dataset

In [74]:
data.describe

<bound method NDFrame.describe of      State              County     Urban_Rural_Code Deaths_COVID Deaths_total
111     CA      Alameda County  Large central metro          573       10,908
112     CA       Alpine County                  NaN            0            0
113     CA       Amador County              Noncore           31          415
114     CA        Butte County          Small metro          101        2,313
115     CA    Calaveras County              Noncore           12          385
...    ...                 ...                  ...          ...          ...
1778    WA    Wahkiakum County                  NaN            0            0
1779    WA  Walla Walla County          Small metro           38          651
1780    WA      Whatcom County          Small metro           63        1,881
1781    WA      Whitman County         Micropolitan           21          269
1782    WA       Yakima County          Small metro          266        2,239

[133 rows x 5 columns]>

#### Check the basic information of the dataset, especially for the data type

In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133 entries, 111 to 1782
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   State             133 non-null    object
 1   County            133 non-null    object
 2   Urban_Rural_Code  85 non-null     object
 3   Deaths_COVID      133 non-null    object
 4   Deaths_total      133 non-null    object
dtypes: object(5)
memory usage: 6.2+ KB


#### Check name of the data frame by using code learned during class

In [76]:
data[data.iloc[:,3:].isnull().any(axis=1)]

Unnamed: 0,State,County,Urban_Rural_Code,Deaths_COVID,Deaths_total


### Renaming all observations in the 'County' variable and check the results
### During our team meeting, we decided to drop the 'County' in all county observations.

#### Drop the 'County' in each county observation

In [77]:
data['County'] = data['County'].str.replace('County', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['County'] = data['County'].str.replace('County', '')


#### Drop the blank space in the county name and check the final results

In [78]:
data['County'] = data['County'].str.strip()

data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['County'] = data['County'].str.strip()


Unnamed: 0,State,County,Urban_Rural_Code,Deaths_COVID,Deaths_total
111,CA,Alameda,Large central metro,573,10908
112,CA,Alpine,,0,0
113,CA,Amador,Noncore,31,415
114,CA,Butte,Small metro,101,2313
115,CA,Calaveras,Noncore,12,385
...,...,...,...,...,...
1778,WA,Wahkiakum,,0,0
1779,WA,Walla Walla,Small metro,38,651
1780,WA,Whatcom,Small metro,63,1881
1781,WA,Whitman,Micropolitan,21,269


### Renaming location names, created a new variable called 'Location' and check the results
### We checked the dataset and found out that there are some duplicate county names. To solve this problem, we created a new variable called 'Location' by adding county name and state name together. This variable should be able to represent the location of each county.

In [79]:
data["Location"] = data["County"].astype(str) + "_" + data["State"]
del data['State']
del data['County']

data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Location"] = data["County"].astype(str) + "_" + data["State"]


Unnamed: 0,Urban_Rural_Code,Deaths_COVID,Deaths_total,Location
111,Large central metro,573,10908,Alameda_CA
112,,0,0,Alpine_CA
113,Noncore,31,415,Amador_CA
114,Small metro,101,2313,Butte_CA
115,Noncore,12,385,Calaveras_CA
...,...,...,...,...
1778,,0,0,Wahkiakum_WA
1779,Small metro,38,651,Walla Walla_WA
1780,Small metro,63,1881,Whatcom_WA
1781,Micropolitan,21,269,Whitman_WA


### Convert 'Deaths_COVID' and 'Deaths_total' from string variables to numeric variables.
### By converting the string variables, we are able to run regression and clustering by using these variables.

#### Check the data information again before making any changes

In [80]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133 entries, 111 to 1782
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Urban_Rural_Code  85 non-null     object
 1   Deaths_COVID      133 non-null    object
 2   Deaths_total      133 non-null    object
 3   Location          133 non-null    object
dtypes: object(4)
memory usage: 5.2+ KB


#### Change the ',' in these variables into '' (nothing) and check the results

In [81]:
data["Deaths_total"] = data["Deaths_total"].str.replace(',', '')
data["Deaths_COVID"] = data["Deaths_COVID"].str.replace(',', '')

data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Deaths_total"] = data["Deaths_total"].str.replace(',', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Deaths_COVID"] = data["Deaths_COVID"].str.replace(',', '')


Unnamed: 0,Urban_Rural_Code,Deaths_COVID,Deaths_total,Location
111,Large central metro,573,10908,Alameda_CA
112,,0,,Alpine_CA
113,Noncore,31,415,Amador_CA
114,Small metro,101,2313,Butte_CA
115,Noncore,12,385,Calaveras_CA
...,...,...,...,...
1778,,0,,Wahkiakum_WA
1779,Small metro,38,651,Walla Walla_WA
1780,Small metro,63,1881,Whatcom_WA
1781,Micropolitan,21,269,Whitman_WA


#### As the step above changed all the '0' observations into missing in the 'Deaths_total' variable, we need to change them back.

#### Replace the missing value with 0 in 'Deaths_total' and check the results

In [82]:
data['Deaths_total'].fillna(0, inplace = True)

data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


Unnamed: 0,Urban_Rural_Code,Deaths_COVID,Deaths_total,Location
111,Large central metro,573,10908,Alameda_CA
112,,0,0,Alpine_CA
113,Noncore,31,415,Amador_CA
114,Small metro,101,2313,Butte_CA
115,Noncore,12,385,Calaveras_CA
...,...,...,...,...
1778,,0,0,Wahkiakum_WA
1779,Small metro,38,651,Walla Walla_WA
1780,Small metro,63,1881,Whatcom_WA
1781,Micropolitan,21,269,Whitman_WA


#### Covert string variables to numeric variables and check the results

In [83]:
data["Deaths_total"] = pd.to_numeric(data.Deaths_total)
data["Deaths_COVID"] = pd.to_numeric(data.Deaths_COVID)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133 entries, 111 to 1782
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Urban_Rural_Code  85 non-null     object
 1   Deaths_COVID      133 non-null    int64 
 2   Deaths_total      133 non-null    int64 
 3   Location          133 non-null    object
dtypes: int64(2), object(2)
memory usage: 5.2+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Deaths_total"] = pd.to_numeric(data.Deaths_total)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Deaths_COVID"] = pd.to_numeric(data.Deaths_COVID)


### Final check of the data before exporting them into a CSV file

In [84]:
data

Unnamed: 0,Urban_Rural_Code,Deaths_COVID,Deaths_total,Location
111,Large central metro,573,10908,Alameda_CA
112,,0,0,Alpine_CA
113,Noncore,31,415,Amador_CA
114,Small metro,101,2313,Butte_CA
115,Noncore,12,385,Calaveras_CA
...,...,...,...,...
1778,,0,0,Wahkiakum_WA
1779,Small metro,38,651,Walla Walla_WA
1780,Small metro,63,1881,Whatcom_WA
1781,Micropolitan,21,269,Whitman_WA


### Exporting the cleaned dataset by using (data.to_csv) code, name it "emma_data.csv"

In [85]:
data.to_csv('emma_data.csv',header=True, index=False)