# Data Wrangling

In this section we will look into analyzing relatively large sets of data using Pandas.

We will use our data manipulation and analysis skills to analyze weekly Covid-19 data provided by John Hopkins University. 

## The Input Data

John Hopkins provides a data repository which contains the daily Covid-19 metrics aggregated from a variety of sources including World Health Organization, European Centre for Disease Prevention and Control, US CDC, and public health departments of various states and counties. The popular 2019 Novel Coronavirus Visual Dashboard (https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) is powered by this large data repository. 
You can download the dataset from https://github.com/CSSEGISandData/COVID-19. We are going to use the daily report data aggregated at the level of states. 

## Reading the Data

Let us first read a single file from the folder csse_covid_19_daily_reports_us to get an idea about the structure of the data.

In [1]:
# First we import pandas
import pandas as pd

In [2]:
#read the file 01-01-2021.csv from the csse_covid_19_daily_reports_us folder
sampledata = pd.read_csv('../../largedatasets/csse_covid_19_daily_reports_us/01-01-2021.csv')

In [3]:
#let's see the first five records
sampledata.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,Total_Test_Results,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate
0,Alabama,US,2021-01-02 05:30:44,32.3182,-86.9023,365747,4872,202137.0,158738.0,1.0,...,,,1.332068,84000001.0,USA,,,2021-01-01,,
1,Alaska,US,2021-01-02 05:30:44,61.3707,-152.4044,47019,206,7165.0,39648.0,2.0,...,1275750.0,,0.438121,84000002.0,USA,174391.185778,,2021-01-01,,
2,American Samoa,US,2021-01-02 05:30:44,-14.271,-170.132,0,0,,,60.0,...,2140.0,,,16.0,ASM,3846.084722,,2021-01-01,,
3,Arizona,US,2021-01-02 05:30:44,33.7298,-111.4312,530267,9015,76934.0,444318.0,4.0,...,5155330.0,,1.700087,84000004.0,USA,39551.860582,,2021-01-01,,
4,Arkansas,US,2021-01-02 05:30:44,34.9697,-92.3731,229442,3711,199247.0,26484.0,5.0,...,2051488.0,,1.617402,84000005.0,USA,67979.497674,,2021-01-01,,


In [4]:
#how many rows columns
sampledata.shape

(58, 21)

In [5]:
#List the columns
sampledata.columns

Index(['Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_',
       'Confirmed', 'Deaths', 'Recovered', 'Active', 'FIPS', 'Incident_Rate',
       'Total_Test_Results', 'People_Hospitalized', 'Case_Fatality_Ratio',
       'UID', 'ISO3', 'Testing_Rate', 'Hospitalization_Rate', 'Date',
       'People_Tested', 'Mortality_Rate'],
      dtype='object')

Let's check the record for Ohio

In [6]:
sampledata[sampledata['Province_State']=='Ohio']

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,Total_Test_Results,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate
40,Ohio,US,2021-01-02 05:30:44,40.3888,-82.7649,700380,13766,556106.0,130508.0,39.0,...,8757248.0,,1.965504,84000039.0,USA,7e-06,,2021-01-01,,


## Reading multiple files from a folder

Many a times you might need to combine data from multiple files and then perform some analysis on the combined data. Python has a library called glob which helps to find out all pathnames matching a specified pattern. Let's try to list all the csv files in the folder csse_covid_19_daily_reports_us

In [7]:
#First we need to import the library glob
import glob
#then we use the glob function in the glob library to search for all files that match the pattern *.csv in the folder csse_covid_19_daily_reports_us
allFiles = glob.glob('../../largedatasets/csse_covid_19_daily_reports_us/*.csv')
#print the first 10 file paths
allFiles[0:10]

['../../largedatasets/csse_covid_19_daily_reports_us\\01-01-2021.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-01-2022.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-02-2021.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-02-2022.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-03-2021.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-03-2022.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-04-2021.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-04-2022.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-05-2021.csv',
 '../../largedatasets/csse_covid_19_daily_reports_us\\01-05-2022.csv']

## Combining multiple DataFrames to a single DataFrame

We can use the concat() function in the Pandas library to combine multiple DataFrames. Let's see the details of concat function using the '?' operator

![concat](images/concat.png)

In [None]:
pd.concat?

So concat() function accepts a sequence of DataFrames/Series and then concatenate them along a specified axis (index or columns) 

Our plan is to
1) Loop through each pathnames stored in allFiles and read them into a DataFrame using read_csv() function. 
2) Save all DataFrames into a list. 
3) Use pd.concat to create a single DataFrame from all the DataFrames

In [9]:
# an empty list to store the DataFrames
allDataFrames = []
# now we will loop through each filepath and read them into DataFrames
for fileName in allFiles:
    #read the file using pandas
    data = pd.read_csv(fileName)
    #store the DataFrame to the list
    allDataFrames.append(data)
#now we apply pd.concat to create a single DataFrame containing all the records. We will use the same variable name as the list
allDataFrames = pd.concat(allDataFrames,ignore_index=True)

In [10]:
allDataFrames.shape

(57322, 21)

So there are 57,322 records in our combined dataset. 

Let's use the describe function to see some basic stats about the combined dataset. 

In [None]:
allDataFrames.describe()

Let us look at the datatypes for the columns

In [None]:
allDataFrames.dtypes

## Parsing Dates

Currently the date column that is available in the dataset is 'Date'. If you check the datatype for the 'Last_Update' column (using dtype), you can see that it is in string format. This might not be particularly helpful if you want to filter/aggregate the data based on various time granularities (day,week,month,year etc.). Pandas has a function called to_datetime, which can convert string to datetime object. Let us try out pd.to_datetime.

In [None]:
pd.to_datetime?

In [12]:
allDataFrames['Date_converted'] = pd.to_datetime(allDataFrames['Date'])

Let's check the datatype for the new column that we have created

In [13]:
allDataFrames['Date_converted']

0       2021-01-01
1       2021-01-01
2       2021-01-01
3       2021-01-01
4       2021-01-01
           ...    
57317   2021-12-31
57318   2021-12-31
57319   2021-12-31
57320   2021-12-31
57321   2021-12-31
Name: Date_converted, Length: 57322, dtype: datetime64[ns]

Now we could easily filter the dataset based on this new column. For example filter out all the records for the year 2021.

In [14]:
allDataFrames[allDataFrames['Date_converted'].dt.year==2021]

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted
0,Alabama,US,2021-01-02 05:30:44,32.3182,-86.9023,365747,4872,202137.0,158738.0,1.0,...,,1.332068,84000001.0,USA,,,2021-01-01,,,2021-01-01
1,Alaska,US,2021-01-02 05:30:44,61.3707,-152.4044,47019,206,7165.0,39648.0,2.0,...,,0.438121,84000002.0,USA,174391.185778,,2021-01-01,,,2021-01-01
2,American Samoa,US,2021-01-02 05:30:44,-14.2710,-170.1320,0,0,,,60.0,...,,,16.0,ASM,3846.084722,,2021-01-01,,,2021-01-01
3,Arizona,US,2021-01-02 05:30:44,33.7298,-111.4312,530267,9015,76934.0,444318.0,4.0,...,,1.700087,84000004.0,USA,39551.860582,,2021-01-01,,,2021-01-01
4,Arkansas,US,2021-01-02 05:30:44,34.9697,-92.3731,229442,3711,199247.0,26484.0,5.0,...,,1.617402,84000005.0,USA,67979.497674,,2021-01-01,,,2021-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57317,Virginia,US,2022-01-01 04:32:44,37.7693,-78.1700,1118518,15587,,,51.0,...,,1.393540,84000051.0,USA,130970.524464,,2021-12-31,,,2021-12-31
57318,Washington,US,2022-01-01 04:32:44,47.4009,-121.4905,849075,9853,,,53.0,...,,1.160439,84000053.0,USA,130778.607132,,2021-12-31,,,2021-12-31
57319,West Virginia,US,2022-01-01 04:32:44,38.4912,-80.9545,328162,5336,,,54.0,...,,1.626026,84000054.0,USA,275536.772374,,2021-12-31,,,2021-12-31
57320,Wisconsin,US,2022-01-01 04:32:44,44.2685,-89.6165,1120663,11173,,,55.0,...,,0.996999,84000055.0,USA,0.000023,,2021-12-31,,,2021-12-31


Or for January 2021

In [15]:
allDataFrames[(allDataFrames['Date_converted'].dt.year==2021)&(allDataFrames['Date_converted'].dt.month==1)]

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted
0,Alabama,US,2021-01-02 05:30:44,32.3182,-86.9023,365747,4872,202137.0,158738.0,1.0,...,,1.332068,84000001.0,USA,,,2021-01-01,,,2021-01-01
1,Alaska,US,2021-01-02 05:30:44,61.3707,-152.4044,47019,206,7165.0,39648.0,2.0,...,,0.438121,84000002.0,USA,174391.185778,,2021-01-01,,,2021-01-01
2,American Samoa,US,2021-01-02 05:30:44,-14.2710,-170.1320,0,0,,,60.0,...,,,16.0,ASM,3846.084722,,2021-01-01,,,2021-01-01
3,Arizona,US,2021-01-02 05:30:44,33.7298,-111.4312,530267,9015,76934.0,444318.0,4.0,...,,1.700087,84000004.0,USA,39551.860582,,2021-01-01,,,2021-01-01
4,Arkansas,US,2021-01-02 05:30:44,34.9697,-92.3731,229442,3711,199247.0,26484.0,5.0,...,,1.617402,84000005.0,USA,67979.497674,,2021-01-01,,,2021-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3533,Virginia,US,2021-02-01 05:30:44,37.7693,-78.1700,504779,6464,40322.0,457993.0,51.0,...,,1.280560,84000051.0,USA,61322.047318,,2021-01-31,,,2021-01-31
3534,Washington,US,2021-02-01 05:30:44,47.4009,-121.4905,311597,4285,,,53.0,...,,1.375174,84000053.0,USA,59728.206293,,2021-01-31,,,2021-01-31
3535,West Virginia,US,2021-02-01 05:30:44,38.4912,-80.9545,121001,2024,97782.0,21195.0,54.0,...,,1.672713,84000054.0,USA,108561.351273,,2021-01-31,,,2021-01-31
3536,Wisconsin,US,2021-02-01 05:30:44,44.2685,-89.6165,592140,6434,517169.0,68537.0,55.0,...,,1.086567,84000055.0,USA,0.000011,,2021-01-31,,,2021-01-31


Or between January 01 2021 and Decemeber 31 2021 (yearly)

In [16]:
allDataFrames[(allDataFrames['Date_converted']>='2021-01-01')&(allDataFrames['Date_converted']<='2021-12-31')]

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted
0,Alabama,US,2021-01-02 05:30:44,32.3182,-86.9023,365747,4872,202137.0,158738.0,1.0,...,,1.332068,84000001.0,USA,,,2021-01-01,,,2021-01-01
1,Alaska,US,2021-01-02 05:30:44,61.3707,-152.4044,47019,206,7165.0,39648.0,2.0,...,,0.438121,84000002.0,USA,174391.185778,,2021-01-01,,,2021-01-01
2,American Samoa,US,2021-01-02 05:30:44,-14.2710,-170.1320,0,0,,,60.0,...,,,16.0,ASM,3846.084722,,2021-01-01,,,2021-01-01
3,Arizona,US,2021-01-02 05:30:44,33.7298,-111.4312,530267,9015,76934.0,444318.0,4.0,...,,1.700087,84000004.0,USA,39551.860582,,2021-01-01,,,2021-01-01
4,Arkansas,US,2021-01-02 05:30:44,34.9697,-92.3731,229442,3711,199247.0,26484.0,5.0,...,,1.617402,84000005.0,USA,67979.497674,,2021-01-01,,,2021-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57317,Virginia,US,2022-01-01 04:32:44,37.7693,-78.1700,1118518,15587,,,51.0,...,,1.393540,84000051.0,USA,130970.524464,,2021-12-31,,,2021-12-31
57318,Washington,US,2022-01-01 04:32:44,47.4009,-121.4905,849075,9853,,,53.0,...,,1.160439,84000053.0,USA,130778.607132,,2021-12-31,,,2021-12-31
57319,West Virginia,US,2022-01-01 04:32:44,38.4912,-80.9545,328162,5336,,,54.0,...,,1.626026,84000054.0,USA,275536.772374,,2021-12-31,,,2021-12-31
57320,Wisconsin,US,2022-01-01 04:32:44,44.2685,-89.6165,1120663,11173,,,55.0,...,,0.996999,84000055.0,USA,0.000023,,2021-12-31,,,2021-12-31


## Aggregating Data using Grouping (groupby)

Categorizing a dataset into groups and then applying a function to each group can be a critical component of a data analysis workflow.

You can think of Grouping Operations in Pandas as a combination of split-apply-combine. In the first stage of the process, data contained in a pandas object (DataFrame/Series), is split into groups based on one or more keys that you provide. Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object.

The figure shown below gives you a better idea about the split-apply-combine paradigm. 

![split-apply-combine](images/split-apply-combine.png)

In pandas we use the groupby() method to group a DataFrame/Series based on a key or a combination of keys. Let's group our dataset based on state name (Province_State).

In [20]:
stateGroups = allDataFrames.groupby('Province_State')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002A4211B1EA0>

We can check the number of groups using the len() function

In [21]:
len(stateGroups)

59

You can loop through each group using a for loop.

In [22]:
for key, group in stateGroups:
    print ('This is the',key,'group and it has',len(group),'records')

This is the Alabama group and it has 988 records
This is the Alaska group and it has 988 records
This is the American Samoa group and it has 988 records
This is the Arizona group and it has 988 records
This is the Arkansas group and it has 988 records
This is the California group and it has 988 records
This is the Colorado group and it has 988 records
This is the Connecticut group and it has 988 records
This is the Delaware group and it has 988 records
This is the Diamond Princess group and it has 988 records
This is the District of Columbia group and it has 988 records
This is the Florida group and it has 988 records
This is the Georgia group and it has 988 records
This is the Grand Princess group and it has 988 records
This is the Guam group and it has 988 records
This is the Hawaii group and it has 988 records
This is the Idaho group and it has 988 records
This is the Illinois group and it has 988 records
This is the Indiana group and it has 988 records
This is the Iowa group and it

You might be confused with the two loop variables (until now we have only seen one loop variable). This is possible. Let's see a small example.

In [23]:
# a list of list
records = [[1,'John'],[10,'Jay'],[34,'Sam']]
# now we can use two loop variables
for idValue,name in records:
    print (idValue,'-->',name)

1 --> John
10 --> Jay
34 --> Sam


Back to our groupby example. We can retrieve individual group using get_group() method. Lets retrieve the Ohio group

In [24]:
ohGroup = stateGroups.get_group('Ohio')

Can you show another way to retrieve all the records from Ohio

If we want to calculate the total number of deaths in Ohio, we should just retrieve the maximum value for death as the data is cumulative. 

In [25]:
ohGroup['Deaths'].max()

40840

We can also apply the same logic and retrive the total number of deaths for all the states by applying the max() method to the entire group.

In [26]:
stateGroups['Deaths'].max()

Province_State
Alabama                     20737
Alaska                       1455
American Samoa                 39
Arizona                     32038
Arkansas                    12682
California                  98385
Colorado                    13841
Connecticut                 11733
Delaware                     3205
Diamond Princess                0
District of Columbia         1411
Florida                     83606
Georgia                     41361
Grand Princess                  3
Guam                          411
Hawaii                       1758
Idaho                        5303
Illinois                    40610
Indiana                     25303
Iowa                        10389
Kansas                       9709
Kentucky                    17625
Louisiana                   18345
Maine                        2779
Maryland                    15905
Massachusetts               22793
Michigan                    40657
Minnesota                   14273
Mississippi                 13083

And then sort them by value

In [27]:
stateGroups['Deaths'].max().sort_values(ascending=False)

Province_State
California                  98385
Texas                       91716
Florida                     83606
New York                    74845
Pennsylvania                48798
Georgia                     41361
Ohio                        40840
Michigan                    40657
Illinois                    40610
New Jersey                  35425
Arizona                     32038
Tennessee                   28545
North Carolina              27629
Indiana                     25303
Massachusetts               22793
Virginia                    22649
Missouri                    22209
Alabama                     20737
South Carolina              18811
Louisiana                   18345
Kentucky                    17625
Oklahoma                    17383
Maryland                    15905
Wisconsin                   15767
Washington                  14991
Minnesota                   14273
Colorado                    13841
Mississippi                 13083
Arkansas                    12682

In [28]:
ohGroup.tail()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,People_Hospitalized,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted
57072,Ohio,US,2021-12-30 04:33:59,40.3888,-82.7649,1975723,31637,,,39.0,...,,1.59658,84000039.0,USA,2.2e-05,,2021-12-29,,,2021-12-29
57130,Ohio,US,2020-12-31 05:30:27,40.3888,-82.7649,690748,13469,546305.0,130974.0,39.0,...,,1.949915,84000039.0,USA,7e-06,,2020-12-30,,,2020-12-30
57188,Ohio,US,2021-12-31 04:33:06,40.3888,-82.7649,1995497,31770,,,39.0,...,,1.587474,84000039.0,USA,2.2e-05,,2021-12-30,,,2021-12-30
57246,Ohio,US,2021-01-01 05:30:27,40.3888,-82.7649,700380,13621,556106.0,130653.0,39.0,...,,1.944801,84000039.0,USA,7e-06,,2020-12-31,,,2020-12-31
57304,Ohio,US,2022-01-01 04:32:44,40.3888,-82.7649,2016082,31898,,,39.0,...,,1.577565,84000039.0,USA,2.2e-05,,2021-12-31,,,2021-12-31


Now can you do the same for the confirmed cases. 

Since we have cumulative totals for deaths, what if we want to see daily totals. For this we can use the diff() method which can be used to subtract values with in certain periods. But before applying this method we need to make sure that the data is sorted by value. 

In [29]:
ohGroup_sorted = ohGroup.sort_values(by='Deaths')

In [30]:
ohGroup_sorted['Deaths'].diff()

11756     NaN
11931    36.0
12106    35.0
12281    36.0
12456    34.0
         ... 
55390     0.0
55216     0.0
55042     0.0
56608     0.0
55912     0.0
Name: Deaths, Length: 988, dtype: float64

You can add this as a new column in the group

In [31]:
ohGroup_sorted['Daily_Deaths_Total'] = ohGroup_sorted['Deaths'].diff()

In [33]:
ohGroup_sorted

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted,Daily_Deaths_Total
11756,Ohio,US,2020-04-12 23:18:15,40.3888,-82.7649,6604,363,,,39.0,...,,84000039.0,USA,5.973514e-08,29.497274,2020-04-12,63243.0,5.496669,2020-04-12,
11931,Ohio,US,2020-04-13 23:07:54,40.3888,-82.7649,6975,399,,,39.0,...,,84000039.0,USA,6.187559e-08,29.146953,2020-04-13,65112.0,5.720430,2020-04-13,36.0
12106,Ohio,US,2020-04-14 23:33:31,40.3888,-82.7649,7285,434,,,39.0,...,,84000039.0,USA,6.496651e-08,29.595058,2020-04-14,67874.0,5.957447,2020-04-14,35.0
12281,Ohio,US,2020-04-15 22:56:51,40.3888,-82.7649,7794,470,,,39.0,...,,84000039.0,USA,6.839449e-08,28.701565,2020-04-15,71552.0,6.030280,2020-04-15,36.0
12456,Ohio,US,2020-04-16 23:30:51,40.3888,-82.7649,8414,504,,,39.0,...,,84000039.0,USA,7.171125e-08,27.703827,2020-04-16,74840.0,5.990017,2020-04-16,34.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55390,Ohio,US,2022-12-19 04:31:41,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0
55216,Ohio,US,2022-12-18 04:31:17,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0
55042,Ohio,US,2022-12-17 04:32:17,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0
56608,Ohio,US,2022-12-26 04:31:20,40.3888,-82.7649,3294521,40840,,,39.0,...,1.239634,84000039.0,USA,,,,,,NaT,0.0


Now you can find out the day with most number of deaths in Ohio

In [34]:
ohGroup_sorted.sort_values(by='Daily_Deaths_Total',ascending=False)

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,Case_Fatality_Ratio,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted,Daily_Deaths_Total
54926,Ohio,US,2020-12-17 05:30:43,40.3888,-82.7649,584766,10979,416028.0,157759.0,39.0,...,1.877503,84000039.0,USA,6.692936e-06,,2020-12-16,,,2020-12-16,211.0
55274,Ohio,US,2020-12-19 05:30:27,40.3888,-82.7649,605862,11396,430621.0,163845.0,39.0,...,1.880956,84000039.0,USA,6.828255e-06,,2020-12-18,,,2020-12-18,209.0
55100,Ohio,US,2020-12-18 05:30:33,40.3888,-82.7649,596178,11187,426525.0,158466.0,39.0,...,1.876453,84000039.0,USA,6.761601e-06,,2020-12-17,,,2020-12-17,208.0
55970,Ohio,US,2020-12-23 05:30:33,40.3888,-82.7649,637032,12165,467570.0,157297.0,39.0,...,1.909637,84000039.0,USA,7.014998e-06,,2020-12-22,,,2020-12-22,199.0
53882,Ohio,US,2020-12-11 05:30:31,40.3888,-82.7649,531850,9848,361308.0,160694.0,39.0,...,1.851650,84000039.0,USA,6.347023e-06,,2020-12-10,,,2020-12-10,194.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55216,Ohio,US,2022-12-18 04:31:17,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0
55042,Ohio,US,2022-12-17 04:32:17,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0
56608,Ohio,US,2022-12-26 04:31:20,40.3888,-82.7649,3294521,40840,,,39.0,...,1.239634,84000039.0,USA,,,,,,NaT,0.0
55912,Ohio,US,2022-12-22 04:31:45,40.3888,-82.7649,3276630,40840,,,39.0,...,1.243564,84000039.0,USA,,,,,,NaT,0.0


As you can see the highest number of deaths in Ohio were during 2nd and 3rd week of Decemeber 2020

Now can you do the same for Pennsylvania and check the spikes in Covid Death

Before we move into the next section we will calculate monthly Covid-19 death totals for Ohio. In order to do that we can apply groupby in Ohio stategroup using month and year as the key. Since we don't have month and year as seperate columns we can create them. 

In [35]:
ohGroup['year'] = ohGroup['Date_converted'].dt.year
ohGroup['month'] = ohGroup['Date_converted'].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohGroup['year'] = ohGroup['Date_converted'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohGroup['month'] = ohGroup['Date_converted'].dt.month


In [36]:
#use year and month as keys
ohDateGrouped = ohGroup.groupby(['year','month'])

Now if you want to just get the data for March 2021

In [37]:
ohDateGrouped.get_group((2021,3))

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,...,UID,ISO3,Testing_Rate,Hospitalization_Rate,Date,People_Tested,Mortality_Rate,Date_converted,year,month
6884,Ohio,US,2021-03-02 05:30:53,40.3888,-82.7649,968874,18720,911474.0,38677.0,39.0,...,84000039.0,USA,1e-05,,2021-03-01,,,2021-03-01,2021.0,3.0
7000,Ohio,US,2021-03-03 05:30:49,40.3888,-82.7649,970583,18749,914893.0,36938.0,39.0,...,84000039.0,USA,1e-05,,2021-03-02,,,2021-03-02,2021.0,3.0
7116,Ohio,US,2021-03-04 05:30:41,40.3888,-82.7649,972605,18779,916592.0,37231.0,39.0,...,84000039.0,USA,1e-05,,2021-03-03,,,2021-03-03,2021.0,3.0
7232,Ohio,US,2021-03-05 05:31:07,40.3888,-82.7649,974480,18806,919296.0,36375.0,39.0,...,84000039.0,USA,1.1e-05,,2021-03-04,,,2021-03-04,2021.0,3.0
7348,Ohio,US,2021-03-06 04:30:38,40.3888,-82.7649,976230,18832,921707.0,35689.0,39.0,...,84000039.0,USA,1.1e-05,,2021-03-05,,,2021-03-05,2021.0,3.0
7464,Ohio,US,2021-03-07 05:31:30,40.3888,-82.7649,977736,18853,921707.0,37174.0,39.0,...,84000039.0,USA,1.1e-05,,2021-03-06,,,2021-03-06,2021.0,3.0
7580,Ohio,US,2021-03-08 05:31:09,40.3888,-82.7649,978471,18879,,,39.0,...,84000039.0,USA,1.1e-05,,2021-03-07,,,2021-03-07,2021.0,3.0
7696,Ohio,US,2021-03-09 05:30:55,40.3888,-82.7649,979725,18902,,,39.0,...,84000039.0,USA,1.1e-05,,2021-03-08,,,2021-03-08,2021.0,3.0
7812,Ohio,US,2021-03-10 05:31:08,40.3888,-82.7649,981618,18924,,,39.0,...,84000039.0,USA,1.1e-05,,2021-03-09,,,2021-03-09,2021.0,3.0
7928,Ohio,US,2021-03-11 05:30:58,40.3888,-82.7649,983486,18949,,,39.0,...,84000039.0,USA,1.1e-05,,2021-03-10,,,2021-03-10,2021.0,3.0


If you want to get the monthly death totals you need to apply the max() method as the death totals are cumulative. 

In [38]:
monthlyDeathTotals = ohDateGrouped['Deaths'].max()

In [39]:
monthlyDeathTotals

year    month
2020.0  4.0       1083
        5.0       2134
        6.0       2601
        7.0       3209
        8.0       3913
        9.0       4471
        10.0      5325
        11.0      8084
        12.0     13621
2021.0  1.0      17325
        2.0      18691
        3.0      19291
        4.0      19846
        5.0      20306
        6.0      20541
        7.0      20706
        8.0      21442
        9.0      23616
        10.0     26100
        11.0     28141
        12.0     31898
2022.0  1.0      36289
        2.0      38022
        3.0      38497
        4.0      38646
        5.0      38817
        6.0      39045
        7.0      39347
        8.0      39775
        9.0      40010
Name: Deaths, dtype: int64

This is a series object with multiple indexes (or Multi Index). If you want to convert it into a DataFrame you can use the reset_index() method. 

In [40]:
monthlyDeathTotals = monthlyDeathTotals.reset_index()

In [41]:
monthlyDeathTotals

Unnamed: 0,year,month,Deaths
0,2020.0,4.0,1083
1,2020.0,5.0,2134
2,2020.0,6.0,2601
3,2020.0,7.0,3209
4,2020.0,8.0,3913
5,2020.0,9.0,4471
6,2020.0,10.0,5325
7,2020.0,11.0,8084
8,2020.0,12.0,13621
9,2021.0,1.0,17325


Now if we want to see the change in totals you can apply the diff() method

In [42]:
monthlyDeathTotals['change'] = monthlyDeathTotals.Deaths.diff()
monthlyDeathTotals

Unnamed: 0,year,month,Deaths,change
0,2020.0,4.0,1083,
1,2020.0,5.0,2134,1051.0
2,2020.0,6.0,2601,467.0
3,2020.0,7.0,3209,608.0
4,2020.0,8.0,3913,704.0
5,2020.0,9.0,4471,558.0
6,2020.0,10.0,5325,854.0
7,2020.0,11.0,8084,2759.0
8,2020.0,12.0,13621,5537.0
9,2021.0,1.0,17325,3704.0


In the next section we will learn about joining multiple files (not concatenating) which is a key data analysis procedure. 