## Some data exploration with `pandas`
* find datasets at: https://datasetsearch.research.google.com/
* see this repo: https://github.com/clownfragment/covid-19-exploration-with-pandas




In [1]:
# imports
import pandas as pd

### COVID-19 data from data.europa.eu
* https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data

In [3]:
# import .csv into a dataframe (df)
df = pd.read_csv('data/2020_04_07__data_europa_eu__covid_19_data.csv')

### Basics of your data
* "object" means it could be a string, some non-numeric value
* `pandas` is going to impute the datatype of the column

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9310 entries, 0 to 9309
Data columns (total 10 columns):
dateRep                    9310 non-null object
day                        9310 non-null int64
month                      9310 non-null int64
year                       9310 non-null int64
cases                      9310 non-null int64
deaths                     9310 non-null int64
countriesAndTerritories    9310 non-null object
geoId                      9286 non-null object
countryterritoryCode       9125 non-null object
popData2018                9172 non-null float64
dtypes: float64(1), int64(5), object(4)
memory usage: 727.4+ KB


In [14]:
# get the dimensions of the dataframe
df.shape

(9310, 10)

In [15]:
df.columns

Index(['dateRep', 'day', 'month', 'year', 'cases', 'deaths',
       'countriesAndTerritories', 'geoId', 'countryterritoryCode',
       'popData2018'],
      dtype='object')

In [13]:
# give some brief summary stats
df.describe()

Unnamed: 0,day,month,year,cases,deaths,popData2018
count,9310.0,9310.0,9310.0,9310.0,9310.0,9172.0
mean,15.713426,2.561117,2019.992803,141.459506,7.955532,64882720.0
std,9.462419,1.281713,0.084531,1104.27054,66.660811,202613400.0
min,1.0,1.0,2019.0,-9.0,0.0,1000.0
25%,6.0,2.0,2020.0,0.0,0.0,3731000.0
50%,16.0,3.0,2020.0,0.0,0.0,10627160.0
75%,24.0,3.0,2020.0,12.0,0.0,44494500.0
max,31.0,12.0,2020.0,34272.0,2004.0,1392730000.0


##### be sure these summary stats are meaningful given your data
* ex. the mean of the day, month or year is not helpful
* the mean of the cases is not helpful because different countries
started reporting at different times
* max cases seems helpful
    * does a max case count for a given country in one day make sense as 34,272?

In [10]:
# view top 10 rows
df.head(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,07/04/2020,7,4,2020,38,0,Afghanistan,AF,AFG,37172386.0
1,06/04/2020,6,4,2020,29,2,Afghanistan,AF,AFG,37172386.0
2,05/04/2020,5,4,2020,35,1,Afghanistan,AF,AFG,37172386.0
3,04/04/2020,4,4,2020,0,0,Afghanistan,AF,AFG,37172386.0
4,03/04/2020,3,4,2020,43,0,Afghanistan,AF,AFG,37172386.0
5,02/04/2020,2,4,2020,26,0,Afghanistan,AF,AFG,37172386.0
6,01/04/2020,1,4,2020,25,0,Afghanistan,AF,AFG,37172386.0
7,31/03/2020,31,3,2020,27,0,Afghanistan,AF,AFG,37172386.0
8,30/03/2020,30,3,2020,8,1,Afghanistan,AF,AFG,37172386.0
9,29/03/2020,29,3,2020,15,1,Afghanistan,AF,AFG,37172386.0


#### Notes
* it looks like this data is in alphabetical order
    * secondarily, it looks like it's in reverse chronological order
* daily reports of cases and deaths per country

In [11]:
# view last 10 rows
df.tail(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
9300,30/03/2020,30,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
9301,29/03/2020,29,3,2020,2,0,Zimbabwe,ZW,ZWE,14439018.0
9302,28/03/2020,28,3,2020,2,0,Zimbabwe,ZW,ZWE,14439018.0
9303,27/03/2020,27,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
9304,26/03/2020,26,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0
9305,25/03/2020,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
9306,24/03/2020,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14439018.0
9307,23/03/2020,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
9308,22/03/2020,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0
9309,21/03/2020,21,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0


##### When does it seem that the EU started getting reporting on COVID-19 from Zimbabwe?
- it looks like March 21, 2020

### Dump the dataframe to an excel spreadsheet

In [None]:
df.to_excel('data/2020_04_07__data_europa_eu__covid_19_data.xls')b

## Notation Methods

In [17]:
# dot notation
df.dateRep.head()

0    07/04/2020
1    06/04/2020
2    05/04/2020
3    04/04/2020
4    03/04/2020
Name: dateRep, dtype: object

In [18]:
# bracket notation
df['dateRep'].head()

0    07/04/2020
1    06/04/2020
2    05/04/2020
3    04/04/2020
4    03/04/2020
Name: dateRep, dtype: object

In [19]:
df['dateRep'].describe()

count           9310
unique            99
top       06/04/2020
freq             203
Name: dateRep, dtype: object

##### How many unique countries have reported?

In [20]:
df['countriesAndTerritories'].describe()

count      9310
unique      204
top       China
freq         99
Name: countriesAndTerritories, dtype: object

##### What are the unique countries that have reported?

In [22]:
print(f'{len(df[\'countriesAndTerritories'].unique())} countries have reported')
df['countriesAndTerritories'].unique()

SyntaxError: invalid syntax (<ipython-input-22-18537a330970>, line 1)