# Data Gather scratch

Let's use the very nice project [COVID19Py](https://github.com/Kamaropoulos/COVID19Py) package to get `json` file with time series for COVID-19 around the world.

In [1]:
import COVID19Py

And the we need to instantiate the parser (for example):

In [2]:
covid19 = COVID19Py.COVID19()

We have to set which database we want. In our case, we are interest in JHU database:

In [3]:
covid19 = COVID19Py.COVID19(data_source="jhu")

Now, we can get time series by country:

In [4]:
location = covid19.getLocationByCountryCode("CA", timelines=True)

location

[{'id': 35,
  'country': 'Canada',
  'country_code': 'CA',
  'country_population': 33679000,
  'province': 'Alberta',
  'last_updated': '2020-03-28T17:55:48.094294Z',
  'coordinates': {'latitude': '53.9333', 'longitude': '-116.5765'},
  'latest': {'confirmed': 542, 'deaths': 2, 'recovered': 0},
  'timelines': {'confirmed': {'latest': 542,
    'timeline': {'2020-01-22T00:00:00Z': 0,
     '2020-01-23T00:00:00Z': 0,
     '2020-01-24T00:00:00Z': 0,
     '2020-01-25T00:00:00Z': 0,
     '2020-01-26T00:00:00Z': 0,
     '2020-01-27T00:00:00Z': 0,
     '2020-01-28T00:00:00Z': 0,
     '2020-01-29T00:00:00Z': 0,
     '2020-01-30T00:00:00Z': 0,
     '2020-01-31T00:00:00Z': 0,
     '2020-02-01T00:00:00Z': 0,
     '2020-02-02T00:00:00Z': 0,
     '2020-02-03T00:00:00Z': 0,
     '2020-02-04T00:00:00Z': 0,
     '2020-02-05T00:00:00Z': 0,
     '2020-02-06T00:00:00Z': 0,
     '2020-02-07T00:00:00Z': 0,
     '2020-02-08T00:00:00Z': 0,
     '2020-02-09T00:00:00Z': 0,
     '2020-02-10T00:00:00Z': 0,
     '2

The above getter method returns a list. Each entry has a `json` parsed as a Python `dict`. To get the country data, we run the following:

In [5]:
location[0]['timelines']

{'confirmed': {'latest': 542,
  'timeline': {'2020-01-22T00:00:00Z': 0,
   '2020-01-23T00:00:00Z': 0,
   '2020-01-24T00:00:00Z': 0,
   '2020-01-25T00:00:00Z': 0,
   '2020-01-26T00:00:00Z': 0,
   '2020-01-27T00:00:00Z': 0,
   '2020-01-28T00:00:00Z': 0,
   '2020-01-29T00:00:00Z': 0,
   '2020-01-30T00:00:00Z': 0,
   '2020-01-31T00:00:00Z': 0,
   '2020-02-01T00:00:00Z': 0,
   '2020-02-02T00:00:00Z': 0,
   '2020-02-03T00:00:00Z': 0,
   '2020-02-04T00:00:00Z': 0,
   '2020-02-05T00:00:00Z': 0,
   '2020-02-06T00:00:00Z': 0,
   '2020-02-07T00:00:00Z': 0,
   '2020-02-08T00:00:00Z': 0,
   '2020-02-09T00:00:00Z': 0,
   '2020-02-10T00:00:00Z': 0,
   '2020-02-11T00:00:00Z': 0,
   '2020-02-12T00:00:00Z': 0,
   '2020-02-13T00:00:00Z': 0,
   '2020-02-14T00:00:00Z': 0,
   '2020-02-15T00:00:00Z': 0,
   '2020-02-16T00:00:00Z': 0,
   '2020-02-17T00:00:00Z': 0,
   '2020-02-18T00:00:00Z': 0,
   '2020-02-19T00:00:00Z': 0,
   '2020-02-20T00:00:00Z': 0,
   '2020-02-21T00:00:00Z': 0,
   '2020-02-22T00:00:00Z': 0

We can get the confirmed cases with:

In [6]:
location[0]['timelines']['confirmed']['timeline']

{'2020-01-22T00:00:00Z': 0,
 '2020-01-23T00:00:00Z': 0,
 '2020-01-24T00:00:00Z': 0,
 '2020-01-25T00:00:00Z': 0,
 '2020-01-26T00:00:00Z': 0,
 '2020-01-27T00:00:00Z': 0,
 '2020-01-28T00:00:00Z': 0,
 '2020-01-29T00:00:00Z': 0,
 '2020-01-30T00:00:00Z': 0,
 '2020-01-31T00:00:00Z': 0,
 '2020-02-01T00:00:00Z': 0,
 '2020-02-02T00:00:00Z': 0,
 '2020-02-03T00:00:00Z': 0,
 '2020-02-04T00:00:00Z': 0,
 '2020-02-05T00:00:00Z': 0,
 '2020-02-06T00:00:00Z': 0,
 '2020-02-07T00:00:00Z': 0,
 '2020-02-08T00:00:00Z': 0,
 '2020-02-09T00:00:00Z': 0,
 '2020-02-10T00:00:00Z': 0,
 '2020-02-11T00:00:00Z': 0,
 '2020-02-12T00:00:00Z': 0,
 '2020-02-13T00:00:00Z': 0,
 '2020-02-14T00:00:00Z': 0,
 '2020-02-15T00:00:00Z': 0,
 '2020-02-16T00:00:00Z': 0,
 '2020-02-17T00:00:00Z': 0,
 '2020-02-18T00:00:00Z': 0,
 '2020-02-19T00:00:00Z': 0,
 '2020-02-20T00:00:00Z': 0,
 '2020-02-21T00:00:00Z': 0,
 '2020-02-22T00:00:00Z': 0,
 '2020-02-23T00:00:00Z': 0,
 '2020-02-24T00:00:00Z': 0,
 '2020-02-25T00:00:00Z': 0,
 '2020-02-26T00:00:0

Now, Deaths:

In [7]:
location[0]['timelines']['deaths']['timeline'].values()

dict_values([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2])

But how to create a `pandas.DataFrame` from this parsed data? Well, let's figure it out!

In [8]:
import pandas as pd

In [9]:
amount_of_days = len(location[0]['timelines']['confirmed']['timeline'])
days_range_list = list(range(amount_of_days))
dict_for_a_country = {
    "day": days_range_list,
    "date": list(location[0]['timelines']['confirmed']['timeline'].keys()),
    "confirmed": list(location[0]['timelines']['confirmed']['timeline'].values()),
    "deaths": list(location[0]['timelines']['deaths']['timeline'].values()),
}

dict_for_a_country

{'day': [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65],
 'date': ['2020-01-22T00:00:00Z',
  '2020-01-23T00:00:00Z',
  '2020-01-24T00:00:00Z',
  '2020-01-25T00:00:00Z',
  '2020-01-26T00:00:00Z',
  '2020-01-27T00:00:00Z',
  '2020-01-28T00:00:00Z',
  '2020-01-29T00:00:00Z',
  '2020-01-30T00:00:00Z',
  '2020-01-31T00:00:00Z',
  '2020-02-01T00:00:00Z',
  '2020-02-02T00:00:00Z',
  '2020-02-03T00:00:00Z',
  '2020-02-04T00:00:00Z',
  '2020-02-05T00:00:00Z',
  '2020-02-06T00:00:00Z',
  '2020-02-07T00:00:00Z',
  '2020-02-08T00:00:00Z',
  '2020-02-09T00:00:00Z',
  '2020-02-10T00:00:00Z',
  '2020-02-11T00:00:00Z',
  '2020-02-12T00:00:00Z',
  '2020-02-13T00:00:00Z',


Now we can put everything in a DataFrame:

In [10]:
df_country_data = pd.DataFrame(dict_for_a_country)
df_country_data.date = df_country_data.date.astype('datetime64[ns]') 

df_country_data

Unnamed: 0,day,date,confirmed,deaths
0,0,2020-01-22,0,0
1,1,2020-01-23,0,0
2,2,2020-01-24,0,0
3,3,2020-01-25,0,0
4,4,2020-01-26,0,0
...,...,...,...,...
61,61,2020-03-23,301,1
62,62,2020-03-24,359,1
63,63,2020-03-25,358,2
64,64,2020-03-26,486,2


Not so complicated! Let's design a class to handle such kind of requests. Let's get the available countries and data for each country and province.

In [11]:
all_data = covid19.getAll()

all_data

{'latest': {'confirmed': 593291, 'deaths': 27198, 'recovered': 0},
 'locations': [{'id': 0,
   'country': 'Afghanistan',
   'country_code': 'AF',
   'country_population': 29121286,
   'province': '',
   'last_updated': '2020-03-28T17:56:50.812740Z',
   'coordinates': {'latitude': '33.0', 'longitude': '65.0'},
   'latest': {'confirmed': 110, 'deaths': 4, 'recovered': 0}},
  {'id': 1,
   'country': 'Albania',
   'country_code': 'AL',
   'country_population': 2986952,
   'province': '',
   'last_updated': '2020-03-28T17:56:50.889715Z',
   'coordinates': {'latitude': '41.1533', 'longitude': '20.1683'},
   'latest': {'confirmed': 186, 'deaths': 8, 'recovered': 0}},
  {'id': 2,
   'country': 'Algeria',
   'country_code': 'DZ',
   'country_population': 34586184,
   'province': '',
   'last_updated': '2020-03-28T17:56:50.907829Z',
   'coordinates': {'latitude': '28.0339', 'longitude': '1.6596'},
   'latest': {'confirmed': 409, 'deaths': 26, 'recovered': 0}},
  {'id': 3,
   'country': 'Andorra'

In [12]:
country_names = list()
country_codes = list()
country_provinces = list()
for entry in all_data['locations']:
    country_name = entry['country']
    country_names.append(country_name)
    
    country_code = entry['country_code']
    country_codes.append(country_code)
    
    country_province = entry['province']
    country_provinces.append(country_province)
    
    print(f"Country name: {country_name}")
    print(f"Country code: {country_code}")
    print(f"Province: {country_province}")
    
    if country_province:
        print(f"Full region name: {country_name} ({country_province})")
    else:
        print(f"Full region name: {country_name}")
        
    print("**************************************\n")
    
country_database_dict = {
    "name": country_names,
    "code": country_codes,
    "province": country_provinces
}

df_available_countries = pd.DataFrame(country_database_dict)

Country name: Afghanistan
Country code: AF
Province: 
Full region name: Afghanistan
**************************************

Country name: Albania
Country code: AL
Province: 
Full region name: Albania
**************************************

Country name: Algeria
Country code: DZ
Province: 
Full region name: Algeria
**************************************

Country name: Andorra
Country code: AD
Province: 
Full region name: Andorra
**************************************

Country name: Angola
Country code: AO
Province: 
Full region name: Angola
**************************************

Country name: Antigua and Barbuda
Country code: AG
Province: 
Full region name: Antigua and Barbuda
**************************************

Country name: Argentina
Country code: AR
Province: 
Full region name: Argentina
**************************************

Country name: Armenia
Country code: AM
Province: 
Full region name: Armenia
**************************************

Country name: Australia
Country code: 

In [13]:
df_available_countries = df_available_countries.sort_values(by="name").reset_index(drop=True)

df_available_countries

Unnamed: 0,name,code,province
0,Afghanistan,AF,
1,Albania,AL,
2,Algeria,DZ,
3,Andorra,AD,
4,Angola,AO,
...,...,...,...
244,Venezuela,VE,
245,Vietnam,VN,
246,West Bank and Gaza,PS,
247,Zambia,ZM,


In [14]:
df_available_countries.to_csv("available_countries.csv", index=False)

In [15]:
df_from_csv = pd.read_csv("../data/available_countries.csv")

df_from_csv

Unnamed: 0,name,code,province
0,Afghanistan,AF,
1,Albania,AL,
2,Algeria,DZ,
3,Andorra,AD,
4,Angola,AO,
...,...,...,...
244,Venezuela,VE,
245,Vietnam,VN,
246,West Bank and Gaza,PS,
247,Zambia,ZM,


In [16]:
import attr

In [17]:
# @attr.s(auto_attribs=True)
# class CountryData:
    