# Jupyter notebook to extract COVID data for Nepal from an API and aggregate the data on the basis of date and admins (municipality, district, and province).

#### _Work done by Nepal Poverty Team, The World Bank_

## API
1. [Nepal Coronavirus Information](https://nepalcorona.info/)

We have used Python 3 and produced the Python 3 Jupyter notebook showing data cleaning and merging.

## Setup

Running of this notebook requires Jupyter software system. Either Jupyter notebook or Jupyter lab can be installed on the system. In addition, a Python package, _pandas_, is required.

### Jupyter Software Installation
https://jupyter.org/install

### pandas Package Installation
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

After all the dependencies are installed the notebook can be imported to the Jupyter software and run.

## Imports

In [195]:
import requests
import json
import pandas as pd

## API URLs

In [196]:
# API urls to extract districts data, municipalities data and individual covid cases data
districts_url = 'https://data.nepalcorona.info/api/v1/districts'
municipals_url = 'https://data.nepalcorona.info/api/v1/municipals'
covid_cases_url = 'https://data.nepalcorona.info/api/v1/covid'

## Function to extract data from API URL

In [197]:
# function to get list of dictionaries from URL
def get_data_from_url(url):
    """
    Gets list of dictionaries from the API URL.
    
    Params:
    url (str): URL for the API
    """
    response = requests.get(url)
    response_content = response.content.decode()
    return json.loads(response_content)

## Data preprocessing -- extraction and wrangling

In [198]:
# get data for districts, municipalities and individual COVID cases
districts = get_data_from_url(districts_url)
municipals = get_data_from_url(municipals_url)
covid_cases = get_data_from_url(covid_cases_url)

# store the data in dataframe using pd.DataFrame constructor
districts = pd.DataFrame(districts)
municipals = pd.DataFrame(municipals)
covid_cases = pd.DataFrame(covid_cases)

# add a column in municipals -- district_en -- which stores districts names
municipals['district_en'] = municipals['district'].map(dict(zip(districts['id'], districts['title_en'])))

# helper function to strip the whitespaces
def clean(name):
    return name.strip()

# concatenate two columns, municipality name and municipality type to get the full municipality name
municipals['municipal_en'] = municipals['title_en'].apply(clean) + ' ' + municipals['type'].apply(clean)

# add municipality and district names in the covid cases dataframe
covid_cases['municipality_name'] = covid_cases['municipality'].map(dict(zip(municipals['id'], municipals['municipal_en'])))
covid_cases['district_name'] = covid_cases['district'].map(dict(zip(municipals['district'], municipals['district_en'])))

# convert the string date format into proper pandas date format
covid_cases['date'] = pd.to_datetime(covid_cases['createdOn']).dt.date

## Data aggregation

In [199]:
# aggregate the data by date and municipality
date_muni_covid_cases = covid_cases.groupby(['date', 'province', 'district_name', 'municipality_name']).agg('count')['id'].reset_index()
date_muni_covid_cases = date_muni_covid_cases.rename(columns={'id': 'cases_count'})

# aggregate the data by date and district
date_dist_covid_cases = covid_cases.groupby(['date', 'province', 'district_name']).agg('count')['id'].reset_index()
date_dist_covid_cases = date_dist_covid_cases.rename(columns={'id': 'cases_count'})

# aggregate the data by date and province
date_province_covid_cases = covid_cases.groupby(['date', 'province']).agg('count')['id'].reset_index()
date_province_covid_cases = date_province_covid_cases.rename(columns={'id': 'cases_count'})

In [200]:
# aggregate up-to-date data by municipality
municipals['cases_count'] = municipals['id'].map(dict(covid_cases.groupby('municipality').agg('count')['id']))

# aggregate up-to-date data by district
districts['cases_count'] = districts['id'].map(dict(covid_cases.groupby('district').agg('count')['id']))

# aggregate up-to-date data by province
provinces = pd.DataFrame({'province': [1, 2, 3, 4, 5, 6, 7]})
provinces['cases_count'] = provinces['province'].map(dict(covid_cases.groupby('province').agg('count')['id']))

## Validation

If some error pops up in the below *assertion* code, make sure to check the cleaning and aggregation section. The error shows that the cleaning and/or aggregation might not have been done correctly.

In [204]:
assert date_muni_covid_cases['cases_count'].sum() == covid_cases.shape[0]
assert date_dist_covid_cases['cases_count'].sum() == covid_cases.shape[0]
assert date_province_covid_cases['cases_count'].sum() == covid_cases.shape[0]
assert municipals['cases_count'].sum() == covid_cases.shape[0]
assert districts['cases_count'].sum() == covid_cases.shape[0]
assert provinces['cases_count'].sum() == covid_cases.shape[0]

## Export

In [205]:
# export the aggregates to Nepal COVID folder

date_muni_covid_cases.to_csv('Nepal COVID/date_muni_covid_cases.csv', index=False)
date_dist_covid_cases.to_csv('Nepal COVID/date_dist_covid_cases.csv', index=False)
date_province_covid_cases.to_csv('Nepal COVID/date_province_covid_cases.csv', index=False)

covid_cases.to_csv('Nepal COVID/individual_covid_cases.csv', index=False)
municipals.to_csv('Nepal COVID/municipal_covid_cases.csv', index=False)
districts.to_csv('Nepal COVID/district_covid_cases.csv', index=False)
provinces.to_csv('Nepal COVID/province_covid_cases.csv', index=False)