# Data Collection and Preparation

The following workbook collects all data required for the analysis. This includes: 

 * Economic data: GDP per capita growth rates and inflation rates
 * Coronavirus data: daily case numbers, total cases, deaths
 * Fear and greed index: related to Crypto assets and stocks
 * Asset price data: Bitcoin, Ethereum, S & P 500 and overall Cryptocurrency market capitilisation figure

---

### Import the modules required for the analysis

The data collection process involves the use of a number of key Python modules, specifically pandas.

In [2]:
# Import the modules required for the analysis
import pandas as pd
import datetime as dt
import time
import numpy as np
import os
import requests
import json
from pathlib import Path

### Coronavirus Data

This section of the data collection process collects time series data related to COVID-19. The objective of collecting this data is to prepare the dataset for multiple visualisations (such as geographic representation and time-series comparison against other indicators). 

In [2]:
# Specify the URL to the raw github content
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
geo_url = 'https://raw.githubusercontent.com/albertyw/avenews/master/old/data/average-latitude-longitude-countries.csv'

# Read the covid data into a dataframe
covid = pd.read_csv(url)

# Read the geographic data into a dataframe
geo_data = pd.read_csv(geo_url)

# Print datatypes
print(covid.dtypes)

# Display data
covid.head(5)

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


Convert the date column to a datetime format using pandas' *to_datetime* function.

Once completed, the code then filters for the time period that is being utilised for this report. For the purposes of analysis and data availability, the time period being used is from **1st March 2020 to 31st December 2021**. The data collected is being constantly updated and therefore needs to be filtered to capture the correct dates.

In [3]:
# Convert date to datetime
covid['date'] = pd.to_datetime(covid['date'], infer_datetime_format = True)

# Filter 'date' based on assessment period specified above
covid = covid[(covid['date'] >= '2020-01-01') & (covid['date'] <= '2021-12-31')]

The time series data is now prepared and available to use. Now grouped data is required per country for both 2020 and 2021.

In [4]:
# Select only relevant columns, group by location and year and sum both total cases and year
covid_grouped = covid.iloc[:,[2,3,5,8]].groupby(['location', covid.date.dt.year]).sum().reset_index().\
                                        rename(columns = {'new_cases' : 'Cases',
                                                          'new_deaths' : 'Deaths',
                                                          'location' : 'Country'})

The following code merges the *covid_grouped* dataframe with the *geo_data* dataframe to produce a combined dataframe with Covid-19 data and locational data.

In [5]:
#Join the locational data from geo_data with the covid_grouped dataframe
covid_grouped = pd.merge(covid_grouped, geo_data, on = 'Country', how = 'inner')

# Display joined results
covid_grouped

Unnamed: 0,Country,date,Cases,Deaths,ISO 3166 Country Code,Latitude,Longitude
0,Afghanistan,2020,52330.0,2189.0,AF,33.0,65.0
1,Afghanistan,2021,105754.0,5167.0,AF,33.0,65.0
2,Albania,2020,58316.0,1181.0,AL,41.0,20.0
3,Albania,2021,151908.0,2036.0,AL,41.0,20.0
4,Algeria,2020,99610.0,2756.0,DZ,28.0,3.0
...,...,...,...,...,...,...,...
378,Yemen,2021,8027.0,1374.0,YE,15.0,48.0
379,Zambia,2020,20725.0,388.0,ZM,-15.0,30.0
380,Zambia,2021,233549.0,3346.0,ZM,-15.0,30.0
381,Zimbabwe,2020,13867.0,363.0,ZW,-20.0,30.0


### Economic data

There are two primary economic indicators that will be used to answer a number of questions relating to this analysis. These are Gross Domestic Product (GDP) per capita (per capita translates to *'per person'*) and inflation (the increase in the price of goods and services).

In [14]:
# Read real GDP growth data 
real_gdp = pd.read_csv('real_gdp_growth.csv')

# Read inflation rate data from the worldbank API
inflation = pd.read_csv('inflation_annual.csv')
inflation

Unnamed: 0,Country,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,4.316297,0.627472,-2.372065,0.421441,0.474764,-0.931196,-1.028282,3.626041,4.257462,
1,Africa Eastern and Southern,AFE,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,8.971206,9.158707,5.746949,5.370290,5.250171,6.594604,6.399343,4.720811,3.923372,4.978097
2,Afghanistan,AFG,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,11.804186,6.441213,7.385772,4.673996,-0.661709,4.383892,4.975952,0.626149,2.302373,
3,Africa Western and Central,AFW,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,4.018699,4.578375,2.439201,1.758052,2.130268,1.494564,1.764635,1.784050,1.758565,2.425007
4,Angola,AGO,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,13.482468,10.277905,8.777814,7.280387,9.150372,30.695313,29.843587,19.628608,17.081215,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,7.336418,2.476738,1.767324,0.428958,-0.536929,0.273169,1.488234,1.053798,2.675992,0.198228
262,"Yemen, Rep.",YEM,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,19.543562,9.885387,10.968442,8.104726,,,,,,
263,South Africa,ZAF,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,1.288878,2.102343,1.24629,1.337968,2.53498,4.069023,...,5.017158,5.723944,5.776404,6.136020,4.509208,6.594604,5.181082,4.504577,4.124351,3.223885
264,Zambia,ZMB,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,6.429397,6.575900,6.977676,7.806876,10.110593,17.869730,6.577312,7.494572,9.150316,15.732585


The data is not presented in panel data format and thus should be structured accordingly. The following piece of code will prepare the data to be presented in a more appropriate format.

In [15]:
# Stack data, reset index and rename columns
real_gdp = real_gdp.set_index(['Country','Country Code', 'Indicator Name', 'Indicator Code']).\
                    stack().reset_index().rename(columns = {'level_4' : 'Year', 
                                                            0 : 'GDP per capita growth (annual %)'})

# Select only relevant columns within the real_gdp dataframe
real_gdp = real_gdp.iloc[:,[0,1,4,5]]

# Apply the above logic inflation data
inflation = inflation.set_index(['Country','Country Code', 'Indicator Name', 'Indicator Code']).\
                      stack().reset_index().rename(columns = {'level_4' : 'Year', 
                                                              0 : 'Inflation Rate'})

# Select only relevant columns within the real_gdp dataframe
inflation = inflation.iloc[:,[0,1,4,5]]

inflation

Unnamed: 0,Country,Country Code,Year,Inflation Rate
0,Aruba,ABW,1985,4.032258
1,Aruba,ABW,1986,1.073966
2,Aruba,ABW,1987,3.643045
3,Aruba,ABW,1988,3.121868
4,Aruba,ABW,1989,3.991628
...,...,...,...,...
10257,Zimbabwe,ZWE,2014,-0.197785
10258,Zimbabwe,ZWE,2015,-2.430968
10259,Zimbabwe,ZWE,2016,-1.543670
10260,Zimbabwe,ZWE,2017,0.893962


Convert year column to datetime format.

In [8]:
# List of dataframes
economic_data = [real_gdp, inflation]

# Initiate for loop
for eco in economic_data:
    
    # Convert to datetime
    eco['Year'] = pd.to_datetime(eco['Year'], infer_datetime_format = True)

Join the dataframes.

In [9]:
# Merge dataframes
eco_data = pd.merge(real_gdp, inflation, on = ['Country','Year','Country Code'] , how = 'inner')

# Take the average inflation rate and GDP per capita growth rate for 2019 and 2020 only (to join with Covid-19 data)
eco_avg = eco_data[eco_data['Year'] >= '2019-01-01'].groupby(['Country','Country Code',eco_data.Year.dt.year]).\
                                                     mean().rename(columns = {
                                        'GDP per capita growth (annual %)' : 'Average GDP Per Capita Growth Rate',
                                        'Inflation Rate' : 'Average Inflation Rate'}).reset_index()

# Display results
eco_avg

Unnamed: 0,Country,Country Code,Year,Average GDP Per Capita Growth Rate,Average Inflation Rate
0,Afghanistan,AFG,2019,1.535637,2.302373
1,Africa Eastern and Southern,AFE,2019,-0.570661,3.923372
2,Africa Eastern and Southern,AFE,2020,-5.394391,4.978097
3,Africa Western and Central,AFW,2019,0.501092,1.758565
4,Africa Western and Central,AFW,2020,-3.502433,2.425007
...,...,...,...,...,...
396,West Bank and Gaza,PSE,2020,-13.631959,-0.735332
397,World,WLD,2019,1.480856,2.167730
398,World,WLD,2020,-4.395210,1.936941
399,Zambia,ZMB,2019,-1.451364,9.150316


---

### Combined Economic and Covid Data

The following code combines both Coronavirus data prepared earlier in this workbook with the economic data prepared above. Data is joined based on country name. The following approach has been undertaken to deal with data quality issues: 

 * Nations with a year of data missing will be excluded for completeness

There are also a number of limitations to mention. These include:

 * Not all data is available for GDP per capita figures, meaning there will be exclusions in the dataset
 * Using country name as a joining key has proven somewhat effective, but may differ amongst datasets
 
The following code combines the abovementioned dataframes.

In [10]:
# Combine dataframes using country as the key
eco_covid = pd.merge(eco_avg, 
                     covid_grouped, 
                     how = 'left', 
                     left_on = ['Country','Year'], 
                     right_on = ['Country','date'])

# Count number of country occurrances to exclude from dataset (if not containing both 2019 and 2020 data)
count_country = eco_covid.groupby('Country').size().reset_index(name = 'Count')

# Join count to eco_covid dataframe
eco_covid = pd.merge(eco_covid, count_country, on = 'Country', how = 'inner')

# Remove any countries with missing 2019 or 2020 data
eco_covid = eco_covid[eco_covid['Count'] > 1].drop(columns = 'Count')

# Fill in NaN 2019 data with identical latitude and longitude data
eco_covid[['Latitude','Longitude']] = eco_covid.groupby('Country')[['Latitude','Longitude']].bfill()

# Remove null values with missing data
eco_covid = eco_covid[eco_covid['Latitude'].notna()]

# Select only relevant columns
eco_covid = eco_covid.iloc[:,[0,2,3,4,6,7,9,10]]

---

### Cryptocurrency



In [3]:
# Specify the path used where the data is located
crypto_tmc_path = Path("./crypto_tmc.csv")

# Read the CSV file
crypto_tmc_data = pd.read_csv(crypto_tmc_path, index_col="date", infer_datetime_format=True, parse_dates=True)

# Generate sample data
crypto_tmc_data.head(5)

Unnamed: 0_level_0,open,high,low,close,MA,MA.1,Volume,Volume MA,OnBalanceVolume,RSI
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-03-19,7734802000.0,7869879000.0,7623081000.0,7700141000.0,,,8512398.32,,-50927719.95,57.701674
2014-03-20,7688016000.0,7704364000.0,7290326000.0,7378310000.0,,,8512398.32,17958504.08,-59440118.27,50.721708
2014-03-21,7380354000.0,7667154000.0,7002819000.0,7179603000.0,,,8512398.32,16100911.41,-67952516.59,46.945709
2014-03-22,7180253000.0,7210748000.0,6808422000.0,7125488000.0,,,8512398.32,15695300.61,-76464914.91,45.94262
2014-03-23,7131450000.0,7197064000.0,7034979000.0,7105506000.0,,,22428897.94,15985514.79,-98893812.86,45.55555


In [None]:
# Select only relevant columns
crypto = crypto_tmc_data['close']

# Generate sample data
crypto.head(5)

In [None]:
# Convert to CSV
crypto.to_csv('cryptotmc_data_cleaned.csv')

### Final remarks

With the removal of null values and countries with missing data, the dataset for assessing Covid-19 impacts is ready for use. The following code confirms the available data and the structure of the new dataframe that will be used in the report.

In [11]:
# Print sample size and code completion
print(f"After the data preparation process, there are {(eco_covid['Country'].count())/2} countries with available data.")
print(f"Code executed without error at {dt.datetime.now()}")

# Print to CSV
eco_covid.to_csv("eco_covid.csv", index = False)

# Display results
eco_covid

After the data preparation process, there are 124.0 countries with available data.
Code executed without error at 2022-01-06 16:25:35.035855


Unnamed: 0,Country,Year,Average GDP Per Capita Growth Rate,Average Inflation Rate,Cases,Deaths,Latitude,Longitude
5,Albania,2019,2.549359,1.411091,,,41.0,20.0
6,Albania,2020,-3.398708,1.620887,58316.0,1181.0,41.0,20.0
7,Algeria,2019,-0.934556,1.951768,,,28.0,3.0
8,Algeria,2020,-6.826964,2.415131,99610.0,2756.0,28.0,3.0
13,Armenia,2019,7.382197,1.443447,,,40.0,45.0
...,...,...,...,...,...,...,...,...
391,Uruguay,2020,-6.183824,9.756406,19119.0,181.0,-33.0,-56.0
393,Vietnam,2019,6.001037,2.795824,,,16.0,106.0
394,Vietnam,2020,1.979231,3.220934,1465.0,35.0,16.0,106.0
399,Zambia,2019,-1.451364,9.150316,,,-15.0,30.0
