# Data Collection and Preparation

The following workbook collects all data required for the analysis. This includes: 

 * Economic data: GDP per capita growth rates and inflation rates
 * Coronavirus data: daily case numbers, total cases, deaths
 * Fear and greed index: related to Crypto assets and stocks
 * Asset price data: Bitcoin, Ethereum, S & P 500 and overall Cryptocurrency market capitilisation figure

---

### Import the modules required for the analysis

The data collection process involves the use of a number of key Python modules, specifically pandas.

In [110]:
# Import the modules required for the analysis
import pandas as pd
import datetime as dt
import time
import numpy as np
import os

### Coronavirus Data

This section of the data collection process collects time series data related to COVID-19. The objective of collecting this data is to prepare the dataset for multiple visualisations (such as geographic representation and time-series comparison against other indicators). 

In [111]:
# Specify the URL to the raw github content
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
geo_url = 'https://raw.githubusercontent.com/albertyw/avenews/master/old/data/average-latitude-longitude-countries.csv'

# Read the covid data into a dataframe
covid = pd.read_csv(url)

# Read the geographic data into a dataframe
geo_data = pd.read_csv(geo_url)

# Understand datatypes
print(covid.dtypes)

# Display data
covid.head(5)

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


Convert the date column to a datetime format using pandas' *to_datetime* function.

Once completed, the code then filters for the time period that is being utilised for this report. For the purposes of analysis and data availability, the time period being used is from **1st March 2020 to 31st December 2021**. The data collected is being constantly updated and therefore needs to be filtered to capture the correct dates.

In [112]:
# Convert date to datetime
covid['date'] = pd.to_datetime(covid['date'], infer_datetime_format = True)

# Filter 'date' based on assessment period specified above
covid = covid[(covid['date'] >= '2020-03-01') & (covid['date'] <= '2021-12-31')]

The time series data is now prepared and available to use. Now grouped data is required per country for both 2020 and 2021.

In [114]:
# Select only relevant columns, group by location and year and sum both total cases and year
covid_grouped = covid.iloc[:,[2,3,5,8]].groupby(['location', covid.date.dt.year]).sum().reset_index().\
                                        rename(columns = {'new_cases' : 'Cases',
                                                          'new_deaths' : 'Deaths',
                                                          'location' : 'Country'})

Unnamed: 0,Country,date,Cases,Deaths
0,Afghanistan,2020,52325.0,2189.0
1,Afghanistan,2021,105754.0,5167.0
2,Africa,2020,2760451.0,65468.0
3,Africa,2021,6932141.0,162439.0
4,Albania,2020,58316.0,1181.0
...,...,...,...,...
456,Yemen,2021,8027.0,1374.0
457,Zambia,2020,20725.0,388.0
458,Zambia,2021,233549.0,3346.0
459,Zimbabwe,2020,13867.0,363.0


In [115]:
#Join the locational data from geo_data with the covid_grouped dataframe
covid_grouped = pd.merge(covid_grouped, geo_data, on = 'Country', how = 'inner')

# Display joined results
covid_grouped

Unnamed: 0,Country,date,Cases,Deaths,ISO 3166 Country Code,Latitude,Longitude
0,Afghanistan,2020,52325.0,2189.0,AF,33.0,65.0
1,Afghanistan,2021,105754.0,5167.0,AF,33.0,65.0
2,Albania,2020,58316.0,1181.0,AL,41.0,20.0
3,Albania,2021,151908.0,2036.0,AL,41.0,20.0
4,Algeria,2020,99609.0,2756.0,DZ,28.0,3.0
...,...,...,...,...,...,...,...
378,Yemen,2021,8027.0,1374.0,YE,15.0,48.0
379,Zambia,2020,20725.0,388.0,ZM,-15.0,30.0
380,Zambia,2021,233549.0,3346.0,ZM,-15.0,30.0
381,Zimbabwe,2020,13867.0,363.0,ZW,-20.0,30.0


### Economic data

There are two primary economic indicators that will be used to answer a number of questions relating to this analysis. These are Gross Domestic Product (GDP) per capita (per capita translates to *'per person'*) and inflation.