# Wrangle and Analyze Data of a Twitter Account


## Table of Contents
- [1. Introduction](#intro)
- [2. Gather data](#gather)
- [3. Assess data](#assess)
- [4. Clean data](#clean)
- [5. Store](#store)


<a id='gather'></a>
## 1. Introduction

This project is an analysis of correlation between the Covid-19 cases and the political environment of different countries. Goal is to find answers or at least indicators to questions like: 
- Did the countries which had more success in containing the amount of Covid-19 cases something in common? 
- Is there a correlation in  Gross domestic product, Human Development Index or political ideology with the amount of Covid-19 cases of the country.

Main goal of this project is to generate a comprehensive exploratory and explanatory data analysis of the gathered data. The data analysis process is distributed over three ipynb-files: gather_clean_Covid19.ipynb, exploration_Covid19.ipynb and slide_deck_Covid19.ipynb.

Firstly, as part of gather_clean_Covid19.ipynb data is gathered from different sources: The Covid-19 data of this project is retrieved via programmatically downloaded csv-files from the GitHub repository [Covid-19](https://github.com/CSSEGISandData/COVID-19) and additional data about countries is retrieved via the wikipedia API. Secondly, the data from the different sources is visually and programmatically assessed to be cleaned.
The exploratory and explanatory data analysis of the gathered data is performed in exploration_Covid19.ipynb. Finally the findings are presented in slide_deck_Covid19.ipynb.

In [469]:
# Import necessary libraries
import numpy as np
import pandas as pd
from datetime import date
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import os # to work with local directory
import re
import wptools
import json # to create json file from python dictionary
import time # for timer 
sns.set()

<a id='intro'></a>
## 2. Gather data

####  Data is gathered from three different sources of data as described in steps below:

1. Fatality, confirmed cases, recovered cases and data by country is retrieved via programmatically downloaded csv-files from the GitHub repository [Covid-19](https://github.com/CSSEGISandData/COVID-19).
2. Additional data is retrieved via the wptools API from different wikipedia articles.

### a. Read data from programmatically download csv-file

In [470]:
# Gather data from John Hopkins GitHub 
df_JHU_Fatality = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_JHU_Confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df_JHU_Recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_JHU_Countries = pd.read_csv('https://raw.githubusercontent.com/RRighart/covid-19/master/countries.csv')

In [471]:
df_OWID_Covid = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df_OWID_Testing = pd.read_csv('https://covid.ourworldindata.org/data/testing/covid-testing-latest-data-source-details.csv')
df_OWID_Countries = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/locations.csv')

### b. Read data from local datasets

Data downloaded manually from different databases, [European statistical database](https://ec.europa.eu/eurostat/data/database), [Wikipedia table on intensive care units](https://en.wikipedia.org/wiki/List_of_countries_by_hospital_beds) and [United Nations database](https://data.un.org):

In [472]:
df_ESTAT_census = pd.read_csv('inputData/Eurostat_census_2001.csv')
df_WIKI_ICU = pd.read_csv('inputData/Wikipedia_ICU.csv')
df_UN_births = pd.read_csv('inputData/UNdata_birthsByMonth.csv')
df_UN_deaths = pd.read_csv('inputData/UNdata_deathsByMonth.csv')

### c. Query additional information for countries via wikipedia API

Additional Information
- Leader Gender
- Ideology of Leading Party
- Amount of Intensive Care Beds
- Gross domestic product per capita
- Human Development Index

In [473]:
# Query for every tweet id in enhanced twitter archive and save tweet-information in json-format to 'tweet_json.txt'
'''             
country_jsons = {}
county_id_errors = []
start = time.time()
count = 0


with open('country_json.txt', 'w') as outfile:
    
    for country in df_JHU_Countries['Country/Region']:
        count +=1
        try:
            # Query API for data of wikipedia article
            article = wptools.page(country).get_parse()
            infobox = article.data['infobox']
            # Measure elapsed time
            mid_s = time.time()
            # Print id and time elapsed
            print(str(count) + str(mid_s - start) )
            # Write json of tweet to 'tweet_json.txt'
            json.dump(infobox, outfile)
            # New line
            outfile.write("\n")

        # Not best practice to catch all exceptions but fine for this short script
        except Exception as error:
            mid_f = time.time()
            print(str(count) + str(mid_f - start) + str(error))
            # Gather ids of id's without status
            tweet_id_errors.append([count, str(tweet_id)])
            
    end = time.time()
    print(end - start)
    
    '''

'             \ncountry_jsons = {}\ncounty_id_errors = []\nstart = time.time()\ncount = 0\n\n\nwith open(\'country_json.txt\', \'w\') as outfile:\n    \n    for country in df_JHU_Countries[\'Country/Region\']:\n        count +=1\n        try:\n            # Query API for data of wikipedia article\n            article = wptools.page(country).get_parse()\n            infobox = article.data[\'infobox\']\n            # Measure elapsed time\n            mid_s = time.time()\n            # Print id and time elapsed\n            print(str(count) + str(mid_s - start) )\n            # Write json of tweet to \'tweet_json.txt\'\n            json.dump(infobox, outfile)\n            # New line\n            outfile.write("\n")\n\n        # Not best practice to catch all exceptions but fine for this short script\n        except Exception as error:\n            mid_f = time.time()\n            print(str(count) + str(mid_f - start) + str(error))\n            # Gather ids of id\'s without status\n       

In [474]:
'''
so = wptools.page('Germany').get_parse()
infobox = so.data['infobox']
print(infobox)
'''

"\nso = wptools.page('Germany').get_parse()\ninfobox = so.data['infobox']\nprint(infobox)\n"

<a id='assess'></a>
## 3. Assess data

After gathering each of the above pieces of data, they are assessed visually and programmatically for quality and tidiness issues. Requirements to be met:

- Quality requirements:
    - Completeness: All necessary records in dataframes, no specific rows, columns or cells missing.
    - Validity: No records available, that do not conform schema.
    - Accuracy: No wrong data, that is valid.
    - Consistency: No data, that is valid and accurate, but referred to in multiple correct ways.
- Tidiniss requirements (as defined by Hadley Wickham):
    - each variable is a column
    - each observation is a row
    - each type of observational unit is a table.

### a. Visual assessment

In [475]:
# Check layout of df_JHU_Countries vsiually
df_JHU_Countries.sample(n=5)

Unnamed: 0.1,Unnamed: 0,Country/Region,inhabitants,area
17,17,Turkey,84200851,783562
15,15,Luxembourg,625978,2590
19,19,Brazil,212559417,8358140
16,16,Poland,37864109,312685
27,27,Vietnam,97338579,310070


In [476]:
# Check layout of df_JHU_Fatality vsiually
df_JHU_Fatality.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
50,Beijing,China,40.1824,116.4142,0,0,0,0,0,1,...,9,9,9,9,9,9,9,9,9,9
193,,Senegal,14.4974,-14.4524,0,0,0,0,0,0,...,12,13,13,17,19,19,19,21,23,25
35,Alberta,Canada,53.9333,-116.5765,0,0,0,0,0,0,...,112,114,115,116,117,117,118,120,121,125
52,Fujian,China,26.0789,117.9874,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
216,,United Arab Emirates,24.0,54.0,0,0,0,0,0,0,...,157,165,174,185,198,201,203,206,208,210


In [477]:
# Check layout of df_JHU_Confirmed vsiually
df_JHU_Confirmed.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
217,Bermuda,United Kingdom,32.3078,-64.7505,0,0,0,0,0,0,...,118,118,118,118,118,119,121,121,122,122
24,,Benin,9.3077,2.3158,0,0,0,0,0,0,...,96,140,242,284,319,319,327,327,339,339
258,Saint Pierre and Miquelon,France,46.8852,-56.3159,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
191,,San Marino,43.9424,12.4578,0,0,0,0,0,0,...,608,622,623,637,628,628,638,643,648,652
182,,Philippines,13.0,122.0,0,0,0,0,0,0,...,10004,10343,10463,10610,10794,11086,11350,11618,11876,12091


In [478]:
# Check layout of df_JHU_Recovered vsiually
df_JHU_Recovered.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
161,,Namibia,-22.9576,18.4904,0,0,0,0,0,0,...,8,9,9,10,11,11,11,11,12,13
234,,Burma,21.9162,95.956,0,0,0,0,0,0,...,50,62,67,68,72,74,76,79,84,89
34,,Cambodia,11.55,104.9167,0,0,0,0,0,0,...,120,120,120,120,120,121,121,121,121,122
154,,Mexico,23.6345,-102.5528,0,0,0,0,0,0,...,17781,17781,20314,21824,21824,23100,25935,26990,28475,30451
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,458,468,472,502,558,558,610,648,691,745


In [490]:
# Check layout of df_OWID_Covid vsiually
df_OWID_Covid.sample(n=5)

# df_OWID_Covidchange 'location' to 'country'
# df_OWID_Covid create df_OWID_Countries with 'iso_code', 'location', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_100k'
# df_OWID_Covid merge it to df_country

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
3561,CPV,Cape Verde,2020-03-28,5,0,1,0,8.993,0.0,1.799,...,4.46,3.437,6222.554,,182.219,2.42,2.1,16.5,,2.1
17526,OWID_WRL,World,2020-04-02,945624,76757,47447,4907,121.315,9.847,6.087,...,8.696,5.355,15469.207,10.0,233.07,8.51,6.434,34.635,60.13,2.705
12445,NPL,Nepal,2020-01-07,0,0,0,0,0.0,0.0,0.0,...,5.809,3.212,2442.804,15.0,260.797,7.26,9.5,37.8,47.782,0.3
6080,GEO,Georgia,2020-02-17,0,0,0,0,0.0,0.0,0.0,...,14.864,10.244,9745.079,4.2,496.218,7.11,5.3,55.5,,2.6
9597,LBN,Lebanon,2020-03-11,41,0,1,1,6.007,0.0,0.147,...,8.514,5.43,13367.565,,266.591,12.71,26.9,40.7,,2.9


In [491]:
# Check layout of df_OWID_Testing vsiually
df_OWID_Testing.sample(n=5)

# drop

Unnamed: 0,ISO code,Entity,Date,Source URL,Source label,Notes,Number of observations,Cumulative total,Cumulative total per thousand,Daily change in cumulative total,Daily change in cumulative total per thousand,3-day rolling mean daily change,3-day rolling mean daily change per thousand,7-day rolling mean daily change,7-day rolling mean daily change per thousand,General source label,General source URL,Short description,Detailed description
73,ZAF,South Africa - units unclear,2020-05-15,https://github.com/dsfsi/covid19za,National Institute for Communicable Diseases (...,Made available by the University of Pretoria o...,76,421555,7.108,18537.0,0.313,17286.0,0.292,16257.571,0.274,National Institute for Communicable Diseases (...,https://www.nicd.ac.za/media/alerts/,The number of people tested.,The South African National Institute for Commu...
42,KEN,Kenya - units unclear,2020-05-15,https://www.health.go.ke/wp-content/uploads/20...,Kenya Ministry of Health,,47,39018,0.726,2100.0,0.039,1700.667,0.032,1369.714,0.026,Ministry of Health,http://www.health.go.ke,"Units are unclear, and could refer to the numb...",The Kenya Ministry of Health provides daily pr...
47,MEX,Mexico - cases tested,2020-05-15,https://datos.gob.mx/busca/dataset/informacion...,Health Secretary,,136,134663,1.044,,0.0,,0.006,,0.016,Health Secretary,https://datos.gob.mx/busca/dataset/informacion...,The number of cases tested.,The Mexican Health Secretary publishes a datas...
65,RWA,Rwanda - units unclear,2020-05-16,https://twitter.com/RwandaHealth/status/126171...,Rwanda Ministry of Health,,41,48239,3.724,2041.0,0.158,1331.333,0.103,979.143,0.076,Rwanda Ministry of Health,https://www.moh.gov.rw/,The number of samples tested.,The Rwanda Ministry of Health ([@RwandaHealth]...
20,EST,Estonia - tests performed,2020-05-15,https://www.koroonakaart.ee/en,Estonian Central Health Information System and...,,81,68840,51.894,752.0,0.567,920.0,0.694,876.857,0.661,Social Ministry,https://www.terviseamet.ee/et/koroonaviirus/ko...,"The number of tests performed (""Testide koguarv"")",The Social Ministry embeds the [Koroonakaart d...


In [492]:
# Check layout of df_OWID_Countries vsiually
df_OWID_Countries.sample(n=5)

# df_OWID_Countries convert datatype population to integer
# df_OWID_Countries drop 'countriesAndTerritories', 'population_year'

Unnamed: 0,countriesAndTerritories,location,continent,population_year,population
85,Guinea_Bissau,Guinea-Bissau,Africa,2020.0,1967998.0
8,Armenia,Armenia,Asia,2020.0,2963234.0
108,Kyrgyzstan,Kyrgyzstan,Asia,2020.0,6524191.0
4,Angola,Angola,Africa,2020.0,32866268.0
89,Honduras,Honduras,North America,2020.0,9904608.0


In [479]:
# Check layout of df_ESTAT_census vsiually
df_ESTAT_census.sample(n=5)

# df_ESTAT_census make columns from values in n_person
# df_ESTAT_census replace 'Germany (until 1990 former territory of the FRG)' with 'Germany'
# df_ESTAT_census drop country 'Bulgaria'
# df_ESTAT_census drop '4 persons', '5 persons', '6 persons or more'

Unnamed: 0,N_PERSON,GEO,TIME,AGE,HHCOMP,UNIT,Value,Flag and Footnotes
213,5 persons,Liechtenstein,2001,65 years or over,Total,Number,:,
170,4 persons or more,France,2001,65 years or over,Total,Number,2117,
3,Total,Germany (until 1990 former territory of the FRG),2001,65 years or over,Total,Number,37706500,
206,5 persons,Poland,2001,65 years or over,Total,Number,:,
102,2 persons,Slovakia,2001,65 years or over,Total,Number,116684,


In [500]:
df_ESTAT_census.AGE.value_counts()

65 years or over    243
Name: AGE, dtype: int64

In [480]:
# Check layout of df_WIKI_ICU vsiually
df_WIKI_ICU.sample(n=5)

Unnamed: 0,country,continent,hospital_beds_per_1000_people,occupancy,ICU-CCB_beds_per_1000_people,ventilators
40,Mexico,North America,1.38,74.0,1.2,2050.0
14,Estonia,Europe,4.69,70.4,14.6,
17,Slovenia,Europe,4.5,69.5,6.4,
19,Greece,Europe,4.21,61.6,6.0,
23,Netherlands,Europe,3.32,65.4,6.4,


In [481]:
df_UN_births.sample(n=5)
# Drop columns 'Area', 'Record Type', 'Reliability', 'Value Footnotes', 'Source Year'
# change datatype of columns  'Value' to integer
# Merge df_UN_births and df_UN_deaths on Year


Unnamed: 0,Country or Area,Year,Area,Month,Record Type,Reliability,Source Year,Value,Value Footnotes
451,Anguilla,2012,Total,December,Data tabulated by year of registration,"Final figure, complete",2014.0,15.0,
1522,Brunei Darussalam,2014,Total,November,Data tabulated by year of registration,"Final figure, complete",2016.0,594.0,
8860,Singapore,2014,Total,November,Data tabulated by year of occurrence,"Final figure, complete",2015.0,3735.0,
8907,Singapore,2010,Total,February,Data tabulated by year of occurrence,"Final figure, complete",2012.0,2867.0,
9233,Spain,2013,Total,January,Data tabulated by year of occurrence,"Final figure, complete",2015.0,36869.0,


In [482]:
df_UN_births.Area.value_counts()

Total    10373
Name: Area, dtype: int64

In [483]:
df_UN_deaths.sample(n=5)

Unnamed: 0,Country or Area,Year,Area,Month,Record Type,Reliability,Source Year,Value,Value Footnotes
257,Andorra,2011,Total,October,Data tabulated by year of occurrence,"Final figure, complete",2013.0,19.0,
1476,Canada,2017,Total,August,Data tabulated by year of occurrence,"Final figure, complete",2019.0,21578.0,7.0
3023,Faroe Islands,2018,Total,Total,Data tabulated by year of occurrence,"Final figure, complete",2019.0,392.0,
7021,Portugal,2017,Total,Total,Data tabulated by year of occurrence,"Final figure, complete",2019.0,109758.0,30.0
1947,"China, Macao SAR",2010,Total,August,Data tabulated by year of occurrence,"Final figure, complete",2012.0,128.0,


### b. Programmatic assessment

In [484]:
# List of countries that are avaoilable in John Hopkins Dataset
df_JHU_Recovered['Country/Region'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Brazil', 'Brunei',
       'Bulgaria', 'Burkina Faso', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Diamond Princess',
       'Cuba', 'Cyprus', 'Czechia', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
       'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Grenada', 'Greece', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'H

In [485]:
# List of countries that are avaoilable in John Hopkins Dataset
df_OWID_Covid['location'].unique()

array(['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania', 'Andorra',
       'United Arab Emirates', 'Argentina', 'Armenia',
       'Antigua and Barbuda', 'Australia', 'Austria', 'Azerbaijan',
       'Burundi', 'Belgium', 'Benin', 'Bonaire Sint Eustatius and Saba',
       'Burkina Faso', 'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas',
       'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda',
       'Bolivia', 'Brazil', 'Barbados', 'Brunei', 'Bhutan', 'Botswana',
       'Central African Republic', 'Canada', 'Switzerland', 'Chile',
       'China', "Cote d'Ivoire", 'Cameroon',
       'Democratic Republic of Congo', 'Congo', 'Colombia', 'Comoros',
       'Cape Verde', 'Costa Rica', 'Cuba', 'Curacao', 'Cayman Islands',
       'Cyprus', 'Czech Republic', 'Germany', 'Djibouti', 'Dominica',
       'Denmark', 'Dominican Republic', 'Algeria', 'Ecuador', 'Egypt',
       'Eritrea', 'Western Sahara', 'Spain', 'Estonia', 'Ethiopia',
       'Finland', 'Fiji', 'Falkland Islands', 'France',

In [486]:
# Available variables in dataset
list(df_OWID_Covid)

['iso_code',
 'location',
 'date',
 'total_cases',
 'new_cases',
 'total_deaths',
 'new_deaths',
 'total_cases_per_million',
 'new_cases_per_million',
 'total_deaths_per_million',
 'new_deaths_per_million',
 'total_tests',
 'new_tests',
 'total_tests_per_thousand',
 'new_tests_per_thousand',
 'tests_units',
 'population',
 'population_density',
 'median_age',
 'aged_65_older',
 'aged_70_older',
 'gdp_per_capita',
 'extreme_poverty',
 'cvd_death_rate',
 'diabetes_prevalence',
 'female_smokers',
 'male_smokers',
 'handwashing_facilities',
 'hospital_beds_per_100k']

In [487]:
df_OWID_Covid.query('location == "Germany" and date == "2020-05-13"')

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
4194,DEU,Germany,2020-05-13,171306,798,7634,101,2044.616,9.524,91.115,...,21.453,15.957,45229.245,,156.139,8.31,28.2,33.1,,8.0


In [488]:
df_Check = df_JHU_Confirmed.copy()
df_Check.rename(columns={'Country/Region': 'country'}, inplace=True)
df_Check.query('country == "Germany"')

Unnamed: 0,Province/State,country,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
120,,Germany,51.0,9.0,0,0,0,0,0,1,...,168162,169430,170588,171324,171879,172576,173171,174098,174478,175233


### Findings, which contradict requirements:

#### Quality Observations:
- Validity: Some observations/rows in dataframes 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality' contain the values for a region, for example Australia appears multiple times in column country as the observations are per region.
- Consistency: Data about Covid-19 cases differs slightly between John Hopkins and OWID, data which is available in both datasets will be kept only from John Hopkins

#### Tidiness Observations:
- The data of 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality' should be one observational unit 'df_covid' with columns 'country', 'date', 'recovered', 'confirmed', 'fatal' and 'date' beeing of type datetime.
- Column 'Country/Region' should only contain countries, therefore column name should by 'country', same for OWID data.
- Columns 'Province/State', 'Lat' and 'Long' are not necessary in dataframes 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality'
- Data for countries, which are not of interested is not needed in dataframes 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality', 'df_JHU_Countries'





- In the different dataframes there is covid-related data where the variation frequency is daily and there is data not directly covid-related where data variation frequency is monthly or even constant for . Thus, there should be three observational units, df_covid for covid-related data and daily observations, df_country

- Use isocode as primary ke



<a id='clean'></a>
## 4. Clean data

In [489]:
# Create copies for cleaning process to preserve original dataframes
df_JHU_Fatality_clean = df_JHU_Fatality.copy()
df_JHU_Confirmed_clean = df_JHU_Confirmed.copy()
df_JHU_Recovered_clean = df_JHU_Recovered.copy()
df_JHU_Countries_clean = df_JHU_Countries.copy()
df_OWID_Covid_clean = df_OWID_Covid.copy()
df_OWID_Testing_clean = df_OWID_Testing.copy()
df_OWID_Countries_clean = df_OWID_Countries.copy()
df_ESTAT_census_clean = df_ESTAT_census.copy()
df_WIKI_ICU_clean = df_WIKI_ICU.copy()
df_UN_births_clean = df_UN_births.copy()
df_UN_deaths_clean = df_UN_deaths.copy()

### Issue 1:
#### Observe:
-  Tidiness: Columns 'Province/State', 'Lat' and 'Long' are not necessary in dataframes 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality'

#### Define:
- Drop columns 'Province/State', 'Lat' and 'Long'

#### Code:

In [444]:
# Drop variables which are only necessary for retweets
df_JHU_Fatality_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_JHU_Confirmed_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_JHU_Recovered_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_OWID_Covid_clean.drop(['total_cases',
                         'new_cases',
                         'total_deaths',
                         'new_deaths',
                         'total_cases_per_million',
                         'new_cases_per_million',
                         'total_deaths_per_million',
                         'new_deaths_per_million',
                         'total_tests_per_thousand',
                         'new_tests_per_thousand',
                         'population',
                         'population_density',], axis=1, inplace=True)

#### Test:

In [445]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_JHU_Fatality_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,104,106,109,115,120,122,127,132,136,153


In [446]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_JHU_Confirmed_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,3392,3563,3778,4033,4402,4687,4963,5226,5639,6053


In [447]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_JHU_Recovered_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,458,468,472,502,558,558,610,648,691,745


### Issue 2:
#### Observe:
- Tidiness: Column 'Country/Region' should only contain countries, therefore column name should by 'Country'.

#### Define:
- Rename column 'Country/Region' to 'country'

#### Code:

In [448]:
# Rename coloumn inplace to identic primary key
df_JHU_Fatality_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_JHU_Confirmed_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_JHU_Recovered_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_JHU_Countries_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_OWID_Covid_clean.rename(columns={'location': 'country'}, inplace=True)



#### Test:

In [449]:
assert df_JHU_Fatality_clean.country.any()

In [450]:
assert df_JHU_Confirmed_clean.country.any()

In [451]:
assert df_JHU_Recovered_clean.country.any()

In [452]:
assert df_JHU_Countries_clean.country.any()

In [453]:
assert df_OWID_Covid_clean.country.any()

### Issue 3:
#### Observe:
- 

#### Define
- Create array with countries of interest and keep only rows of thes countries for all dataframes.

#### Code:

#### Test:

### Issue 4:
#### Observe:
- Validity: Some observations/rows in dataframes 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality' contain the values for a region, for example Australia appears multiple times in column country as the observations are per region.

#### Define: 
- Sum values of rows with same entry in column country by using groupby

#### Code:

In [454]:
# Groupby and sum
df_JHU_Fatality_clean = df_JHU_Fatality_clean.groupby(['country'], as_index=False).sum()
df_JHU_Confirmed_clean = df_JHU_Confirmed_clean.groupby(['country'], as_index=False).sum()
df_JHU_Recovered_clean = df_JHU_Recovered_clean.groupby(['country'], as_index=False).sum()

#### Test:

In [455]:
df_JHU_Fatality_clean.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
183    False
184    False
185    False
186    False
187    False
Length: 188, dtype: bool

### Issue 5:
#### Observe:
- Tidiness: The data of 'df_JHU_Confirmed', 'df_JHU_Recovered', 'df_JHU_Fatality' should be one observational unit 'df_covid' with columns 'country', 'date', 'recovered', 'confirmed', 'fatal' and 'date' beeing of type datetime.

#### Define:
- Melt date columns to one column 'date', transform date to type datetime and merge the three dataframes to ones dataframe 'df_covid' with sorted date values.

#### Code:

In [456]:
# Melt each dataframe so that results in columns: country,
df_JHU_Fatality_clean = pd.melt(df_JHU_Fatality_clean, id_vars = ['country'], var_name='date', value_name='fatal')
df_JHU_Confirmed_clean = pd.melt(df_JHU_Confirmed_clean, id_vars = ['country'], var_name='date', value_name='confirmed')
df_JHU_Recovered_clean = pd.melt(df_JHU_Recovered_clean, id_vars = ['country'], var_name='date', value_name='recovered')

In [457]:
# Convert new columns date to datetime
df_JHU_Fatality_clean.date=pd.to_datetime(df_JHU_Fatality_clean.date)
df_JHU_Confirmed_clean.date=pd.to_datetime(df_JHU_Confirmed_clean.date)
df_JHU_Recovered_clean.date=pd.to_datetime(df_JHU_Recovered_clean.date)

In [458]:
# Merge three covid dataframes to one
df_covid = pd.merge(df_JHU_Fatality_clean, df_JHU_Confirmed_clean, on=['country','date'])
df_covid = pd.merge(df_covid, df_JHU_Recovered_clean, on=['country','date'])

In [459]:
# Sort date values by date
df_covid = df_covid.sort_values(by='date', ascending=True)

#### Test:

In [460]:
list(df_covid)

['country', 'date', 'fatal', 'confirmed', 'recovered']

In [461]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21620 entries, 0 to 21619
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   country    21620 non-null  object        
 1   date       21620 non-null  datetime64[ns]
 2   fatal      21620 non-null  int64         
 3   confirmed  21620 non-null  int64         
 4   recovered  21620 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 1013.4+ KB


In [462]:
df_covid

Unnamed: 0,country,date,fatal,confirmed,recovered
0,Afghanistan,2020-01-22,0,0,0
120,Namibia,2020-01-22,0,0,0
121,Nepal,2020-01-22,0,0,0
122,Netherlands,2020-01-22,0,0,0
123,New Zealand,2020-01-22,0,0,0
...,...,...,...,...,...
21496,Gambia,2020-05-15,1,23,10
21497,Georgia,2020-05-15,12,671,393
21498,Germany,2020-05-15,7897,175233,151597
21489,Estonia,2020-05-15,63,1766,923


### Issue 6:
#### Observe:
- Consistency: Some country names differ from one dataframe to another 

#### Define:
- ...

#### Code

In [463]:
# Rename coloumn inplace
df_covid['country'].replace({'US': 'United States', 'Taiwan*': 'Taiwan'}, inplace=True)
df_JHU_Countries_clean['country'].replace({'US': 'United States', 'Taiwan*': 'Taiwan'}, inplace=True)

In [464]:
df_covid = pd.merge(df_covid,df_OWID_Covid_clean[['country','Target_Column']],on='Key_Column', how='left')

KeyError: "['Target_Column'] not in index"

#### Test:

In [None]:
countries_df_JHU = df_covid['country'].unique()
countries_df_JHU = sorted(countries_df_covid)

countries_df_OWID

In [None]:
len(countrie_in_both_df)

In [None]:
len(df_covid.country.unique())

In [None]:
len(df_OWID_Covid_clean.country.unique())

In [None]:
list(df_OWID_Covid_clean)

In [None]:
df_OWID_Covid_clean.new_tests.value_counts()

<a id='store'></a>
## 5. Store clean data

In [None]:
# Store cleaned dataset to csv
df_covid.to_csv('covid_master.csv', encoding='utf-8')