# Wrangle and Analyze Data of a Twitter Account


## Table of Contents
- [1. Introduction](#intro)
- [2. Gather data](#gather)
- [3. Assess data](#assess)
- [4. Clean data](#clean)
- [5. Store](#store)


<a id='gather'></a>
## 1. Introduction

This project is an analysis of correlation between the Covid-19 cases and the political environment of different countries. Goal is to find answers or at least indicators to questions like: 
- Did the countries which had more success in containing the amount of Covid-19 cases something in common? 
- Is there a correlation in  Gross domestic product, Human Development Index or political ideology with the amount of Covid-19 cases of the country.

Main goal of this project is to generate a comprehensive exploratory and explanatory data analysis of the gathered data. The data analysis process is distributed over three ipynb-files: gather_clean_Covid19.ipynb, exploration_Covid19.ipynb and slide_deck_Covid19.ipynb.

Firstly, as part of gather_clean_Covid19.ipynb data is gathered from different sources: The Covid-19 data of this project is retrieved via programmatically downloaded csv-files from the GitHub repository [Covid-19](https://github.com/CSSEGISandData/COVID-19) and additional data about countries is retrieved via the wikipedia API. Secondly, the data from the different sources is visually and programmatically assessed to be cleaned.
The exploratory and explanatory data analysis of the gathered data is performed in exploration_Covid19.ipynb. Finally the findings are presented in slide_deck_Covid19.ipynb.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from datetime import date
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import os # to work with local directory
import re
import wptools
import json # to create json file from python dictionary
import time # for timer 
sns.set()

<a id='intro'></a>
## 2. Gather data

####  Data is gathered from three different sources of data as described in steps below:

1. Fatality, confirmed cases, recovered cases and data by country is retrieved via programmatically downloaded csv-files from the GitHub repository [Covid-19](https://github.com/CSSEGISandData/COVID-19).
2. Additional data is retrieved via the wptools API from different wikipedia articles.

### a. Read data from programmatically download csv-file

In [2]:
# Gather data from John Hopkins GitHub 
df_Fatality = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_Confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df_Recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_Countries = pd.read_csv('https://raw.githubusercontent.com/RRighart/covid-19/master/countries.csv')

In [4]:
df_OWID_covid = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')

### b. Query additional information for countries via wikipedia API

Additional Information
- Leader Gender
- Ideology of Leading Party
- Amount of Intensive Care Beds
- Gross domestic product per capita
- Human Development Index

In [407]:
# Query for every tweet id in enhanced twitter archive and save tweet-information in json-format to 'tweet_json.txt'
'''             
country_jsons = {}
county_id_errors = []
start = time.time()
count = 0


with open('country_json.txt', 'w') as outfile:
    
    for country in df_Countries['Country/Region']:
        count +=1
        try:
            # Query API for data of wikipedia article
            article = wptools.page(country).get_parse()
            infobox = article.data['infobox']
            # Measure elapsed time
            mid_s = time.time()
            # Print id and time elapsed
            print(str(count) + str(mid_s - start) )
            # Write json of tweet to 'tweet_json.txt'
            json.dump(infobox, outfile)
            # New line
            outfile.write("\n")

        # Not best practice to catch all exceptions but fine for this short script
        except Exception as error:
            mid_f = time.time()
            print(str(count) + str(mid_f - start) + str(error))
            # Gather ids of id's without status
            tweet_id_errors.append([count, str(tweet_id)])
            
    end = time.time()
    print(end - start)
    
    '''

'             \ncountry_jsons = {}\ncounty_id_errors = []\nstart = time.time()\ncount = 0\n\n\nwith open(\'country_json.txt\', \'w\') as outfile:\n    \n    for country in df_Countries[\'Country/Region\']:\n        count +=1\n        try:\n            # Query API for data of wikipedia article\n            article = wptools.page(country).get_parse()\n            infobox = article.data[\'infobox\']\n            # Measure elapsed time\n            mid_s = time.time()\n            # Print id and time elapsed\n            print(str(count) + str(mid_s - start) )\n            # Write json of tweet to \'tweet_json.txt\'\n            json.dump(infobox, outfile)\n            # New line\n            outfile.write("\n")\n\n        # Not best practice to catch all exceptions but fine for this short script\n        except Exception as error:\n            mid_f = time.time()\n            print(str(count) + str(mid_f - start) + str(error))\n            # Gather ids of id\'s without status\n           

In [408]:
'''
so = wptools.page('Germany').get_parse()
infobox = so.data['infobox']
print(infobox)
'''

"\nso = wptools.page('Germany').get_parse()\ninfobox = so.data['infobox']\nprint(infobox)\n"

<a id='assess'></a>
## 3. Assess data

After gathering each of the above pieces of data, they are assessed visually and programmatically for quality and tidiness issues. Requirements to be met:

- Quality requirements:
    - Completeness: All necessary records in dataframes, no specific rows, columns or cells missing.
    - Validity: No records available, that do not conform schema.
    - Accuracy: No wrong data, that is valid.
    - Consistency: No data, that is valid and accurate, but referred to in multiple correct ways.
- Tidiniss requirements (as defined by Hadley Wickham):
    - each variable is a column
    - each observation is a row
    - each type of observational unit is a table.

### a. Visual assessment

In [409]:
# Check layout of df_Countries vsiually
df_Countries.head()

Unnamed: 0.1,Unnamed: 0,Country/Region,inhabitants,area
0,0,US,325386357,9833520
1,1,Germany,83792987,357386
2,2,France,65227357,551695
3,3,Belgium,11579502,30510
4,4,Netherlands,17123478,41198


In [410]:
# Check layout of df_Fatality vsiually
df_Fatality.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,85,90,95,104,106,109,115,120,122,127
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,31,31,31,31,31,31,31,31,31,31
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,463,465,470,476,483,488,494,502,507,515
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,45,45,46,46,47,47,48,48,48,48
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,2,2,2,2,2,2,2,2,2,2


In [411]:
# Check layout of df_Confirmed vsiually
df_Confirmed.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,2704,2894,3224,3392,3563,3778,4033,4402,4687,4963
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,795,803,820,832,842,850,856,868,872,876
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,4474,4648,4838,4997,5182,5369,5558,5723,5891,6067
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,748,750,751,751,752,752,754,755,755,758
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,35,35,36,36,36,43,43,45,45,45


In [412]:
# Check layout of df_Recovered vsiually
df_Recovered.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,345,397,421,458,468,472,502,558,558,610
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,531,543,570,595,605,620,627,650,654,682
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,1936,1998,2067,2197,2323,2467,2546,2678,2841,2998
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,493,499,514,521,526,537,545,550,550,568
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,11,11,11,11,11,11,13,13,13,13


### b. Programmatic assessment

In [413]:
# List of countries that are avaoilable in John Hopkins Dataset
df_Recovered['Country/Region'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Brazil', 'Brunei',
       'Bulgaria', 'Burkina Faso', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Diamond Princess',
       'Cuba', 'Cyprus', 'Czechia', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
       'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Grenada', 'Greece', 'Guatemala', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'H

In [10]:
# Available variables in dataset
list(df_OWID_covid)

['iso_code',
 'location',
 'date',
 'total_cases',
 'new_cases',
 'total_deaths',
 'new_deaths',
 'total_cases_per_million',
 'new_cases_per_million',
 'total_deaths_per_million',
 'new_deaths_per_million',
 'total_tests',
 'new_tests',
 'total_tests_per_thousand',
 'new_tests_per_thousand',
 'tests_units',
 'population',
 'population_density',
 'median_age',
 'aged_65_older',
 'aged_70_older',
 'gdp_per_capita',
 'extreme_poverty',
 'cvd_death_rate',
 'diabetes_prevalence',
 'female_smokers',
 'male_smokers',
 'handwashing_facilities',
 'hospital_beds_per_100k']

In [15]:
df_OWID_covid.query('location == "Germany" and date == "2020-05-13"')

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
4144,DEU,Germany,2020-05-13,171306,798,7634,101,2044.616,9.524,91.115,...,21.453,15.957,45229.245,,156.139,8.31,28.2,33.1,,8.0


In [18]:
df_Check = df_Confirmed.copy()
df_Check.rename(columns={'Country/Region': 'country'}, inplace=True)
df_Check.query('country == "Germany"')

Unnamed: 0,Province/State,country,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20,5/13/20,5/14/20,5/15/20
120,,Germany,51.0,9.0,0,0,0,0,0,1,...,168162,169430,170588,171324,171879,172576,173171,174098,174478,175233


### Findings, which contradict requirements:

#### Quality Observations:
- Validity: Some observations/rows in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality' contain the values for a region, for example Australia appears multiple times in column country as the observations are per region.
- Consistency: Data about Covid-19 cases differs slightly between John Hopkins and OWID, data which is available in both datasets will be kept only from John Hopkins

#### Tidiness Observations:
- The data of 'df_Confirmed', 'df_Recovered', 'df_Fatality' should be one observational unit 'df_covid' with columns 'country', 'date', 'recovered', 'confirmed', 'fatal' and 'date' beeing of type datetime.
- Column 'Country/Region' should only contain countries, therefore column name should by 'country', same for OWID data.
- Columns 'Province/State', 'Lat' and 'Long' are not necessary in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality'
- Data for countries, which are not of interested is not needed in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality', 'df_Countries'

<a id='clean'></a>
## 4. Clean data

In [415]:
# Create copies for cleaning process to preserve original dataframes
df_Fatality_clean = df_Fatality.copy()
df_Confirmed_clean = df_Confirmed.copy()
df_Recovered_clean = df_Recovered.copy()
df_Countries_clean = df_Countries.copy()
df_OWID_covid_clean = df_OWID_covid.copy()

### Issue 1:
#### Observe:
-  Tidiness: Columns 'Province/State', 'Lat' and 'Long' are not necessary in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality'

#### Define:
- Drop columns 'Province/State', 'Lat' and 'Long'

#### Code:

In [416]:
# Drop variables which are only necessary for retweets
df_Fatality_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_Confirmed_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_Recovered_clean.drop(['Province/State', 'Lat', 'Long'], axis=1, inplace=True)
df_OWID_covid_clean(['iso_code',
                     'total_cases',
                     'new_cases',
                     'total_deaths',
                     'new_deaths',
                     'total_cases_per_million',
                     'new_cases_per_million',
                     'total_deaths_per_million',
                     'new_deaths_per_million',
                     'new_tests',
                     'total_tests_per_thousand',
                     'new_tests_per_thousand',
                     'population',
                     'population_density',], axis=1, inplace=True)

#### Test:

In [417]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_Fatality_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,85,90,95,104,106,109,115,120,122,127


In [418]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_Confirmed_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,2704,2894,3224,3392,3563,3778,4033,4402,4687,4963


In [419]:
# Check if columnns 'Province/State', 'Lat' and 'Long' dropped
df_Recovered_clean.head(1)

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20,5/8/20,5/9/20,5/10/20,5/11/20,5/12/20
0,Afghanistan,0,0,0,0,0,0,0,0,0,...,345,397,421,458,468,472,502,558,558,610


### Issue 2:
#### Observe:
- Tidiness: Column 'Country/Region' should only contain countries, therefore column name should by 'Country'.

#### Define:
- Rename column 'Country/Region' to 'country'

#### Code:

In [420]:
# Rename coloumn inplace
df_Fatality_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_Confirmed_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_Recovered_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_Countries_clean.rename(columns={'Country/Region': 'country'}, inplace=True)
df_OWID_covid_clean.rename(columns={'location': 'country'}, inplace=True)

#### Test:

In [421]:
assert df_Fatality_clean.country.any()

In [422]:
assert df_Confirmed_clean.country.any()

In [423]:
assert df_Recovered_clean.country.any()

In [424]:
assert df_Countries_clean.country.any()

### Issue 3:
#### Observe:
- Tidiness: Data for countries, which are not of interested is not needed in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality', 'df_Countries'.

#### Define
- Create array with countries of interest and keep only rows of thes countries for all dataframes.

#### Code:

In [425]:
# Create array with countries of interest
countries = ['Australia', 'Austria', 'Belgium', 'Brazil', 'Canada', 'China', 'Denmark', 'Finland', 'France', 
             'Germany', 'Greece', 'Hungary', 'India', 'Indonesia', 'Iran', 'Israel', 'Italy', 'Japan', 
             'Korea, South', 'Luxembourg', 'Mexico', 'Netherlands', 'Norway', 'Philippines', 'Poland', 'Portugal',
            'Russia', 'South Africa', 'Spain', 'Sweden', 'Switzerland', 'Taiwan*', 'Thailand', 'Tunisia', 'Turkey', 
             'United Arab Emirates', 'United Kingdom', 'US', 'Vietnam']

In [426]:
# Keep only rows of countries which are in country array
df_Fatality_clean = df_Fatality_clean[df_Fatality_clean['country'].isin(countries)]
df_Confirmed_clean = df_Confirmed_clean[df_Confirmed_clean['country'].isin(countries)]
df_Recovered_clean = df_Recovered_clean[df_Recovered_clean['country'].isin(countries)]
df_Countries_clean = df_Countries_clean[df_Countries_clean['country'].isin(countries)]

#### Test:

In [427]:
assert len(df_Fatality_clean.query('country == "Afghanistan"')) == 0

In [428]:
assert len(df_Confirmed_clean.query('country == "Afghanistan"')) == 0

In [429]:
assert len(df_Recovered_clean.query('country == "Afghanistan"')) == 0

In [430]:
assert len(df_Countries_clean.query('country == "Afghanistan"')) == 0

### Issue 4:
#### Observe:
- Validity: Some observations/rows in dataframes 'df_Confirmed', 'df_Recovered', 'df_Fatality' contain the values for a region, for example Australia appears multiple times in column country as the observations are per region.

#### Define: 
- Sum values of rows with same entry in column country by using groupby

#### Code:

In [431]:
# Groupby and sum
df_Fatality_clean = df_Fatality_clean.groupby(['country'], as_index=False).sum()
df_Confirmed_clean = df_Fatality_clean.groupby(['country'], as_index=False).sum()
df_Recovered_clean = df_Fatality_clean.groupby(['country'], as_index=False).sum()

#### Test:

In [432]:
# Check if amount of rows equals amount of countries elected to be of interest
assert df_Fatality_clean.index.nunique() == len(countries)

In [433]:
assert df_Confirmed_clean.index.nunique() == len(countries)

In [434]:
assert df_Recovered_clean.index.nunique() == len(countries)

### Issue 5:
#### Observe:
- Tidiness: The data of 'df_Confirmed', 'df_Recovered', 'df_Fatality' should be one observational unit 'df_covid' with columns 'country', 'date', 'recovered', 'confirmed', 'fatal' and 'date' beeing of type datetime.

#### Define:
- Melt date columns to one column 'date', transform date to type datetime and merge the three dataframes to ones dataframe 'df_covid' with sorted date values.

#### Code:

In [435]:
# Melt each dataframe so that results in columns: country,
df_Fatality_clean = pd.melt(df_Fatality_clean, id_vars = ['country'], var_name='date', value_name='fatal')
df_Confirmed_clean = pd.melt(df_Confirmed_clean, id_vars = ['country'], var_name='date', value_name='confirmed')
df_Recovered_clean = pd.melt(df_Recovered_clean, id_vars = ['country'], var_name='date', value_name='recovered')

In [436]:
# Convert new columns date to datetime
df_Fatality_clean.date=pd.to_datetime(df_Fatality_clean.date)
df_Confirmed_clean.date=pd.to_datetime(df_Confirmed_clean.date)
df_Recovered_clean.date=pd.to_datetime(df_Recovered_clean.date)

In [437]:
# Merge three covid dataframes to one
df_covid = pd.merge(df_Fatality_clean, df_Confirmed_clean, on=['country','date'])
df_covid = pd.merge(df_covid, df_Recovered_clean, on=['country','date'])

In [438]:
# Sort date values by date
df_covid = df_covid.sort_values(by='date', ascending=True)

#### Test:

In [439]:
list(df_covid)

['country', 'date', 'fatal', 'confirmed', 'recovered']

In [440]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4368 entries, 0 to 4367
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   country    4368 non-null   object        
 1   date       4368 non-null   datetime64[ns]
 2   fatal      4368 non-null   int64         
 3   confirmed  4368 non-null   int64         
 4   recovered  4368 non-null   int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 204.8+ KB


In [441]:
df_covid

Unnamed: 0,country,date,fatal,confirmed,recovered
0,Australia,2020-01-22,0,0,0
22,Norway,2020-01-22,0,0,0
23,Philippines,2020-01-22,0,0,0
24,Poland,2020-01-22,0,0,0
25,Portugal,2020-01-22,0,0,0
...,...,...,...,...,...
4343,Iran,2020-05-12,6733,6733,6733
4344,Israel,2020-05-12,260,260,260
4345,Italy,2020-05-12,30911,30911,30911
4347,"Korea, South",2020-05-12,259,259,259


### Issue 6:
#### Observe:
- Consistency: 

#### Define:
- ...

#### Code

#### Test:

<a id='store'></a>
## 5. Store clean data

In [442]:
# Store cleaned dataset to csv
df_covid.to_csv('covid_master.csv', encoding='utf-8')