# Data ingest

This notebook will load in various COVID-19 datasets, combine them using Python methods and export them in full to the datasets folder. It will also create a sample of the final export file to reduce time in following analysis notebooks.

### Imports

First let's load in some packages.

In [1]:
from __future__ import print_function, division
import pandas as pd

In [2]:
pd.set_option('display.max_columns', None)

### Data ingest

Next let's load all the data into dataframes.

In [3]:
owid = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
countries = pd.read_csv('https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv')
google = pd.read_csv('https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv', header=0, na_values = [''])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Data merging

Now let's prepare the datasets and merge them into a single dataframe.

In [4]:
df = pd.merge(owid, countries[['alpha-2', 'alpha-3']], how="inner", left_on='iso_code', right_on='alpha-3')

In [5]:
del df['alpha-3']
df = df.rename(columns={"iso_code": "alpha-3"})

In [6]:
df = pd.merge(df, google, how="inner", left_on=['alpha-2', 'date'], right_on=['country_region_code', 'date'])

In [7]:
del df['country_region_code']

### Let's quickly explore!

In [8]:
df.head()

Unnamed: 0,alpha-3,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,alpha-2,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AFG,Asia,Afghanistan,2020-02-24,1.0,1.0,,,,,0.026,0.026,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,AF,Afghanistan,,,,,,ChIJbQL_-LZu0TgReNqWvg1GtfM,3.0,13.0,4.0,9.0,7.0,0.0
1,AFG,Asia,Afghanistan,2020-02-24,1.0,1.0,,,,,0.026,0.026,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,AF,Afghanistan,,,Kabul Metropolitan Area,,,ChIJ5U8S6F5v0TgREM8nxN8oCHA,-2.0,15.0,1.0,8.0,6.0,-1.0
2,AFG,Asia,Afghanistan,2020-02-25,1.0,0.0,,,,,0.026,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,AF,Afghanistan,,,,,,ChIJbQL_-LZu0TgReNqWvg1GtfM,2.0,5.0,1.0,9.0,7.0,0.0
3,AFG,Asia,Afghanistan,2020-02-25,1.0,0.0,,,,,0.026,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,AF,Afghanistan,,,Kabul Metropolitan Area,,,ChIJ5U8S6F5v0TgREM8nxN8oCHA,-2.0,7.0,4.0,8.0,5.0,1.0
4,AFG,Asia,Afghanistan,2020-02-26,1.0,0.0,,,,,0.026,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,AF,Afghanistan,,,,,,ChIJbQL_-LZu0TgReNqWvg1GtfM,-1.0,5.0,2.0,3.0,5.0,1.0


In [9]:
df.shape

(4342866, 73)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4342866 entries, 0 to 4342865
Data columns (total 73 columns):
 #   Column                                              Dtype  
---  ------                                              -----  
 0   alpha-3                                             object 
 1   continent                                           object 
 2   location                                            object 
 3   date                                                object 
 4   total_cases                                         float64
 5   new_cases                                           float64
 6   new_cases_smoothed                                  float64
 7   total_deaths                                        float64
 8   new_deaths                                          float64
 9   new_deaths_smoothed                                 float64
 10  total_cases_per_million                             float64
 11  new_cases_per_million                

### Data export

Finally let's export them to CSV so we can use them in then upcoming notebooks.

In [11]:
df.sample(n = 50000).to_csv('../datasets/sdf.csv')

In [12]:
locations = df['location'].unique()
pd.DataFrame(locations).to_csv('../datasets/locations.csv')
for location in locations:
    df[df['location'] == location].to_csv(f'../datasets/df_location_{location}.csv')

In [13]:
continents = df['continent'].unique()
pd.DataFrame(continents).to_csv('../datasets/continents.csv')
for continent in continents:
    df[df['continent'] == continent].to_csv(f'../datasets/df_continent_{continent}.csv')