# Data Visualization Project - Data Engineering

This notebook contains all the data manipulations we will perform throughout the development of the Covid-19 poster project for the Data Visualization curricular unit.

The goal of the project is to showcase that the access to Covid-19 vaccines, there is flagrant inequality between developed countries and countries in development. In order to do that, we will rely on data from different sources, mesh it together and output a solid dataset that can be used in a data visualization tool like Tableau or Microsoft PowerBI.

#### Brief outline of desired columns and the source used:

#### Part 1 - General country information and representation

1. Country Name
2. Location - polygon design - to allow for representation 
Sources: World map shapefile: A file with the necessary data to allow world map vizualization; https://hub.arcgis.com/datasets/2b93b06dc0dc4e809d3c8db5cb96ba69_0

3. GDP 
4. Population 
5. GDP p/capita
Source: IMF, World Bank
Source: https://www.imf.org/en/Publications/WEO/weo-database/2020/October

#### Part 2 - Covid Vaccine Data

Contracted quantatity by manufacturer:
https://launchandscalefaster.org/COVID-19

specifically
https://public.tableau.com/vizql/w/TimelineofCOVIDVaccineProcurementDeals_16125539354560/v/Dashboard1/viewData/sessions/BD1E18003B5448B88669524972EB60A5-0:0/views/16126187992227925297_15952188591581136529?maxrows=200&viz=%7B%22worksheet%22%3A%22Sheet%201%22%2C%22dashboard%22%3A%22Dashboard%201%22%7D

vaccination by country: other vaccination data - (number of vaccines taken) https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations

vaccination by manufacturer - vaccinations performed (not bought) https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/vaccinations-by-manufacturer.csv

Price of vaccines - UNICEF - may rely on
https://app.powerbi.com/view?r=eyJrIjoiNmE0YjZiNzUtZjk2OS00ZTg4LThlMzMtNTRhNzE0NzA4YmZlIiwidCI6Ijc3NDEwMTk1LTE0ZTEtNGZiOC05MDRiLWFiMTg5MjAyMzY2NyIsImMiOjh9&pageName=ReportSectiona329b3eafd86059a947b

Data agendada (esperada) para primeiras entrega de vacinas

#### Part 3 - The Dream - apenas a pensar depois de dados para as partes 1 e 2 estarem encontrados.

Em países ainda sem vacina, já morreram estas........, quantas mais é que estamos dispostos a ter ou aceitar? 
Mortes confirmadas
Mortes projetadas até que o país tenha a vacina (se espere) - não há

Data limite de entrega
Data esperada do contrato
Quantidades verdadeiramente entregues para cada time period!

In [1]:
#!pip install  openpyxl 

In [2]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from math import ceil
import warnings

import os
warnings.filterwarnings("ignore")
import matplotlib.gridspec as gspec

In [3]:
import geopandas as gdp

In [4]:
#Select the path where you've put your dataset provided in Moodle

#step 1: check current file directory
path = os.getcwd()
files = os.listdir(path)
files

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'Datasets and SHPs',
 'Data_Viz_CoVID.ipynb',
 'README.md']

In [5]:
os.chdir('Datasets and SHPs\general')
path = os.getcwd()
files = os.listdir(path)

In [6]:
pop = pd.read_excel('POP.xlsx')

#load gdp dataset
GDP = pd.read_excel("GDP,US.xlsx")

In [7]:
GDP

Unnamed: 0,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",GDP p/capita
0,Afghanistan,19.006,499.441
1,Albania,14.034,4898.277
2,Algeria,147.323,3331.076
3,Angola,62.724,2021.310
4,Antigua and Barbuda,1.389,14158.571
...,...,...,...
187,Vietnam,340.602,3497.512
188,West Bank and Gaza,14.750,2894.069
189,Yemen,20.948,645.126
190,Zambia,18.909,1001.440


In [8]:
data = pd.merge(GDP, pop, left_on='Country', right_on='Country')

In [9]:
data

Unnamed: 0,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",GDP p/capita,Population in 2020 (Millions)
0,Afghanistan,19.006,499.441,38.055
1,Albania,14.034,4898.277,2.865
2,Algeria,147.323,3331.076,44.227
3,Angola,62.724,2021.310,31.031
4,Antigua and Barbuda,1.389,14158.571,0.098
...,...,...,...,...
187,Vietnam,340.602,3497.512,97.384
188,West Bank and Gaza,14.750,2894.069,5.097
189,Yemen,20.948,645.126,32.471
190,Zambia,18.909,1001.440,18.882


In [10]:
data = data[['Country','GDP, current prices, in  2020 (Bilions U.S. dollar)','Population in 2020 (Millions)','GDP p/capita']]

In [11]:
data

Unnamed: 0,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",Population in 2020 (Millions),GDP p/capita
0,Afghanistan,19.006,38.055,499.441
1,Albania,14.034,2.865,4898.277
2,Algeria,147.323,44.227,3331.076
3,Angola,62.724,31.031,2021.310
4,Antigua and Barbuda,1.389,0.098,14158.571
...,...,...,...,...
187,Vietnam,340.602,97.384,3497.512
188,West Bank and Gaza,14.750,5.097,2894.069
189,Yemen,20.948,32.471,645.126
190,Zambia,18.909,18.882,1001.440


In [12]:
geo = gdp.read_file('World_Countries__Generalized_.shp')

In [13]:
geo = geo[['COUNTRY','geometry']]

In [14]:
geo

Unnamed: 0,COUNTRY,geometry
0,American Samoa,"POLYGON ((-170.74390 -14.37555, -170.74942 -14..."
1,United States Minor Outlying Islands,"MULTIPOLYGON (((-160.02114 -0.39805, -160.0281..."
2,Cook Islands,"MULTIPOLYGON (((-159.74698 -21.25667, -159.793..."
3,French Polynesia,"MULTIPOLYGON (((-149.17920 -17.87084, -149.258..."
4,Niue,"POLYGON ((-169.89389 -19.14556, -169.93088 -19..."
...,...,...
244,Northern Mariana Islands,"MULTIPOLYGON (((145.73468 15.08722, 145.72830 ..."
245,Palau,"MULTIPOLYGON (((134.53137 7.35444, 134.52234 7..."
246,Russian Federation,"MULTIPOLYGON (((-179.99999 68.98010, -179.9580..."
247,Spain,"MULTIPOLYGON (((-2.91472 35.27361, -2.93924 35..."


In [15]:
exp = pd.merge(geo, data, left_on='COUNTRY', right_on='Country')

In [16]:
exp

Unnamed: 0,COUNTRY,geometry,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",Population in 2020 (Millions),GDP p/capita
0,Samoa,"MULTIPOLYGON (((-172.59650 -13.50911, -172.551...",Samoa,0.829,0.203,4083.806
1,Tonga,"MULTIPOLYGON (((-175.14529 -21.26806, -175.186...",Tonga,0.503,0.100,5023.166
2,El Salvador,"POLYGON ((-87.69467 13.81901, -87.72501 13.733...",El Salvador,24.784,6.486,3821.286
3,Guatemala,"POLYGON ((-89.34831 14.43198, -89.43556 14.414...",Guatemala,76.191,17.971,4239.672
4,Mexico,"MULTIPOLYGON (((-111.56001 24.42945, -111.5761...",Mexico,1040.372,128.933,8069.104
...,...,...,...,...,...,...
180,Marshall Islands,"MULTIPOLYGON (((168.78637 7.28889, 168.76721 7...",Marshall Islands,0.225,0.055,4070.617
181,Micronesia,"MULTIPOLYGON (((158.22775 6.78055, 158.18469 6...",Micronesia,0.395,0.103,3854.743
182,Palau,"MULTIPOLYGON (((134.53137 7.35444, 134.52234 7...",Palau,0.251,0.018,14232.720
183,Russian Federation,"MULTIPOLYGON (((-179.99999 68.98010, -179.9580...",Russian Federation,1464.078,146.812,9972.495


In [17]:
a = exp['Country'].to_list()

In [18]:
 b = data['Country'].to_list()

In [19]:
# países que não se perderam por não haver dados
print([x for x in b if x not in set(a)])

["Côte d'Ivoire", 'Hong Kong SAR', 'Korea', 'Kosovo', 'Macao SAR', 'Taiwan Province of China', 'West Bank and Gaza']


In [20]:
exp

Unnamed: 0,COUNTRY,geometry,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",Population in 2020 (Millions),GDP p/capita
0,Samoa,"MULTIPOLYGON (((-172.59650 -13.50911, -172.551...",Samoa,0.829,0.203,4083.806
1,Tonga,"MULTIPOLYGON (((-175.14529 -21.26806, -175.186...",Tonga,0.503,0.100,5023.166
2,El Salvador,"POLYGON ((-87.69467 13.81901, -87.72501 13.733...",El Salvador,24.784,6.486,3821.286
3,Guatemala,"POLYGON ((-89.34831 14.43198, -89.43556 14.414...",Guatemala,76.191,17.971,4239.672
4,Mexico,"MULTIPOLYGON (((-111.56001 24.42945, -111.5761...",Mexico,1040.372,128.933,8069.104
...,...,...,...,...,...,...
180,Marshall Islands,"MULTIPOLYGON (((168.78637 7.28889, 168.76721 7...",Marshall Islands,0.225,0.055,4070.617
181,Micronesia,"MULTIPOLYGON (((158.22775 6.78055, 158.18469 6...",Micronesia,0.395,0.103,3854.743
182,Palau,"MULTIPOLYGON (((134.53137 7.35444, 134.52234 7...",Palau,0.251,0.018,14232.720
183,Russian Federation,"MULTIPOLYGON (((-179.99999 68.98010, -179.9580...",Russian Federation,1464.078,146.812,9972.495


In [21]:
#print_SHP to test

#exp.to_file("test.shp", driver='ESRI Shapefile')

## Joining part 2:

In [22]:
 #vaccines by manufacturer
#the following code searchs a github directory and extracts all csv files in the directory to a dictionary

#in thius case, it will look into the country data folder on owid's covid
#Probably not very efficient, but works

#TO RUN THIS CELL YOU MIGHT HAVE TIO INSTALL BS4 AND REQUESTS, uncomment if needed
!pip install bs4
!pip install requests

# Import the required packages: 
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re 

# Store the url as a string scalar: url => str
url = 'https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data'

# Issue request: r => requests.models.Response
r = requests.get(url)

# Extract text: html_doc => str
html_doc = r.text

# Parse the HTML: soup => bs4.BeautifulSoup
soup = BeautifulSoup(html_doc)

# Find all 'a' tags (which define hyperlinks): a_tags => bs4.element.ResultSet
a_tags = soup.find_all('a')

# Store a list of urls ending in .csv: urls => list
urls = ['https://raw.githubusercontent.com'+re.sub('/blob', '', link.get('href')) 
        for link in a_tags  if '.csv' in link.get('href')]

# Store a list of Data Frame names to be assigned to the list: df_list_names => list
df_list_names = [url.split('.csv')[0].split('/')[url.count('/')] for url in urls]

# Initialise an empty list the same length as the urls list: df_list => list
df_list = [pd.DataFrame([None]) for i in range(len(urls))]

# Store an empty list of dataframes: df_list => list
df_list = [pd.read_csv(url, sep = ',') for url in urls]

# Name the dataframes in the list, coerce to a dictionary: df_dict => dict
df_dict = dict(zip(df_list_names, df_list))



In [23]:
df_dict

{'Albania':    location        date          vaccine  \
 0   Albania  2021-01-10  Pfizer/BioNTech   
 1   Albania  2021-01-12  Pfizer/BioNTech   
 2   Albania  2021-01-13  Pfizer/BioNTech   
 3   Albania  2021-01-14  Pfizer/BioNTech   
 4   Albania  2021-01-15  Pfizer/BioNTech   
 5   Albania  2021-01-16  Pfizer/BioNTech   
 6   Albania  2021-01-17  Pfizer/BioNTech   
 7   Albania  2021-01-18  Pfizer/BioNTech   
 8   Albania  2021-01-19  Pfizer/BioNTech   
 9   Albania  2021-01-20  Pfizer/BioNTech   
 10  Albania  2021-01-21  Pfizer/BioNTech   
 11  Albania  2021-02-02  Pfizer/BioNTech   
 12  Albania  2021-02-09  Pfizer/BioNTech   
 13  Albania  2021-02-17  Pfizer/BioNTech   
 14  Albania  2021-02-18  Pfizer/BioNTech   
 15  Albania  2021-02-19  Pfizer/BioNTech   
 16  Albania  2021-02-22  Pfizer/BioNTech   
 17  Albania  2021-02-25  Pfizer/BioNTech   
 18  Albania  2021-03-01  Pfizer/BioNTech   
 19  Albania  2021-03-03  Pfizer/BioNTech   
 
                                          

In [24]:
#convert dict in dataframe
# adding the key in 
for key in df_dict.keys():
    df_dict[key]['key'] = key 

# concatenating the DataFrames
countries = pd.concat(df_dict.values())
countries

Unnamed: 0,location,date,vaccine,source_url,total_vaccinations,people_vaccinated,people_fully_vaccinated,key
0,Albania,2021-01-10,Pfizer/BioNTech,https://www.france24.com/en/live-news/20210111...,0.0,0.0,,Albania
1,Albania,2021-01-12,Pfizer/BioNTech,https://shendetesia.gov.al/dita-iii-e-vaksinim...,128.0,128.0,,Albania
2,Albania,2021-01-13,Pfizer/BioNTech,https://shendetesia.gov.al/dita-iii-e-vaksinim...,188.0,188.0,,Albania
3,Albania,2021-01-14,Pfizer/BioNTech,https://shendetesia.gov.al/dita-iv-e-vaksinimi...,266.0,266.0,,Albania
4,Albania,2021-01-15,Pfizer/BioNTech,https://shendetesia.gov.al/dita-peste-e-vaksin...,308.0,308.0,,Albania
...,...,...,...,...,...,...,...,...
9,Zimbabwe,2021-03-02,Sinopharm/Beijing,https://twitter.com/MoHCCZim/status/1366851011...,25077.0,25077.0,,Zimbabwe
10,Zimbabwe,2021-03-03,Sinopharm/Beijing,https://twitter.com/MoHCCZim/status/1367208409...,27970.0,27970.0,,Zimbabwe
11,Zimbabwe,2021-03-04,Sinopharm/Beijing,https://twitter.com/MoHCCZim/status/1367546700...,30658.0,30658.0,,Zimbabwe
12,Zimbabwe,2021-03-05,Sinopharm/Beijing,https://twitter.com/MoHCCZim/status/1367917461...,31325.0,31325.0,,Zimbabwe


In [25]:
#keep last row (most updated one)

countries = countries.drop_duplicates(subset='key', keep="last").drop(['source_url', 'key'], axis = 1)
countries

Unnamed: 0,location,date,vaccine,total_vaccinations,people_vaccinated,people_fully_vaccinated
19,Albania,2021-03-03,Pfizer/BioNTech,15793.0,,
2,Algeria,2021-02-19,Sputnik V,75000.0,,
6,Andorra,2021-02-26,Pfizer/BioNTech,2526.0,2526.0,
3,Anguilla,2021-02-26,Oxford/AstraZeneca,3929.0,3929.0,
47,Argentina,2021-03-05,Sputnik V,1357596.0,1030504.0,327092.0
...,...,...,...,...,...,...
62,United States,2021-03-06,"Moderna, Pfizer/BioNTech",87912323.0,57358849.0,29776160.0
5,Uruguay,2021-03-05,Sinovac,70408.0,70408.0,
2,Venezuela,2021-03-04,Sputnik V,12194.0,12194.0,
56,Wales,2021-03-05,"Oxford/AstraZeneca, Pfizer/BioNTech",1151582.0,983419.0,168163.0


In [26]:
#go back
os.chdir('..')
path = os.getcwd()
files = os.listdir(path)

#go to vaccinations
os.chdir('vaccinations')
path = os.getcwd()
files = os.listdir(path)

In [27]:
files

['country_data',
 'locations.csv',
 'README.md',
 'Sheet_1_data.csv',
 'us_state_vaccinations.csv',
 'vaccinations-by-manufacturer.csv',
 'vaccinations.csv',
 'vaccinations.json',
 'vaccine proc_data_07_03.csv']

In [28]:
#get vaccine data from ds

vaccines = pd.read_csv(r'vaccine proc_data_07_03.csv')
vaccines

Unnamed: 0,Country seperate (group),subtitle,Page 1,Company and Scientific Name1,Deal not on Map,Deal Period1 11,"Potential (1=yes, 0=no)1",3 Star Note,Company's Country,Country seperate,...,Purchaser's country Economic Status,Purchaser's Country Income Status,title vaccine,Tooltip deal amount,Type of Vaccine,Year,% Of National Population Able To Be Vaccinated,Deal Amount,Number of people able to be vaccinated with doses procured,Population
0,Israel,This map shows publicly-reported vaccine purch...,July 2020,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,...,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,vaccines,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
1,Israel,This map shows publicly-reported vaccine purch...,August 2020,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,...,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,vaccines,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
2,Israel,This map shows publicly-reported vaccine purch...,September 2020,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,...,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,vaccines,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
3,Israel,This map shows publicly-reported vaccine purch...,October 2020,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,...,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,vaccines,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
4,Israel,This map shows publicly-reported vaccine purch...,November 2020,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,...,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,vaccines,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4156,Latin America w/o Brazil,This map shows publicly-reported vaccine purch...,January 2021,Oxford-AstraZeneca _AZD1222,0,August 2020,Confirmed,0,UK,Latin America w/o Brazil,...,Upper middle income,LMIC,only Oxford-AstraZeneca _AZD1222,vaccines,Adenoviral,2020,1782.707876,150000000.0,75000000.0,4207083
4157,Latin America w/o Brazil,This map shows publicly-reported vaccine purch...,March 2021,Oxford-AstraZeneca _AZD1222,0,August 2020,Confirmed,0,UK,Latin America w/o Brazil,...,Upper middle income,LMIC,only Oxford-AstraZeneca _AZD1222,vaccines,Adenoviral,2020,1782.707876,150000000.0,75000000.0,4207083
4158,Latin America w/o Brazil,This map shows publicly-reported vaccine purch...,November 2020,Oxford-AstraZeneca _AZD1222,0,August 2020,Confirmed,0,UK,Latin America w/o Brazil,...,Upper middle income,LMIC,only Oxford-AstraZeneca _AZD1222,vaccines,Adenoviral,2020,1782.707876,150000000.0,75000000.0,4207083
4159,Latin America w/o Brazil,This map shows publicly-reported vaccine purch...,October 2020,Oxford-AstraZeneca _AZD1222,0,August 2020,Confirmed,0,UK,Latin America w/o Brazil,...,Upper middle income,LMIC,only Oxford-AstraZeneca _AZD1222,vaccines,Adenoviral,2020,1782.707876,150000000.0,75000000.0,4207083


In [29]:
with pd.option_context("display.max_rows", 1000):
    display(vaccines[vaccines['Country seperate (group)'] == 'African Union'])

Unnamed: 0,Country seperate (group),subtitle,Page 1,Company and Scientific Name1,Deal not on Map,Deal Period1 11,"Potential (1=yes, 0=no)1",3 Star Note,Company's Country,Country seperate,...,Purchaser's country Economic Status,Purchaser's Country Income Status,title vaccine,Tooltip deal amount,Type of Vaccine,Year,% Of National Population Able To Be Vaccinated,Deal Amount,Number of people able to be vaccinated with doses procured,Population
518,African Union,This map shows publicly-reported vaccine purch...,January 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Burundi,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,vaccines,Adenoviral,2021,8.82354,120000000.0,120000000.0,1359998350
519,African Union,This map shows publicly-reported vaccine purch...,February 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Burundi,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,vaccines,Adenoviral,2021,8.82354,120000000.0,120000000.0,1359998350
520,African Union,This map shows publicly-reported vaccine purch...,January 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Cameroon,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
521,African Union,This map shows publicly-reported vaccine purch...,February 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Cameroon,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
522,African Union,This map shows publicly-reported vaccine purch...,January 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Central African Republic,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
523,African Union,This map shows publicly-reported vaccine purch...,February 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Central African Republic,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
524,African Union,This map shows publicly-reported vaccine purch...,January 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Chad,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
525,African Union,This map shows publicly-reported vaccine purch...,February 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Chad,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
526,African Union,This map shows publicly-reported vaccine purch...,January 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Congo Republic,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350
527,African Union,This map shows publicly-reported vaccine purch...,February 2021,Janssen (J&J)_Ad26.COV2.S,0,January 2021,Confirmed,0,Belgium,Congo Republic,...,Low income,LMIC,only Janssen (J&J)_Ad26.COV2.S,unknown amount,Adenoviral,2021,,,,1359998350


In [30]:
vaccines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4161 entries, 0 to 4160
Data columns (total 32 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   Country seperate (group)                                    4161 non-null   object 
 1   subtitle                                                    4161 non-null   object 
 2   Page 1                                                      4161 non-null   object 
 3   Company and Scientific Name1                                4161 non-null   object 
 4   Deal not on Map                                             4161 non-null   int64  
 5   Deal Period1 11                                             4161 non-null   object 
 6   Potential (1=yes, 0=no)1                                    4161 non-null   object 
 7   3 Star Note                                                 4161 non-null   int64  
 8 

In [31]:
#replacing names of countries to match each other

vaccines['Country seperate'] = vaccines['Country seperate'].replace('UK', 'United Kingdom')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('USA', 'United States')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('Príncipe', 'Sao Tome and Principe')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('São Tomé', 'Sao Tome and Principe')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('South Korea', 'Korea')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('Côte d’Ivoire', 'Côte d\'Ivoire')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('DR Congo', 'Congo DRC')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('Congo Republic', 'Congo')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('Taiwan', 'Taiwan Province of China')
vaccines['Country seperate'] = vaccines['Country seperate'].replace('Hong Kong', 'Hong Kong SAR')

In [32]:
countries['location'] = countries['location'].replace('Czechia', 'Czech Republic')
countries['location'] = countries['location'].replace('Russia', 'Russian Federation')

In [33]:
#store number of countries to see if there is difference

#vaccines df and store it in list

a = list(vaccines['Country seperate'].unique())

#countries df and store it in list

b = list(countries['location'].unique())

#check overlap in lists
new_list = list(set(a).difference(set(b)))
new_list

['Korea',
 'Sao Tome and Principe',
 'Cameroon',
 'Djibouti',
 'Chad',
 'Iraq',
 'Somalia',
 'Niger',
 'Namibia',
 'Uzbekistan',
 'Liberia',
 'Latin America w/o Brazil',
 'Kenya',
 'South Sudan',
 'Angola',
 'Burkina Faso',
 'Ghana',
 'Vietnam',
 'Mali',
 'Sierra Leone',
 'Libya',
 'Cabo Verde',
 'Eswatini',
 "Côte d'Ivoire",
 'Congo',
 'Sahrawi Republic',
 'Central African Republic',
 'Palestine',
 'Zambia',
 'Burundi',
 'Equatorial Guinea',
 'Gambia',
 'North Macedonia',
 'Comoros',
 'Guinea-Bissau',
 'Nigeria',
 'Botswana',
 'Benin',
 'Ethiopia',
 'Tanzania',
 'Tunisia',
 'Sudan',
 'COVAX',
 'Lesotho',
 'Uganda',
 'Togo',
 'Mozambique',
 'Hong Kong SAR',
 'Eritrea',
 'Congo DRC',
 'Gabon',
 'Malawi',
 'Madagascar',
 'Taiwan Province of China',
 'Mauritania']

In [34]:
new_list1 = list(set(b).difference(set(a)))
new_list1

['Macao',
 'Andorra',
 'Cambodia',
 'Belarus',
 'Northern Cyprus',
 'Liechtenstein',
 'Guatemala',
 'Cayman Islands',
 'Jersey',
 'Bermuda',
 'Isle of Man',
 'Honduras',
 'Saint Helena',
 'Maldives',
 'Monaco',
 'Montenegro',
 'Barbados',
 'Trinidad and Tobago',
 'Scotland',
 'Guernsey',
 'Guyana',
 'Belize',
 'Turks and Caicos Islands',
 'Mongolia',
 'Greenland',
 'Faeroe Islands',
 'Montserrat',
 'Russian Federation',
 'South Korea',
 'Gibraltar',
 'Northern Ireland',
 'San Marino',
 'Palau',
 'Hong Kong',
 'England',
 'Bahrain',
 'Iran',
 'Wales',
 'Anguilla',
 'Falkland Islands']

In [35]:
#getting the set of countries in the world map shapefile
#merge exp with vaccines and countries

In [36]:
#drop irrelevant columns

vaccines = vaccines.drop(['subtitle', 'Page 1', 'Purchaser Entity / Country1', 'Tooltip deal amount'], axis = 1)

In [37]:
#NEXT STEP - MERGE THE ORIGINAL

pro_vaxers = vaccines.drop_duplicates()
pro_vaxers

Unnamed: 0,Country seperate (group),Company and Scientific Name1,Deal not on Map,Deal Period1 11,"Potential (1=yes, 0=no)1",3 Star Note,Company's Country,Country seperate,Date of Deal,Day,...,Partners,Purchaser's country Economic Status,Purchaser's Country Income Status,title vaccine,Type of Vaccine,Year,% Of National Population Able To Be Vaccinated,Deal Amount,Number of people able to be vaccinated with doses procured,Population
0,Israel,Arcturus Therapeutics_LUNAR-COV19,0,July 2020,Confirmed,0,USA,Israel,Thu Jul 23 00:00:00 EDT 2020,23,...,CEPI,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,mRNA,2020,44.182784,4000000.0,4000000.0,9053300
8,Singapore,Arcturus Therapeutics_LUNAR-COV19,0,November 2020,Confirmed,0,USA,Singapore,Mon Nov 09 00:00:00 EST 2020,9,...,Duke-NUS,High income,HIC,only Arcturus Therapeutics_LUNAR-COV19,mRNA,2020,,,,5703569
12,Brazil,Bharat Biotech_COVAXIN,0,January 2021,Confirmed,0,India,Brazil,Tue Jan 12 00:00:00 EST 2021,12,...,Indian Council of Medical Research and Nationa...,Upper middle income,LMIC,only Bharat Biotech_COVAXIN,Whole-viron inactivated,2021,2.842935,12000000.0,6000000.0,211049527
14,India,Bharat Biotech_COVAXIN,0,January 2021,Confirmed,0,India,India,Tue Jan 12 00:00:00 EST 2021,12,...,Indian Council of Medical Research and Nationa...,Lower middle income,LMIC,only Bharat Biotech_COVAXIN,Whole-viron inactivated,2021,0.201256,5500000.0,2750000.0,1366417754
16,Brazil,Bharat Biotech_COVAXIN,0,February 2021,Confirmed,0,India,Brazil,Wed Feb 03 00:00:00 EST 2021,3,...,Indian Council of Medical Research and Nationa...,Upper middle income,LMIC,only Bharat Biotech_COVAXIN,Whole-viron inactivated,2021,1.895290,8000000.0,4000000.0,211049527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,UK,Valneva_VLA2001,0,February 2021,Confirmed,1,France,United Kingdom,Tue Feb 02 00:00:00 EST 2021,2,...,NIH,High income,HIC,only Valneva_VLA2001,Protein adjuvant,2021,29.924707,40000000.0,20000000.0,66834405
4145,Canada,Pfizer-BioNTech_BNT162,0,January 2021,Confirmed,1,USA/Germany,Canada,Tue Jan 12 00:00:00 EST 2021,12,...,BioNTech and Fosun Pharma,High income,HIC,only Pfizer-BioNTech_BNT162,mRNA,2021,26.603342,20000000.0,10000000.0,37589262
4148,USA,Moderna_mRNA-1273,0,February 2021,Confirmed,1,USA,United States,2/11/2021,11,...,NIH,High income,HIC,only Moderna_mRNA-1273,mRNA,2021,15.232779,100000000.0,50000000.0,328239523
4150,USA,Pfizer-BioNTech_BNT162,0,January 2021,Confirmed,1,USA/Germany,United States,Wed Jan 27 00:00:00 EST 2021,27,...,BioNTech and Fosun Pharma,High income,HIC,only Pfizer-BioNTech_BNT162,mRNA,2021,15.232779,100000000.0,50000000.0,328239523


In [38]:
proer_vaxxers = pro_vaxers.groupby(['Country seperate', 'Company and Scientific Name1', 'Potential (1=yes, 0=no)1']).agg(
             Population = ('Population', 'mean'),
             Percent_pop_covered =('% Of National Population Able To Be Vaccinated', 'sum'),
             Vaccines_bought = ('Deal Amount', 'sum'),
             People_covered = ('Number of people able to be vaccinated with doses procured', 'sum'))

In [39]:
#create pivot table to relate vaccines bought by manufacturer to countries

first_pivot = pd.pivot_table(proer_vaxxers, values='Vaccines_bought', index=['Country seperate'],
                    columns=['Company and Scientific Name1'])

#reset index
first_pivot = first_pivot.reset_index()
first_pivot

Company and Scientific Name1,Country seperate,Arcturus Therapeutics_LUNAR-COV19,Bharat Biotech_COVAXIN,COVAX,COVAXX (United Biomedical)_UB-162,CanSino Biologics_Ad5-nCoV,CureVac_CVnCov,Gamaleya Research Institute_Sputnik V,Janssen (J&J)_Ad26.COV2.S,Medicago_CoVLP,Moderna_mRNA-1273,Novavax_NVX-CoV2373,Oxford-AstraZeneca _AZD1222,Pfizer-BioNTech_BNT162,Sanofi-GSK_SARS-CoV-2,Sinopharm,Sinovac_Coronavac,Valneva_VLA2001
0,Albania,,,,,,,,,,,,,500000.0,,,,
1,Algeria,,,,,,,,0.0,,,,0.0,0.0,,,,
2,Angola,,,,,,,,0.0,,,,0.0,0.0,,,,
3,Argentina,,,,,,,25000000.0,,,,,22000000.0,,,,,
4,Australia,,,,,,,,,,,51000000.0,53800000.0,20000000.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,Uzbekistan,,,,,,,1000000.0,,,,,,,,,,
138,Venezuela,,,,,,,10000000.0,,,,,,,,,,
139,Vietnam,,,,,,,50000000.0,,,,,30000000.0,,,,,
140,Zambia,,,,,,,,0.0,,,,0.0,0.0,,,,


In [40]:
proer_vaxxers = proer_vaxxers.reset_index().rename(columns = {'Country seperate': 'COUNTRY'})

In [41]:
first_pivot.rename(columns = {'Country seperate': 'COUNTRY'}, inplace = True)

In [42]:
#first_pivot['COUNTRY'].unique()

In [43]:
df=first_pivot[['COUNTRY',
                'COVAXX (United Biomedical)_UB-162',
                'Pfizer-BioNTech_BNT162',
                'Janssen (J&J)_Ad26.COV2.S',
                'Sanofi-GSK_SARS-CoV-2',
                'Oxford-AstraZeneca _AZD1222',
                'Moderna_mRNA-1273',
                'Sinovac_Coronavac',
                'Gamaleya Research Institute_Sputnik V' ]]

In [44]:
df

Company and Scientific Name1,COUNTRY,COVAXX (United Biomedical)_UB-162,Pfizer-BioNTech_BNT162,Janssen (J&J)_Ad26.COV2.S,Sanofi-GSK_SARS-CoV-2,Oxford-AstraZeneca _AZD1222,Moderna_mRNA-1273,Sinovac_Coronavac,Gamaleya Research Institute_Sputnik V
0,Albania,,500000.0,,,,,,
1,Algeria,,0.0,0.0,,0.0,,,
2,Angola,,0.0,0.0,,0.0,,,
3,Argentina,,,,,22000000.0,,,25000000.0
4,Australia,,20000000.0,,,53800000.0,,,
...,...,...,...,...,...,...,...,...,...
137,Uzbekistan,,,,,,,,1000000.0
138,Venezuela,,,,,,,,10000000.0
139,Vietnam,,,,,30000000.0,,,50000000.0
140,Zambia,,0.0,0.0,,0.0,,,


In [45]:
df=df.rename(columns = {'COVAXX (United Biomedical)_UB-162': 'COVAX',
                     'Pfizer-BioNTech_BNT162':'Pfizer/BioNTech',
                     'Janssen (J&J)_Ad26.COV2.S':"Johnson & Johnson",
                     'Oxford-AstraZeneca _AZD1222':'Oxford-AstraZeneca',
                     'Moderna_mRNA-1273':'Moderna',
                     'Sinovac_Coronavac':'Sinovac',
                     'Sanofi-GSK_SARS-CoV-2':'Sanofi/GSK',
                     'Gamaleya Research Institute_Sputnik V':'SputnikV'})

In [46]:
df=df.replace(0.0, np.NaN)
df=df.dropna(how='all',thresh=2)
df2=df.set_index('COUNTRY').stack().to_frame('values').reset_index()
df2.rename(columns={"Company and Scientific Name1":"Labs",
                   "values":"vaccines"}, inplace=True)

In [47]:
proer_vaxxers

Unnamed: 0,COUNTRY,Company and Scientific Name1,"Potential (1=yes, 0=no)1",Population,Percent_pop_covered,Vaccines_bought,People_covered
0,Albania,Pfizer-BioNTech_BNT162,Confirmed,2.854191e+06,8.759049,500000.0,250000.0
1,Algeria,Janssen (J&J)_Ad26.COV2.S,Confirmed,1.359998e+09,0.000000,0.0,0.0
2,Algeria,Oxford-AstraZeneca _AZD1222,Confirmed,1.359998e+09,0.000000,0.0,0.0
3,Algeria,Pfizer-BioNTech_BNT162,Confirmed,1.359998e+09,0.000000,0.0,0.0
4,Angola,Janssen (J&J)_Ad26.COV2.S,Confirmed,1.359998e+09,0.000000,0.0,0.0
...,...,...,...,...,...,...,...
490,Zambia,Oxford-AstraZeneca _AZD1222,Confirmed,1.359998e+09,0.000000,0.0,0.0
491,Zambia,Pfizer-BioNTech_BNT162,Confirmed,1.359998e+09,0.000000,0.0,0.0
492,Zimbabwe,Janssen (J&J)_Ad26.COV2.S,Confirmed,1.359998e+09,0.000000,0.0,0.0
493,Zimbabwe,Oxford-AstraZeneca _AZD1222,Confirmed,1.359998e+09,0.000000,0.0,0.0


In [48]:
df2 = df2.merge(proer_vaxxers, on='COUNTRY', how = 'left')


In [49]:
df2['Labs'].unique()

array(['Pfizer/BioNTech', 'Oxford-AstraZeneca', 'SputnikV',
       'Johnson & Johnson', 'Sanofi/GSK', 'Moderna', 'Sinovac', 'COVAX'],
      dtype=object)

In [50]:
#create dataframes for each manufacturer and merge together
pfizer = df2[(df2['Labs']=='Pfizer/BioNTech') & (df2['Company and Scientific Name1']=='Pfizer-BioNTech_BNT162')]
jj = df2[(df2['Labs']=="Johnson & Johnson") & (df2['Company and Scientific Name1']=='Janssen (J&J)_Ad26.COV2.S')]
oxf = df2[(df2['Labs']=='Oxford-AstraZeneca') & (df2['Company and Scientific Name1']=='Oxford-AstraZeneca _AZD1222')]
mod = df2[(df2['Labs']=='Moderna') & (df2['Company and Scientific Name1']=='Moderna_mRNA-1273')]
sinovac = df2[(df2['Labs']=='Sinovac') & (df2['Company and Scientific Name1']=='Sinovac_Coronavac')]
sanofi = df2[(df2['Labs']=='Sanofi/GSK') & (df2['Company and Scientific Name1']=='Sanofi-GSK_SARS-CoV-2')]
sputnik = df2[(df2['Labs']=='SputnikV') & (df2['Company and Scientific Name1']=='Gamaleya Research Institute_Sputnik V')]

pdList = [pfizer, jj, oxf, mod, sinovac, sanofi, sputnik]  # List of your dataframes
pdList

#replace df2
df2 = pd.concat(pdList)
df2

Unnamed: 0,COUNTRY,Labs,vaccines,Company and Scientific Name1,"Potential (1=yes, 0=no)1",Population,Percent_pop_covered,Vaccines_bought,People_covered
0,Albania,Pfizer/BioNTech,500000.0,Pfizer-BioNTech_BNT162,Confirmed,2.854191e+06,8.759049,500000.0,250000.0
7,Australia,Pfizer/BioNTech,20000000.0,Pfizer-BioNTech_BNT162,Confirmed,2.536431e+07,39.425481,20000000.0,10000000.0
15,Austria,Pfizer/BioNTech,500000000.0,Pfizer-BioNTech_BNT162,Confirmed,4.475120e+08,55.864419,500000000.0,250000000.0
63,Burundi,Pfizer/BioNTech,50000000.0,Pfizer-BioNTech_BNT162,Confirmed,1.359998e+09,1.838238,50000000.0,25000000.0
73,COVAX,Pfizer/BioNTech,40000000.0,Pfizer-BioNTech_BNT162,Confirmed,5.047561e+06,396.230972,40000000.0,20000000.0
...,...,...,...,...,...,...,...,...,...
319,Palestine,SputnikV,10000.0,Gamaleya Research Institute_Sputnik V,Confirmed,5.168185e+06,0.096746,10000.0,5000.0
346,Serbia,SputnikV,2000000.0,Gamaleya Research Institute_Sputnik V,Confirmed,6.944975e+06,14.398900,2000000.0,1000000.0
441,Uzbekistan,SputnikV,1000000.0,Gamaleya Research Institute_Sputnik V,Confirmed,3.358065e+07,1.488953,1000000.0,500000.0
442,Venezuela,SputnikV,10000000.0,Gamaleya Research Institute_Sputnik V,Confirmed,2.851583e+07,17.534121,10000000.0,5000000.0


In [51]:
df2 = df2.sort_values(by = 'COUNTRY').drop(['Company and Scientific Name1', 'Vaccines_bought'], axis = 1).reset_index(drop = True)

In [52]:
#merge with original shapefile, left join - keeps all original countries of exp and adds column values for those where there is information available

exp = exp.merge(df2, on='COUNTRY', how='left')
exp

Unnamed: 0,COUNTRY,geometry,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",Population in 2020 (Millions),GDP p/capita,Labs,vaccines,"Potential (1=yes, 0=no)1",Population,Percent_pop_covered,People_covered
0,Samoa,"MULTIPOLYGON (((-172.59650 -13.50911, -172.551...",Samoa,0.829,0.203,4083.806,,,,,,
1,Tonga,"MULTIPOLYGON (((-175.14529 -21.26806, -175.186...",Tonga,0.503,0.100,5023.166,,,,,,
2,El Salvador,"POLYGON ((-87.69467 13.81901, -87.72501 13.733...",El Salvador,24.784,6.486,3821.286,Oxford-AstraZeneca,2000000.0,Confirmed,6453553.0,15.495340,1000000.0
3,Guatemala,"POLYGON ((-89.34831 14.43198, -89.43556 14.414...",Guatemala,76.191,17.971,4239.672,,,,,,
4,Mexico,"MULTIPOLYGON (((-111.56001 24.42945, -111.5761...",Mexico,1040.372,128.933,8069.104,SputnikV,24000000.0,Confirmed,127575529.0,9.406193,12000000.0
...,...,...,...,...,...,...,...,...,...,...,...,...
236,Marshall Islands,"MULTIPOLYGON (((168.78637 7.28889, 168.76721 7...",Marshall Islands,0.225,0.055,4070.617,,,,,,
237,Micronesia,"MULTIPOLYGON (((158.22775 6.78055, 158.18469 6...",Micronesia,0.395,0.103,3854.743,,,,,,
238,Palau,"MULTIPOLYGON (((134.53137 7.35444, 134.52234 7...",Palau,0.251,0.018,14232.720,,,,,,
239,Russian Federation,"MULTIPOLYGON (((-179.99999 68.98010, -179.9580...",Russian Federation,1464.078,146.812,9972.495,,,,,,


In [53]:
#now, to do the same with the countries table
countries = countries.rename(columns = {'location': 'COUNTRY', 'date': 'last_update'}).drop('vaccine', axis = 1)
countries

Unnamed: 0,COUNTRY,last_update,total_vaccinations,people_vaccinated,people_fully_vaccinated
19,Albania,2021-03-03,15793.0,,
2,Algeria,2021-02-19,75000.0,,
6,Andorra,2021-02-26,2526.0,2526.0,
3,Anguilla,2021-02-26,3929.0,3929.0,
47,Argentina,2021-03-05,1357596.0,1030504.0,327092.0
...,...,...,...,...,...
62,United States,2021-03-06,87912323.0,57358849.0,29776160.0
5,Uruguay,2021-03-05,70408.0,70408.0,
2,Venezuela,2021-03-04,12194.0,12194.0,
56,Wales,2021-03-05,1151582.0,983419.0,168163.0


In [54]:
#merge with original shapefile, left join - keeps all original countries of exp and adds column values for those where there is information available

exp = exp.merge(countries, on='COUNTRY', how='left')
exp

Unnamed: 0,COUNTRY,geometry,Country,"GDP, current prices, in 2020 (Bilions U.S. dollar)",Population in 2020 (Millions),GDP p/capita,Labs,vaccines,"Potential (1=yes, 0=no)1",Population,Percent_pop_covered,People_covered,last_update,total_vaccinations,people_vaccinated,people_fully_vaccinated
0,Samoa,"MULTIPOLYGON (((-172.59650 -13.50911, -172.551...",Samoa,0.829,0.203,4083.806,,,,,,,,,,
1,Tonga,"MULTIPOLYGON (((-175.14529 -21.26806, -175.186...",Tonga,0.503,0.100,5023.166,,,,,,,,,,
2,El Salvador,"POLYGON ((-87.69467 13.81901, -87.72501 13.733...",El Salvador,24.784,6.486,3821.286,Oxford-AstraZeneca,2000000.0,Confirmed,6453553.0,15.495340,1000000.0,2021-02-25,16000.0,16000.0,
3,Guatemala,"POLYGON ((-89.34831 14.43198, -89.43556 14.414...",Guatemala,76.191,17.971,4239.672,,,,,,,2021-03-01,2427.0,,
4,Mexico,"MULTIPOLYGON (((-111.56001 24.42945, -111.5761...",Mexico,1040.372,128.933,8069.104,SputnikV,24000000.0,Confirmed,127575529.0,9.406193,12000000.0,2021-03-06,2765805.0,2162358.0,603447.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,Marshall Islands,"MULTIPOLYGON (((168.78637 7.28889, 168.76721 7...",Marshall Islands,0.225,0.055,4070.617,,,,,,,,,,
237,Micronesia,"MULTIPOLYGON (((158.22775 6.78055, 158.18469 6...",Micronesia,0.395,0.103,3854.743,,,,,,,,,,
238,Palau,"MULTIPOLYGON (((134.53137 7.35444, 134.52234 7...",Palau,0.251,0.018,14232.720,,,,,,,2021-02-01,3109.0,,
239,Russian Federation,"MULTIPOLYGON (((-179.99999 68.98010, -179.9580...",Russian Federation,1464.078,146.812,9972.495,,,,,,,2021-03-06,6583873.0,5082127.0,1501746.0


In [55]:
#now that we have the shapefile we verify how it behaves concerning groups: especially EU and african union
#Create sets with countries in each - as defined in data

Af_union = set(pro_vaxers[pro_vaxers['Country seperate (group)'] == 'African Union']['Country seperate'])

EU = set(pro_vaxers[pro_vaxers['Country seperate (group)'] == 'European Union']['Country seperate'])

#as it is possible to see - african union considers all African Countries
exp = exp.drop('COUNTRY', axis = 1)

#utilizar isto para fazer o fill
np.where()

TypeError: where() missing 1 required positional argument: 'condition'

In [None]:
exp[exp['Country'] == 'Canada']

In [None]:
#go back
os.chdir('..')
path = os.getcwd()
files = os.listdir(path)

#go to SHP_exp folder
os.chdir('SHP_exp')
path = os.getcwd()
files = os.listdir(path)

#export and pray it works
exp.to_file("test.shp", driver='ESRI Shapefile')

In [None]:
#Convert to csv
exp.to_csv('result_poster.csv')