# GDP vs VACCINATION

In this project we seek to analyze the GDP for selected countries depending on the child vaccination rates. 

Imports and set magics:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


# Read and clean data

We have imported data in cvs-files from the OECD Databank. 

**The 1st dataset:**
The first dataset is an overview of GDP per capita for 34 countries in the world from 2018 to 2022. The included variables are year, GDP per capita and country. 

**The 2nd dataset:**
The second dataset is an overview of child vaccination rates for XX countries in the world from 2018 to 2022. The included variables are year, GDP per capita and country. 

# Importing and cleaning the first dataset

In [2]:
#import data
gdp = pd.read_csv('GDP.csv')

#inspect data
print(gdp)

    LOCATION INDICATOR SUBJECT  MEASURE FREQUENCY  TIME         Value  \
0        AUS       GDP     TOT  USD_CAP         A  2018  52991.214900   
1        AUS       GDP     TOT  USD_CAP         A  2019  52732.185831   
2        AUS       GDP     TOT  USD_CAP         A  2020  55690.254043   
3        AUS       GDP     TOT  USD_CAP         A  2021  61974.774982   
4        AUT       GDP     TOT  USD_CAP         A  2018  56956.110562   
..       ...       ...     ...      ...       ...   ...           ...   
266      CMR       GDP     TOT  USD_CAP         A  2019   3900.880855   
267      CMR       GDP     TOT  USD_CAP         A  2020   3861.209890   
268      SEN       GDP     TOT  USD_CAP         A  2018   3416.171427   
269      SEN       GDP     TOT  USD_CAP         A  2019   3530.096494   
270      SEN       GDP     TOT  USD_CAP         A  2020   3513.149951   

    Flag Codes  
0          NaN  
1          NaN  
2          NaN  
3          NaN  
4          NaN  
..         ...  
266 

In [3]:
#We drop the unecessary columns
drop_cont = ['SUBJECT', 'MEASURE', 'FREQUENCY', 'INDICATOR', 'Flag Codes']
gdp.drop(drop_cont, axis=1, inplace=True)

#Renaming the columns
gdp.rename(columns = {'LOCATION' : 'country', 'TIME' : 'year', 'Value' : 'GDP'}, inplace=True)

#Removing two rows that are not containing only one country
gdp = gdp[gdp.country != 'EA19']
gdp = gdp[gdp.country != 'EU27_2020']

#Print
print(gdp)

    country  year           GDP
0       AUS  2018  52991.214900
1       AUS  2019  52732.185831
2       AUS  2020  55690.254043
3       AUS  2021  61974.774982
4       AUT  2018  56956.110562
..      ...   ...           ...
266     CMR  2019   3900.880855
267     CMR  2020   3861.209890
268     SEN  2018   3416.171427
269     SEN  2019   3530.096494
270     SEN  2020   3513.149951

[261 rows x 3 columns]


In [4]:
#Reshaping the data set to have all information for a country in one row
gdp = gdp.pivot_table(index='country', columns='year', values='GDP', aggfunc='sum').reset_index() # reset_index removes the second index created by the pivot
print(gdp)


#Removing variables where data is missing
gdp = gdp.dropna()
print(gdp)

year country          2018          2019          2020          2021  \
0        ALB  13498.122566  14407.379403  14033.967211           NaN   
1        ARG  23290.675736  23263.429463  20981.428074  23856.680324   
2        AUS  52991.214900  52732.185831  55690.254043  61974.774982   
3        AUT  56956.110562  59719.612013  57258.818295  59991.439846   
4        BEL  52530.841510  55801.087951  54544.288759  58820.998349   
..       ...           ...           ...           ...           ...   
58       SWE  53521.635778  56404.297122  56140.551353  60126.822079   
59       TUR  28281.473275  28011.322789  27973.899303  30679.506593   
60       USA  62449.612827  64690.272395  63480.861222  70181.121750   
61       ZAF  13218.431286  13228.020366  12253.323477           NaN   
62       ZMB   3507.795230   3514.688008   3358.004891           NaN   

year          2022  
0              NaN  
1              NaN  
2              NaN  
3     68774.881689  
4     66176.491484  
..       

In [5]:
gdp.info()

#Changing the type for country to string
gdp.country = gdp.country.astype('string')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 3 to 58
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  25 non-null     object 
 1   2018     25 non-null     float64
 2   2019     25 non-null     float64
 3   2020     25 non-null     float64
 4   2021     25 non-null     float64
 5   2022     25 non-null     float64
dtypes: float64(5), object(1)
memory usage: 1.4+ KB


# Importing and cleaning the second dataset

In [6]:
# import data
vac = pd.read_csv('vaccination.csv') 
# inspect data
print(vac) 

    LOCATION    INDICATOR  SUBJECT   MEASURE FREQUENCY  TIME  Value Flag Codes
0        AUS  CHILDVACCIN      DTP  PC_CHILD         A  2018   94.6        NaN
1        AUS  CHILDVACCIN      DTP  PC_CHILD         A  2019   94.9        NaN
2        AUS  CHILDVACCIN      DTP  PC_CHILD         A  2020   95.4        NaN
3        AUS  CHILDVACCIN      DTP  PC_CHILD         A  2021   95.0        NaN
4        AUS  CHILDVACCIN  MEASLES  PC_CHILD         A  2018   95.0          E
..       ...          ...      ...       ...       ...   ...    ...        ...
319      PER  CHILDVACCIN      DTP  PC_CHILD         A  2019   88.0        NaN
320      PER  CHILDVACCIN      DTP  PC_CHILD         A  2020   72.2        NaN
321      PER  CHILDVACCIN  MEASLES  PC_CHILD         A  2018   85.0        NaN
322      PER  CHILDVACCIN  MEASLES  PC_CHILD         A  2019   85.0        NaN
323      PER  CHILDVACCIN  MEASLES  PC_CHILD         A  2020   77.1        NaN

[324 rows x 8 columns]


In [7]:
# Dropping the unecessary colums
drop_these = ['FREQUENCY', 'Flag Codes', 'INDICATOR', 'MEASURE']
vac.drop(drop_these, axis=1, inplace=True)
# Renaming columns
vac.rename(columns = {'LOCATION':'country', 'TIME':'year', 'SUBJECT':'subject', 'Value':'cv_rate'}, inplace=True)
vac.head()
#Removing the subject DTP, as we are interested in the measles vaccination rates
vac = vac[vac.country != 'DTP']

In [8]:
# Reshaping from long to wide
vac = vac.pivot_table(index='country', columns='year', values='cv_rate').reset_index() 
vac.head()

year,country,2018,2019,2020,2021
0,ARG,90.0,84.5,75.8,78.35
1,AUS,94.8,94.95,95.2,93.85
2,AUT,89.5,89.5,89.5,
3,BEL,97.0,97.0,96.5,
4,BRA,89.5,80.5,78.3,


In [9]:
# Removing variables where data is missing
vac = vac.dropna()
print(vac)

year country   2018   2019   2020   2021
0        ARG  90.00  84.50  75.80  78.35
1        AUS  94.80  94.95  95.20  93.85
8        CHN  98.85  98.95  99.30  99.40
9        COL  93.50  94.50  89.20  86.45
10       CRI  94.00  95.00  95.80  90.50
11       CZE  96.70  94.55  95.60  95.15
13       DNK  95.00  95.50  94.80  94.95
14       ESP  95.25  96.75  96.15  92.60
15       EST  89.40  89.60  90.90  89.45
18       GBR  93.00  92.00  92.05  91.75
20       HUN  99.90  99.90  99.85  99.85
21       IDN  92.30  95.80  86.60  83.65
22       IND  94.85  92.95  86.90  87.10
23       IRL  93.40  92.25  93.00  91.95
24       ISL  92.00  92.60  93.15  91.80
29       LTU  92.25  92.40  90.75  89.25
31       LVA  97.00  99.50  99.45  95.45
32       MEX  92.50  77.50  79.40  83.45
33       NLD  93.20  93.65  93.90  93.75
34       NOR  96.00  97.00  96.90  96.70
35       NZL  92.25  91.70  91.75  90.45
39       RUS  97.25  97.30  96.90  97.00
40       SVK  96.30  96.45  96.50  96.20
43       TUR  97

In [10]:
vac.info()

#Changing the type for country to string
vac.country = vac.country.astype('string')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 45
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  25 non-null     object 
 1   2018     25 non-null     float64
 2   2019     25 non-null     float64
 3   2020     25 non-null     float64
 4   2021     25 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.2+ KB


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [11]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


interactive(children=(Output(),), _dom_classes=('widget-interact',))

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [12]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

NameError: name 'venn2' is not defined

<Figure size 1500x700 with 0 Axes>

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.