# Data Analysis Project - GDP & Population

This dataproject will analyse how the GDP per capita (*chained linked volumes*) has changed over the years (2012-2022) for the european countries. The countries include countries that are currently in the EU, that has previously been in the EU or countries aplying to become a memeber of the EU. 

For this project we will use data from Eurostat, we will acces the data dirctly form Eurostat and thus it will be nessesary to install the eurostat extension *(See below)* .
We will use data from two datasets, nama_10_gdp and DEMO_PJAN. 

We will use two different methods to access and clean the data, firstly we will acces the full dataset of "nama_10_gdp" and then manually clean it and delete the parts that we do not need. Secondly, with the dataset "DEMO_PJAN" we will only access the parts of the dataset that we need, this is done filtering the dataset, such that we only access the data that we need. 

After accessing and cleaning both datasets, we will combine the two and make some calculations ond vizualisations of the data. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# For this projekt we are going to use the eurostat module, and therefore you will need to run this line of code if you havent installed it yet. If the eurostat module is already installed, you can add a # in front of the next line.
##%pip install eurostat

# user written modules
from DPJ import GDP_CapitaClass
model = GDP_CapitaClass()

## Definitions

**GDP :**  Gross Domestic Product 

**Chained Linked Volumes :**

**Population :** The population is calculated as the total population, this mean that it is all people who is registered a citizen in a country. This value is measured January 1st of the year in question. $^*$ 


$^*$ Definition from Eurostat on the metadata within the DEMO_PJAN-dataset used.

# The first Dataset - GDP

We will start of by accessing the dataset (nama_10_gdp) from EuroStat.

With this dataset, we are accessing the full dataset, which we will then clean up.

We choose which rows *('unit' and 'na_items')* we want to see. For this we have chosen to see the Gross Domestic Product in Chained linked volumes (205), million euro.

In [2]:
model.Get_GDP() 


Unnamed: 0,freq,unit,na_item,geo\TIME_PERIOD,1975,1976,1977,1978,1979,1980,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,A,CLV05_MEUR,B1G,AT,,,,,,,...,251167.1,252879.7,255017.1,259996.4,266016.9,272985.2,277001.4,259083.6,269392.5,283659.7
1,A,CLV05_MEUR,B1G,BA,,,,,,,...,9118.0,9224.4,9358.7,9658.1,9978.2,10355.7,10650.6,10358.6,11080.5,11427.6
2,A,CLV05_MEUR,B1G,BE,,,,,,,...,308495.4,313684.3,320657.4,323677.5,328736.7,334655.5,342477.2,325482.3,344637.6,355710.6
3,A,CLV05_MEUR,B1G,BG,,,,,,,...,25127.0,25373.6,26129.3,26806.7,27691.2,28654.5,29721.9,28535.0,30816.2,31876.7
4,A,CLV05_MEUR,B1G,CH,,,,,,,...,381505.9,390405.6,396747.1,405030.0,410565.5,422877.6,427969.5,418277.5,435957.0,445107.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29630,A,PYP_MNAC,YA1,PT,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-27.4,156.0
29631,A,PYP_MNAC,YA1,RO,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29632,A,PYP_MNAC,YA1,SI,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29633,A,PYP_MNAC,YA1,UK,,,,,,,...,,,,,,,,,,


We will now clean up det dataset:

1.  We remove the columns freq, unit, na_items, and the years 1975-2011.

2. We rename the column geo/Time_Period to Country_code. 

3. We remove the aggregate values in our dataset, as we are only interested in the specific countries. 

4.  We reset the index.

In [7]:
model.Clean_GDP()

Unnamed: 0,unit,na_item,Country_code,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,CLV15_MEUR,B1GQ,AL,9768.4,9866.2,10041.3,10264.1,10604.4,11007.6,11450.0,11689.0,11282.1,,
1,CLV15_MEUR,B1GQ,AT,338486.5,338572.8,340811.7,344269.2,351118.3,359048.5,367756.8,373337.1,349242.1,365156.5,383403.2
2,CLV15_MEUR,B1GQ,BA,13695.8,14017.6,14179.3,14791.1,15270.7,15766.1,16369.8,16842.2,16334.4,17541.8,18225.3
3,CLV15_MEUR,B1GQ,BE,400181.0,402018.8,408364.8,416701.4,421979.7,428814.0,436502.4,446374.2,422437.0,448991.1,463573.6
4,CLV15_MEUR,B1GQ,BG,44117.0,43869.7,44293.9,45812.3,47204.9,48508.8,49811.2,51822.6,49771.2,53571.0,55371.7


# The second dataset - Population

We will now access the dataset (DEMO_PJAN) from Eurostat.

With this dataset, we will filter it directly form Eurostat, meaning that we will only access the data we need:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Startperiod : 2012

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Endperiod : 2022

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * sex : T

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Age : Total

This will give us the the total population for each country in the period 2012-2022

In [8]:
model.Get_Population()

Unnamed: 0,freq,unit,age,sex,geo\TIME_PERIOD,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,A,NR,TOTAL,T,AD,78115.0,76246.0,,,,,,76177.0,,,
1,A,NR,TOTAL,T,AL,2903008.0,2897770.0,2892394.0,2885796.0,2875592.0,2876591.0,2870324.0,2862427.0,2845955.0,2829741.0,
2,A,NR,TOTAL,T,AM,3274285.0,,,3010598.0,2998577.0,2986151.0,2972732.0,2965269.0,2959694.0,2963251.0,
3,A,NR,TOTAL,T,AT,8408121.0,8451860.0,8507786.0,8584926.0,8700471.0,8772865.0,8822267.0,8858775.0,8901064.0,8932664.0,8978929.0
4,A,NR,TOTAL,T,AZ,9235085.0,9356483.0,9477119.0,9593038.0,9705643.0,9809981.0,9898085.0,9981457.0,10067108.0,10119133.0,


We will also do a bit of cleaning wiht this dataset: 

1. We will rename the column geo/Time_Period to Country_code.

2. We will delete the columns 'freq', 'unit', 'age' , 'sex'

In [9]:
model.Clean_Population()

Unnamed: 0,Country_code,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AD,78115.0,76246.0,,,,,,76177.0,,,
1,AL,2903008.0,2897770.0,2892394.0,2885796.0,2875592.0,2876591.0,2870324.0,2862427.0,2845955.0,2829741.0,
2,AM,3274285.0,,,3010598.0,2998577.0,2986151.0,2972732.0,2965269.0,2959694.0,2963251.0,
3,AT,8408121.0,8451860.0,8507786.0,8584926.0,8700471.0,8772865.0,8822267.0,8858775.0,8901064.0,8932664.0,8978929.0
4,AZ,9235085.0,9356483.0,9477119.0,9593038.0,9705643.0,9809981.0,9898085.0,9981457.0,10067108.0,10119133.0,


# Merging the two datsets

We will now merge the two datasets. First we change the direction of the two datasets, from wide to long, to make the result of the merge look the best. 

Then we will merge the two datasets through an inner-merge, meaning that we will keep the observations which are in both datasets. The observation that we will do the merge for is 'Country_code' and 'year'. 

In [10]:
model.Merge_Data()

Unnamed: 0_level_0,Unnamed: 1_level_0,na_item,unit,_x,_y
Country_code,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL,2012,B1GQ,CLV15_MEUR,9768.4,2903008.0
AT,2012,B1GQ,CLV15_MEUR,338486.5,8408121.0
AL,2013,B1GQ,CLV15_MEUR,9866.2,2897770.0
AT,2013,B1GQ,CLV15_MEUR,338572.8,8451860.0
AL,2014,B1GQ,CLV15_MEUR,10041.3,2892394.0
AT,2014,B1GQ,CLV15_MEUR,340811.7,8507786.0
AL,2015,B1GQ,CLV15_MEUR,10264.1,2885796.0
AT,2015,B1GQ,CLV15_MEUR,344269.2,8584926.0
AL,2016,B1GQ,CLV15_MEUR,10604.4,2875592.0
AT,2016,B1GQ,CLV15_MEUR,351118.3,8700471.0


# Adding the third data set

We will now add a third dataset, which shows the country code, the country name and the iso-3 code for each country. This dataset is stored in a .xlsx file under the name C_name_ISO3.

In [12]:
pd.read_excel('C_Name_ISO3.xlsx')


Unnamed: 0,Country_code,ISO_3_Code,Country_Name
0,AL,ALB,Albania
1,AT,AUT,Austria
2,BA,BIH,Bosnia and Herzegovina
3,BE,BEL,Belgium
4,BG,BGR,Bulgaria
5,CH,CHE,Switzerland
6,CY,CYP,Cyprus
7,CZ,CZE,Czechia
8,DE,DEU,Germany
9,DK,DNK,Denmark
