# Data Analysis Project - GDP & Population

This dataproject will analyse how the GDP per capita (*chained linked volumes*) has changed over the years (2012-2022) for the european countries. The countries include countries that are currently in the EU, that has previously been in the EU or countries aplying to become a memeber of the EU. 

For this project we will use data from Eurostat, we will acces the data dirctly form Eurostat and thus it will be nessesary to install the eurostat extension *(See below)* .
We will use data from two datasets, nama_10_gdp and DEMO_PJAN. 

We will use two different methods to access and clean the data, firstly we will acces the full dataset of "nama_10_gdp" and then manually clean it and delete the parts that we do not need. Secondly, with the dataset "DEMO_PJAN" we will only access the parts of the dataset that we need, this is done filtering the dataset, such that we only access the data that we need. 

After accessing and cleaning both datasets, we will combine the two and make some calculations ond vizualisations of the data. 

**Table of contents**<a id='toc0_'></a>    
- 1. [Definitions](#toc1_)    
- 2. [The first dataset - GDP](#toc2_)    
- 3. [The second dataset - Population](#toc3_)    
- 4. [Merging the two datasets](#toc4_)    
- 5. [Adding a third dataset](#toc5_)  
- 6. [Plotting the results](#toc6_) 
    - 6.1 [The Lineplot](#toc7_)
    - 6.2 [The Choropleth map](#toc8_)
    - 6.3 [The Scatterplot](#toc9_)
- 7. [Standard Calulations](#toc10_)
- 8. [Conclusion](#toc11_)

In [20]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2
import plotly.express as px

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# For this projekt we are going to use the eurostat module, and therefore you will need to run this line of code if you havent installed it yet. If the eurostat module is already installed, you can add a # in front of the next line.
##%pip install eurostat

# user written modules
from DPJ import GDP_CapitaClass
model = GDP_CapitaClass()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. <a id='toc1_'></a>[Definitions](#toc0_)

**GDP :**  Gross Domestic Product 

**Chained Linked Volumes :**

**Population :** The population is calculated as the total population, this mean that it is all people who is registered a citizen in a country. This value is measured January 1st of the year in question. $^*$ 


$^*$ Definition from Eurostat on the metadata within the DEMO_PJAN-dataset used.

## 2. <a id='toc2_'></a>[The first Dataset - GDP](#toc0_)

We will start of by accessing the dataset (nama_10_gdp) from EuroStat.

With this dataset, we are accessing the full dataset, which we will then clean up.

We choose which rows *('unit' and 'na_items')* we want to see. For this we have chosen to see the Gross Domestic Product in Chained linked volumes (205), million euro.

In [2]:

# If you want to see the data before we do anything with it
# you can run the code below.

model.Get_GDP() 


We will now clean up det dataset:

1.  We remove the columns freq, unit, na_items, and the years 1975-2011.

2. We rename the column geo/Time_Period to Country_code. 

3. We remove the aggregate values in our dataset, as we are only interested in the specific countries. 

4.  We reset the index.

In [3]:
# The code below will show you the cleaned data
model.Clean_GDP()

## 3. <a id='toc3_'></a>[The second dataset - Population](#toc0_)

We will now access the dataset (DEMO_PJAN) from Eurostat.

With this dataset, we will filter it directly form Eurostat, meaning that we will only access the data we need:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Startperiod : 2012

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Endperiod : 2022

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * sex : T

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Age : Total

This will give us the the total population for each country in the period 2012-2022

In [4]:
# The code below will show you the "raw" population data
model.Get_Population()

We will also do a bit of cleaning wiht this dataset: 

1. We will rename the column geo/Time_Period to Country_code.

2. We will delete the columns 'freq', 'unit', 'age' , 'sex'

In [6]:
# The code below will show you the cleaned population data
model.Clean_Population()

KeyError: "['freq', 'unit', 'age', 'sex'] not found in axis"

## 4. <a id='toc4_'></a>[Merging the two datsets](#toc0_)

We will now merge the two datasets. First we change the direction of the two datasets, from wide to long, to make the result of the merge look the best. 

Then we will merge the two datasets through an inner-merge, meaning that we will keep the observations which are in both datasets. The observation that we will do the merge for is 'Country_code' and 'year'. 

In [7]:
# The code below will show you the merged data
model.Merge_Data()

We will now clean the merged data:
1. We rename the coloumns x and y, to be GDP and Population.
2. We drop countries that have nans for all values
3. We reset the index
4. We calculate the GDP per capita

In [8]:
# Running the below code will show you the cleaned and merged data
model.Clean_merge()

I should be noted that the GDP per capita is now in thousand euros per person, while GDP is still in million euro and population is in total number of people

## 5. <a id='toc5_'></a>[Adding the third data set](#toc0_)

We will now add a third dataset, which shows the country code, the country name and the iso-3 code for each country. This dataset is stored in a .xlsx file under the name C_name_ISO3.

In [9]:
# Running the below code will show you the data merged with the excel file
model.Merge_excel()

## 6. <a id='toc6_'></a>[Plotting the results](#toc0_)

We will now plot our data in three different types of plots.

**1. A line plot :** The line plot will show how the GDP/capita has changed over time for each country individually.

**2. A Choropleth map :** The choropleth map will show 

**3. A scatterplot :** 

### 6.1 <a id='toc7_'></a>[The Lineplot](#toc0_)

In [16]:
model.line_interactive()


interactive(children=(Dropdown(description='Country_Name', index=9, options=('Albania', 'Austria', 'Bosnia and…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

The above plot shows us how the GDP per capita has changed over the years for each country. There is a general tendency for most of the countries to have a drop in the GDP/Capita in the year 2020; This is most likely due to the Corona pandemic. Otherwise, there has been growth for all the countries over the years. 

### 6.2 <a id='toc8_'></a>[The Choropleth map](#toc0_)

In [11]:
model.plot_choropleth()

The above plot gives us a more visual view of how the countries GDP/Capita is compared to one another. 

By using the slider we will se how the GDP/Capita has changed over time. The most noticable change is for Ireland; it goes form being around 42 thousand euros per capita to being around 89 thousand euros per capita, this is more than a doubling over the 10 years.

We can also see when countries enter and exit the EU, when they are not a part of the EU, the data is not recorded. Eg. The United Kingdomexited in 2019 and thus there is no data after this year. 

### 6.3 <a id='toc9_'></a>[The Scatterplot](#toc0_)

In [None]:
model.scatter_interactive()

interactive(children=(IntSlider(value=2012, description='year', max=2022, min=2012), Output()), _dom_classes=(…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

The above plot shows the population against GDP per capita in euros, it can be seen that the countries with a high GDP per capita also have a relatively small population, ofcourse this does not mean that a high GDP per capita equals a small population, it is more likely that it is the other way around. 

## 7. <a id='toc10_'></a>[Standard Calculations](#toc0_)

We will now do some standard caluculations for the data, mean value, average, etc.
In this we also change the format of the output, to make it more readable. 

In [17]:
pd.options.display.float_format = '{:.2f}'.format
model.Merge.describe()

Unnamed: 0,year,GDP,Population,GDP_Cap
count,396.0,396.0,396.0,396.0
mean,2016.93,455101.7,16427120.83,28.47
std,3.14,732127.62,23181705.27,22.19
min,2012.0,3353.7,319575.0,3.36
25%,2014.0,36851.28,2076912.75,11.79
50%,2017.0,177109.7,6981901.5,20.28
75%,2020.0,452793.88,11412821.5,40.76
max,2022.0,3261919.4,83614362.0,98.63


From the above calculations we can see that the mean value for GDP per capita across all the countries and the years is 28464.5 euros, with a standard deviation of 22191.51 euros. 
The GDP per capita values vary quiet a bit with a minimum value of 3364.92 euros and a maximum of 98633.75 euros. 


## 8. <a id='toc11_'></a>[Conclusion](#toc0_)