# Data Project

This is a notebook with basic information and inspiration for the Data Project.

My favorite `pandas` function: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html (**very** usefull!)

<img src="pivot.PNG" style="float:center" width="800">

In [36]:
import pandas as pd
import numpy as np

N = 5
df_gdp = pd.DataFrame({
                    "Country": "ABC_land",
                    "Year": np.linspace(1990, 2011, N, dtype = int), 
                    "Variable_type": "GDP",
                    "Value": np.linspace(666, 6666, N, dtype = int)}) # arbitrary data
df_wages = pd.DataFrame({
                    "Country": "ABC_land",
                    "Year": np.linspace(1990, 2011, N, dtype = int), 
                    "Variable_type": "Wages",
                    "Value": np.linspace(100, 1000, N, dtype = int)}) # arbitrary data

df_country = pd.concat([df_gdp, df_wages])
# df_country

In [37]:
df_country

Unnamed: 0,Country,Year,Variable_type,Value
0,ABC_land,1990,GDP,666
1,ABC_land,1995,GDP,2166
2,ABC_land,2000,GDP,3666
3,ABC_land,2005,GDP,5166
4,ABC_land,2011,GDP,6666
0,ABC_land,1990,Wages,100
1,ABC_land,1995,Wages,325
2,ABC_land,2000,Wages,550
3,ABC_land,2005,Wages,775
4,ABC_land,2011,Wages,1000


In [38]:
#df_country = df_country.pivot_table(values = 'Value', index = ['Year', 'Country'], columns = 'Variable_type') #THIS IS THE SAME , just with the keywords
df_country = df_country.pivot_table('Value', ['Year', 'Country'], 'Variable_type')
df_country

Unnamed: 0_level_0,Variable_type,GDP,Wages
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
1990,ABC_land,666,100
1995,ABC_land,2166,325
2000,ABC_land,3666,550
2005,ABC_land,5166,775
2011,ABC_land,6666,1000


Merge example: (This is also a very good function!)

In [39]:
df_pop = pd.DataFrame({
                    "Country": "ABC_land",
                    "Year": np.linspace(1990, 2011, N, dtype = int), 
                    "Population (MIL)": np.linspace(4, 5, N)}) # arbitrary data
df_pop

Unnamed: 0,Country,Year,Population (MIL)
0,ABC_land,1990,4.0
1,ABC_land,1995,4.25
2,ABC_land,2000,4.5
3,ABC_land,2005,4.75
4,ABC_land,2011,5.0


In [33]:
df_country_complete = df_country.merge(df_pop, on = ['Country', 'Year'])
df_country_complete

Unnamed: 0,Country,Year,GDP,Wages,Population (MIL)
0,ABC_land,1990,666,100,4.0
1,ABC_land,1995,2166,325,4.25
2,ABC_land,2000,3666,550,4.5
3,ABC_land,2005,5166,775,4.75
4,ABC_land,2011,6666,1000,5.0


## What does Christian appreciate?

- Show that you are able to "clean" a dataset  
    - Take a big dataset and focus on specific variables  
    
- Merge different datasets (**Important**)
    - Cross-reference different datasets
    - This shows a lot of understanding, as you usually have to able to do a few manipulations before you can merge data.
    - (E.g. you can't merge a dataset with "years" in the columns and one with "years" in the rows -> the have to be the same format -> pivot!)
- Convey some kind of story!
    - Basically treat this as a mini-paper, where instead of Excel/ Stata, you use Python to analyze your data
- If possible use an API for getting your data
    - Then everything is done in Python -> no seperat Excel files!

## Very broad structural idea (you don't **have** to follow this)

- Import some data  and show the data in a table

- Sort the data, which variables do you wanna look at?

- Arrange the data, cross-reference with another dataset, or create new variables. 

- Present visually, graphs are your friend!

- Try to to convey a story throughout

# Ideas

## Declining labor shares

Although long taught as a "stylized fact", labor shares seem to be be declining at a global level. See this paper: http://piketty.pse.ens.fr/files/KarabarbounisNeiman13.pdf  

In my current seminar paper I am looking at something very similar:  

GDP in USD: https://data.un.org/Data.aspx?q=gdp&d=SNAAMA&f=grID%3a101%3bcurrID%3aUSD%3bpcFlag%3a0  
GDP and compensation of workers in national currency: https://data.un.org/Data.aspx?q=gdp&d=SNAAMA&f=grID%3a101%3bcurrID%3aUSD%3bpcFlag%3a0  
Value added and compensation of workers in the corporate sector: https://data.un.org/Data.aspx?q=4.8&d=SNA&f=group_code%3a408

Watch out when downloading, you can't download for all countries at once if you include too many variables..

Labor's share of total income is defined as: Labor share = Total compensation of employees / GDP  
(For the corporate sector it's "value added" and not GDP)

- Import UN data for GDP and compensation of employees, and GDP in us dollars.
- Specifcy which countries you use, and which years you look at.
- Create a labor share variable
- Plot the average global labor share for each year, use GDP in us dollars as a weight (otherweise small countries contribute the same as big ones)

**EXTRA CHALLENGE:** The UN data comes in different "Series", which is the same data where different methodology has been applied. Data should not be continued across multiple series unless the data is "compatible". I.e, are the values for different series close to eachother for overlapping years? (close could be within 5%)

## Covid-19 and municipality data

- Find data on covid-19 infections (John Hopkins, WHO etc.) or danish municipality data
- Cross reference with something else, maybe country/ municipality wealth or population density
- Present any correlation you might find in tables and graphs

## Climate change and investments in green energy?

- Find data on which countries are hit the hardest by climate change, and cross-country data on investments in green energy
- Any data on cross-country climate action plans?
- Present any correlation you might find in tables and graphs

## Data from your own seminar papers, BA or previous assignments

Feel free to use data you've previously used for something else, it's good practice to see how much you can manipulate the data in Python!