In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
from dstapi import DstApi


Create objects to interact with API of Denmark Statistics and display a summary table of the datasets informations for the GDP dataset:

In [2]:
gdp_dst = DstApi('NRHP')

t_gdp = gdp_dst.tablesummary(language='en')
display(t_gdp)

Table NRHP: 1-2.1.1 Production, GDP and generation of income by region, transaction, price unit and time
Last update: 2022-10-27T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,18,000,All Denmark,999,Outside regions,False
1,TRANSAKT,9,P1K,P.1 Output,B2A3GD,B.2g+B.3g Gross operating surplus and mixed in...,False
2,PRISENHED,4,V_T,"Current prices, (mill. DKK.)",LRG_C,"Pr. capita, 2010-prices, chained values, (1000...",False
3,Tid,29,1993,1993,2021,2021,True


Look up values (rows) that each variable (columns) can take: 

In [3]:
for variable in t_gdp['variable name']:
    print(variable+':')
    display(gdp_dst.variable_levels(variable, language='en'))

OMRÅDE:


Unnamed: 0,id,text
0,0,All Denmark
1,84,Region Hovedstaden
2,1,Province Byen København
3,2,Province Københavns omegn
4,3,Province Nordsjælland
5,4,Province Bornholm
6,85,Region Sjælland
7,5,Province Østsjælland
8,6,Province Vest- og Sydsjælland
9,83,Region Syddanmark


TRANSAKT:


Unnamed: 0,id,text
0,P1K,P.1 Output
1,P2D,P.2 Intermediate consumption
2,B1GD,B.1g Gross value added
3,D21X31D,D.21-D.31 Taxes less subsidies on products
4,B1GQD,B.1*g Gross domestic product
5,D29X39D,D.29-D.39 Other taxes less subsidies on produc...
6,B1GFD,B.1GF Gross domestic product at factor cost
7,D1D,D.1 Compensation of employees
8,B2A3GD,B.2g+B.3g Gross operating surplus and mixed in...


PRISENHED:


Unnamed: 0,id,text
0,V_T,"Current prices, (mill. DKK.)"
1,V_C,"Pr. capita. Current prices, (1000 DKK.)"
2,LRG_T,"2010-prices, chained values, (mill. DKK.)"
3,LRG_C,"Pr. capita, 2010-prices, chained values, (1000..."


Tid:


Unnamed: 0,id,text
0,1993,1993
1,1994,1994
2,1995,1995
3,1996,1996
4,1997,1997
5,1998,1998
6,1999,1999
7,2000,2000
8,2001,2001
9,2002,2002


Look up the format of the dictionary of dataset parameters:

In [4]:
par_gdp = gdp_dst._define_base_params(language='en')

display(par_gdp)

{'table': 'nrhp',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'TRANSAKT', 'values': ['*']},
  {'code': 'PRISENHED', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

Define parameters dictionary to select only specified values (rows) of dataset:

In [5]:
par_gdp = {'table': 'nrhp',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['000']},
  {'code': 'TRANSAKT', 'values': ['B1GQD']},
  {'code': 'PRISENHED', 'values': ['LRG_T']},
  {'code': 'Tid', 'values': ['>1993<=2021']}]}

# Just took some random parameters: I'll fix it later
# !! Does real price in gdp dataset have same base year as in consumption dataset?

Download dataset using only specified parameters:

In [6]:
gdp = gdp_dst.get_data(params=par_gdp)

display(gdp.head(5))

Unnamed: 0,OMRÅDE,TRANSAKT,PRISENHED,TID,INDHOLD
0,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1994,1403340
1,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1995,1445828
2,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1996,1487758
3,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1997,1536272
4,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1998,1570349


Renaming columns:

In [7]:
gdp.rename(columns = {'OMRÅDE':'Area', 
                      'PRISENHED':'Price unit', 
                      'TID':'Variables', #helpfull later
                      'INDHOLD':'GDP'}, inplace=True)
gdp.head(5)

Unnamed: 0,Area,TRANSAKT,Price unit,Variables,GDP
0,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1994,1403340
1,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1995,1445828
2,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1996,1487758
3,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1997,1536272
4,All Denmark,B.1*g Gross domestic product,"2010-prices, chained values, (mill. DKK.)",1998,1570349


Dropping unimportant variables:

In [8]:
gdp.drop(['TRANSAKT', 'Area', 'Price unit'], axis='columns', inplace=True)
gdp.head(5)

Unnamed: 0,Variables,GDP
0,1994,1403340
1,1995,1445828
2,1996,1487758
3,1997,1536272
4,1998,1570349


Change year to not-only-number and set index:

In [9]:
import string 
for value in gdp['Variables'].values:
    gdp.loc[gdp['Variables'].values == value,['Variables']] = 'year'+str(value)
gdp = gdp.set_index('Variables')
gdp.head(5)

Unnamed: 0_level_0,GDP
Variables,Unnamed: 1_level_1
year1994,1403340
year1995,1445828
year1996,1487758
year1997,1536272
year1998,1570349


Transpose:

In [10]:
gdp = gdp.T
gdp.head(5)

Variables,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,year2003,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
GDP,1403340,1445828,1487758,1536272,1570349,1616643,1677217,1691023,1698909,1705536,...,1839290,1856457,1886520,1930714,1993384,2049632,2090410,2121630,2079312,2180277


Import dataset for "consumption choices" previously downloaded from DST, selecting only the necessary parameters. We also skip empty rows:

In [11]:
filename = 'FU07.xlsx'
cop = pd.read_excel(filename, skiprows=2, skipfooter=2)
display(cop.head(50))

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,1994,1995,1996,1997,1998,1999,2000,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Real prices,Average Households,CONSUMPTION TOTAL,273603,290986,293971,304904,311828,312286,306447,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,,,01.1 Food,35028,35031,34386,34364,34742,34246,34405,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,,,"04.5 Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
3,,,05.2 Household textiles,1148,1140,1071,1097,1059,1093,1094,...,1323,1224,1287,1496,1291,1010,1077,1113,1478,1781
4,,,07.1 Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
5,,,07.3 Transport services,5990,5996,6056,6491,5982,6060,6077,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
6,,,12.1 Personal care,6469,6604,6761,6866,6937,6867,6797,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
7,,,12.5 Insurance,256,12547,13008,14468,15694,15688,14660,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762


We drop NaN columns:

In [12]:
drop_these = ['Unnamed: ' + str(num) for num in range(2)] # use list comprehension to create list of columns
cop.drop(drop_these, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made
cop.head(10)

Unnamed: 0,Unnamed: 2,1994,1995,1996,1997,1998,1999,2000,2001,2002,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,CONSUMPTION TOTAL,273603,290986,293971,304904,311828,312286,306447,301471,297266,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,01.1 Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,"04.5 Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
3,05.2 Household textiles,1148,1140,1071,1097,1059,1093,1094,1239,1475,...,1323,1224,1287,1496,1291,1010,1077,1113,1478,1781
4,07.1 Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
5,07.3 Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
6,12.1 Personal care,6469,6604,6761,6866,6937,6867,6797,6365,6432,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
7,12.5 Insurance,256,12547,13008,14468,15694,15688,14660,13619,12301,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762


Renmaing consumption and year column names:

In [13]:
cop.rename(columns = {'Unnamed: 2':'Variables'}, inplace=True)

col_dict = {}
col_dict = {str(i) : f'year{i}' for i in range(1994,2021+1)}
cop.rename(columns = col_dict, inplace=True)

cop.head(5)

Unnamed: 0,Variables,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
0,CONSUMPTION TOTAL,273603,290986,293971,304904,311828,312286,306447,301471,297266,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,01.1 Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,"04.5 Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
3,05.2 Household textiles,1148,1140,1071,1097,1059,1093,1094,1239,1475,...,1323,1224,1287,1496,1291,1010,1077,1113,1478,1781
4,07.1 Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174


Dropping unimportant variables:

In [14]:
# Build up a logical index I
I = cop.Variables.str.contains('Household textiles')
cop.loc[I, :]
cop = cop.loc[I == False] # keeping everything else
cop.head(10)

Unnamed: 0,Variables,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
0,CONSUMPTION TOTAL,273603,290986,293971,304904,311828,312286,306447,301471,297266,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,01.1 Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,"04.5 Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
4,07.1 Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
5,07.3 Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
6,12.1 Personal care,6469,6604,6761,6866,6937,6867,6797,6365,6432,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
7,12.5 Insurance,256,12547,13008,14468,15694,15688,14660,13619,12301,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762


Resetting index:

In [15]:
cop.reset_index(inplace = True, drop = True) # Drop old index too
cop.iloc[0:7,:]

Unnamed: 0,Variables,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
0,CONSUMPTION TOTAL,273603,290986,293971,304904,311828,312286,306447,301471,297266,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,01.1 Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,"04.5 Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
3,07.1 Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
4,07.3 Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
5,12.1 Personal care,6469,6604,6761,6866,6937,6867,6797,6365,6432,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
6,12.5 Insurance,256,12547,13008,14468,15694,15688,14660,13619,12301,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762


Removing numbers from consumption categories:

In [16]:
import string 
for value in cop['Variables'].values:
    cop.loc[cop['Variables'].values == value,['Variables']] = value.strip('0123456789.')

cop.loc[0,['Variables']] = 'Total consumption'

cop.head(10)

Unnamed: 0,Variables,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
0,Total consumption,273603,290986,293971,304904,311828,312286,306447,301471,297266,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
1,Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
2,"Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
3,Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
4,Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
5,Personal care,6469,6604,6761,6866,6937,6867,6797,6365,6432,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
6,Insurance,256,12547,13008,14468,15694,15688,14660,13619,12301,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762


Set Index:

In [17]:
cop = cop.set_index('Variables')
cop.head()

Unnamed: 0_level_0,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,year2003,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
Variables,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total consumption,273603,290986,293971,304904,311828,312286,306447,301471,297266,296754,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,32743,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
"Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,23392,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,10152,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,5639,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935


Concatenate gdp and cop datasets:

In [18]:
all = pd.concat([cop,gdp])
all.head(11)

Unnamed: 0,year1994,year1995,year1996,year1997,year1998,year1999,year2000,year2001,year2002,year2003,...,year2012,year2013,year2014,year2015,year2016,year2017,year2018,year2019,year2020,year2021
Total consumption,273603,290986,293971,304904,311828,312286,306447,301471,297266,296754,...,314891,313597,312190,308583,298823,303926,318057,316611,313395,322890
Food,35028,35031,34386,34364,34742,34246,34405,33466,32742,32743,...,33744,33234,33822,33393,33160,32604,34210,33004,33774,33007
"Electricity, gas and other fuels",21063,21376,22726,23558,23566,22344,21690,23257,23068,23392,...,23230,23200,22880,22306,21301,24776,24968,24445,21068,26193
Purchase of vehicles,13595,13302,13756,14874,14538,14976,10656,10220,11188,10152,...,14444,15174,13933,17255,15871,15876,18436,18466,22491,21174
Transport services,5990,5996,6056,6491,5982,6060,6077,6049,6075,5639,...,6245,6130,5932,4841,4713,5099,5821,5469,3094,2935
Personal care,6469,6604,6761,6866,6937,6867,6797,6365,6432,6392,...,6209,6258,5944,6283,6040,6699,7064,6540,6432,6240
Insurance,256,12547,13008,14468,15694,15688,14660,13619,12301,14709,...,15590,15497,15096,18237,14919,13802,14545,14500,14781,17762
GDP,1403340,1445828,1487758,1536272,1570349,1616643,1677217,1691023,1698909,1705536,...,1839290,1856457,1886520,1930714,1993384,2049632,2090410,2121630,2079312,2180277


Next things to do: 

0) Solving problem about real prices base year: 
    Base year for real value calculation is different for gdp and cop datasets: should we use current prices instead? can we find other solutions? --> current prices
1) Drop remaining NaN rows in cop dataset - DONE
2) New index for cop dataset - (I think) DONE
3) Trasform gdp dataset: 
    - years as column names
    - only one row called gdp: drop other (?)
4) Merge datasets
5) Create new data as a result of running an operator on other data in the dataset: 
    (e.g. summing two rows)
6) Running a function on the dataset