# Data project

**Table of contents**<a id='toc0_'></a>    
- 1. [Import packages and check folders](#toc1_)    
- 2. [GDP data set](#toc1_1_)  
- 3. [Consumption data set](#toc1_1_)
- 4. [Combining the data sets](#toc1_1_) 
- 5. [Analysis](#toc2_)    
  - 5.1. [Graphs](#toc2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Brief description of the project**

In this project we download two datasets by using two different methods. Afterwards we clean the data by deleting unwanted cells and variables, renaming variables, creating homogeneity in units etc. Lastly we will make a graph and add some comment on the results.

**Goal of the analysis**

We want to analyse how economic fluctuations affect the Danes consumption patterns in different consumption groups. 




## 1. <a id='toc1_'></a>[Import packages and check folders](#toc0_)

In [270]:
#First we import necessary packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os 

plt.style.use('seaborn-whitegrid')
from dstapi import DstApi # Denmark Statistics API wrapper

# Import our function
from data_proj import prediction

# Using assert to check that paths exist on computer.
assert os.path.isfile('data/FU07_cp.xlsx')

# Print everything in data
os.listdir('data/')

  plt.style.use('seaborn-whitegrid')


['FU07_cp.xlsx']

## 2. <a id='toc1_1_'></a>[GDP data set](#toc1_1_)

2.1 Retrieving the data

In [271]:
#Create objects to interact with API of Denmark Statistics and display a summary table of the datasets 
#informations for the GDP dataset:
gdp_dst = DstApi('NRHP')

t_gdp = gdp_dst.tablesummary(language='en')


#Decomment below to gain further insight into what the data contains

#display(t_gdp)

#Look up values (rows) that each variable (columns) can take: 
#for variable in t_gdp['variable name']:
#    print(variable+':')
#    display(gdp_dst.variable_levels(variable, language='en'))

#Look up the format of the dictionary of dataset parameters:
#par_gdp = gdp_dst._define_base_params(language='en')
#display(par_gdp)

Table NRHP: 1-2.1.1 Production, GDP and generation of income by region, transaction, price unit and time
Last update: 2022-10-27T08:00:00


2.2 Handling the data

In [272]:
#Define parameters dictionary to select only specified values (rows) of dataset:
par_gdp = {'table': 'nrhp',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['000']},
  {'code': 'TRANSAKT', 'values': ['B1GQD']},
  {'code': 'PRISENHED', 'values': ['V_C']},
  {'code': 'Tid', 'values': ['>1993<=2021']}]}

#Download the specific dataset by specified parameters:
gdp = gdp_dst.get_data(params=par_gdp)

#Rename columns:
gdp.rename(columns = {'OMRÅDE':'Area', 
                      'PRISENHED':'Price unit', 
                      'TID':'variables', #helpfull later
                      'INDHOLD':'GDP'}, inplace=True)

#Drop unimportant variables
gdp.drop(['TRANSAKT', 'Area', 'Price unit'], axis='columns', inplace=True)

#Make column names a mix of text and numbers (without spaces) and set index:
import string 

for value in gdp['variables'].values:
    gdp.loc[gdp['variables'].values == value,['variables']] = 'value'+str(value)
gdp = gdp.set_index('variables')

#Transpose
gdp = gdp.T

## 3. <a id='toc1_2_'></a>[Consumption data set](#toc1_2_)

3.1 Retrieving the data

In [273]:
#Import dataset for "consumption choices" previously downloaded from DST, selecting only 
#the necessary parameters. We also skip empty rows:
filename = 'data/FU07_cp.xlsx'
cop = pd.read_excel(filename, skiprows=2, skipfooter=2)

#display(cop) #Decomment to gain information on the data set

3.2 Handling the data

In [274]:
#Drop NaN columns:
drop_these = ['Unnamed: ' + str(num) for num in range(2)] # use list comprehension to create list of columns
cop.drop(drop_these, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made

#Rename consumption and year columns:
cop.rename(columns = {'Unnamed: 2':'variables'}, inplace=True)
col_dict = {}
col_dict = {str(i) : f'value{i}' for i in range(1994,2021+1)}
cop.rename(columns = col_dict, inplace=True)

#Drop unimportant variables:
I = cop.variables.str.contains('Household textiles')
cop.loc[I, :]
cop = cop.loc[I == False] # keeping everything else

#Reset the index
cop.reset_index(inplace = True, drop = True) # Drop old index too
cop.iloc[0:7,:]

#Remove numbers from consumption categories:
for value in cop['variables'].values:
    cop.loc[cop['variables'].values == value,['variables']] = value.strip('0123456789.')

cop.loc[0,['variables']] = 'Total consumption'

#Set variables as index:
cop = cop.set_index('variables')
cop

Unnamed: 0_level_0,value1994,value1995,value1996,value1997,value1998,value1999,value2000,value2001,value2002,value2003,...,value2012,value2013,value2014,value2015,value2016,value2017,value2018,value2019,value2020,value2021
variables,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total consumption,22325,23280,23488,24021,24904,24905,25605,25803,25972,26428,...,33133,32802,33381,33393,33268,33526,35068,34380,35485,34864
"Electricity, gas and other fuels",11551,11786,13261,14056,14669,14579,15426,17058,17774,17939,...,23866,24282,24322,22306,20812,23919,23940,23580,19925,25635
Purchase of vehicles,12170,12029,12453,13525,13535,14268,10072,9742,10985,10423,...,14669,15307,13928,17255,15671,15465,17742,17754,21668,20965
Transport services,3663,3701,3445,3720,3425,3669,3869,3984,4154,4229,...,6188,6096,5820,4841,4533,5133,6010,5918,3324,3026
Personal care,4092,4274,4531,4697,4874,5072,5169,5017,5210,5314,...,6163,6236,5916,6283,5945,6522,6733,6155,5970,5761
Insurance,5875,6206,6528,7519,8344,8809,8399,8292,8207,10671,...,15398,14502,14820,18237,15276,14183,15225,15210,15887,19317


## 4. <a id='toc1_3_'></a>[Combining the data sets](#toc1_3_)

4.1 Concatenate gdp and cop datasets:


In [275]:
#Check if they have the same variables 
different_years = [y for y in cop.columns.unique() if y not in gdp.columns.unique()] 
print(f'Columns (years) found in cop data but not in gdp: {different_years}')

#Concatenate them
all = pd.concat([cop,gdp])
#all

Columns (years) found in cop data but not in gdp: []


4.2 Accomodate the data

In [276]:
#Rename index:
all.index.names = ['variables']

#Consumption is in DKK while GDP (per capita) is in 1000 DKK. It will be homogenized towards the unitary value.
scalar = 1000
for i in all.index.values:
    if i != 'GDP':
        all[all.index == i] = all[all.index == i] / scalar
all

Unnamed: 0_level_0,value1994,value1995,value1996,value1997,value1998,value1999,value2000,value2001,value2002,value2003,...,value2012,value2013,value2014,value2015,value2016,value2017,value2018,value2019,value2020,value2021
variables,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total consumption,22.325,23.28,23.488,24.021,24.904,24.905,25.605,25.803,25.972,26.428,...,33.133,32.802,33.381,33.393,33.268,33.526,35.068,34.38,35.485,34.864
"Electricity, gas and other fuels",11.551,11.786,13.261,14.056,14.669,14.579,15.426,17.058,17.774,17.939,...,23.866,24.282,24.322,22.306,20.812,23.919,23.94,23.58,19.925,25.635
Purchase of vehicles,12.17,12.029,12.453,13.525,13.535,14.268,10.072,9.742,10.985,10.423,...,14.669,15.307,13.928,17.255,15.671,15.465,17.742,17.754,21.668,20.965
Transport services,3.663,3.701,3.445,3.72,3.425,3.669,3.869,3.984,4.154,4.229,...,6.188,6.096,5.82,4.841,4.533,5.133,6.01,5.918,3.324,3.026
Personal care,4.092,4.274,4.531,4.697,4.874,5.072,5.169,5.017,5.21,5.314,...,6.163,6.236,5.916,6.283,5.945,6.522,6.733,6.155,5.97,5.761
Insurance,5.875,6.206,6.528,7.519,8.344,8.809,8.399,8.292,8.207,10.671,...,15.398,14.502,14.82,18.237,15.276,14.183,15.225,15.21,15.887,19.317
GDP,191.0,198.0,207.0,217.0,224.0,233.0,249.0,256.0,262.0,267.0,...,339.0,344.0,351.0,358.0,368.0,380.0,389.0,397.0,399.0,428.0


## 5. <a id='toc1_'></a>[Analysis](#toc2_)

In [277]:
#Create new column, year2022, which contains values 
#given a 0.05 growth rate prediction of every variable in year 2022:
all['value2022'] = all.apply(prediction, rate=1.05, axis=1)

#Check consumption of each variable over GDP:
for val in all.index:
    all.loc[val + "/GDP"] = all.loc[val] / all.loc["GDP"]

#Set decimal units
all = all.astype(float).round(decimals=2)
all

Unnamed: 0_level_0,value1994,value1995,value1996,value1997,value1998,value1999,value2000,value2001,value2002,value2003,...,value2013,value2014,value2015,value2016,value2017,value2018,value2019,value2020,value2021,value2022
variables,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total consumption,22.32,23.28,23.49,24.02,24.9,24.9,25.6,25.8,25.97,26.43,...,32.8,33.38,33.39,33.27,33.53,35.07,34.38,35.48,34.86,36.61
"Electricity, gas and other fuels",11.55,11.79,13.26,14.06,14.67,14.58,15.43,17.06,17.77,17.94,...,24.28,24.32,22.31,20.81,23.92,23.94,23.58,19.92,25.64,26.92
Purchase of vehicles,12.17,12.03,12.45,13.52,13.54,14.27,10.07,9.74,10.98,10.42,...,15.31,13.93,17.26,15.67,15.46,17.74,17.75,21.67,20.96,22.01
Transport services,3.66,3.7,3.44,3.72,3.42,3.67,3.87,3.98,4.15,4.23,...,6.1,5.82,4.84,4.53,5.13,6.01,5.92,3.32,3.03,3.18
Personal care,4.09,4.27,4.53,4.7,4.87,5.07,5.17,5.02,5.21,5.31,...,6.24,5.92,6.28,5.94,6.52,6.73,6.16,5.97,5.76,6.05
Insurance,5.88,6.21,6.53,7.52,8.34,8.81,8.4,8.29,8.21,10.67,...,14.5,14.82,18.24,15.28,14.18,15.22,15.21,15.89,19.32,20.28
GDP,191.0,198.0,207.0,217.0,224.0,233.0,249.0,256.0,262.0,267.0,...,344.0,351.0,358.0,368.0,380.0,389.0,397.0,399.0,428.0,449.4
Total consumption/GDP,0.12,0.12,0.11,0.11,0.11,0.11,0.1,0.1,0.1,0.1,...,0.1,0.1,0.09,0.09,0.09,0.09,0.09,0.09,0.08,0.08
"Electricity, gas and other fuels/GDP",0.06,0.06,0.06,0.06,0.07,0.06,0.06,0.07,0.07,0.07,...,0.07,0.07,0.06,0.06,0.06,0.06,0.06,0.05,0.06,0.06
Purchase of vehicles/GDP,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04,0.04,...,0.04,0.04,0.05,0.04,0.04,0.05,0.04,0.05,0.05,0.05


The table above gives an overview of how big a part consumption is of GDP. Consumption makes up 12% of GDP in 1994 but is only 8% in 2021. We have futher looked into 5 subcategories of consumption. These will be easier to analyse thorugh a graph, which we will create in the next section. However,it can be seen that electricity, gas and other fuels have held a steady level in the whole periode on 6-7% of GDP. Where as insuranse waver more every year from 3% of GDP in 1994, and 5% of GDP in 2021. 

### 5.1 <a id='toc2_1_'></a>[Graphs](#toc2_1_)

5.1.1 Prepare the data

In [278]:
#Reset the index:
all = all.reset_index()

#Transform dataframe from wide to long format:
all_long = pd.wide_to_long(all, stubnames='value', i='variables', j='year')

#Save a copy of the final format of our dataset (uncomment to run the code):
#all_long.to_csv('data/FU07_cp_long.xlsx', index=False)

#Reset the index again
all_long = all_long.reset_index()

5.1.2 Plot an interactive graph

In [279]:
import ipywidgets as widgets

def plot_e(df, variable): 
    I = df['variables'] == variable
    ax=df.loc[I,:].plot(x='year', y='value', style='-o', legend=False)

widgets.interact(plot_e, 
    df = widgets.fixed(all_long),
    variable = widgets.Dropdown(description='variables', 
                                    options=all_long.variables.unique(), 
                                    value='Total consumption')
); 


interactive(children=(Dropdown(description='variables', options=('Total consumption', ' Electricity, gas and o…

It is seen that most of the variables have an upwards trend except purchase of vehicles, transport services and personal care after 2007.
Purchase of vehicles is the most volatile variable, which is because vehicles are a luxury good and therefore more vulnerable to economic fluctuations. The biggest drop in purchase of vehicles is in 2007, where GDP also drops due to the finanical crisis. It is also seen that the share of purchase of vehicles of GDP decreases with 2 percentage points from 2007-2009. 
However, it looks like most of the variables are affected by GDP growth, however this could also be a side effect of the decrease in total consumption, and therefore a generel trend. 