# Alt andet end lige: Data project

Importing packages and setting magics:

In [198]:
# importing the used packages 
import pandas as pd
import numpy as np
import datetime

# importing package to create plots and setting basic, visual settings
import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"-"})
plt.rcParams.update({'font.size': 10})
import ipywidgets as widgets

# importing the API from DST used to import data
from dstapi import DstApi 

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

1. Data is imported using the API for Danmarks statistik

In [199]:
price = DstApi('EJ55') #EJ55 is data on the pricing of houses
unemp = DstApi('AULK01') #AULK01 is data on the number of unemployed 
unemp = DstApi('AUS08') #AUS08 is seasonally adjusted data on the unemployment rate

## Exploring the data sets

1. The availble values for each variable is plotted in order to select relevant variables. 

In [200]:
#An overview over the available data. 
tabsum = price.tablesummary(language='en')
display(tabsum)

# Displaying the available values for each variable:
for variable in tabsum['variable name']:
    print(variable+':')
    display(price.variable_levels(variable, language='en'))

Table EJ55: Price index for sales of property by region, category of real property, unit and time
Last update: 2023-03-31T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,17,000,All Denmark,11,Province Nordjylland,False
1,EJENDOMSKATE,3,0111,One-family houses,2103,"Owner-occupied flats, total",False
2,TAL,3,100,Index,310,Percentage change compared to same quarter the...,False
3,Tid,124,1992K1,1992Q1,2022K4,2022Q4,True


OMRÅDE:


Unnamed: 0,id,text
0,0,All Denmark
1,84,Region Hovedstaden
2,1,Province Byen København
3,2,Province Københavns omegn
4,3,Province Nordsjælland
5,4,Province Bornholm
6,85,Region Sjælland
7,5,Province Østsjælland
8,6,Province Vest- og Sydsjælland
9,83,Region Syddanmark


EJENDOMSKATE:


Unnamed: 0,id,text
0,111,One-family houses
1,801,Weekend cottages
2,2103,"Owner-occupied flats, total"


TAL:


Unnamed: 0,id,text
0,100,Index
1,210,Percentage change compared to previous quarter
2,310,Percentage change compared to same quarter the...


Tid:


Unnamed: 0,id,text
0,1992K1,1992Q1
1,1992K2,1992Q2
2,1992K3,1992Q3
3,1992K4,1992Q4
4,1993K1,1993Q1
...,...,...
119,2021K4,2021Q4
120,2022K1,2022Q1
121,2022K2,2022Q2
122,2022K3,2022Q3


We are only interested in a subset of the total dataset. Below, we specify the subset of the dataset we want to include

1. A param dictionary is defined in order to detaile the data we want

In [201]:
# Getting an overview of the underlying code that determines which variables and subset of data we import via the API
params = price._define_base_params(language='en')
params

{'table': 'ej55',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'EJENDOMSKATE', 'values': ['*']},
  {'code': 'TAL', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

1. We select the data we want. We only want data for "All Denmark" and indexed values, and percentage change compared to previous quarter.

In [202]:
# Using the format printed above to specify which subsets of the available dataset we want to import
params = {'table': 'ej55',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']},
  {'code': 'EJENDOMSKATE', 'values': ['0111']},
  {'code': 'TAL', 'values': ['100']}, # with the key 'code' we choose the desired variable and with the key 'values' we choose what subset of the dataset for the given variable we want to include
  {'code': 'Tid', 'values': ['*']}]} # ['*'] includes all the available data

1. Data is sorted and the index is reset. 
2. Columns are renamed.

In [203]:
sales_api = price.get_data(params=params) # retrieving the specified subset of the dataset from DST
sales_api.sort_values(by=['OMRÅDE', 'TID', 'EJENDOMSKATE'], inplace=True) # sorting the values
sales_api.reset_index(inplace = True, drop = True) # resetting the initial index, so it fits the new dataset
sales_api.rename(columns = {'OMRÅDE':'PROVINCE', 'EJENDOMSKATE':'CATEGORY', 'TAL':'UNIT', 'TID':'TIME', 'INDHOLD':'SALES_INDEX'}, inplace=True) # renaming columns

1. Missing values are replaced with NaN.

In [204]:
sales_api = sales_api.replace('..', np.nan) # replacing all the missing data (denoted with '..' by DST) with NaN-values

1. Values types are replaced. 

In [205]:
sales_api.info() # assesing the types of data in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1364 entries, 0 to 1363
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PROVINCE     1364 non-null   object
 1   CATEGORY     1364 non-null   object
 2   UNIT         1364 non-null   object
 3   TIME         1364 non-null   object
 4   SALES_INDEX  1308 non-null   object
dtypes: object(5)
memory usage: 53.4+ KB


1. The value variable is changed to a float type variable. 

In [206]:
sales_api.SALES_INDEX = sales_api.SALES_INDEX .astype('float') # changing the column SALES_INDEX from object to float
sales_api.info() # displaying the new types of variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1364 entries, 0 to 1363
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PROVINCE     1364 non-null   object 
 1   CATEGORY     1364 non-null   object 
 2   UNIT         1364 non-null   object 
 3   TIME         1364 non-null   object 
 4   SALES_INDEX  1308 non-null   float64
dtypes: float64(1), object(4)
memory usage: 53.4+ KB


Below, we **explore the raw data** by creating **interactive plots** to show important developments 

**Interactive plot** :

1. We create an interactive plot of the housing prices in different provinces of Denmark

In [207]:
def plot_value(df, province): 
    I = (df['PROVINCE'] == province) # creating the plot for the variable 'PROVINCE', thus allowing us to choose the desired province in the plot later on 
    ax=df.loc[I,:].plot(x='TIME', y='SALES_INDEX', legend=False) # specifying the x- and y-variables for the plot

widgets.interact(plot_value, # creating an interactive plot
    df = widgets.fixed(sales_api),
    province = widgets.Dropdown(description='Province', #creating drop-down widget that allows us to choose the desired province
                                    options=sales_api.PROVINCE.unique(), 
                                    value='Province Byen København'), # initial province will be 'Byen København'
)


interactive(children=(Dropdown(description='Province', index=1, options=('Province Bornholm', 'Province Byen K…

<function __main__.plot_value(df, province)>

In the plot above, we see that all the (nominal) price indexes have followed an increasing trend since 1992. We note that the indexes are not comparable across regions, as each region is indexed so the value of the prices are normalized to index 100 in 2006. So, even if prices in e.g. Byen København are, say, 30 percent higher than prices in Fyn in 2006, both indexes will have the value of index 100.

However, from the graphical representation we can compare the relative increase in prices *within* a given province over time, and we see that the most dominant increases in prices since 2006 have been in the provinces 'Byen København', 'Københavns Omegn' and 'Bornholm'. In the recent quarters, of these three provinces, Bornholm have experienced the relatively largest decrease in prices again.

# Merge with data on gross unemployment

Now we wish to examine the correlation within provinces between gross unemployment and prices on one-family houses. Theory suggests that the gross unemployment in a province is a determinant of the housing prices, and that they are negatively correlated. The intuition is that when the unemployment rate increases in a province, then it becomes less attractive to move to this province as the risk of unemployment, all else equal, is higher in this area. Therefore, the demand of houses in the given province decreases implying that the housing prices decrease.

First, we collect data on gross unemployment within provinces in Denmark. This is done by using the API from Statistics Denmark and importing the register *AULK01*. 

Just like before, we then clean this dataset and select the relevant information from the dataset. To do this, we need to get an overview of the dataset:

In [208]:
#An overview over the available data. 
tabsum_unemp = unemp.tablesummary(language='en')
display(tabsum)

# The available values for each variable:
for variable in tabsum_unemp['variable name']:
    print(variable+':')
    display(unemp.variable_levels(variable, language='en'))

Table AUS08: unemployed persons (seasonally adjusted) by region, seasonal adjustment and actual figures and time
Last update: 2023-04-28T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,17,000,All Denmark,11,Province Nordjylland,False
1,EJENDOMSKATE,3,0111,One-family houses,2103,"Owner-occupied flats, total",False
2,TAL,3,100,Index,310,Percentage change compared to same quarter the...,False
3,Tid,124,1992K1,1992Q1,2022K4,2022Q4,True


OMRÅDE:


Unnamed: 0,id,text
0,000,All Denmark
1,084,Region Hovedstaden
2,01,Province Byen København
3,101,Copenhagen
4,147,Frederiksberg
...,...,...
112,787,Thisted
113,820,Vesthimmerlands
114,851,Aalborg
115,998,Unknown municipality


SAESONFAK:


Unnamed: 0,id,text
0,9,Seasonally adjusted figures in percent of the ...
1,10,Seasonally adjusted
2,22,Enumerated actual figures in percent of the l...
3,24,Enumerated actual figures


Tid:


Unnamed: 0,id,text
0,2007M01,2007M01
1,2007M02,2007M02
2,2007M03,2007M03
3,2007M04,2007M04
4,2007M05,2007M05
...,...,...
190,2022M11,2022M11
191,2022M12,2022M12
192,2023M01,2023M01
193,2023M02,2023M02


Now, we select the relevant variables for this analysis.

In [209]:
# Getting an overview of the underlying code that determines which variables and subset of data we import via the API
params_unemp = unemp._define_base_params(language='en')
params_unemp 

{'table': 'aus08',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'SAESONFAK', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

In [210]:
# Using the format printed above to specify which subsets of the available dataset we want to import
params_unemp = {'table': 'aus08',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']}, # with the key 'code' we choose the desired variable and with the key 'values' we choose what subset of the dataset for the given variable we want to include
  {'code': 'SAESONFAK', 'values': ['9']},
  {'code': 'Tid', 'values': ['*']}]} # ['*'] includes all the available values



In [211]:
unemp_api = unemp.get_data(params=params_unemp) # retrieving the specified subset of the dataset from DST
unemp_api.sort_values(by=['OMRÅDE', 'TID'], inplace=True) # sorting the values
unemp_api = unemp_api.drop(columns = ['SAESONFAK']) # dropping unwanted columns
unemp_api.reset_index(inplace = True, drop = True) # resetting the index, so it fits the new dataset
unemp_api.rename(columns = {'OMRÅDE':'PROVINCE','TID':'TIME', 'INDHOLD':'UNEMPLOYMENT_RATE'}, inplace=True) # renaming columns



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2145 entries, 0 to 2144
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PROVINCE           2145 non-null   object 
 1   UNEMPLOYMENT_RATE  2145 non-null   float64
 2   YEAR               2145 non-null   int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 50.4+ KB


### Taking yearly averages over the unemployment rate 

Now, we want to calculate yearly averages of the unemployment rate and the housing prices, respectively. This is so we can plot the two against each other later.

We start by creating a new variable 'YEAR' where it has the same value for all observations in a given year. Then we drop 'TIME'. Finally we calculate the average over each year. Firstly, we do this  the unemployment rate and then for the housing prices.

In [None]:
unemp_api['YEAR'] = pd.to_datetime(unemp_api['TIME'], format='%YM%m').dt.year # extracting the year information from the column 'TIME' and creating a new variable 'YEAR' with the year information
unemp_api = unemp_api.drop(columns = ['TIME']) # dropping the column 'TIME'

unemp_api['UNEMPLOYMENT_RATE'] = unemp_api['UNEMPLOYMENT_RATE'].astype('float') # changing the column UNEMPLOYMENT_RATE from object to float
unemp_api.info() # assesing the types of data in the dataset

In [212]:
unemp_avg = unemp_api.groupby(['PROVINCE', 'YEAR'])['UNEMPLOYMENT_RATE'].apply('mean') # calculating the average unemployment rate for each province in each year
unemp_avg = unemp_avg.reset_index() # resetting the index, so it fits the new dataset
unemp_avg.head(5)

Unnamed: 0,PROVINCE,YEAR,UNEMPLOYMENT_RATE
0,Province Bornholm,2007,7.391667
1,Province Bornholm,2008,5.975
2,Province Bornholm,2009,7.475
3,Province Bornholm,2010,8.191667
4,Province Bornholm,2011,7.4


Now, we also calculate the mean for the house prices in the same fashion.

In [213]:

sales_api['YEAR'] = pd.to_datetime(sales_api['TIME'], format='%YQ%m').dt.year # extracting the year information from the column 'TIME'
sales_api = sales_api.drop(columns = ['TIME']) # dropping the column 'TIME'

sales_api['SALES_INDEX'] = sales_api['SALES_INDEX'].astype('float') # changing the column SALES_INDEX from object to float
sales_api.info() # assesing the types of data in the dataset
sales_api.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1364 entries, 0 to 1363
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PROVINCE     1364 non-null   object 
 1   CATEGORY     1364 non-null   object 
 2   UNIT         1364 non-null   object 
 3   SALES_INDEX  1308 non-null   float64
 4   YEAR         1364 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 53.4+ KB


Unnamed: 0,PROVINCE,CATEGORY,UNIT,SALES_INDEX,YEAR
282,Province Fyn,One-family houses,Index,63.9,2000
137,Province Byen København,One-family houses,Index,22.8,1995
941,Province Vest- og Sydsjælland,One-family houses,Index,85.0,2010
831,Province Sydjylland,One-family houses,Index,91.9,2013
798,Province Sydjylland,One-family houses,Index,88.1,2005


In [219]:
sales_avg = sales_api.groupby(['PROVINCE', 'YEAR'])['SALES_INDEX'].apply('mean') # calculating the average unemployment rate for each province in each year
sales_avg = sales_avg.reset_index() # resetting the index, so it fits the new dataset
sales_avg.sample(10)

Unnamed: 0,PROVINCE,YEAR,SALES_INDEX
307,Province Østjylland,2020,117.1
51,Province Byen København,2012,79.4
296,Province Østjylland,2009,92.475
20,Province Bornholm,2012,84.625
86,Province Fyn,2016,100.9
154,Province Nordjylland,2022,127.65
329,Province Østsjælland,2011,75.55
203,Province Sydjylland,2009,104.15
1,Province Bornholm,1993,
153,Province Nordjylland,2021,126.425


Now, we wish to merge gross unemployment data within provinces on the dataset of housing prices. Here, we merge on province and time, such that we for each province in each quarter now also have information on the number of gross employed. We have data on housing prices from 1992Q1 and onwards, while we only have data on gross unemployment from 2007Q1. As all observations on housing prices before 2007Q1 cannot be linked to the gross unemployment in the given period, these observations are irrelvant, and we therefore use the *inner join*-method to merge. This only method only includes the matches between the two datasets.


In [221]:
sales_with_unemp = pd.merge(sales_avg, unemp_avg, on = ['PROVINCE', 'YEAR'], how = 'inner') # Performing an inner merge of the two datasets (keeping only data for which there are observations for both variables in the dataset)
sales_with_unemp.sample(10) # displaying a sample of 10 random values in the dataset

Unnamed: 0,PROVINCE,YEAR,SALES_INDEX,UNEMPLOYMENT_RATE
147,Province Østjylland,2010,95.15,5.558333
37,Province Fyn,2012,91.625,7.375
36,Province Fyn,2011,96.4,7.15
85,Province Nordsjælland,2012,71.65,4.625
105,Province Sydjylland,2016,101.825,3.633333
40,Province Fyn,2015,98.625,5.3
16,Province Byen København,2007,96.275,5.433333
126,Province Vest- og Sydsjælland,2021,105.275,3.5
60,Province Københavns omegn,2019,116.0,3.741667
74,Province Nordjylland,2017,112.475,4.791667


Now that the dataset is ready, we can examine the correlation between gross unemployment and prices on one-family houses within each province. We do this by constructing a binned scatterplot for each province. The binned scatterplot groups the unemployment rates into 10 equal-sized bins for each province and plots the mean of the associated housing prices within each bin. 

We construct the binned scatter plot using the package *binsreg*, which can be installed writing "pip install binsreg" in the prompt.

In [None]:
# Import binsreg
import binsreg

def binscatter(df,province):
    I =df[df['PROVINCE'] == province]
    binsreg.binsreg('SALES_INDEX', 'GROSS UNEMPLOYMENT', data=I, 
                    nbins=10, #specify 10 bins 
                    polyreg=1, #create linear fitted line      
    )

    
widgets.interact(binscatter, # creating interactive widget letting us choose the desired province
    df = widgets.fixed(sales_with_unemp),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_with_unemp.PROVINCE.unique(), 
                                    value='Province Byen København'),
)



From the interactive figure above, we clearly observe a negative correlation between gross unemployment rate and housing prices in all of the provinces. This results supports the theoretical suggestions that the gross unemployment rate in a province is a determinant of the housing prices in the province. However, it is important to emphasize that these results are only correlations, hence we cannot give the results any causal interpretations. In principle, the negative correlation could be due to reverse causality or omitted variables correlating with both gross unemployment and the housing prices (e.g. the interest rates).