# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [1]:
import pandas as pd
import numpy as np
import datetime

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"-"})
plt.rcParams.update({'font.size': 10})
import ipywidgets as widgets

from dstapi import DstApi 

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject



# Read and clean data

1. Data is imported using the API for Danmarks statistik

In [2]:
data = DstApi('EJ55') 
unemp = DstApi('AULK01') 

## Explore each data set

1. The availble values for each variable is plotted in order to select relevant variables. 

In [3]:
#An overview over the availble data. 
tabsum = data.tablesummary(language='en')
display(tabsum)

# The available values for a each variable:
for variable in tabsum['variable name']:
    print(variable+':')
    display(data.variable_levels(variable, language='en'))

Table EJ55: Price index for sales of property by region, category of real property, unit and time
Last update: 2023-03-31T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,17,000,All Denmark,11,Province Nordjylland,False
1,EJENDOMSKATE,3,0111,One-family houses,2103,"Owner-occupied flats, total",False
2,TAL,3,100,Index,310,Percentage change compared to same quarter the...,False
3,Tid,124,1992K1,1992Q1,2022K4,2022Q4,True


OMRÅDE:


Unnamed: 0,id,text
0,0,All Denmark
1,84,Region Hovedstaden
2,1,Province Byen København
3,2,Province Københavns omegn
4,3,Province Nordsjælland
5,4,Province Bornholm
6,85,Region Sjælland
7,5,Province Østsjælland
8,6,Province Vest- og Sydsjælland
9,83,Region Syddanmark


EJENDOMSKATE:


Unnamed: 0,id,text
0,111,One-family houses
1,801,Weekend cottages
2,2103,"Owner-occupied flats, total"


TAL:


Unnamed: 0,id,text
0,100,Index
1,210,Percentage change compared to previous quarter
2,310,Percentage change compared to same quarter the...


Tid:


Unnamed: 0,id,text
0,1992K1,1992Q1
1,1992K2,1992Q2
2,1992K3,1992Q3
3,1992K4,1992Q4
4,1993K1,1993Q1
...,...,...
119,2021K4,2021Q4
120,2022K1,2022Q1
121,2022K2,2022Q2
122,2022K3,2022Q3


We are only interested in some of 

1. A param dictionary is defined in order to detaile the data we want

In [4]:
params = data._define_base_params(language='en')
params

{'table': 'ej55',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'EJENDOMSKATE', 'values': ['*']},
  {'code': 'TAL', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

1. We select the data we want. We only want data for "All Denmark" and indexed values, and percentage change compared to previous quarter.

In [5]:
params = {'table': 'ej55',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']},
  {'code': 'EJENDOMSKATE', 'values': ['0111']},
  {'code': 'TAL', 'values': ['100']},
  {'code': 'Tid', 'values': ['*']}]}

1. Data is sorted and the index is reset. 
2. Coloumns are renamed.

In [6]:
sales_api = data.get_data(params=params)
sales_api.sort_values(by=['OMRÅDE', 'TID', 'EJENDOMSKATE'], inplace=True)
sales_api.reset_index(inplace = True, drop = True)
sales_api.rename(columns = {'OMRÅDE':'PROVINCE', 'EJENDOMSKATE':'CATEGORY', 'TAL':'UNIT', 'TID':'TIME', 'INDHOLD':'SALES_INDEX'}, inplace=True)

1. Missing values are replaced with NaN.

In [7]:
sales_api = sales_api.replace('..', np.nan)

1. Values types are replaced. 

In [8]:
sales_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1364 entries, 0 to 1363
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PROVINCE     1364 non-null   object
 1   CATEGORY     1364 non-null   object
 2   UNIT         1364 non-null   object
 3   TIME         1364 non-null   object
 4   SALES_INDEX  1308 non-null   object
dtypes: object(5)
memory usage: 53.4+ KB


1. The value variable is changed to er float type variable. 

In [9]:
sales_api.SALES_INDEX = sales_api.SALES_INDEX .astype('float')
sales_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1364 entries, 0 to 1363
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PROVINCE     1364 non-null   object 
 1   CATEGORY     1364 non-null   object 
 2   UNIT         1364 non-null   object 
 3   TIME         1364 non-null   object 
 4   SALES_INDEX  1308 non-null   float64
dtypes: float64(1), object(4)
memory usage: 53.4+ KB


In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

1. We make an interactive plot

In [56]:
def plot_value(df, province): 
    I = (df['PROVINCE'] == province)
    ax=df.loc[I,:].plot(x='TIME', y='SALES_INDEX', legend=False)

widgets.interact(plot_value, 
    df = widgets.fixed(sales_api),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_api.PROVINCE.unique(), 
                                    value='Province Byen København'),
)


interactive(children=(Dropdown(description='Province', index=1, options=('Province Bornholm', 'Province Byen K…

<function __main__.plot_value(df, province)>

Explain what you see when moving elements of the interactive plot around. 

# merge med arbejdsløshed

In [11]:
#An overview over the availble data. 
tabsum_unemp = unemp.tablesummary(language='en')
display(tabsum)

# The available values for a each variable:
for variable in tabsum_unemp['variable name']:
    print(variable+':')
    display(unemp.variable_levels(variable, language='en'))

Table AULK01: Full-time unemployed persons by region, type of benefits, unemployment insurance fund, age, sex and time
Last update: 2023-03-17T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,17,000,All Denmark,11,Province Nordjylland,False
1,EJENDOMSKATE,3,0111,One-family houses,2103,"Owner-occupied flats, total",False
2,TAL,3,100,Index,310,Percentage change compared to same quarter the...,False
3,Tid,124,1992K1,1992Q1,2022K4,2022Q4,True


OMRÅDE:


Unnamed: 0,id,text
0,000,All Denmark
1,084,Region Hovedstaden
2,01,Province Byen København
3,101,Copenhagen
4,147,Frederiksberg
...,...,...
112,787,Thisted
113,820,Vesthimmerlands
114,851,Aalborg
115,998,Unknown municipality


YDELSESTYPE:


Unnamed: 0,id,text
0,TOT,Gross unemployment
1,LDP,Net unemployed recipients of unemployment bene...
2,LKT,Net unemployed recipients of social assistance
3,ADP,Activation of persons on unemployment benefits
4,AKT,Activation of persons on social assistance (pr...


AKASSE:


Unnamed: 0,id,text
0,TOT,Total
1,48,Akademikernes (fra 1. juli 2013 inkl. ingeniører)
2,46,Din Faglige A-kasse (fra 1. januar 2021 inkl. ...
3,05,Børne- og Ungdomspædagoger (BUPL-A)
4,06,Din Sundhedsfaglige A-kasse (DSA)
5,40,Det Faglige Hus A-kasse
6,44,Fag og Arbejde (FOA)
7,43,Faglig Fælles a-kasse (3F)
8,11,A-kassen Frie (fra 1. januar 2020 inkl. DANA)
9,13,Funktionærer og Tjenestemænd (FTF-A)


ALDER:


Unnamed: 0,id,text
0,TOT,"Age, total"
1,16-24,16-24 years
2,25-29,25-29 years
3,30-34,30-34 years
4,35-39,35-39 years
5,40-44,40-44 years
6,45-49,45-49 years
7,50-54,50-54 years
8,55-59,55-59 years
9,6099,60 year and over


KØN:


Unnamed: 0,id,text
0,TOT,Total
1,M,Men
2,K,Women


Tid:


Unnamed: 0,id,text
0,2007K1,2007Q1
1,2007K2,2007Q2
2,2007K3,2007Q3
3,2007K4,2007Q4
4,2008K1,2008Q1
...,...,...
59,2021K4,2021Q4
60,2022K1,2022Q1
61,2022K2,2022Q2
62,2022K3,2022Q3


In [12]:
params_unemp = unemp._define_base_params(language='en')
params_unemp

{'table': 'aulk01',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'YDELSESTYPE', 'values': ['*']},
  {'code': 'AKASSE', 'values': ['*']},
  {'code': 'ALDER', 'values': ['*']},
  {'code': 'KØN', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

In [13]:
params_unemp = {'table': 'aulk01',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']},
  {'code': 'YDELSESTYPE', 'values': ['TOT']},
  {'code': 'AKASSE', 'values': ['TOT']},
  {'code': 'ALDER', 'values': ['TOT']},
  {'code': 'KØN', 'values': ['TOT']},
  {'code': 'Tid', 'values': ['*']}]}


In [14]:
unemp_api = unemp.get_data(params=params_unemp)
unemp_api.sort_values(by=['OMRÅDE', 'TID'], inplace=True)
unemp_api = unemp_api.drop(columns = ['YDELSESTYPE', 'AKASSE', 'ALDER', 'KØN'])
unemp_api.reset_index(inplace = True, drop = True)
unemp_api.rename(columns = {'OMRÅDE':'PROVINCE','TID':'TIME', 'INDHOLD':'GROSS UNEMPLOYMENT'}, inplace=True)
unemp_api.head(5)

Unnamed: 0,PROVINCE,TIME,GROSS UNEMPLOYMENT
0,Province Bornholm,2007Q1,1930
1,Province Bornholm,2007Q2,1364
2,Province Bornholm,2007Q3,1060
3,Province Bornholm,2007Q4,1459
4,Province Bornholm,2008Q1,1572


In [15]:
sales_with_unemp = pd.merge(sales_api, unemp_api, on = ['PROVINCE', 'TIME'], how = 'inner')
sales_with_unemp.sample(10)

Unnamed: 0,PROVINCE,CATEGORY,UNIT,TIME,SALES_INDEX,GROSS UNEMPLOYMENT
25,Province Bornholm,One-family houses,Index,2013Q2,87.7,1064
514,Province Vestjylland,One-family houses,Index,2007Q3,113.6,4334
665,Province Østsjælland,One-family houses,Index,2013Q2,75.1,5374
697,Province Østsjælland,One-family houses,Index,2021Q2,122.8,4239
152,Province Fyn,One-family houses,Index,2013Q1,90.9,16931
619,Province Østjylland,One-family houses,Index,2017Q4,106.3,16596
271,Province Nordjylland,One-family houses,Index,2010Q4,100.6,17867
13,Province Bornholm,One-family houses,Index,2010Q2,104.8,1347
598,Province Østjylland,One-family houses,Index,2012Q3,92.0,20468
585,Province Østjylland,One-family houses,Index,2009Q2,93.1,18100


**Figure**

In [16]:
print(sales_with_unemp.columns)


Index(['PROVINCE', 'CATEGORY', 'UNIT', 'TIME', 'SALES_INDEX',
       'GROSS UNEMPLOYMENT'],
      dtype='object')


Klokken 12:49

In [84]:
# Import binsreg
import binsreg

def binscatter(df,province):
    I =df[df['PROVINCE'] == province]
    binsreg.binsreg('SALES_INDEX', 'GROSS UNEMPLOYMENT', data=I, 
                    nbins=10, #specify 10 bins 
                    polyreg=1, #create linear fitted line      
    )

 


    
widgets.interact(binscatter, 
    df = widgets.fixed(sales_with_unemp),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_with_unemp.PROVINCE.unique(), 
                                    value='Province Byen København'),
)



interactive(children=(Dropdown(description='Province', index=1, options=('Province Bornholm', 'Province Byen K…

<function __main__.binscatter(df, province)>

LAV FIGUR MED UNEMP OG BOLIPRISER FOR HVER LANDSDEL

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.