# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [None]:
# importing the used packages 
import pandas as pd 
import numpy as np 
import datetime 

# importing package to create plots and setting basic settings
import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"-"})
plt.rcParams.update({'font.size': 10})
import ipywidgets as widgets

# importing the API from DST used to gather data
from dstapi import DstApi 

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject



# Read and clean data

1. Data is imported using the API from Danmarks statistik

In [None]:
data = DstApi('EJ55') #EJ55 is data on the pricing of houses
unemp = DstApi('AULK01') #AULK01 is data on the number of unemployed 

## Explore each data set

1. The availble values for each variable is plotted in order to select relevant variables. 

In [None]:
#An overview over the availble data. 
tabsum = data.tablesummary(language='en')
display(tabsum)

# The available values for a each variable:
for variable in tabsum['variable name']:
    print(variable+':')
    display(data.variable_levels(variable, language='en'))

We are only interested in some of 

1. A param dictionary is defined in order to detail the data we want
    - Initially it includes all data

In [None]:
params = data._define_base_params(language='en')
params

1. We select the data we want on prices of housing. We only want data for "All Denmark" in indexed values, and percentage change compared to previous quarter.

In [None]:
params = {'table': 'ej55',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']},
  {'code': 'EJENDOMSKATE', 'values': ['0111']},
  {'code': 'TAL', 'values': ['100']}, # with the key 'code' we choose the desired variable and with the key 'values' we choose what subset of the dataset for the given variable we want to include
  {'code': 'Tid', 'values': ['*']}]} # ['*'] includes all the available data


1. Data is sorted and the index is reset. 
2. Coloumns are renamed.

In [None]:
sales_api = data.get_data(params=params) # retrieving the desired part of the dataset from DST
sales_api.sort_values(by=['OMRÅDE', 'TID', 'EJENDOMSKATE'], inplace=True) # sorting the values
sales_api.reset_index(inplace = True, drop = True) # resetting the index, so it fits the new dataset
sales_api.rename(columns = {'OMRÅDE':'PROVINCE', 'EJENDOMSKATE':'CATEGORY', 'TAL':'UNIT', 'TID':'TIME', 'INDHOLD':'SALES_INDEX'}, inplace=True) # renaming columns

1. Missing values are replaced with NaN.

In [None]:
sales_api = sales_api.replace('..', np.nan) # replacing all the missing data (denoted with '..' by DST) with NaN-values

1. Values types are replaced. 

In [None]:
sales_api.info() 

1. The value variable is changed from an object to a float type variable. 

In [None]:
sales_api.SALES_INDEX = sales_api.SALES_INDEX .astype('float')
sales_api.info()

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

1. We make an interactive plot of the prices on housing over time

In [None]:
def plot_value(df, province): 
    I = (df['PROVINCE'] == province) # creating the plot for the variable 'PROVINCE' allowing us to choose the desired province in the plot later on 
    ax=df.loc[I,:].plot(x='TIME', y='SALES_INDEX', legend=False) # choosing the x- and y-variables for the plot

widgets.interact(plot_value, 
    df = widgets.fixed(sales_api),
    province = widgets.Dropdown(description='Province', #creating drop-down widget that allows us to choose the desired province
                                    options=sales_api.PROVINCE.unique(), 
                                    value='Province Byen København'), # initial province will be 'Byen København'
)


In the plot above, we see that all price indexes have **strictly increased** since 1992. We note that the indexes are not comparable across regions, as each region is indexed so the value of the prices is normalized to 100 in 2006. So, even if prices in Byen København are, say, 30 percent higher than prices in Fyn in 2006, both indexes will have the value of index 100.

However, from the graphical representation we can compare the relative increase in prices *within* a given province over time, and we see that the most dominant increases in prices since 2006 have been in the provinces 'Byen København', 'Københavns Omegn' and 'Bornholm'. In the recent quarters, of these three provinces, Bornholm have experienced the relatively largest decrease in prices again.



# Including a dataset on unemployment sorted by regions

1. The data have been imported with the DST API in cell 7 above. Below we, once again, clean and structure the dataset following the same basic method as with the dataset on housing prices.

In [None]:
#An overview over the available data. 
tabsum_unemp = unemp.tablesummary(language='en')
display(tabsum)

# The available values for a each variable:
for variable in tabsum_unemp['variable name']:
    print(variable+':')
    display(unemp.variable_levels(variable, language='en'))

In [None]:
params_unemp = unemp._define_base_params(language='en')
params_unemp

In [None]:
params_unemp = {'table': 'aulk01',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11']} # choosing same provinces as before,
  {'code': 'YDELSESTYPE', 'values': ['TOT']}, # choosing the total number of unemployed for each variable
  {'code': 'AKASSE', 'values': ['TOT']},
  {'code': 'ALDER', 'values': ['TOT']},
  {'code': 'KØN', 'values': ['TOT']},
  {'code': 'Tid', 'values': ['*']}]} # choosing the entire timespan of the dataset


In [None]:
unemp_api = unemp.get_data(params=params_unemp) # retrieving the dataset from DST with the specifications from 'params_unemp'
unemp_api.sort_values(by=['OMRÅDE', 'TID'], inplace=True) #sorting values
unemp_api = unemp_api.drop(columns = ['YDELSESTYPE', 'AKASSE', 'ALDER', 'KØN']) #cleaning dataset by dropping columns
unemp_api.reset_index(inplace = True, drop = True) # resetting index
unemp_api.rename(columns = {'OMRÅDE':'PROVINCE','TID':'TIME', 'INDHOLD':'GROSS UNEMPLOYMENT'}, inplace=True) # renaming columns
unemp_api.head(5)

In [None]:
sales_with_unemp = pd.merge(sales_api, unemp_api, on = ['PROVINCE', 'TIME'], how = 'inner') # Performing an inner merge of the two dataset (keeping only data for which there are observations for both variables in the dataset)
sales_with_unemp.sample(10)

**Figure**

In [None]:
print(sales_with_unemp.columns)


In [None]:
def plot_value(df, province): 
    I = (df['PROVINCE'] == province)
    ax=df.loc[I,:].plot(x='GROSS UNEMPLOYMENT', y='SALES_INDEX', legend=False)

widgets.interact(plot_value, 
    df = widgets.fixed(sales_with_unemp),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_with_unemp.PROVINCE.unique(), 
                                    value='Province Byen København'),
)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets

def plot_value(df, province): 
    I = (df['PROVINCE'] == province)
    ax = df.loc[I,:].plot(kind='scatter', x='GROSS UNEMPLOYMENT', y='SALES_INDEX', legend=False)
    ax.set_xlabel('GROSS UNEMPLOYMENT')
    ax.set_ylabel('SALES_INDEX')
    ax.set_title('Correlation between Sales Index and Gross Unemployment for {}'.format(province))

widgets.interact(plot_value, 
    df = widgets.fixed(sales_with_unemp),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_with_unemp.PROVINCE.unique(), 
                                    value='Province Byen København'),
)


In [None]:
# Import binsreg
import binsreg

def binscatter(df,province):
    I =df[df['PROVINCE'] == province]
    binsreg.binsreg('SALES_INDEX', 'GROSS UNEMPLOYMENT', data=I, 
                    nbins=10, #specify 10 bins 
                    polyreg=1) #create linear fitted line)
    plt.xlabel('GROSS UNEMPLOYMENT')
    plt.ylabel('SALES_INDEX')
    # Specify x and y-titles


widgets.interact(binscatter, 
    df = widgets.fixed(sales_with_unemp),
    province = widgets.Dropdown(description='Province', 
                                    options=sales_with_unemp.PROVINCE.unique(), 
                                    value='Province Byen København'),
)


LAV FIGUR MED UNEMP OG BOLIPRISER FOR HVER LANDSDEL

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.