# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code from [lecture 5](https://numeconcopenhagen.netlify.com/lectures/Workflow_and_debugging).
> 1. Remember this [guide](https://www.markdownguide.org/basic-syntax/) on markdown and (a bit of) latex.
> 1. Turn on automatic numbering by clicking on the small icon on top of the table of contents in the left sidebar.
> 1. The `dataproject.py` file includes a function which will be used multiple times in this notebook.

Imports and set magics:

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2 

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# local modules
import dataproject

# import pydst
import pydst
dst = pydst.Dst(lang='en')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

## Employment data

**Read the employment data** in ``RAS200.xlsx`` and **clean it** removing and renaming columns:

In [42]:
educ_vars = dst.get_variables(table_id='HFUDD10')
educ_vars

educ = dst.get_data(table_id = 'HFUDD10', variables={'Tid':['*'],'HERKOMST':['*'], 'ALDER':['*'], 'HFUDD':['TOT','H30', 'H40', 'H50', 'H60', 'H70'], 'KØN':['*']})
educ.head(10)

#Uddannelse: H10, H20, H30, H40, H50, H60, H70 

# b. drop columns
drop_these = ['BOPOMR']
educ.drop(drop_these, axis=1, inplace=True)
educ.head(10)


# c. rename columns
educ.rename(columns = {'HERKOMST':'Ancestry', 'ALDER':'Age', 'HFUDD':'HCEDUC', 'KØN':'Gender', 'TID':'Year', 'INDHOLD':'Units'}, inplace=True)
educ.head(10)




Unnamed: 0,Year,Ancestry,Age,HCEDUC,Gender,Units
0,2016,Total,40-44 years,H30 Vocational Education and Training (VET),Total,141758
1,2016,Immigrants,"Age, total",H30 Vocational Education and Training (VET),Women,43424
2,2016,Persons of Danish origin,65-69 years,H60 Bachelors programmes,Total,1528
3,2016,Descendant,"Age, total",H60 Bachelors programmes,Women,1386
4,2016,Descendant,"Age, total",H60 Bachelors programmes,Total,2457
5,2016,Immigrants,"Age, total",H60 Bachelors programmes,Women,6026
6,2016,Immigrants,30-34 years,H30 Vocational Education and Training (VET),Total,10105
7,2016,Persons of Danish origin,35-39 years,Total,Total,286370
8,2016,Descendant,40-44 years,Total,Women,1473
9,2016,Persons of Danish origin,"Age, total",H30 Vocational Education and Training (VET),Men,630689


The dataset now looks like this:

In [None]:
empl.head()

**Remove all rows which are not municipalities**:

In [None]:
empl = dataproject.only_keep_municipalities(empl)
empl.head()

**Convert the dataset to long format**:

In [None]:
# a. rename year columns
mydict = {str(i):f'employment{i}' for i in range(2008,2018)}
empl.rename(columns = mydict, inplace=True)

# b. convert to long
empl_long = pd.wide_to_long(empl, stubnames='employment', i='municipality', j='year').reset_index()

# c. show
empl_long.head()

## Income data

**Read the income data** in ``INDKP101.xlsx`` and **clean it**:

In [None]:
# a. load
inc = pd.read_excel('INDKP101.xlsx', skiprows=2)

# b. drop and rename columns
inc.drop([f'Unnamed: {i}' for i in range(3)], axis=1, inplace=True)
inc.rename(columns = {'Unnamed: 3':'municipality'}, inplace=True)

# c. drop rows with missing
inc.dropna(inplace=True)

# d. remove non-municipalities
inc = dataproject.only_keep_municipalities(inc)

# e. convert to long
inc.rename(columns = {str(i):f'income{i}' for i in range(1986,2018)}, inplace=True)
inc_long = pd.wide_to_long(inc, stubnames='income', i='municipality', j='year').reset_index()

# f. show
inc_long.head(5)

> **Note:** The function ``dataproject.only_keep_municipalities()`` is used on both the employment and the income datasets.

## Explore data set

In order to be able to **explore the raw data**, we here provide an **interactive plot** to show, respectively, the employment and income level in each municipality

The **static plot** is:

In [None]:
def plot_empl_inc(empl,inc,dataset,municipality): 
    
    if dataset == 'Employment':
        df = empl
        y = 'employment'
    else:
        df = inc
        y = 'income'
    
    I = df['municipality'] == municipality
    ax = df.loc[I,:].plot(x='year', y=y, style='-o')

The **interactive plot** is:

In [None]:
widgets.interact(plot_empl_inc, 
    
    empl = widgets.fixed(empl_long),
    inc = widgets.fixed(inc_long),
    dataset = widgets.Dropdown(description='Dataset', 
                               options=['Employment','Income']),
    municipality = widgets.Dropdown(description='Municipality', 
                                    options=empl_long.municipality.unique())
                 
); 

ADD SOMETHING HERE IF THE READER SHOULD KNOW THAT E.G. SOME MUNICIPALITY IS SPECIAL.

# Merge data sets

We now create a data set with **municpalities which are in both of our data sets**. We can illustrate this **merge** as:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('inc', 'empl'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

In [None]:
merged = pd.merge(empl_long, inc_long, how='inner',on=['municipality','year'])

print(f'Number of municipalities = {len(merged.municipality.unique())}')
print(f'Number of years          = {len(merged.year.unique())}')

# Analysis

To get a quick overview of the data, we show some **summary statistics by year**:

In [None]:
merged.groupby('year').agg(['mean','std']).round(2)

ADD FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.