# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

To do:

skal vi fjerne kategorien "All Denmark"? Eller bruge den til noget. Den forsvinder ved inner join, men gør det eksplicit - Noteret i merged_000

Deskriptiv statistik - Variation over årene?

Rydde op i koden

Fjerne decimaler i årstal for slider. 

Imports and set magics:

In [3]:
#Necessary to create the maps below
%pip install geopandas
%pip install mapclassify


import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from dstapi import DstApi

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


SyntaxError: invalid syntax (dataproject.py, line 2)

# Read and clean data

We have three data sets overall: 
1. Data for the **average family income**. 
2. Data for numbers of **member of the church**. 
3. Data for the **location of municipalities**. 

In [4]:
inc = DstApi('INDKP132')
church = DstApi('KM6')

**1. Income data (Tabel: Inc):**

In [5]:
#We specify what we select from the dataset. We want the average family, and we want to see the total and 
#not specific income intervals.
params = {'table': 'indkf132',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'ENHED', 'values': ['117']}, #Average income for families in the group (DKK)
  {'code': 'FAMTYP', 'values': ['*']},
  {'code': 'INDKINTB', 'values': ['99']}, #Total
  {'code': 'Tid', 'values': ['*']}]}

#We apply the dictionary created above to get our dataset.
inc_table = inc.get_data(params=params)

#We sort the data.
inc_table.sort_values(by=['OMRÅDE', 'TID', 'FAMTYP'], inplace=True)

#Removing non-used columns to simplify the data set
inc_table_000=inc_table.loc[:, ['OMRÅDE', 'TID', 'FAMTYP','INDHOLD']]

#We only save average income for families (for each municipality and year)
I = inc_table_000.FAMTYP.str.contains('Families, total')
inc_table_010 = inc_table_000.loc[I,['OMRÅDE', 'TID', 'INDHOLD']]

#We rename variables. The first two are made for inner join latter in the code. The last variable is
#to give a more meaningful name. 
inc_table_020=inc_table_010.rename(columns={'OMRÅDE':'MUNICIPALITY', 'TID':'YEAR','INDHOLD': 'AVG_FAM_INC'})

**2. Church data (Tabel: Church):**

In [8]:
#The displays the different variables. However deactivated after the first initial steps. : 
#tabsum = church.tablesummary(language='en')
#for variable in tabsum['variable name']:
    #print(variable+':')
    #display(church.variable_levels(variable, language='en'))

#We define a dictionary that loads all variables in the church dataset.
params = church._define_base_params(language='en')

#We load the dataset from the DST API and load all variables using the dictionary created above
church_table = church.get_data(params=params)

#We sort the data
church_table.sort_values(by=['KOMK', 'TID', 'FKMED'], inplace=True)

#Rename variables
church_table_000=church_table.rename(columns={'KOMK':'MUNICIPALITY','TID':'YEAR','INDHOLD':'NUMBER_OF_INDIVIDUALS'})

#We calculate the number of members and non-members for each municipality and year
church_grouped=church_table_000.groupby(['MUNICIPALITY', 'YEAR', 'FKMED'])['NUMBER_OF_INDIVIDUALS'].apply('sum')

#We drop duplicates, so we only have one row per municipality, year and membership status
church_table_010 = church_table_000.drop_duplicates(subset=['MUNICIPALITY', 'YEAR', 'FKMED'])

#We only keep the categorical variables as we will merch the numbers on in the next step
church_table_020=church_table_010.loc[:,['MUNICIPALITY', 'YEAR', 'FKMED']]

#We now merge the grouped values on the dataset.
church_table_030 = church_table_020.set_index(['MUNICIPALITY', 'YEAR', 'FKMED']).join(church_grouped, how='left').reset_index()

#We transpose to get a variable for each outcome of FKMED. This is useful for the figures later on. 
church_table_040 = church_table_030.pivot(index=['MUNICIPALITY', 'YEAR'], columns='FKMED', values='NUMBER_OF_INDIVIDUALS').reset_index()


**Merge of income and church data**

In [None]:
#We do an inner join, so that we don't have any missing values. 
#Therefore, the municipality "All Denmark" will be deleted. 
merged_000 = pd.merge(church_table_040,inc_table_020,how='inner',on=['MUNICIPALITY','YEAR'])

#Changing names - We have to do this so it fits with the geojson file. 
merged_000.loc[merged_000.MUNICIPALITY == 'Copenhagen', 'MUNICIPALITY'] = 'København'
merged_000.loc[merged_000.MUNICIPALITY == 'Høje-Taastrup', 'MUNICIPALITY'] = 'Høje Taastrup'

#Making a new variable. % of the population which is member of the church. Further, get the income in 1000 DKK
merged_000['% non-members']=100*(1-merged_000['Member of National Church']
                            /(merged_000['Member of National Church']+merged_000['Not member of National Church']))
merged_000['avg_fam_inc_1000']=merged_000['AVG_FAM_INC']/1000

**3. Geographic data**

In [None]:
#Getting the data with positional data for the map. This is so the code can connect the names with
#the locations of the municipalities. 
gdf = gpd.read_file('kommuner.geojson')
#Renaming so we have the same variable name. Otherwise, we cannot make a merge. 
gdf_000=gdf.rename(columns={'KOMNAVN':'MUNICIPALITY'})


**Merging data from DST with geographical data**

In [None]:
#Merging the locational data with the datasets from Statistics Denmark
merged_010 = pd.merge(merged_000,gdf_000,how='left',on=['MUNICIPALITY'])

# Analysis

In [None]:
#We choose the year 2021 for the maps. 
merged_015=merged_010.loc[merged_010['YEAR']==2021]

#Make it into a geodataframe. This is needed to show the map. 
merged_020 = gpd.GeoDataFrame(merged_015, geometry='geometry')

#Plotting. cmap is the color. set_axis_off remove latitude and longitude as they are not used. 
merged_020.plot(column='% non-members', 
                legend=True, 
                cmap='inferno',
                legend_kwds={'label': "% of population not member of the church", 'orientation': "horizontal"} 
                ).set_axis_off()
#Give it a title. 
plt.title("Non-Members of the Danish Church (2021)")

In [None]:
#Making map with income. 
merged_020.plot(column='avg_fam_inc_1000', 
                legend=True, 
                cmap='inferno',
                legend_kwds={'label': "Average family income (1000 DKK)", 'orientation': "horizontal"} 
                ).set_axis_off()
plt.title("Average family income (2021)")

In [None]:
#Making an interactive map, where we can see the values for all municipalities. 

merged_020.explore(column='% non-members', 
                   legend=True, 
                   legend_kwds={'label': "% of population", 'orientation': "horizontal"},
                   cmap='inferno',
                   width='70%',
                   height='70%',
                   highlight=True)


In [None]:
merged_020.explore(column='avg_fam_inc_1000', 
                   legend=True, 
                   legend_kwds={'label': "Average Family Income (1000 DKK)", 'orientation': "horizontal"},
                   cmap='inferno',
                   width='70%',
                   height='70%',
                   highlight=True)

In [None]:
#Add scatter with slicer. We use the dataset for all periods in order to have a slider with the periods
#This is dataset merged_010 as we only keep 2021 in merged_015
def figure(time):
    
    scatter_000 = merged_010[merged_010['YEAR'] == time]
    fig=plt.figure(dpi=100)
    ax=fig.add_subplot(1,1,1)
    ax.scatter(scatter_000['% non-members'], scatter_000.avg_fam_inc_1000, color=(148/255.,141/255.,134/255.))

    ax.set_title('Non members of church and average family income: 2011-2021', color=(7/255., 9/255., 74/255.))
    ax.set_xlabel('Non members of the church in % of population', color=(7/255., 9/255., 74/255.))
    ax.set_ylabel('Average family income (1000 DKK)', color=(7/255., 9/255., 74/255.))
    ax.grid(True)

    ax.set_xlim(0,70)
    ax.set_ylim(200,900)


    plt.show()

In [None]:
widgets.interact(figure, 
                 time=widgets.FloatSlider(min=merged_010['YEAR'].min(), 
                                          max=merged_010['YEAR'].max(), step=1, 
                                          value=merged_010['YEAR'].max())
);

# Conclusion

ADD CONCISE CONLUSION.