# Notebook Setup

In [109]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Dela - Predicting the amount of deaths per year

### 1. Intro
<b>Who is our client?</b><br>
Our client is Dela. Dela is a funeral insurer and caretaker for funeral services. In this semester, they will give us some inside problems that we can investigate on.<br><br>
<b>Project explanation</b><br>
Dela was faced with unprecedented challenge by the fluctuating demands during the Covid19 first year, and that way they are looking to enhance their abilities to react to surprising serge or drops in demands.
We cannot predict when Dela needs to upscale or downscale. However, we can forecast the amount of deaths in the upcoming years. Based on that knowledge and Dela experience, they can conclude themselves when to upscale or downscale.<br><br>
<b>Project goal</b><br>
In our project, we are going to forecast the amount of deaths per year. In this way we can help them conclude easier what to do on higher, lower demand.<br><br><br>
<b>Document explanation</b><br>
In this document there will be the testing and implementation of the project delivered proposal hypothesis. In this document, we will do that by understanding and experimenting with the collected data. After the understanding, we will see if our hypothesis can be validated and we will do that by applying machine learning onto our dataset.<br><br>
<b>Document setup:</b><br>
<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left;">Data requirements</th>
        <td style="text-align: left;">In this chapter, we are going to setup the requirements for the data that is needed for the prediction. We will answer questions like ‘Which references are trustworthy?’, ‘Do we need some specific features?’, etc…</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data collection</th>
        <td style="text-align: left;">In this chapter, we are going to explain where we found our data, where we are going to store our data and references to he subchapters of each dataset</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data understanding</th>
        <td style="text-align: left;">In this chapter, we are going to understand each specific dataset that we downloaded to really understand the value of each dataset and how it is going to bring a value to Dela.</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data preparation</th>
        <td style="text-align: left;">In this chapter, we are going to prepare our data so it is clean to work with, think about removing all the data that has invalid records, data that has wrong values or data that has similar features with different names.</td>
    </tr>
</table>

### 2. Provisioning

### 2.1 Data Requirements
In this chapter, we are trying to setup the expectations/ requirements of the data we are going to collect for the provisioning fase.

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Domain</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data type</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Target Variable</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Expected Features</th>
        <td style="text-align: left !important"></td>
    </tr>
</table>

### 2.2 Data Collection
Because we want to search for data that contains the amount of deaths in the Netherlands, we began searching for an open data bank that is governmental. In this case the data is from a trustworthy source and will higher the chance of a good prediction. That's when we landed on CBS (stands for `Centraal Bureau voor de Statistiek`, translated to English that means  `Central Station of Statistics`). 

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Source</th>
        <td style="text-align: left !important">We got our data from the official   <a href="https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS" target="_blank">CBS</a> Website</td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data Storage</th>
        <td style="text-align: left !important">We stored all of our datasets on <a href="https://github.com/i454038/AI-car-price-prediction" target="_blank">Github</a>, so it is globally accessible</td>
    </tr>
</table>

Load in the datasets from Github

In [110]:
# these are custom classes made to keep the notebook neat.
from classes.dataImporting import datasetManager

datasets = datasetManager.defineDatasets()
dataframes = datasetManager.loadDatasets(datasets)

### 2.3 Data Understanding

understand this data

### 2.4 Data Preperation

In [111]:
dataframes['lifeExpectency']['lifeExpectencyPerRegion'].head(5)

Unnamed: 0,id,Gemeente,Groep_rij,Geslacht,Bij geboorte,Bij geboorte (afwijking tov NL),Bij 65 jaar,Bij 65 jaar (afwijking tov NL)
0,518,'s-Gravenhage,Levensverwachting,Totaal,808,"onder, 99% zeker",191,"onder, 99% zeker"
1,796,'s-Hertogenbosch,Levensverwachting,Totaal,813,"onder, 99% zeker",197,"onder, 99% zeker"
2,1680,Aa en Hunze,Levensverwachting,Totaal,821,geen,204,geen
3,358,Aalsmeer,Levensverwachting,Totaal,829,"boven, 99% zeker",201,geen
4,197,Aalten,Levensverwachting,Totaal,821,geen,203,geen


In [112]:
# transform dataset to align with others
dataframes['populationChange']['pop2002_2020'] = (
    dataframes['populationChange']['pop2002_2020']
        .rename(columns={'RegioS': 'Gemeente', 'Perioden': 'Year'})
        .assign(Year = lambda x: pd.to_datetime(x.Year.str[:4]).dt.year)
        .assign(Gemeente = lambda x: x.Gemeente.replace(datasetManager.mapFeature('Gemeente')))
#         .groupby(['Year', 'Gemeente', 'LevendGeborenKinderen_2', 'Overledenen_3']).mean()
#         .iloc[:, 4:4]
#         .reset_index()
)

In [129]:
dataframes['populationChange']['pop2002_2020']

Unnamed: 0,ID,Gemeente,Year,BevolkingAanHetBeginVanDePeriode_1,LevendGeborenKinderen_2,Overledenen_3,TotaleVestiging_4,VestigingVanuitEenAndereGemeente_5,Immigratie_6,TotaalVertrekInclAdmCorrecties_7,VertrekNaarAndereGemeente_8,EmigratieInclusiefAdmCorrecties_9,OverigeCorrecties_10,Bevolkingsgroei_11,BevolkingsgroeiRelatief_12,BevolkingsgroeiSinds1Januari_13,BevolkingsgroeiSinds1JanuariRela_14,BevolkingAanHetEindeVanDePeriode_15
0,14604,Aa en Hunze,2002,25552.0,289.0,251.0,1353.0,1121.0,232.0,1617.0,1498.0,119.0,-21.0,-247.0,-0.97,-247.0,-0.97,25305.0
1,14617,Aa en Hunze,2003,25305.0,279.0,241.0,1127.0,1071.0,56.0,1264.0,1111.0,153.0,12.0,-87.0,-0.34,-87.0,-0.34,25218.0
2,14630,Aa en Hunze,2004,25218.0,233.0,221.0,1167.0,1104.0,63.0,1077.0,1023.0,54.0,9.0,111.0,0.44,111.0,0.44,25329.0
3,14643,Aa en Hunze,2005,25329.0,230.0,231.0,1322.0,1254.0,68.0,1143.0,1062.0,81.0,0.0,178.0,0.70,178.0,0.70,25507.0
4,14656,Aa en Hunze,2006,25507.0,216.0,212.0,1369.0,1320.0,49.0,1326.0,1222.0,104.0,9.0,56.0,0.22,56.0,0.22,25563.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10483,155842,Zwolle,2016,124896.0,1541.0,913.0,6793.0,6017.0,776.0,6763.0,5996.0,767.0,-6.0,652.0,0.52,652.0,0.52,125548.0
10484,155855,Zwolle,2017,125548.0,1443.0,977.0,6703.0,5954.0,749.0,6593.0,5937.0,656.0,-8.0,568.0,0.45,568.0,0.45,126116.0
10485,155868,Zwolle,2018,126116.0,1458.0,988.0,7210.0,6343.0,867.0,6297.0,5567.0,730.0,-2.0,1381.0,1.10,1381.0,1.10,127497.0
10486,155881,Zwolle,2019,127497.0,1460.0,953.0,6915.0,5940.0,975.0,6071.0,5356.0,715.0,-8.0,1343.0,1.05,1343.0,1.05,128840.0


In [114]:
# dataset = (
#     dataframes['populationChange']['pop2002_2020']
#         .merge(dataframes['lifeExpectency']['lifeExpectencyPerRegion'], how='left', on='Gemeente')
#         .groupby(['Year', 'Gemeente', 'Bij geboorte']).mean()
#         .reset_index()
# )

# dataset.head(10)

In [115]:
# underneath each gemeente, we create a summary of the numbers in the whole netherlands
# this dataset will help doing that
# dataframes['populationChange']['popOverview'] = (
#     dataframes['populationChange']['popOverview']
        
# )

# bevolking aan het begin van de periode -> people at te beginning of the period.
#
#
dataframes['populationChange']['popOverview'].head(5)
# [year, deaths]
# [2002, 142355]
# [2003, 141936]

Unnamed: 0,Regio's,Onderwerp,Unnamed: 2,2002,2003,2004,2005,2006,2007,2008,...,2017,2018,2019,2020 januari,2020 februari,2020 maart,2020 oktober,2020 november,2020 december,2020
0,Nederland,Bevolking aan het begin van de periode,aantal,16105285,16192572,16258032,16305526,16334210,16357992,16405399,...,17081507,17181084,17282163,17407585,17413971,17423863,17465296,17472440,17476482,17407585
1,Nederland,Levend geboren kinderen,aantal,202083,200297,194007,187910,185057,181336,184634,...,169836,168525,169680,14085,12905,13556,14508,13555,13277,168681
2,Nederland,Overledenen,aantal,142355,141936,136553,136402,135372,133022,135136,...,150214,153363,151885,14111,12895,16237,14613,14925,16802,168678
3,Nederland,Vestiging in de gemeente |Totale vestiging,aantal,750197,720704,711944,734386,753452,763383,792769,...,1031566,1002022,1026586,89538,83523,77114,92808,81445,82067,1014246
4,Nederland,Vestiging in de gemeente |Vestiging vanuit een...,aantal,628947,616190,617925,642089,652302,646564,649253,...,796609,758285,757522,67318,61828,61504,71947,63662,65751,793393


In [116]:
# allready in another dataset
dataframes['populationChange']['popComparison2015_2020'].head(1)

Unnamed: 0,id,Gemeente,Indicator,Waarde
0,3,Appingedam,Aantal inwoners 2015,12.011


In [130]:
dataframes['populationChange']['growthPrediction2020_2050'].head(1)

Unnamed: 0,id,Gemeente,Indicator,2030,2040,2050
0,3,Appingedam,Absoluut aantal inwoners,10.873,10.046,9.253
1,3,Appingedam,Bevolkingsgroei (% t.o.v. 2020),-66.0,-137.0,-205.0
2,10,Delfzijl,Absoluut aantal inwoners,22.393,19.182,16.213


In [118]:
dataframes['populationChange']['absoluteNr'].head(1)

Unnamed: 0,id,Gemeente,Indicator,Waarde
0,3,Appingedam,Aantal inwoners (absoluut),11.642


In [119]:
dataframes['death']['reasons1997_2014'].head(1)

Unnamed: 0,ID,Geslacht,Leeftijd,DoodsoorzakenUitgebreideLijst,Perioden,Overledenen_1
0,1,T001038,10000,T001075,1997JJ00,135783.0


In [120]:
dataframes['death']['reasons2005_2012'].head(1)

Unnamed: 0,ID,Geslacht,Leeftijd,DoodsoorzakenUitgebreideLijst,Perioden,Overledenen_1
0,9,T001038,10000,T001075,2005JJ00,136402.0


In [121]:
dataframes['death']['reasons2013_2020'].head(1)

Unnamed: 0,ID,Geslacht,Leeftijd,DoodsoorzakenUitgebreideLijst,Perioden,Overledenen_1
0,17,T001038,10000,T001075,2013JJ00,141245.0


In [122]:
dataframes['death']['perWeek2020_2021'].head(1)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Overledenen,Verwacht aantal overledenen,Verwacht aantal overledenen (95%-interval)
0,2020,1.0,3103.0,3277.0,2908 – 3645


In [123]:
dataframes['birth']['birthPerYear1899_2018'].head(1)

Unnamed: 0,Jaar,Levend geborenen
0,1899,163.0


In [124]:
dataframes['birth']['avaragesOfMonth'].head(1)

Unnamed: 0,Maand,2021*,2020,1975
0,januari,13899,14085,15140


<b>Clean Data</b>

- lifeExpectency
    - lifeExpectencyPerRegion2016_2019 
- populationChange
    - pop2002_2020
    - popOverview
    - popComparison2015_2020
    - growthPrediction2020_2050
    - absoluteNr
- death
    - reasons1997_2014
    - reasons2005_2012
    - reasons2013_2020
    - perWeek2020_2021
- birth
    - birthPerYear1899_2018
    - avaragesOfMonth

<b>Merge</b>

In [125]:
# these are custom classes made to keep the notebook neat.



<b>Heatmap</b>