# Notebook Setup

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Dela - Predicting the amount of deaths per year

### 1. Intro
<b>Who is our client?</b><br>
Our client is Dela. Dela is a funeral insurer and caretaker for funeral services. In this semester, they will give us some inside problems that we can investigate on.<br><br>
<b>Project explanation</b><br>
Dela was faced with unprecedented challenge by the fluctuating demands during the Covid19 first year, and that way they are looking to enhance their abilities to react to surprising serge or drops in demands.
We cannot predict when Dela needs to upscale or downscale. However, we can forecast the amount of deaths in the upcoming years. Based on that knowledge and Dela experience, they can conclude themselves when to upscale or downscale.<br><br>
<b>Project goal</b><br>
In our project, we are going to forecast the amount of deaths per year. In this way we can help them conclude easier what to do on higher, lower demand.<br><br><br>
<b>Document explanation</b><br>
In this document there will be the testing and implementation of the project delivered proposal hypothesis. In this document, we will do that by understanding and experimenting with the collected data. After the understanding, we will see if our hypothesis can be validated and we will do that by applying machine learning onto our dataset.<br><br>
<b>Document setup:</b><br>
<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left;">Data requirements</th>
        <td style="text-align: left;">In this chapter, we are going to setup the requirements for the data that is needed for the prediction. We will answer questions like ‘Which references are trustworthy?’, ‘Do we need some specific features?’, etc…</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data collection</th>
        <td style="text-align: left;">In this chapter, we are going to explain where we found our data, where we are going to store our data and references to he subchapters of each dataset</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data understanding</th>
        <td style="text-align: left;">In this chapter, we are going to understand each specific dataset that we downloaded to really understand the value of each dataset and how it is going to bring a value to Dela.</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data preparation</th>
        <td style="text-align: left;">In this chapter, we are going to prepare our data so it is clean to work with, think about removing all the data that has invalid records, data that has wrong values or data that has similar features with different names.</td>
    </tr>
</table>

### 2. Provisioning

### 2.1 Data Requirements
In this chapter, we are trying to setup the expectations/ requirements of the data we are going to collect for the provisioning fase.

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Domain</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data type</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Target Variable</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Expected Features</th>
        <td style="text-align: left !important"></td>
    </tr>
</table>

### 2.2 Data Collection
Because we want to search for data that contains the amount of deaths in the Netherlands, we began searching for an open data bank that is governmental. In this case the data is from a trustworthy source and will higher the chance of a good prediction. That's when we landed on CBS (stands for `Centraal Bureau voor de Statistiek`, translated to English that means  `Central Station of Statistics`). 

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Source</th>
        <td style="text-align: left !important">We got our data from the official   <a href="https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS" target="_blank">CBS</a> Website</td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data Storage</th>
        <td style="text-align: left !important">We stored all of our datasets on <a href="https://github.com/i454038/AI-car-price-prediction" target="_blank">Github</a>, so it is globally accessible</td>
    </tr>
</table>

Load in the datasets from Github

In [2]:
# these are custom classes made to keep the notebook neat.
from classes.dataImporting import datasetManager

datasets = datasetManager.defineDatasets()
dataframes = datasetManager.loadDatasets(datasets)

### 2.3 Data Understanding

understand this data

### 2.4 Data Preperation

### LifeExpectency - LifeExpectencyPerRegion

For the life expectency
- We renamed the columns to English.
- We converted the integer columns to floats instead of objects.

In [3]:
dataframes['lifeExpectency']['lifeExpectencyPerRegion'] = (
    dataframes['lifeExpectency']['lifeExpectencyPerRegion']
        .rename(columns=datasetManager.renameFeatures('lifeExpectency.lifeExpectencyPerRegion'))
        .assign(LifeExpectancy = lambda x: x.LifeExpectancy.str.replace(',', '.').astype(float))
        .assign(LifeExpectancyWhen65OrOlder = lambda x: x.LifeExpectancyWhen65OrOlder.str.replace(',', '.').astype(float))
)

In [4]:
dataframes['lifeExpectency']['lifeExpectencyPerRegion'].head()

Unnamed: 0,id,Municipality,Groep_rij,Geslacht,LifeExpectancy,LifeExpectancyNL,LifeExpectancyWhen65OrOlder,LifeExpectancyWhen65OrOlderNL
0,518,'s-Gravenhage,Levensverwachting,Totaal,80.8,"onder, 99% zeker",19.1,"onder, 99% zeker"
1,796,'s-Hertogenbosch,Levensverwachting,Totaal,81.3,"onder, 99% zeker",19.7,"onder, 99% zeker"
2,1680,Aa en Hunze,Levensverwachting,Totaal,82.1,geen,20.4,geen
3,358,Aalsmeer,Levensverwachting,Totaal,82.9,"boven, 99% zeker",20.1,geen
4,197,Aalten,Levensverwachting,Totaal,82.1,geen,20.3,geen


### PopulationChange - Pop2002_2020

In [5]:
# transform dataset to align with others
dataframes['populationChange']['pop2002_2020'] = (
    dataframes['populationChange']['pop2002_2020']
        .rename(columns=datasetManager.renameFeatures('populationChange.pop2002_2020'))
        .assign(Year = lambda x: pd.to_datetime(x.Year.str[:4]).dt.year)
)

In [6]:
dataframes['populationChange']['pop2002_2020'].head(5)

Unnamed: 0,ID,Municipality,Year,PopulationAtBeginOfPeriod,AliveBornChildren,Deceased,TotalLocations,LocationsFromOtherMunicipality,Immigration,TotaalVertrekInclAdmCorrecties_7,AmountMovedToOtherMunicipality,EmigratieInclusiefAdmCorrecties_9,OverigeCorrecties_10,PopulationGrowth,RelativePopulationGrowth,PopulationGrowthSinceJanuari,RelativePopulationGrowthSinceJanuari,PopulationAtEndOfPeriod
0,14604,GM1680,2002,25552.0,289.0,251.0,1353.0,1121.0,232.0,1617.0,1498.0,119.0,-21.0,-247.0,-0.97,-247.0,-0.97,25305.0
1,14617,GM1680,2003,25305.0,279.0,241.0,1127.0,1071.0,56.0,1264.0,1111.0,153.0,12.0,-87.0,-0.34,-87.0,-0.34,25218.0
2,14630,GM1680,2004,25218.0,233.0,221.0,1167.0,1104.0,63.0,1077.0,1023.0,54.0,9.0,111.0,0.44,111.0,0.44,25329.0
3,14643,GM1680,2005,25329.0,230.0,231.0,1322.0,1254.0,68.0,1143.0,1062.0,81.0,0.0,178.0,0.7,178.0,0.7,25507.0
4,14656,GM1680,2006,25507.0,216.0,212.0,1369.0,1320.0,49.0,1326.0,1222.0,104.0,9.0,56.0,0.22,56.0,0.22,25563.0


### Death - Reason of Death

In [7]:
dataframes['death']['reason_per_year_per_region2002_2020'] = (
    pd.concat([
        dataframes['death']['reasons_per_year_per_region2002_2015'], 
        dataframes['death']['reasons_per_year_per_region2016_2020']
    ])
)

dataframes['death']['reason_per_year_per_region2002_2020'] = (
    dataframes['death']['reason_per_year_per_region2002_2020']
        .rename(columns={'RegioS': 'Municipality', 'Perioden': 'Year'})
        .assign(Year = lambda x: pd.to_datetime(x.Year.str[:4]).dt.year)
)

### Merge Datasets

In [8]:
dataset = (
    dataframes['populationChange']['pop2002_2020']
        .merge(dataframes['lifeExpectency']['lifeExpectencyPerRegion'], how='outer', on="Municipality")
        .merge(dataframes['death']['reason_per_year_per_region2002_2020'], how='outer', on=['Year', "Municipality"])
        .groupby(['Year', 'Municipality']).mean()
        .reset_index()
#         .sort_values(by=['Year','Municipality'])
        .assign(Municipality = lambda x: x.Municipality.replace(datasetManager.mapFeature('Municipality')))
        .assign(Year = lambda x: x.Year.astype(int))
)

### Rename columns

In [9]:
dataset

Unnamed: 0,Year,Municipality,ID_x,PopulationAtBeginOfPeriod,AliveBornChildren,Deceased,TotalLocations,LocationsFromOtherMunicipality,Immigration,TotaalVertrekInclAdmCorrecties_7,...,k_1713AccidenteleVerdrinking_86,k_1714AccidenteleVergiftiging_87,k_1715OverigeOngevallen_88,k_172Zelfdoding_89,k_173MoordEnDoodslag_90,k_174GebeurtenissenOpzetOnbekend_91,k_175OverigeUitwendigeDoodsoorzaken_92,k_18TotaalCOVID19Coronavirus19_93,k_181VastgesteldeCOVID19_94,k_182VermoedelijkeCOVID19_95
0,2002,Appingedam,21260.0,12443.0,121.0,135.0,726.0,644.0,82.0,682.0,...,,,,,,,,,,
1,2002,Bedum,24076.0,10916.0,140.0,65.0,359.0,328.0,31.0,463.0,...,,,,,,,,,,
2,2002,Bellingwedde,25356.0,9722.0,83.0,101.0,568.0,489.0,79.0,603.0,...,,,,,,,,,,
3,2002,Ten Boer,32780.0,7461.0,98.0,53.0,342.0,298.0,44.0,483.0,...,,,,,,,,,,
4,2002,Delfzijl,40460.0,29018.0,295.0,328.0,1403.0,1096.0,307.0,1425.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10483,2020,Westerkwartier,144374.0,63329.0,614.0,592.0,2757.0,2622.0,135.0,2427.0,...,,,,,,,,,,
10484,2020,Noardeast-FryslÃ¢n,99318.0,45228.0,430.0,485.0,1825.0,1691.0,134.0,1516.0,...,,,,,,,,,,
10485,2020,Molenlanden,93174.0,43909.0,459.0,305.0,1701.0,1505.0,196.0,1631.0,...,,,,,,,,,,
10486,2020,Eemsdelta,47350.0,,,,,,,,...,,,,,,,,,,


### PopulationChange - PopComparison2015_2020

### PopulationChange - GrowthPrediction2020_2050

### Death Datasets

### Birth Datasets

<b>Clean Data</b>

- lifeExpectency
    - lifeExpectencyPerRegion2016_2019 
- populationChange
    - pop2002_2020
    - popOverview
    - popComparison2015_2020
    - growthPrediction2020_2050
    - absoluteNr
- death
    - reasons1997_2014
    - reasons2005_2012
    - reasons2013_2020
    - perWeek2020_2021
- birth
    - birthPerYear1899_2018
    - avaragesOfMonth

<b>Merge</b>

In [10]:
# these are custom classes made to keep the notebook neat.



<b>Heatmap</b>