# Notebook Setup

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import train_test_split as tts

# Dela - Predicting the amount of deaths per year

### 1. Intro
<b>Who is our client?</b><br>
Our client is Dela. Dela is a funeral insurer and caretaker for funeral services. In this semester, they will give us some inside problems that we can investigate on.<br><br>
<b>Project explanation</b><br>
Dela was faced with unprecedented challenge by the fluctuating demands during the Covid19 first year, and that way they are looking to enhance their abilities to react to surprising serge or drops in demands.
We cannot predict when Dela needs to upscale or downscale. However, we can forecast the amount of deaths in the upcoming years. Based on that knowledge and Dela experience, they can conclude themselves when to upscale or downscale.<br><br>
<b>Project goal</b><br>
In our project, we are going to forecast the amount of deaths per year. In this way we can help them conclude easier what to do on higher, lower demand.<br><br><br>
<b>Document explanation</b><br>
In this document there will be the testing and implementation of the project delivered proposal hypothesis. In this document, we will do that by understanding and experimenting with the collected data. After the understanding, we will see if our hypothesis can be validated and we will do that by applying machine learning onto our dataset.<br><br>
<b>Document setup:</b><br>
<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left;">Data requirements</th>
        <td style="text-align: left;">In this chapter, we are going to setup the requirements for the data that is needed for the prediction. We will answer questions like ‘Which references are trustworthy?’, ‘Do we need some specific features?’, etc…</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data collection</th>
        <td style="text-align: left;">In this chapter, we are going to explain where we found our data, where we are going to store our data and references to he subchapters of each dataset</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data understanding</th>
        <td style="text-align: left;">In this chapter, we are going to understand each specific dataset that we downloaded to really understand the value of each dataset and how it is going to bring a value to Dela.</td>
    </tr>
    <tr>
        <th style="text-align: left;">Data preparation</th>
        <td style="text-align: left;">In this chapter, we are going to prepare our data so it is clean to work with, think about removing all the data that has invalid records, data that has wrong values or data that has similar features with different names.</td>
    </tr>
</table>

### 2. Provisioning

### 2.1 Data Requirements
In this chapter, we are trying to setup the expectations/ requirements of the data we are going to collect for the provisioning fase.

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Domain</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data type</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Target Variable</th>
        <td style="text-align: left !important"></td>
    </tr>
    <tr>
        <th style="text-align: left !important">Expected Features</th>
        <td style="text-align: left !important"></td>
    </tr>
</table>

### 2.2 Data Collection
Because we want to search for data that contains the amount of deaths in the Netherlands, we began searching for an open data bank that is governmental. In this case the data is from a trustworthy source and will higher the chance of a good prediction. That's when we landed on CBS (stands for `Centraal Bureau voor de Statistiek`, translated to English that means  `Central Station of Statistics`). 

<table style="font-size: 14px !important; margin: 0 !important">
    <tr>
        <th style="text-align: left !important">Data Source</th>
        <td style="text-align: left !important">We got our data from the official   <a href="https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS" target="_blank">CBS</a> Website</td>
    </tr>
    <tr>
        <th style="text-align: left !important">Data Storage</th>
        <td style="text-align: left !important">We stored all of our datasets on <a href="https://github.com/i454038/AI-car-price-prediction" target="_blank">Github</a>, so it is globally accessible</td>
    </tr>
</table>

Load in the datasets from Github

In [2]:
# these are custom classes made to keep the notebook neat.
from classes.dataImporting import datasetManager

datasets = datasetManager.defineDatasets()
dataframes = datasetManager.loadDatasets(datasets)

### 2.3 Data Understanding

We want to plot the deceased

### 2.4 Data Preperation

### LifeExpectency - LifeExpectencyPerRegion

For the life expectency
- We renamed the columns to English.
- We converted the integer columns to floats instead of objects.

In [3]:
dataframes['lifeExpectency']['lifeExpectencyPerRegion'] = (
    dataframes['lifeExpectency']['lifeExpectencyPerRegion']
        .rename(columns=datasetManager.renameFeatures('lifeExpectency.lifeExpectencyPerRegion'))
        .assign(LifeExpectancy = lambda x: x.LifeExpectancy.str.replace(',', '.').astype(float))
        .assign(LifeExpectancyWhen65OrOlder = lambda x: x.LifeExpectancyWhen65OrOlder.str.replace(',', '.').astype(float))
)

In [4]:
dataframes['lifeExpectency']['lifeExpectencyPerRegion'].head()

Unnamed: 0,id,Municipality,Groep_rij,Geslacht,LifeExpectancy,LifeExpectancyNL,LifeExpectancyWhen65OrOlder,LifeExpectancyWhen65OrOlderNL
0,518,'s-Gravenhage,Levensverwachting,Totaal,80.8,"onder, 99% zeker",19.1,"onder, 99% zeker"
1,796,'s-Hertogenbosch,Levensverwachting,Totaal,81.3,"onder, 99% zeker",19.7,"onder, 99% zeker"
2,1680,Aa en Hunze,Levensverwachting,Totaal,82.1,geen,20.4,geen
3,358,Aalsmeer,Levensverwachting,Totaal,82.9,"boven, 99% zeker",20.1,geen
4,197,Aalten,Levensverwachting,Totaal,82.1,geen,20.3,geen


### PopulationChange - Pop2002_2020

In [5]:
# transform dataset to align with others
dataframes['populationChange']['pop2002_2020'] = (
    dataframes['populationChange']['pop2002_2020']
        .rename(columns=datasetManager.renameFeatures('populationChange.pop2002_2020'))
        .assign(Year = lambda x: pd.to_datetime(x.Year.str[:4]).dt.year)
)

In [6]:
dataframes['populationChange']['pop2002_2020'].head(5)

Unnamed: 0,ID,Municipality,Year,PopulationAtBeginOfPeriod,AliveBornChildren,Deceased,TotalLocations,LocationsFromOtherMunicipality,Immigration,TotaalVertrekInclAdmCorrecties_7,AmountMovedToOtherMunicipality,EmigratieInclusiefAdmCorrecties_9,OverigeCorrecties_10,PopulationGrowth,RelativePopulationGrowth,PopulationGrowthSinceJanuari,RelativePopulationGrowthSinceJanuari,PopulationAtEndOfPeriod
0,14604,GM1680,2002,25552.0,289.0,251.0,1353.0,1121.0,232.0,1617.0,1498.0,119.0,-21.0,-247.0,-0.97,-247.0,-0.97,25305.0
1,14617,GM1680,2003,25305.0,279.0,241.0,1127.0,1071.0,56.0,1264.0,1111.0,153.0,12.0,-87.0,-0.34,-87.0,-0.34,25218.0
2,14630,GM1680,2004,25218.0,233.0,221.0,1167.0,1104.0,63.0,1077.0,1023.0,54.0,9.0,111.0,0.44,111.0,0.44,25329.0
3,14643,GM1680,2005,25329.0,230.0,231.0,1322.0,1254.0,68.0,1143.0,1062.0,81.0,0.0,178.0,0.7,178.0,0.7,25507.0
4,14656,GM1680,2006,25507.0,216.0,212.0,1369.0,1320.0,49.0,1326.0,1222.0,104.0,9.0,56.0,0.22,56.0,0.22,25563.0


### Death - Reason of Death

In [7]:
dataframes['death']['reason_per_year_per_region2002_2020'] = (
    pd.concat([
        dataframes['death']['reasons_per_year_per_region2002_2015'], 
        dataframes['death']['reasons_per_year_per_region2016_2020']
    ])
)

dataframes['death']['reason_per_year_per_region2002_2020'] = (
    dataframes['death']['reason_per_year_per_region2002_2020']
        .rename(columns={'RegioS': 'Municipality', 'Perioden': 'Year'})
        .assign(Year = lambda x: pd.to_datetime(x.Year.str[:4]).dt.year)
)

### Merge Datasets

<table style="margin: 0 !important; font-size: 14px !important">
    <tr>
        <th style="text-align: left !important">Goal:</th>
        <td style="text-align: left !important">
            Merge the datasets:<br>
            <ul>
                <li>lifeExpectency</li>
                <li>population</li>
                <li>reason of death</li>
            </ul>
        </td>
    </tr>
    <tr>
        <th style="text-align: left !important">Possible Solutions:</th>
        <td style="text-align: left !important">
            <ul>
                <li>merge (pandas)</li>
                <li>concatinate (pandas)</li>
            </ul>
        </td>
    </tr>
</table>

<b>Merge</b><br>
Merge DataFrame or named Series objects with a database-style join.<br>
A named Series object is treated as a DataFrame with a single named column.<br>
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.<br>
<b>Concatinate</b><br>
Concatenate pandas objects along a particular axis with optional set logic along the other axes.<br>
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.<br>
<b>What is the best solution?</b><br>
Our datasets do not have the same columns, it's even so that they have diffrent columns. We want to add features to one dataset and we do not want to align the dataset on a particular axis. This is why we are using the merge function to make one big dataset.

In [8]:
dataset = (
    dataframes['populationChange']['pop2002_2020']
        .merge(dataframes['lifeExpectency']['lifeExpectencyPerRegion'], how='outer', on="Municipality")
        .merge(dataframes['death']['reason_per_year_per_region2002_2020'], how='outer', on=['Year', "Municipality"])
        .assign(Municipality = lambda x: x.Municipality.replace(datasetManager.mapFeature('Municipality')))
        .fillna(0)
        .assign(Year = lambda x: x.Year.replace(0, 2002))
        .assign(Year = lambda x: x.Year.astype(int))
)
dataset['Year'].unique()

array([2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])

In [9]:
# PredictingDataset = dataset.assign(Municipality = lambda x: x.Municipality.astype("category").cat.codes)
# # remove values with type string (object) because they are useless eather way
dataset = dataset[[
    i for i in list(dataset.columns) 
    if i not in 
    ['Groep_rij', 'Geslacht_x', 'LifeExpectancyNL', 'LifeExpectancyWhen65OrOlderNL', 'Geslacht_y']
]]

In [10]:
dataset.set_index(['Municipality', 'Year'], inplace=True)

In [11]:
municipalities = dataset.reset_index()['Municipality'].unique()
datasets = {}
for munucipality in municipalities:
    datasets[munucipality] = dataset.loc[munucipality]

print(datasets['Amsterdam']['Deceased'])
print(datasets['Zwolle']['Deceased'])
# pd.DataFrame({
#     "recall": {
#         "amsterdam": '0.82',
#         'zwolle': '0.42'
#     },
#     'accuracy': {
#         "amsterdam": '0.82',
#         'zwolle': '0.42'
#     }
# })

Year
2002    6546.0
2002    6546.0
2002    6546.0
2002    6546.0
2002    6546.0
         ...  
2020    5763.0
2020    5763.0
2020    5763.0
2020    5763.0
2002       0.0
Name: Deceased, Length: 571, dtype: float64
Year
2002     953.0
2002     953.0
2002     953.0
2002     953.0
2002     953.0
         ...  
2020    1036.0
2020    1036.0
2020    1036.0
2020    1036.0
2002       0.0
Name: Deceased, Length: 571, dtype: float64


<table style="margin: 0 !important;">
    <tr>
        <th style="text=align: center !important">Goal:</th>
        <td style="text=align: center !important">
            Grouping the dataset from municapality to year:<br>
        </td>
    </tr>
    <tr>
        <th style="text=align: center !important">Possible Solutions:</th>
        <td style="text=align: center !important">
            <ul>
                <li>groupby</li>
                <li>sort</li>
            </ul>
        </td>
    </tr>
</table>

<b>GroupBy</b><br>
explenation<br>
<b>Sort</b><br>
explenation

### Rename columns

In [12]:
def doFeatureSelection(dataset, y):
    X = dataset.loc[:, dataset.columns != 'Deceased']
    
    sel = SelectKBest(mutual_info_regression, k=5)
    sel.fit_transform(X,y)

    cols = sel.get_support()
    features_df_new = X.iloc[:,cols].keys()
    return features_df_new
#     print("\n")
#     print("SelectKBest chooses following highest correlated features:")
#     print(features_df_new.keys())
#     X = X[features_df_new.keys()]

In [13]:
def predictByNearestNeighbors(dataset, y):    
    scaler = StandardScaler()
    X_new = scaler.fit_transform(dataset)
    
    from sklearn.neighbors import KNeighborsRegressor

    X_train, X_test, y_train, y_test = tts(X_new, y, test_size=0.2, random_state=20)

    neigh = KNeighborsRegressor(n_neighbors=2)
    neigh.fit(X_train, y_train)
    prediction = neigh.predict(X_test)
    return neigh.score(X_test, y_test)

#     from sklearn.neighbors import NearestNeighbors
#     neigh = NearestNeighbors()
#     neigh.fit(X_train)

In [14]:
for municipality in municipalities:
    y = datasets[municipality].Deceased
    dataset = datasets[municipality]
#     print(dataset)
    features = doFeatureSelection(dataset, y)
    dataset = dataset[features]
#     print(dataset)
    predictByNearestNeighbors(dataset, y)

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 16, n_neighbors = 20

In [None]:
X

In [None]:
# dataset = (
#     dataset
#         .groupby(['Year', 'Municipality']).mean()
#         .reset_index()
# )

In [None]:
# dataset
import plotly.express as px
px.line(dataset, x="Year", y='Deceased', color="Municipality", title="Deceased per Year per Municipality in the Netherlands")

### PopulationChange - PopComparison2015_2020

### PopulationChange - GrowthPrediction2020_2050

### Death Datasets

### Birth Datasets

<b>Clean Data</b>

- lifeExpectency
    - lifeExpectencyPerRegion2016_2019 
- populationChange
    - pop2002_2020
    - popOverview
    - popComparison2015_2020
    - growthPrediction2020_2050
    - absoluteNr
- death
    - reasons1997_2014
    - reasons2005_2012
    - reasons2013_2020
    - perWeek2020_2021
- birth
    - birthPerYear1899_2018
    - avaragesOfMonth

<b>Merge</b>

In [None]:
# these are custom classes made to keep the notebook neat.



<b>Heatmap</b>