# Retrieving the data

In order to estimate home-insurance premium for different regions in the country, a variety of datasets will be used. These datasets are retrieved from two sources, namely the police and the CBS (Central Bureau of Statistics). This document will contain all functionalities for retrieving the desired data and temporarily storing it, such that it can be cleaned, analyzed and displayed on the dashboard later.

## Funstionality setup

This section contains global variables and helper functions that aid the retrieving and storing of the data.

In [None]:
# Importing the required dependencies
import requests
import pandas as pd
import cbsodata
from datetime import datetime

In [17]:
# Path where all the datasets will be stored
data_path: str = r"../data"


def dataframe_to_csv(df: pd.DataFrame, save_folder: str, file_name: str) -> None:
    """ Converts a DataFrame to a csv file and saves it at a specific location

    Args:
        df (pd.DataFrame): DataFrame to be converted to CSV
        save_folder (str): The folder in which the file needs to be saved
        file_name (str): The actual name of the file
    """
    df.to_csv(f"{save_folder}/{file_name}.csv", ",", index=False, encoding="utf-8")

## Dynamic data from the police

The dutch police department has provided API access to a database that contains information ranging from missing persons and police stations to patrol agents and general news articles. This data is updated in daily, which is why it is used in this project to satisfy the dynamic dataset requirement.

In [11]:
def get_police_data(target_url: str, max_requests: int = 10, parameters: dict[str, str] = {}) -> pd.DataFrame:
    """ Gets the data from the police API back in a dataframe. Since the API is limited to only returning 25 records,
    the API gets queried for a specified numer of times. While making the API call, it is possible to add additional parameters

    Args:
        target_url (str): The url for the specific data that you want to retrieve from the police
        max_requests (int, optional): Maximum number of requests made to the police API. Defaults to 10
        parameters (dict[str, str], optional): Additional parameters that can be added to the API request. Defaults to {}.

    Returns:
        pd.DataFrame: DataFrame containing the desired data
    """
    df: pd.DataFrame = pd.DataFrame()
    request_index: int = 1

    # Adding the parameters to the target url
    if parameters != {}:
        target_url = f"{target_url}?"

        for parameter, value in parameters.items():
            if value != None:
                target_url = f"{target_url}{parameter}={value}&"

    base_url: str = target_url

    # Making the requests
    while request_index <= max_requests:
        print(f"Starting request {str(request_index)}/{str(max_requests)}...")
        # Calculate the offset
        offset: int = (request_index - 1) * 25
        
        if parameter != {}:
            target_url = f"{base_url}offset={offset}"
        else:
            target_url = f"{base_url}&offset={offset}"

        r = requests.get(target_url).json()
        df = pd.concat([df, pd.DataFrame(r["opsporingsberichten"])], ignore_index=True)

        request_index = request_index + 1

    print("Finished requests")

    return df

In [18]:
police_url: str = "https://api.politie.nl/v4/gezocht"

wanted_persons_parameters: dict[str, str] = {
    "uid": None,
    "language": "nl",
    "query": None,
    "lat": None,
    "lon": None,
    "radius": None,
    "maxnumberofitems": "25"
}

police_datadata = get_police_data(police_url, 3, wanted_persons_parameters)
dataframe_to_csv(police_datadata, data_path, "wanted_persons")

Starting request 1/3...
Starting request 2/3...
Starting request 3/3...
Finished requests


## Data from CBS

The Central Bureau for Statistics in the Netherlands provides all Dutch citizens with free datasets that can be analyzed. The datasets that are provided have all sorts of topics. In the report we provide a justification which dataset is used and why.

In [7]:
# 70072ned: Regional metrics of all of the netherlands. Big dataset with a lot of information. Based on municipality
cbs_datasets: dict[str, str] = {
    "income_ineqaulity": "71511NED"
}


for dataset, identifier in cbs_datasets.items():
    data = pd.DataFrame(cbsodata.get_data(identifier, dir=f"../data/{dataset}"))
    data.head()

In [6]:
def test_functie():
    pass

Unnamed: 0,ID,RegioS,Perioden,PersonenMetUitgesprokenSchuldsanering_1
0,0,Nederland,1999,6220.0
1,1,Nederland,2000,13930.0
2,2,Nederland,2001,20465.0
3,3,Nederland,2002,25560.0
4,4,Nederland,2003,28150.0
