# The data

Our data consists of population data obtained from Finnish Statistical center and healthcare expenditure data obtained from Finnish Institute of Health and Welfare.

First let us load and look at the data:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [4]:
population = pd.read_csv("data/pop_by_region.csv", sep=";", header=None, 
                         names=["Category", "Id", "Region", "Area_code", "Gender", "Year", "Count", "Pop_total"])

# only read necessary fields
population = population[["Region", "Gender", "Year", "Pop_total"]]

print(population)

health_cost = pd.read_csv("data/health_expenditure_total.csv", sep=";", header=None,
                            names=["Category", "Id", "Region", "Area_code", "Gender", "Year", "Count", "Health_total"])

# only read necessary fields
health_cost = health_cost[["Region", "Gender", "Year", "Health_total"]]
print(health_cost)

               Region    Gender  Year  Pop_total
0     Central Finland      male  1990     125353
1     Central Finland    female  1990     129186
2     Central Finland  combined  1990     254539
3     Central Finland      male  1991     126307
4     Central Finland    female  1991     130173
...               ...       ...   ...        ...
1933            Åland    female  2022      15295
1934            Åland  combined  2022      30359
1935            Åland      male  2023      15122
1936            Åland    female  2023      15419
1937            Åland  combined  2023      30541

[1938 rows x 4 columns]
              Region    Gender  Year  Health_total
0    Central Finland  combined  1993        200375
1    Central Finland  combined  1994        192911
2    Central Finland  combined  1995        200223
3    Central Finland  combined  1996        210664
4    Central Finland  combined  1997        225084
..               ...       ...   ...           ...
527            Åland  combined

In the `pop_by_region` we have the 19 regions of Finland with their respective populations by year from 1990 to 2023. In the `health_expenditure_total` we have the total operating expenditure of municipal healthcare, 1000 euros, from 1993 to 2020.

Our goal is to predict the future healthcare costs of a region based on following predictors: population structure, alchohol and tobacco consumption and drug use, level of education and employment (possible to add others).

## Data processing

The data 