## Analysing the correlation between accessibility of jobs and unemployment in the capital region of Finland

This notebook was produced for analysing the correlation between job accessibility and unemployment between different YKR grid cells (250x250m resolution) located in the Finnish capital region. The used data are the following:

- YKR demographic data from the year 2020 (SYKE, 2020: https://www.stat.fi/tup/ykraineistot/index.html):  YKR grids in the Helsinki region area, where there is at least a single person living within a cell. The data also contains various demographic information like the gender, income, employment status etc. of the inhabitants
    - This data is used to extract the number of employed and unemployed people residing in the grid cell and further the unemployment rate
    - Due to privacy reasons, in cells with too few people the number of employed and unemployed people are not told and marked with -1. These are filtered out
    - Unemployment rate has already been calculated beforehand by dividing the number of unemployed people with the number of both employed and unemployed people in each cell to a new column ("tyott")

- Data on the number of people employed in companies located within each grid cell (HSY, 2015: https://hri.fi/data/en_GB/dataset/helsingin-seudun-tyopaikkaruudukko) 
    - The data contains only private companies and non-public organisations
    - This dataset has already been combined to the demographic data beforehand
    
- Helsinki travel time matrix (Digital Geography Lab, 2018: https://blogs.helsinki.fi/accessibility/helsinki-region-travel-time-matrix-2018/)
    - Data on the estimated travel time to each YKR-grid cell from all the other grid cells in the capital region area. The dataset contains travel times of various different kinds of tranport modes and conditions, from which the following travel time metrics were chosen for analysis:
        - Travel times by walking (70 meters/minute)
        - Travel times by cycling (12km/hour). THe time also includes one minute extra for picking up and returning the bike
        - Travel times by public transportation (rush hour, includes the whole travel chain liek for example waiting at home and at the station
        - Travel times by car (rush hour, includes time for walking to the car, and finding a slot for parking)
    - The data is provided in separate text files for each grid cell, which contains the travel times to each other grid cells in the area using different transportation modes

Workflow for the data processing is following:

- Loop over the travel time matrix files and do the following:
    - Only read the file if the grid cell ID is found in the YKR grid
    - Join the travel time data with the dataframe on job and population in each grid cell
    - Do the following:
       - Calculate the average travel time to all the jobs in the Helsinki region for each transport mode for each cell to a new column
       - Calculate the number of jobs within 30 minutes (the average travel time to work in the capital region) for each transport mode and for each cell to a new column
- Join this data to the population YKR-grid cell by cell and save the dataframe with new columns included
- Create correlation graphs showing the unemployment and job access in their own axes
   

In [None]:
#Import necessary libraries
import pandas as pd
import geopandas as gpd
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
#Read the YKR dataset with unemplyment and job numbers in each cell
grid_data=gpd.read_file("ykr_jobaccess.gpkg")

In [None]:
#folder containing the traveltimes to each cell
matrixfold="data\\HelsinkiTravelTimeMatrix2018\\"

In [None]:
#Lists of tuples for the new columns to which to insert the data on avg travel times and number of jobs accessible within 30min
#and paired with the travel time columns for each transport in order to loop through the files easier
avg_ttimes=[("avg_ttime_car", "car_r_t"), ("avg_ttime_walk", "walk_t"), ("avg_ttime_bike", "bike_s_t"), ("avg_ttime_pt", "pt_r_t")]
njobs=[("njobs_car", "car_r_t"), ("njobs_pt", "pt_r_t") , ("njobs_bike", "bike_s_t"), ("njobs_walk", "walk_t")]

#Set up the max travel time for travelling to work to calculate the number of accessible jobs
tt_max=30

In [None]:
#Set up the new columns to which to input the average travel times and number of jobs
grid_data["avg_ttime_car"]=None
grid_data["avg_ttime_walk"]=None
grid_data["avg_ttime_bike"]=None
grid_data["avg_ttime_pt"]=None

grid_data["njobs_car"]=None
grid_data["njobs_pt"]=None
grid_data["njobs_walk"]=None
grid_data["njobs_bike"]=None

#Create a filter which takes in only the rows with adequate data on unemployment. Use this to filter through the travel matrix
grid_data_filt=grid_data[grid_data["pop_pt_tyott"].notnull() & (grid_data["pop_pt_tyott"]!=-1)]

#Start looping over all the travel time matrix text files containing the travel times to each cell
dirlist=[x[0] for x in os.walk(matrixfold)]
for fold in dirlist[1:]:
        for f in os.listdir(fold):
            file=pd.read_csv(str(fold)+"\\"+str(f), sep=";")
            cellid=file["to_id"][0]
            
            #Check if the cell is included in the dataframe (containing cells in which the number of unemployed has been presented)
            #and join this data to the travel time data file
            if cellid in grid_data_filt["YKR_ID"].unique():
                print(cellid)
                
                merged_file=file.merge(grid_data[["YKR_ID", "tp_tp_yht"]], left_on="from_id", right_on="YKR_ID")
                
                #Loop over all the different transport modes, and calculate the number of jobs and average travel times for each
                #of them. Also filter out the unreachable grid cells marked with -1. Insert this data to the corresponding
                #column in the original grid data file
                for tmean in njobs:
                    jobnum=merged_file[(merged_file[tmean[1]]<tt_max) & (merged_file[tmean[1]]!=-1)]["tp_tp_yht"].sum()

                    grid_data.loc[grid_data["YKR_ID"]==cellid, tmean[0]]=jobnum
                for tmean in avg_ttimes:

                    ttime_mean=(pd.Series(merged_file[merged_file[tmean[1]]!=-1][tmean[1]]*merged_file["tp_tp_yht"]).sum()/merged_file["tp_tp_yht"].sum())

                    grid_data.loc[grid_data["YKR_ID"]==cellid, tmean[0]]=ttime_mean

In [None]:
#Filter out the calculated grid data so that only cells where there were enough people and people in the workforce & unemployed
#that the stats could be presented
pop_grid_data=grid_data[grid_data["tyott"].notnull() & (grid_data["pop_pt_tyott"]!=-1) & (grid_data["pop_pt_tyoll"]!=-1)]

In [None]:
#Save the calculated dataframes (with and without the population filter) to new files
grid_data.to_file("ykr_jobaccess_calculated.gpkg")

pop_grid_data.to_file("ykr_jobaccess_calculated_popfilt.gpkg")

In [None]:
#Change the datatype in the columns indicating the number of available jobs to int for better formatting
pop_grid_data["njobs_walk"]=pop_grid_data["njobs_walk"].astype("int64")
pop_grid_data["njobs_car"]=pop_grid_data["njobs_car"].astype("int64")
pop_grid_data["njobs_bike"]=pop_grid_data["njobs_bike"].astype("int64")
pop_grid_data["njobs_pt"]=pop_grid_data["njobs_pt"].astype("int64")

In [None]:
#Correlation graphs presenting the unemployment % in the Y-axis and number of jobs in the X-axis

fig, ax = plt.subplots(nrows=2, ncols=2, sharey=True, constrained_layout=True)


i=0
axlabels={"njobs_car": "Car",
          "njobs_walk": "Walking",
          "njobs_bike": "Bicycle",
          "njobs_pt": "Public Transport"
         }

for row in ax:
    for col in row:
        x=njobs[i][0]
        y="tyott"


        pop_grid_data.plot.scatter(x, y, s=1, ax=col)
        m, b = np.polyfit(pop_grid_data[x], pop_grid_data[y], 1)

        col.plot(pop_grid_data[x], m*pop_grid_data[x]+b, color="red")
        
        spr=stats.spearmanr(pop_grid_data[x],pop_grid_data[y])
        
        col.set_xlabel("Jobs Reached by {} \n Spearman cor:{:.8} \n p: {} \n".format(axlabels[njobs[i][0]], spr[0], spr[1]))
        col.set_ylabel("Unemployment (%)", fontdict=dict(weight='bold'))
        
        i+=1