# Data Project

Install the DST api-data reader and the pandas_datareader. If you wanna run our data, but haven´t installed %pip install git+https://github.com/alemartinello/dstapi and %pip install pandas-datareader on your computer, then just remove the hashtag and run the code just one time. Then commet it put again. 

In [32]:
# The DST API wrapper
#%pip install git+https://github.com/alemartinello/dstapi

# A wrapper for multiple APIs with a pandas interface
#%pip install pandas-datareader

Imports and set magics:

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets

#from dataproject_sript import *

from dstapi import DstApi # install with `pip install git+https://github.com/alemartinello/dstapi`
import pandas_datareader # install with `pip install pandas-datareader`

# Read and clean data

Consider the following dictionary definitions:

In [34]:
columns_dict = {}
columns_dict['TRANSMID'] = 'vehicle'
columns_dict['TID'] = 'year'
columns_dict['INDHOLD'] = 'value'

We will download all data from table PKM1 and ...  using DstApi. First we will read and clean data from table PKM1.

In [35]:
pkm1_api = DstApi('PKM1') # loading the data, by writting the table name inside the paranthesis
params = pkm1_api._define_base_params(language='en') # we have no restriction, only that we eant the table in english

pkm1 = pkm1_api.get_data(params=params) #getting the data, where we in params defined, what we wanted to include from the table.
pkm1.head()

Unnamed: 0,TRANSMID,TID,INDHOLD
0,VEHICLES ON THE ROAD TOTAL,1981,..
1,Bicycles/Mopeds max. 30 km/h,1981,..
2,Motor vehicles total,1981,46168
3,Private cars and vans under 2.001 kg.,1981,36854
4,Vans over 2.000 kg.,1981,3795


Rename column 'TRANSMID' to 'vehicle'

In [36]:
pkm1.rename(columns=columns_dict,inplace=True)
pkm1.head(14)

Unnamed: 0,vehicle,year,value
0,VEHICLES ON THE ROAD TOTAL,1981,..
1,Bicycles/Mopeds max. 30 km/h,1981,..
2,Motor vehicles total,1981,46168
3,Private cars and vans under 2.001 kg.,1981,36854
4,Vans over 2.000 kg.,1981,3795
5,Taxis,1981,441
6,Motorcycles,1981,282
7,Mopeds max. 45 km/h,1981,0
8,Buses and coaches total,1981,4797
9,Scheduled buses,1981,2418


The dataset contains following vehicles, which do not fit into our analysis, therefor they are droped. 

In [37]:
# Build up a logical index I
I = pkm1.vehicle.str.contains('VEHICLES ON THE ROAD TOTAL')
I |= pkm1.vehicle.str.contains('Motor vehicles total')
I |= pkm1.vehicle.str.contains('Vans over 2.000 kg.')
I |= pkm1.vehicle.str.contains('Scheduled buses')
I |= pkm1.vehicle.str.contains('Coaches and other buses')
pkm1.loc[I, :]

pkm1 = pkm1.loc[I == False] # keep everything else
pkm1.reset_index(inplace = True, drop = True) # Drop old index too
pkm1.head(9)

Unnamed: 0,vehicle,year,value
0,Bicycles/Mopeds max. 30 km/h,1981,..
1,Private cars and vans under 2.001 kg.,1981,36854
2,Taxis,1981,441
3,Motorcycles,1981,282
4,Mopeds max. 45 km/h,1981,0
5,Buses and coaches total,1981,4797
6,Train,1981,4724
7,Ship,1981,..
8,Aeroplane,1981,..


Convert values to numeric, so they later on can be used in mathematical operations for the analysis. First remove the empty values, '..', thereafter the values can be turned from strings to floats:

In [54]:
# remove rows where 'value' is '..'
pkm1 = pkm1[pkm1.value != '..']

# convert values to numeric
pkm1.value = pkm1.value.astype('float')

pkm1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 364 entries, 260 to 303
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   vehicle  364 non-null    object 
 1   year     364 non-null    int64  
 2   value    364 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 11.4+ KB


Futhermore we want to restrict the years without any information, maybee

Sort by year and vehicle 

In [39]:
pkm1.sort_values(by=['vehicle', 'year'],inplace=True)
pkm1.head(33)

Unnamed: 0,vehicle,year,value
260,Aeroplane,1990,476.0
134,Aeroplane,1991,442.0
314,Aeroplane,1992,457.0
323,Aeroplane,1993,449.0
143,Aeroplane,1994,478.0
332,Aeroplane,1995,497.0
71,Aeroplane,1996,541.0
341,Aeroplane,1997,519.0
350,Aeroplane,1998,424.0
359,Aeroplane,1999,398.0


# Explore the data set 

Interactive plot of the data from table PKM1, showing the development in the use of vehicles messeaured in mio. personkm (y-axis) over the years (x-axis)

In [40]:
# define the plot_e function
def plot_e(df, vehicle): 
    I = df['vehicle'] == vehicle
    ax=df.loc[I,:].plot(x='year', y='value', style='-o', legend=False)

In [41]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df = widgets.fixed(pkm1),
    vehicle = widgets.Dropdown(description='vehicle', 
                                    options=pkm1.vehicle.unique(), # we choose to look at vehicles
                                    value='Train') # Train is the start observation
);

interactive(children=(Dropdown(description='vehicle', index=8, options=('Aeroplane', 'Bicycles/Mopeds max. 30 …

Suprinsingly we se a fall in the use of Bicycles/Mopeds max 30 km/h when we se ourselves as the cycling nation of the world. 

The use of train has increase since 1980 to around 2015. After 2015 we se a slide fall in the use of train, and then ofc the downfall of Corona. 

Buses and coahches total was at its second highest just before 2000, whereafter it decreased again. From around 2013 it again began to increase, but then Corona came. After Corona transporation with bus exploded to its highest ever, but could be due to a lot of different things. 

Pricvate cars an vans has increase from 1980 to today, again with a slide fall do to Corona. 

Transportation with aeroplane has had a general negative trend from 1990 to just before corona, with a peak before the fananciel crisis.

Checking the outlier for buses and coaches total in 2022

In [42]:
filtered_df = pkm1[pkm1['vehicle'] == 'Buses and coaches total']
filtered_df

Unnamed: 0,vehicle,year,value
194,Buses and coaches total,1980,4611.0
5,Buses and coaches total,1981,4797.0
248,Buses and coaches total,1982,5183.0
122,Buses and coaches total,1983,5378.0
59,Buses and coaches total,1984,5642.0
203,Buses and coaches total,1985,5938.0
212,Buses and coaches total,1986,6171.0
95,Buses and coaches total,1987,6271.0
104,Buses and coaches total,1988,6292.0
14,Buses and coaches total,1989,6343.0


# Merge with population data from Denmark Statistics


We want to have the value (mio. personkm) from the PKM1 table in per capita terms. To get that, we would need to download population data from Denmark Statistics and merge this with our data from PMK1:

In [43]:
FT_api = DstApi('FT')
params = FT_api._define_base_params(language='en')
params['variables'][0]['values'] = ['000'] 
## 000 is the code for all of Denmark, this can be seen by using: FT_api.variable_levels('HOVEDDELE', language='en')
pop = FT_api.get_data(params=params)

pop.rename(columns={'TID':'year','INDHOLD':'population'},inplace=True)
pop =  pop.loc[:,['year','population']]
pop.head()

Unnamed: 0,year,population
0,2010,5534738
1,1769,797584
2,1840,1289075
3,1860,1608362
4,1901,2449540


Merge the population data set with the PKM1 data set. When we do that, we get an extra column with population:

In [44]:
merged = pd.merge(pkm1,pop,how='left',on=['year'])
merged.head(33)

Unnamed: 0,vehicle,year,value,population
0,Aeroplane,1990,476.0,5135409
1,Aeroplane,1991,442.0,5146469
2,Aeroplane,1992,457.0,5162126
3,Aeroplane,1993,449.0,5180614
4,Aeroplane,1994,478.0,5196642
5,Aeroplane,1995,497.0,5215718
6,Aeroplane,1996,541.0,5251027
7,Aeroplane,1997,519.0,5275121
8,Aeroplane,1998,424.0,5294860
9,Aeroplane,1999,398.0,5313577


# Analysis 

First we would like to get the value (mio. personkm) in per capita terms, so we can check if the increase or decrease are driven of the increase in population or not:

Before we can do the calculation, we will need to check if population is numeric. If it is not, then we will have to change it 

In [45]:
# check the column population´s Dtype
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   vehicle     364 non-null    object 
 1   year        364 non-null    int64  
 2   value       364 non-null    float64
 3   population  364 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 11.5+ KB


We will need to convert population to numeric float

In [46]:
# change dtype from integer to float
merged.population = merged.population.astype('float')
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   vehicle     364 non-null    object 
 1   year        364 non-null    int64  
 2   value       364 non-null    float64
 3   population  364 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 11.5+ KB


In [47]:
#merged.head(33)

Now we can do the calculation

In [48]:
merged['pr.capita'] = (merged['value']*1000000)/merged['population']
merged.head(33)

Unnamed: 0,vehicle,year,value,population,pr.capita
0,Aeroplane,1990,476.0,5135409.0,92.689794
1,Aeroplane,1991,442.0,5146469.0,85.884128
2,Aeroplane,1992,457.0,5162126.0,88.529416
3,Aeroplane,1993,449.0,5180614.0,86.669264
4,Aeroplane,1994,478.0,5196642.0,91.982476
5,Aeroplane,1995,497.0,5215718.0,95.288894
6,Aeroplane,1996,541.0,5251027.0,103.027465
7,Aeroplane,1997,519.0,5275121.0,98.386369
8,Aeroplane,1998,424.0,5294860.0,80.07766
9,Aeroplane,1999,398.0,5313577.0,74.902462


We want to recreate the interactive graph with the new pr. capita numbers:

In [49]:
# again define the plot_e function, now with pr. capita on y-axis
def plot_e(df, vehicle): 
    I = df['vehicle'] == vehicle
    ax=df.loc[I,:].plot(x='year', y='pr.capita', style='-o', legend=False)

In [50]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df = widgets.fixed(merged),
    vehicle = widgets.Dropdown(description='vehicle', 
                                    options=merged.vehicle.unique(), # we again choose to look at vehicles, so we can compare with graph from earlier
                                    value='Train') # Train is the start observation
);

interactive(children=(Dropdown(description='vehicle', index=8, options=('Aeroplane', 'Bicycles/Mopeds max. 30 …

The reason we want to include pr. capita terms is e.g. if you observe an increase in the absolute number of train riders over time, you might conclude that public bus transportation is becoming more popular than it really is. The increase could be a biproduct of an increase in population size. 

We don not se any big changes, the trend is similar to the interactive plot from earlier. Peaks and valleys does not differ extraordinary when corrected with population. 

Secondly we want to index the merged data, where january 2015 is set as base month. We use our index_year function, which has been defined in the script, and which returns the dataframe indexed to a given year, in our case 2015:

In [51]:
def index_to_year(df, base_year):
    # Ensure 'year' column exists
    if 'year' not in df.columns:
        raise ValueError("The DataFrame does not contain a 'year' column.")
    
    # Check if base year is in the 'year' column
    if not df['year'].isin([base_year]).any():
        raise ValueError(f"Base year {base_year} not found in DataFrame's 'year' column")
    
    # Set the index to 'vehicle' and 'year' if not already set
    if set(df.index.names) != {'vehicle', 'year'}:
        df = df.set_index(['vehicle', 'year'])
    
    # Isolate the base year data
    base_year_df = df.xs(base_year, level='year')
    
    # Divide each row by its corresponding base year value and multiply by 100
    df_normalized = df.div(base_year_df, level='vehicle') * 100
    
    # Optionally, reset index if you want 'vehicle' and 'year' back as columns
    df_normalized.reset_index(inplace=True)
    
    return df_normalized

In [52]:
merged_index2015 = index_to_year(merged, 2015)
merged_index2015.head(33)

Unnamed: 0,vehicle,year,value,population,pr.capita
0,Aeroplane,1990,133.333333,90.736177,146.946167
1,Aeroplane,1991,123.809524,90.931593,136.156774
2,Aeroplane,1992,128.011204,91.208232,140.350494
3,Aeroplane,1993,125.770308,91.534892,137.401493
4,Aeroplane,1994,133.893557,91.818086,145.824818
5,Aeroplane,1995,139.215686,92.155135,151.066662
6,Aeroplane,1996,151.540616,92.779,163.335039
7,Aeroplane,1997,145.378151,93.204711,155.977257
8,Aeroplane,1998,118.767507,93.553474,126.951466
9,Aeroplane,1999,111.484594,93.88418,118.746944
