# Data Project

Introduction ....

# Importing packages and loading data

Install the DST api-data reader and the pandas_datareader. If you wanna run our data, but haven´t installed %pip install git+https://github.com/alemartinello/dstapi and %pip install pandas-datareader on your computer, then just remove the hashtag and run the code just one time. Then commet it put again. 

In [55]:
# The DST API wrapper
#%pip install git+https://github.com/alemartinello/dstapi

# A wrapper for multiple APIs with a pandas interface
#%pip install pandas-datareader

Imports and set magics:

In [56]:
# import packages 
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display

# Import definitions from the py file
from dataproject import *

# install Api and pandas datareader
from dstapi import DstApi 
import pandas_datareader

# import raw csv file data
#pris111 = pd.read_csv("/Users/jacob/Documents/GitHub/projects-2024-jacobogmads/Jacob/Data project/data.csv", encoding='ISO-8859-1', skiprows=[0])
#pris111.head(20)

# Read and clean data

This assignment will look at two data sets using two different methods. First data set is the PKM1 from Danmark Statistics, which we will download using DstApi. Second data set is the PRIS111 (Forbrugerprinsindekset), which have downloaded manualy as a csv file. We will first read and clean data from table PKM1

Consider the following dictionary definitions:

In [57]:
columns_dict = {}
columns_dict['TRANSMID'] = 'vehicle'
columns_dict['TID'] = 'year'
columns_dict['INDHOLD'] = 'mio_personkm' # kilometers traveled with a vehicle by persons in miollions. 

We will download all data from table PKM1 and ...  using DstApi. First we will read and clean data from table PKM1.

In [58]:
pkm1_api = DstApi('PKM1') # loading the data, by writting the table name inside the paranthesis
params = pkm1_api._define_base_params(language='en') # we have no restriction, only that we eant the table in english

pkm1 = pkm1_api.get_data(params=params) #getting the data, where we in params defined, what we wanted to include from the table.
pkm1.head()

Unnamed: 0,TRANSMID,TID,INDHOLD
0,VEHICLES ON THE ROAD TOTAL,1981,..
1,Bicycles/Mopeds max. 30 km/h,1981,..
2,Motor vehicles total,1981,46168
3,Private cars and vans under 2.001 kg.,1981,36854
4,Vans over 2.000 kg.,1981,3795


Rename column 'TRANSMID' to 'vehicle'

In [59]:
pkm1.rename(columns=columns_dict,inplace=True)
pkm1.head(14)

Unnamed: 0,vehicle,year,mio_personkm
0,VEHICLES ON THE ROAD TOTAL,1981,..
1,Bicycles/Mopeds max. 30 km/h,1981,..
2,Motor vehicles total,1981,46168
3,Private cars and vans under 2.001 kg.,1981,36854
4,Vans over 2.000 kg.,1981,3795
5,Taxis,1981,441
6,Motorcycles,1981,282
7,Mopeds max. 45 km/h,1981,0
8,Buses and coaches total,1981,4797
9,Scheduled buses,1981,2418


The dataset contains following vehicles, which do not fit into our analysis, therefor they are droped. 

In [60]:
# Build up a logical index I
I = pkm1.vehicle.str.contains('VEHICLES ON THE ROAD TOTAL')
I |= pkm1.vehicle.str.contains('Motor vehicles total')
I |= pkm1.vehicle.str.contains('Vans over 2.000 kg.')
I |= pkm1.vehicle.str.contains('Scheduled buses')
I |= pkm1.vehicle.str.contains('Coaches and other buses')
pkm1.loc[I, :]

pkm1 = pkm1.loc[I == False] # keep everything else
pkm1.reset_index(inplace = True, drop = True) # Drop old index too
pkm1.head(9)

Unnamed: 0,vehicle,year,mio_personkm
0,Bicycles/Mopeds max. 30 km/h,1981,..
1,Private cars and vans under 2.001 kg.,1981,36854
2,Taxis,1981,441
3,Motorcycles,1981,282
4,Mopeds max. 45 km/h,1981,0
5,Buses and coaches total,1981,4797
6,Train,1981,4724
7,Ship,1981,..
8,Aeroplane,1981,..


Convert mio_personkm to numeric, so they later on can be used in mathematical operations for the analysis. First remove the empty mio_personkm, '..', thereafter the mio_personkm can be turned from strings to floats:

In [61]:
# remove rows where 'mio_personkm' is '..'
pkm1 = pkm1[pkm1.mio_personkm != '..']

# convert mio_personkm to numeric
pkm1.mio_personkm = pkm1.mio_personkm.astype('float')

pkm1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 364 entries, 1 to 386
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   vehicle       364 non-null    object 
 1   year          364 non-null    int64  
 2   mio_personkm  364 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 11.4+ KB


Sort by year and vehicle 

In [62]:
pkm1.sort_values(by=['vehicle', 'year'],inplace=True)
pkm1.head(33)

Unnamed: 0,vehicle,year,mio_personkm
260,Aeroplane,1990,476.0
134,Aeroplane,1991,442.0
314,Aeroplane,1992,457.0
323,Aeroplane,1993,449.0
143,Aeroplane,1994,478.0
332,Aeroplane,1995,497.0
71,Aeroplane,1996,541.0
341,Aeroplane,1997,519.0
350,Aeroplane,1998,424.0
359,Aeroplane,1999,398.0


Our data for transportation by vehicle has now been cleaned, ready to be used. 

Now we will read and clean the PRIS111 table using the downloaded csv.file. First we create a dictionary with the proper names, and then perform the remapping.


In [63]:
# Define the new translated names
var_dict = {
     '00 Forbrugerprisindekset i alt': 'General Consumer Price Index',
     '07.2 Drift af personlige transportmidler': 'Passenger transport by personal transportation',
     '07.2.1 Reservedele og tilbehï¿½r': 'Spare parts and accessories',
     '07.2.2 Brï¿½ndstof': 'Fuel',
     '07.2.3 Vedligeholdelse og reparation af personlige transportmidler': 'Maintenance and repair of personal transportation equipment',
     '07.3.1.1 Personbefordring med tog': 'Passenger transport by train',
     '07.3.1.2 Personbefordring med metro': 'Passenger transport by metro',
     '07.3.2.1Personbefordring med bus': 'Passenger transport by bus',
     '07.3.2.2 Personbefordring med taxi og lejet bil med fører': 'Passenger transport by taxi and rented car with driver',
     '07.3.3.1 Indenrigsflyvning': 'Personal transport by domestic flights',
     '07.3.4 Personbefordring med fï¿½rge': 'Passenger transport by ferry',
     '07.3.4.1 Personbefordring ad søvejen': 'Passenger transport by sea',
     'ï¿½ndring i forhold til mï¿½neden fï¿½r (pct.)': 'Change compared to the previous month (pct.)',
     'ï¿½ndring i forhold til samme mï¿½ned ï¿½ret fï¿½r (pct.)': 'Change compared to the same month last year (pct.)'
 }

We rename the indexes:

In [64]:
# Rename the indexes
pris111.replace(var_dict, inplace=True)

NameError: name 'pris111' is not defined

We continue by droppping rows which we are not interested in. We then reset the index.

In [None]:
pris111 = pris111.drop(pris111.index[9:])
pris111 = pris111.drop(pris111.index[0])
pris111.reset_index(inplace = True, drop = True)
pris111

NameError: name 'pris111' is not defined

Now we rename our index-column to Category.

In [None]:
pris111.columns.values[1] = 'Category'
pris111.iloc[[]]

NameError: name 'pris111' is not defined

We now want to mean the monthly values for each year, so they become comparable with the rest of our data. To do so, we have to do a bit of manipulation. First we need to ensure, that our column names are correctly formatted.

In [None]:
# Strip leading/trailing spaces from column names
pris111.columns = pris111.columns.str.strip()

# Ensure column names are in the expected case, here assuming title case for 'Category'
pris111.columns = pris111.columns.str.title()

NameError: name 'pris111' is not defined

We then replace ".." to NaN to properly handle missing values when we aggregate and mean the observations.

In [None]:
# Replace '..' with NaN to properly handle missing values during aggregation
pris111.replace('..', pd.NA, inplace=True)

NameError: name 'df' is not defined

Now we make the conversion to long format.

In [None]:
# Convert the DataFrame from wide to long format to easily manipulate the dates and values
pris111_long = pd.melt(pris111, id_vars=["Category"], var_name="Date", value_name="Value")

# Ensure 'Value' is numeric and handle any conversion errors by coercing them to NaN
pris111_long['Value'] = pd.to_numeric(pris111_long['Value'], errors='coerce')

We now convert the column names from the format from yyyyMmm to a proper datetime format.

In [None]:
# Convert 'Date' from the custom format 'YYYYMmm' to datetime, correcting the format
pris111_long['Date'] = pd.to_datetime(pris111_long['Date'], format='%YM%m', errors='coerce')

# Dropping rows where Date conversion resulted in NaT to clean up the data
pris111_long.dropna(subset=['Date'], inplace=True)

NameError: name 'pris111_long' is not defined

We finally group by category and year, and calculate the mean for each group.

In [None]:
# Group by Category and Year, then calculate mean for each group
pris111_yearly_mean = pris111_long.groupby(['Category', pris111_long['Date'].dt.year])['Value'].mean().reset_index()

print(pris111_yearly_mean)

Vi går tilbage til wide?????

In [None]:
pris111_wide = df_yearly_mean.pivot(index='Category', columns='Date', values='Value')

# forward fill to replace NaNs
pris111_wide.fillna(method='ffill', inplace=True)  

# Reset the index so 'Date' is a column and not an index
pris111_wide.reset_index(inplace=True)

# Output the wide format DataFrame
pris111_wide.head()

The data from DST was originally index to january 2015 as base month. As we meaned the values for each year, 2015 is not equal to 100 anymore. We thus want to re-index the dataframe. We use our index_year function, which has been defined in the script, and which returns the dataframe indexed to a given year, in our case 2015.

In [None]:
pris111_wide_index2015 = index_to_year(pris111_wide, 2015)
pris111_wide_index2015.head()

And we save our dataframe as a long format as well in order to plot it easily for descriptive statistics:

In [None]:
pris111_long = pd.melt(pris111_wide_index2015, id_vars=["Category"], var_name="Date", value_name="Value")
pris111_long

NameError: name 'pris111_wide_index2015' is not defined

Our data for transportation prices has now been cleaned, ready to be used. 

# Explore the data set 

First we would like to examine the PKM1 data set, whereafter we look at the PRIS111. We make a interactive plot of the data from table PKM1, showing the development in the use of vehicles messeaured in mio. personkm (y-axis) over the years (x-axis)

In [None]:
# define the plot_e function
def plot_e(df, vehicle): 
    I = df['vehicle'] == vehicle
    ax=df.loc[I,:].plot(x='year', y='mio_personkm', style='-o', legend=False)

In [None]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df = widgets.fixed(pkm1),
    vehicle = widgets.Dropdown(description='vehicle', 
                                    options=pkm1.vehicle.unique(), # we choose to look at vehicles
                                    value='Train') # Train is the start observation
);

interactive(children=(Dropdown(description='vehicle', index=8, options=('Aeroplane', 'Bicycles/Mopeds max. 30 …

Suprinsingly we se a fall in the use of Bicycles/Mopeds max 30 km/h when we se ourselves as the cycling nation of the world. 

The use of train has increase since 1980 to around 2015. After 2015 we se a slide fall in the use of train, and then ofc the downfall of Corona. 

Buses and coahches total was at its second highest just before 2000, whereafter it decreased again. From around 2013 it again began to increase, but then Corona came. After Corona transporation with bus exploded to its highest ever, but could be due to a lot of different things. 

Pricvate cars an vans has increase from 1980 to today, again with a slide fall do to Corona. 

Transportation with aeroplane has had a general negative trend from 1990 to just before corona, with a peak before the fananciel crisis.

We use the same method to examine the PRIS111 data, using the plot_e function. The only difference is that we add an extra curve to graph.

In [None]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df=widgets.fixed(pris111_long),
    category1=widgets.Dropdown(
        description='Category 1', 
        options=pris111_long['Category'].unique(),
        value=pris111_long['Category'].unique()[0]  # Default to the first unique category value
    ),
    category2=widgets.Dropdown(
        description='Category 2', 
        options=pris111_long['Category'].unique(),
        value=pris111_long['Category'].unique()[1] if len(pris111_long['Category'].unique()) > 1 else pris111_long['Category'].unique()[0]  # Default to the second unique category value if it exists, otherwise the first
    )
)

The General Consumer Index shows a consistent upward trend, reflecting a general increase in the consumer price index over time. There's a noticeable acceleration in growth after 2020, indicating a significant rise in consumer prices in the recent years.

Passenger Transport by Bus exhibits a gradual increase with some fluctuations. There's a notable rise after 2020, similar to the general consumer index, suggesting increased costs in bus transport. In specific, we see that the increase in passenger transport by bus rises less than the general consumer price index in the time after 2020.

Passenger Transport by Metro follows a slightly more volatile path than bus transport, with sharper increases and some periods of stability. It also shows a steep increase post-2020, emphasizing a significant jump in metro transport costs. In contrast to the price of transportation by bus, the price of transportation by Metro increases more than the general consumer price, making it relatively more expensive compared to travel by metro compared to the general consumer price index.

Passenger Transport by Personal Transportation shows a unique pattern with more pronounced fluctuations. There's a significant dip around 2016, followed by a rapid increase, particularly sharp after 2020, indicating volatile costs associated with personal transportation. Personal transportation is a category consisting of both cars, bikes, mopeds, motorcycles etc. and further includes the cost of buying, repairing, servicing etc.

Passenger Transport by Sea has a distinct trend with notable dips and recoveries, reflecting the variable costs associated with sea transport. Like others, it shows an upward trend after 2020, but with a notable dip before this recent rise.

# Merge data


We would like to do two merges. In the first merge, we would add Denmarks population size to PKM1 data set.

We want to have the mio.personkm from the PKM1 table in per capita terms. To get that, we would need to download population data from Denmark Statistics and merge this with our data from PMK1:

In [None]:
FT_api = DstApi('FT')
params = FT_api._define_base_params(language='en')
params['variables'][0]['values'] = ['000'] 
## 000 is the code for all of Denmark, this can be seen by using: FT_api.variable_levels('HOVEDDELE', language='en')
pop = FT_api.get_data(params=params)

pop.rename(columns={'TID':'year','INDHOLD':'population'},inplace=True)
pop =  pop.loc[:,['year','population']]
pop.head()

Unnamed: 0,year,population
0,2010,5534738
1,1769,797584
2,1840,1289075
3,1860,1608362
4,1901,2449540


Merge the population data set with the PKM1 data set. When we do that, we get an extra column with population:

In [None]:
merged = pd.merge(pkm1,pop,how='left',on=['year'])
merged.head(33)

Unnamed: 0,vehicle,year,value,population
0,Aeroplane,1990,476.0,5135409
1,Aeroplane,1991,442.0,5146469
2,Aeroplane,1992,457.0,5162126
3,Aeroplane,1993,449.0,5180614
4,Aeroplane,1994,478.0,5196642
5,Aeroplane,1995,497.0,5215718
6,Aeroplane,1996,541.0,5251027
7,Aeroplane,1997,519.0,5275121
8,Aeroplane,1998,424.0,5294860
9,Aeroplane,1999,398.0,5313577


Before we do the second merge, we will index the just merged data set. We use our index_to_year function, which has been defined in the script, and which returns the dataframe indexed to a given year, in our case 2015:

In [66]:
def index_to_year(df, base_year):
    # Ensure 'year' column exists
    if 'year' not in df.columns:
        raise ValueError("The DataFrame does not contain a 'year' column.")
    
    # Check if base year is in the 'year' column
    if not df['year'].isin([base_year]).any():
        raise ValueError(f"Base year {base_year} not found in DataFrame's 'year' column")
    
    # Set the index to 'vehicle' and 'year' if not already set
    if set(df.index.names) != {'vehicle', 'year'}:
        df = df.set_index(['vehicle', 'year'])
    
    # Isolate the base year data
    base_year_df = df.xs(base_year, level='year')
    
    # Divide each row by its corresponding base year value and multiply by 100
    df_normalized = df.div(base_year_df, level='vehicle') * 100
    
    # Optionally, reset index if you want 'vehicle' and 'year' back as columns
    df_normalized.reset_index(inplace=True)
    
    return df_normalized

In [67]:
merged_index2015 = index_to_year(merged, 2015)
merged_index2015.head(33)

Unnamed: 0,vehicle,year,value,population,pr.capita
0,Aeroplane,1990,133.333333,90.736177,146.946167
1,Aeroplane,1991,123.809524,90.931593,136.156774
2,Aeroplane,1992,128.011204,91.208232,140.350494
3,Aeroplane,1993,125.770308,91.534892,137.401493
4,Aeroplane,1994,133.893557,91.818086,145.824818
5,Aeroplane,1995,139.215686,92.155135,151.066662
6,Aeroplane,1996,151.540616,92.779,163.335039
7,Aeroplane,1997,145.378151,93.204711,155.977257
8,Aeroplane,1998,118.767507,93.553474,126.951466
9,Aeroplane,1999,111.484594,93.88418,118.746944


Now we can do the second merge, where we want to add pris_long data set to merged_index2015.

In [None]:
final_index2015_merge = pd.merge(merged_index2015,pris111_long,how='left',on=['year'])
final_index2015_merge.head()

# Analysis 

First we would like to get the mio. personkm in per capita terms, so we can check if the increase or decrease are driven of the increase in population or not:

Before we can do the calculation, we will need to check if population is numeric. If it is not, then we will have to change it 

In [None]:
# check the column population´s Dtype
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   vehicle     364 non-null    object 
 1   year        364 non-null    int64  
 2   value       364 non-null    float64
 3   population  364 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 11.5+ KB


We will need to convert population to numeric float

In [None]:
# change dtype from integer to float
merged.population = merged.population.astype('float')
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   vehicle     364 non-null    object 
 1   year        364 non-null    int64  
 2   value       364 non-null    float64
 3   population  364 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 11.5+ KB


In [None]:
#merged.head(33)

Now we can do the calculation

In [None]:
merged['pr.capita'] = (merged['mio_personkm']*1000000)/merged['population']
merged.head(33)

Unnamed: 0,vehicle,year,value,population,pr.capita
0,Aeroplane,1990,476.0,5135409.0,92.689794
1,Aeroplane,1991,442.0,5146469.0,85.884128
2,Aeroplane,1992,457.0,5162126.0,88.529416
3,Aeroplane,1993,449.0,5180614.0,86.669264
4,Aeroplane,1994,478.0,5196642.0,91.982476
5,Aeroplane,1995,497.0,5215718.0,95.288894
6,Aeroplane,1996,541.0,5251027.0,103.027465
7,Aeroplane,1997,519.0,5275121.0,98.386369
8,Aeroplane,1998,424.0,5294860.0,80.07766
9,Aeroplane,1999,398.0,5313577.0,74.902462


We want to recreate the interactive graph with the new pr. capita numbers:

In [None]:
# again define the plot_e function, now with pr. capita on y-axis
def plot_e(df, vehicle): 
    I = df['vehicle'] == vehicle
    ax=df.loc[I,:].plot(x='year', y='pr.capita', style='-o', legend=False)

In [None]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df = widgets.fixed(merged),
    vehicle = widgets.Dropdown(description='vehicle', 
                                    options=merged.vehicle.unique(), # we again choose to look at vehicles, so we can compare with graph from earlier
                                    value='Train') # Train is the start observation
);

interactive(children=(Dropdown(description='vehicle', index=8, options=('Aeroplane', 'Bicycles/Mopeds max. 30 …

The reason we want to include pr. capita terms is e.g. if you observe an increase in the absolute number of train riders over time, you might conclude that public bus transportation is becoming more popular than it really is. The increase could be a biproduct of an increase in population size. 

However we do not se any big changes. The trend is similar to the interactive plot from earlier. Peaks and valleys does not differ extraordinary when corrected with population. 

Her skal den sidste interaktive graf komme ind, og så lidt snak bla bla bla. Tænker vi bruger den samme kode som under din descriptive analyse, hvor du plottede to grafer. 

In [None]:
# interactive plot using widgets and the defined plot_e function
widgets.interact(plot_e, 
    df=widgets.fixed(final_index2015_merge),
    category1=widgets.Dropdown(
        description='Category 1', 
        options=final_index2015_merge['xxx'].unique(),
        value=final_index2015_merge['xxx'].unique()[0]  # Default to the first unique category value
    ),
    category2=widgets.Dropdown(
        description='Category 2', 
        options=final_index2015_merge['xxx'].unique(),
        value=final_index2015_merge['xxx'].unique()[1] if len(final_index2015_merge['xxx'].unique()) > 1 else final_index2015_merge['xxx'].unique()[0]  # Default to the second unique category value if it exists, otherwise the first
    )
)

# Conclusion 

bla bla bla