# Project Group - 10

Members: Flip Boekhorst, Sjoerd Norbart, Timo van Manen, Justin Plasmeijer, 
Fariba Tavakoli

Student numbers: 4694880, 4700139, 4869494, 5188466, 5716632

# Introduction

In february 2020 the first infection of covid-19 was found in the Netherlands (Rijksoverheid, 2022). After that things went into overdrive. The Netherlands experienced an intelligent lockdown, a curfew and things got even worse when the new omikron variant arrived. This had serious effects on the mobilization and travel patterns of dutch citizens. Public transport especially had a lot of restrictions. Such as wearing a face mask and less seat capacity. Because of this the attractiveness to use public transport decreased.
In the research from van Wee the split between different travel modes depends on the needs and desires of people, transport resistances and the locations of human activities (2002). Covid-19 had an influence on all of these factors. The transport resistances increased, because of the pandemic restrictions the public transport was less comfortable and reliable. Furthermore people worked more from home that influenced the locations of human activities. And lastly during the pandemic social activities had a bigger risk of becoming infected. So people changed their needs and desires.

# Research Objective

*Requires data modeling and quantitative research in Transport, Infrastructure & Logistics*




How does the urbanization of a region affect the distance traveled by a mode of transport?

How does the urbanization of a region affect mobility patterns?


# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

-	distance traveled versus amount of trips
-	distance traveled per mode of transport
-	amount of trips per mode of transport
-	distance traveled per motive
-	amount of trips per motive

# Data Used

We only focus on pre-corona data before march 2020 in the Netherlands. Datasets we might use:
●	https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=84710ENG&_theme=1159


# Data Pipeline

In [54]:
# Import all libraries we use

import pandas as pd
import seaborn as sns
import plotly.express as px
import numpy as np
import plotly.io as pio
import matplotlib.pyplot as plt

In [55]:
file_path = 'per_person__travel_modes__travel_purpose_12102022_104624.csv'
df = pd.read_csv(file_path, delimiter=';', encoding='Windows-1252') 
df.head()

Unnamed: 0,"ï»¿""Travel motives""",Population,Travel modes,Margins,Region characteristics,Periods,Average per person per day/Trips (number),Average per person per day/Distance travelled (passenger kilometres ),Average per person per year/Trips (number),Average per person per year/Distance travelled (passenger kilometres )
0,Total,Population 6 years or older,Total,Value,The Netherlands,2018,2.78,36.16,1015,13200
1,Total,Population 6 years or older,Total,Value,The Netherlands,2019,2.71,36.0,989,13140
2,Total,Population 6 years or older,Total,Value,The Netherlands,2020,2.35,24.88,861,9105
3,Total,Population 6 years or older,Total,Value,The Netherlands,2021,2.51,27.24,915,9942
4,Total,Population 6 years or older,Total,Value,Extremely urbanised,2018,2.7,32.66,987,11922


Checking the names of columns to see if they need to be changed:

In [56]:
df.columns

Index(['ï»¿"Travel motives"', 'Population', 'Travel modes', 'Margins',
       'Region characteristics', 'Periods',
       'Average per person per day/Trips (number)',
       'Average per person per day/Distance travelled                       (passenger kilometres )',
       'Average per person per year/Trips (number)',
       'Average per person per year/Distance travelled    (passenger kilometres )'],
      dtype='object')

The common practice is to rename the columns to lower case and without white spaces:

In [57]:
df.rename(columns={'ï»¿"Travel motives"': 'travel_motive', 'Population':'population', 'Travel modes':'travel_mode', 'Margins':'margine',
       'Region characteristics':'region_characteristics', 'Periods':'year',
       'Average per person per day/Trips (number)':'average_trips_per_person_per_day_number',
       'Average per person per day/Distance travelled                       (passenger kilometres )':'average_trips_per_person_per_day_distance(km)',
       'Average per person per year/Trips (number)':'average_trips_per_person_per_year_number',
       'Average per person per year/Distance travelled    (passenger kilometres )':'average_trips_per_person_per_year_distance(km)'}, inplace=True)

df.columns

Index(['travel_motive', 'population', 'travel_mode', 'margine',
       'region_characteristics', 'year',
       'average_trips_per_person_per_day_number',
       'average_trips_per_person_per_day_distance(km)',
       'average_trips_per_person_per_year_number',
       'average_trips_per_person_per_year_distance(km)'],
      dtype='object')

get an overall view of the dataset and check the data types:

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1008 entries, 0 to 1007
Data columns (total 10 columns):
 #   Column                                          Non-Null Count  Dtype 
---  ------                                          --------------  ----- 
 0   travel_motive                                   1008 non-null   object
 1   population                                      1008 non-null   object
 2   travel_mode                                     1008 non-null   object
 3   margine                                         1008 non-null   object
 4   region_characteristics                          1008 non-null   object
 5   year                                            1008 non-null   int64 
 6   average_trips_per_person_per_day_number         1008 non-null   object
 7   average_trips_per_person_per_day_distance(km)   1008 non-null   object
 8   average_trips_per_person_per_year_number        1008 non-null   object
 9   average_trips_per_person_per_year_distance(km)  1008

The dataset contains no null values. All of the columns are `object` type except for `year` column which is `int`.
We must change the data type of the last four columns to int:

In [59]:
for c in df.columns[6:]:
    df[c] = df[c].str.strip().str.replace(',', '').str.replace("'", "")
    

In [60]:
df['average_trips_per_person_per_day_number'].unique()

array(['2.78', '2.71', '2.35', '2.51', '2.70', '2.59', '2.14', '2.33',
       '2.80', '2.74', '2.37', '2.52', '2.84', '2.61', '2.76', '2.44',
       '2.67', '2.64', '2.41', '2.53', '0.96', '0.95', '0.81', '0.82',
       '0.66', '0.63', '0.52', '0.54', '0.97', '0.80', '1.06', '0.93',
       '1.14', '1.13', '0.98', '1.02', '1.18', '1.01', '1.00', '0.32',
       '0.31', '0.24', '0.26', '0.23', '0.18', '0.20', '0.33', '0.25',
       '0.27', '0.37', '0.28', '0.35', '0.08', '0.03', '0.13', '0.05',
       '0.06', '0.09', '0.04', '0.07', '0.02', '.', '0.16', '0.01',
       '0.79', '0.76', '0.64', '0.86', '0.65', '0.77', '0.68', '0.69',
       '0.75', '0.71', '0.60', '0.61', '0.58', '0.53', '0.55', '0.44',
       '0.43', '0.73', '0.46', '0.42', '0.62', '0.41', '0.49', '0.56',
       '0.36', '0.34', '0.51', '0.50', '0.30', '0.38', '0.39', '0.19',
       '0.12', '0.22', '0.21', '0.29', '0.14', '0.10', '0.15', '0.11',
       '0.00', '0.59', '0.57', '0.48', '0.47', '0.17'], dtype=object)

Our missing values seem to be string in the form of `'.'` therefore below we will look into what rows and columns contain these missing values.
travel motive column is categorized into several motives where we have missing data points. As consulted before with the professor, if we sum up all the motives (showed in the `Total` category) and do not look at specific motives then we will solve the problem of missing values. because below we see that we only have three missing values in `travel_motive` column when it has the `Total` value.



In [61]:
df[df.isin(['.']).any(axis=1)]['travel_motive'].value_counts()

Professionally                              85
Services/care                               58
Shopping, groceries, funshopping.           29
Attending education/courses                 24
Travel to/from work, (non)-daily commute    23
Total                                        3
Name: travel_motive, dtype: int64

Below, we print the three rows that have missing values to get a sense of the data.
Now we should look for a way to replace these missing values with data that makes sense. One solution is to look at urbanisation categories. For example, if the urbanisation state of one of the rows that has missing values is `Not urbanised` then it makes sense to repalce those missing values with the row from the same year which is `Hardly urbanised`.

In [62]:
df[df.isin(['.']).any(axis=1)][df['travel_motive'] == 'Total']


Boolean Series key will be reindexed to match DataFrame index.



Unnamed: 0,travel_motive,population,travel_mode,margine,region_characteristics,year,average_trips_per_person_per_day_number,average_trips_per_person_per_day_distance(km),average_trips_per_person_per_year_number,average_trips_per_person_per_year_distance(km)
94,Total,Population 6 years or older,Train,Value,Not urbanised,2020,.,.,.,.
95,Total,Population 6 years or older,Train,Value,Not urbanised,2021,.,.,.,.
118,Total,Population 6 years or older,Bus/metro,Value,Not urbanised,2020,.,.,.,.


Replacing missing values of row `94` which has `Not urbanised` category and is from `2020` with row `90` which is from the same year, travel mode and `Hardly urbanised` category. Using the same line of reasoning we replace the other two rows as well. 

In [63]:
df.iloc[94,6:10] = df.iloc[90,6:10]

In [64]:
df.iloc[95,6:10] = df.iloc[91,6:10]

In [65]:
df.iloc[118,6:10] = df.iloc[114,6:10]

Now we proceed to filter the `travel_motive` column. We only keep the rows that have `Total` as `travel_motive`.

In [66]:
df = df[df['travel_motive']=='Total']

In [67]:
df.iloc[:,6:10] = df.iloc[:,6:10].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 10 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   travel_motive                                   168 non-null    object 
 1   population                                      168 non-null    object 
 2   travel_mode                                     168 non-null    object 
 3   margine                                         168 non-null    object 
 4   region_characteristics                          168 non-null    object 
 5   year                                            168 non-null    int64  
 6   average_trips_per_person_per_day_number         168 non-null    float64
 7   average_trips_per_person_per_day_distance(km)   168 non-null    float64
 8   average_trips_per_person_per_year_number        168 non-null    float64
 9   average_trips_per_person_per_year_distance(

In [68]:
df_2018 = df.query('year==2018')
df_total = df_2018[df_2018['travel_mode'] == 'Total']
df_total

Unnamed: 0,travel_motive,population,travel_mode,margine,region_characteristics,year,average_trips_per_person_per_day_number,average_trips_per_person_per_day_distance(km),average_trips_per_person_per_year_number,average_trips_per_person_per_year_distance(km)
0,Total,Population 6 years or older,Total,Value,The Netherlands,2018,2.78,36.16,1015.0,13200.0
4,Total,Population 6 years or older,Total,Value,Extremely urbanised,2018,2.7,32.66,987.0,11922.0
8,Total,Population 6 years or older,Total,Value,Strongly urbanised,2018,2.8,36.07,1023.0,13165.0
12,Total,Population 6 years or older,Total,Value,Moderately urbanised,2018,2.84,35.97,1036.0,13128.0
16,Total,Population 6 years or older,Total,Value,Hardly urbanised,2018,2.84,39.49,1036.0,14415.0
20,Total,Population 6 years or older,Total,Value,Not urbanised,2018,2.67,38.39,975.0,14011.0


In [69]:
#showing the amount of trips
fig = px.bar(df_total, x='region_characteristics', y='average_trips_per_person_per_day_number', 
             title="Amount of trips made per person per day in different grades of urbanizations")
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

In [70]:
#showing the distance travelled
fig = px.bar(df_total, x='region_characteristics', y='average_trips_per_person_per_day_distance(km)', 
             title="Average distance traveled per person per day in different grades of urbanizations")
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

In [71]:
df_new = df.groupby(['year', 'region_characteristics', 'travel_mode']).sum().reset_index()
df_new

Unnamed: 0,year,region_characteristics,travel_mode,average_trips_per_person_per_day_number,average_trips_per_person_per_day_distance(km),average_trips_per_person_per_year_number,average_trips_per_person_per_year_distance(km)
0,2018,Extremely urbanised,Bike,0.86,3.26,313.0,1189.0
1,2018,Extremely urbanised,Bus/metro,0.18,1.85,64.0,676.0
2,2018,Extremely urbanised,Passenger car (driver),0.66,13.47,240.0,4918.0
3,2018,Extremely urbanised,Passenger car (passenger),0.24,5.33,89.0,1945.0
4,2018,Extremely urbanised,Total,2.70,32.66,987.0,11922.0
...,...,...,...,...,...,...,...
163,2021,The Netherlands,Passenger car (driver),0.82,14.05,300.0,5129.0
164,2021,The Netherlands,Passenger car (passenger),0.26,5.10,94.0,1860.0
165,2021,The Netherlands,Total,2.51,27.24,915.0,9942.0
166,2021,The Netherlands,Train,0.03,1.67,13.0,610.0


In [72]:
textding = 'average_trips_per_person_per_day_distance(km)'

#Import
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#Exclude data on all the modes of transport combined
df2 = df[df['travel_mode']!="Total"]

#Only use data from 2018
df2018 = df2[df2['year']==2018]

#Prepare dataframes for each level of urbanization
dfNu = df2018[df2018["region_characteristics"]=="Not urbanised"]
dfHu = df2018[df2018["region_characteristics"]=="Hardly urbanised"]
dfMu = df2018[df2018["region_characteristics"]=="Moderately urbanised"]
dfSu = df2018[df2018["region_characteristics"]=="Strongly urbanised"]
dfEu = df2018[df2018["region_characteristics"]=="Extremely urbanised"]

#Set labels 
labels = dfNu["travel_mode"]

#Create subplot frame
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=["Not urbanised", "Hardly urbanised", "Moderately urbanised", "Strongly urbanised", "Extremely urbanised"])

#Add each subplot
fig.add_trace(go.Pie(labels=labels, values=dfNu[textding]),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=dfHu[textding]),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=dfMu[textding]),
              1, 3)
fig.add_trace(go.Pie(labels=labels, values=dfSu[textding]),
              2, 1)
fig.add_trace(go.Pie(labels=labels, values=dfEu[textding]),
              2, 2)

fig.update_layout(title_text="Average distance traveled per person per day for different modes of transport")

#Create hole and add label + percentage as hover data
fig.update_traces(hole=.4, hoverinfo="label+value")

#Show total subplot frame
fig.show()

In [73]:
textding = 'average_trips_per_person_per_day_number'

#Exclude data on all the modes of transport combined
df2 = df[df['travel_mode']!="Total"]

#Only use data from 2018
df2018 = df2[df2['year']==2018]

#Prepare dataframes for each level of urbanization
dfNu = df2018[df2018["region_characteristics"]=="Not urbanised"]
dfHu = df2018[df2018["region_characteristics"]=="Hardly urbanised"]
dfMu = df2018[df2018["region_characteristics"]=="Moderately urbanised"]
dfSu = df2018[df2018["region_characteristics"]=="Strongly urbanised"]
dfEu = df2018[df2018["region_characteristics"]=="Extremely urbanised"]

#Set labels 
labels = dfNu["travel_mode"]

#Create subplot frame
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=["Not urbanised", "Hardly urbanised", "Moderately urbanised", "Strongly urbanised", "Extremely urbanised"])

#Add each subplot
fig.add_trace(go.Pie(labels=labels, values=dfNu[textding]),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=dfHu[textding]),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=dfMu[textding]),
              1, 3)
fig.add_trace(go.Pie(labels=labels, values=dfSu[textding]),
              2, 1)
fig.add_trace(go.Pie(labels=labels, values=dfEu[textding]),
              2, 2)

fig.update_layout(title_text="Average amount of trips per person per day for different modes of transport")

#Create hole and add label + percentage as hover data
fig.update_traces(hole=.4, hoverinfo="label+value")

#Show total subplot frame
fig.show()

In [74]:
df2 = df[df["travel_mode"]=="Total"]

fig = px.line(df2, x="year", y="average_trips_per_person_per_day_number", color="region_characteristics",
              title='Average trips per person per day over time for different urbanization grades', markers=True)


dfTot = df2[df2["region_characteristics"]=="The Netherlands"]
fig.update_xaxes(nticks = len(dfTot["year"]))

fig.show()

In [75]:
df2 = df[df["travel_mode"]=="Total"]

fig = px.line(df2, x="year", y="average_trips_per_person_per_day_distance(km)", color="region_characteristics",
              title='Average distance traveled per person per day over time for different urbanization grades', markers=True)


dfTot = df2[df2["region_characteristics"]=="The Netherlands"]
fig.update_xaxes(nticks = len(dfTot["year"]))

fig.show()

In [76]:
# showing the amount of trips per urbanization grade over the years
fig = px.bar(df_new, x='region_characteristics', y='average_trips_per_person_per_day_number', color='travel_mode', 
             animation_frame='year', title="Travel modes per urbanization grade over the period 2018-2021")
fig.update_xaxes(categoryorder='array', categoryarray=['The Netherlands', 'Extremely urbanised', 'Strongly urbanised', 'Moderately urbanised', 'Hardly urbanised', 'Not urbanised'])
fig.update_layout(yaxis_range=[0,6]) #set correct range
fig.show()

In [77]:
# showing the distance travelled per urbanization grade over the years
fig = px.bar(df_new, x='region_characteristics', y='average_trips_per_person_per_day_distance(km)', color='travel_mode', 
             animation_frame='year', title="Distance travelled per urbanization grade over the period 2018-2021")
fig.update_xaxes(categoryorder='array', categoryarray=['The Netherlands', 'Extremely urbanised', 'Strongly urbanised', 'Moderately urbanised', 'Hardly urbanised', 'Not urbanised'])
fig.update_layout(yaxis_range=[0,80]) #set correct range
fig.show()

# References
Rijksoverheid. (2022, 26 september). Februari 2020: Eerste coronabesmetting in Nederland. Geraadpleegd op 19 oktober 2022, van https://www.rijksoverheid.nl/onderwerpen/coronavirus-tijdlijn/februari-2020-eerste-coronabesmetting-in-nederland

van Wee, B. (2002, december). Land use and transport: research and policy challenges. Journal of Transport Geography, 10(4), 259–271. https://doi.org/10.1016/s0966-6923(02)00041-8