# Project Group - 10

## Members
* Flip Boekhorst:   4694880
* Sjoerd Norbart:  4700139
* Timo van Manen:  4869494
* Justin Plasmeijer: 5188466
* Fariba Tavakoli:   5716632

## Contribution Statement
-   Flip Boekhorst: Introduction, Data used, Coding paragraph 1, writing paragraph 5
-   Sjoerd Norbart: Coding paragraph 5, Data Cleaning, helping during meetings 
-	Timo van Manen: Coding and writing paragraph 2 and 3, writing conclusion 
-   Justin Plasmeijer: Coding and writing paragraph 1 and 4
-   Fariba Tavakoli: Data Cleaning, Data Used, help with visualizing last graphs and errors

# Introduction

In recent years, the difference between more urbanised areas and less urbanised areas has become increasingly pronounced. For example, many highly educated and young people are moving to the big cities of the Netherlands, resulting in population shrinkage and ageing in less urbanised areas (CBS, 2016). The increasing contrast between the urbanised areas and less urbanised could have a large effect on the travel behaviour within these regions. In this report, the influence the grade of urbanisation in an area has on the travel behaviour of its residents is researched. Therefore our research question is the following:

**How does the grade of urbanisation of an area affect the travel behaviour of its residents?**

The research is conducted using a mobility dataset of the CBS. The dataset will be explained in more detail in the next paragraph. CBS defines the grade of urbanisation based on the amount of surrounding addresses per square kilometre. 
Before analysing the dataset, we stated the following hypotheses for our research question:

* H0: The more urbanised a region is, the higher the average amount of distance travelled per person by public transport or walking is.
* H1: The less urbanised a region is, the higher the amount of distance travelled per person by car is.
* H2: The more urbanised a region is, the higher the amount of trips by bike is and the shorter the average amount of distance travelled per person.

Firstly, the dataset used will be further explained. After this, the process of cleaning and organising the data is shown. In the third chapter, data analysis is performed in order to find conclusions to answer the research question. The last chapter gives a conclusion on our findings.


# Data Used

The dataset we are using contains information regarding the mobility of the residents of the Netherlands aged 6 or older in private households, so excluding residents of institutions and homes. The table contains per person per day /year an overview of the average number of trips, the average distance travelled and the average time travelled. These are regular trips on Dutch territory, including domestic holiday mobility. The distance travelled is based on stage information. Excluded in this dataset is mobility based on series of calls trips. The mobility behaviour is broken down by modes of travel, purposes of travel, population and region characteristics. The data used are retrieved from The Dutch National travel survey named Onderweg in Nederland (ODiN). 

According to CBS the definition of trips is the part of a trip with a one mode of transport (2022)

Data available from: 2018 to 2021

Dataset we are using can be found [here](https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=84710ENG&_theme=1159).


### Explanation about dataset
The dataset has columns describing the following things:

* `travel motives`: for what purpose people have traveled
* `population`: for this dataset we are looking at people older than 6 years old
* `travel modes`: what mode of transport people have chosen to travel with
* `Region characteristics`: It is explained below
* `periods`: year
* `Average per person per day/Trips (number)`: average number of trips per person per day
* `Average per person per day/Distance travelled (passenger kilometres )`: average distance travelled per person per day (km)
* `Average per person per year/Trips (number)`: average number of trips per person per year
* `Average per person per year/Distance travelled (passenger kilometres )`: average distance travelled per person per year (km)

**Description about Region_characterstics column:**

Urbanisation is classified on the basis of five categories of surrounding address density.
* Extremely urbanised: 2500 adresses or more per square kilometre.
* Strongly urbanised: 1500 to 2000 adresses per square kilometre.
* Moderately urbanised: 1000 to 1500 adresses per square kilometre.
* Hardly urbanised: 500 to 1000 adresses per square kilometre.
* Not urbanised: 0 to 500 adresses per square kilometre.

The dataset also contains data on entire provinces. While the first four paragraphs of the data anlysis chapter focus on region charasteristics defined as stated above, the fifth paragraph uses the province data to compare this to the previous results. 


# Data Pipeline

Before using our data for analysis and visualisation, we first need to clean and organize the data. This will be done in the blocks of code down below.

In [2]:
# Import all libraries we use

import pandas as pd
import seaborn as sns
import plotly.express as px
import numpy as np
import plotly.io as pio
import matplotlib.pyplot as plt

In [2]:
file_path = 'per_person__travel_modes__travel_purpose_12102022_104624.csv'
df = pd.read_csv(file_path, delimiter=';', encoding='Windows-1252') 
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'per_person__travel_modes__travel_purpose_12102022_104624.csv'

Checking the names of columns to see if they need to be changed:

In [None]:
df.columns

Index(['ï»¿"Travel motives"', 'Population', 'Travel modes', 'Margins',
       'Region characteristics', 'Periods',
       'Average per person per day/Trips (number)',
       'Average per person per day/Distance travelled                       (passenger kilometres )',
       'Average per person per year/Trips (number)',
       'Average per person per year/Distance travelled    (passenger kilometres )'],
      dtype='object')

The common practice is to rename the columns to lower case and without white spaces:

In [None]:
df.rename(columns={'ï»¿"Travel motives"': 'travel_motive', 'Population':'population', 'Travel modes':'travel_mode', 'Margins':'margine',
       'Region characteristics':'region_characteristics', 'Periods':'year',
       'Average per person per day/Trips (number)':'average_trips_per_person_per_day_number',
       'Average per person per day/Distance travelled                       (passenger kilometres )':'average_trips_per_person_per_day_distance(km)',
       'Average per person per year/Trips (number)':'average_trips_per_person_per_year_number',
       'Average per person per year/Distance travelled    (passenger kilometres )':'average_trips_per_person_per_year_distance(km)'}, inplace=True)

df.columns

Index(['travel_motive', 'population', 'travel_mode', 'margine',
       'region_characteristics', 'year',
       'average_trips_per_person_per_day_number',
       'average_trips_per_person_per_day_distance(km)',
       'average_trips_per_person_per_year_number',
       'average_trips_per_person_per_year_distance(km)'],
      dtype='object')

Get an overall view of the dataset and check the data types:

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1008 entries, 0 to 1007
Data columns (total 10 columns):
 #   Column                                          Non-Null Count  Dtype 
---  ------                                          --------------  ----- 
 0   travel_motive                                   1008 non-null   object
 1   population                                      1008 non-null   object
 2   travel_mode                                     1008 non-null   object
 3   margine                                         1008 non-null   object
 4   region_characteristics                          1008 non-null   object
 5   year                                            1008 non-null   int64 
 6   average_trips_per_person_per_day_number         1008 non-null   object
 7   average_trips_per_person_per_day_distance(km)   1008 non-null   object
 8   average_trips_per_person_per_year_number        1008 non-null   object
 9   average_trips_per_person_per_year_distance(km)  1008

The dataset contains 1007 rows and 10 columns. It seems like the dataset contains no null values since the Non-Null column value is equal to number of rows. All of the columns are `object` type except for `year` column which is `int`.

Therefore, we must change the data type of the last four columns to int:

In [None]:
for c in df.columns[6:]:
    df[c] = df[c].str.strip().str.replace(',', '').str.replace("'", "")
    

In [None]:
# dig deeper into the dataset
df['average_trips_per_person_per_day_number'].unique()

array(['2.78', '2.71', '2.35', '2.51', '2.70', '2.59', '2.14', '2.33',
       '2.80', '2.74', '2.37', '2.52', '2.84', '2.61', '2.76', '2.44',
       '2.67', '2.64', '2.41', '2.53', '0.96', '0.95', '0.81', '0.82',
       '0.66', '0.63', '0.52', '0.54', '0.97', '0.80', '1.06', '0.93',
       '1.14', '1.13', '0.98', '1.02', '1.18', '1.01', '1.00', '0.32',
       '0.31', '0.24', '0.26', '0.23', '0.18', '0.20', '0.33', '0.25',
       '0.27', '0.37', '0.28', '0.35', '0.08', '0.03', '0.13', '0.05',
       '0.06', '0.09', '0.04', '0.07', '0.02', '.', '0.16', '0.01',
       '0.79', '0.76', '0.64', '0.86', '0.65', '0.77', '0.68', '0.69',
       '0.75', '0.71', '0.60', '0.61', '0.58', '0.53', '0.55', '0.44',
       '0.43', '0.73', '0.46', '0.42', '0.62', '0.41', '0.49', '0.56',
       '0.36', '0.34', '0.51', '0.50', '0.30', '0.38', '0.39', '0.19',
       '0.12', '0.22', '0.21', '0.29', '0.14', '0.10', '0.15', '0.11',
       '0.00', '0.59', '0.57', '0.48', '0.47', '0.17'], dtype=object)

Our missing values seem to be string in the form of `'.'` therefore below we will look into what rows and columns contain these missing values.
The `travel_motive` column is categorized into several motives where we have missing data points. As consulted before with the professor, if we sum up all the motives (showed in the `Total` category) and do not look at specific motives then we will solve the problem of missing values. In the output below there are three missing values in `travel_motive` column when it has the `Total` value.


In [None]:
df[df.isin(['.']).any(axis=1)]['travel_motive'].value_counts()

Professionally                              85
Services/care                               58
Shopping, groceries, funshopping.           29
Attending education/courses                 24
Travel to/from work, (non)-daily commute    23
Total                                        3
Name: travel_motive, dtype: int64

Below, we print the three rows that have missing values to get a sense of the data.
Now we should look for a way to replace these missing values with data that makes sense. One solution is to look at urbanisation categories. For example, if the urbanisation state of one of the rows that has missing values is `Not urbanised` then it makes sense to repalce those missing values with the row from the same year which is `Hardly urbanised`.

In [None]:
df[df.isin(['.']).any(axis=1)][df['travel_motive'] == 'Total']


Boolean Series key will be reindexed to match DataFrame index.



Unnamed: 0,travel_motive,population,travel_mode,margine,region_characteristics,year,average_trips_per_person_per_day_number,average_trips_per_person_per_day_distance(km),average_trips_per_person_per_year_number,average_trips_per_person_per_year_distance(km)
94,Total,Population 6 years or older,Train,Value,Not urbanised,2020,.,.,.,.
95,Total,Population 6 years or older,Train,Value,Not urbanised,2021,.,.,.,.
118,Total,Population 6 years or older,Bus/metro,Value,Not urbanised,2020,.,.,.,.


Replacing missing values of row `94` which has `Not urbanised` category and is from `2020` with row `90` which is from the same year, travel mode and `Hardly urbanised` category. Using the same line of reasoning we replace the other two rows as well. 

In [None]:
df.iloc[94,6:10] = df.iloc[90,6:10]

In [None]:
df.iloc[95,6:10] = df.iloc[91,6:10]

In [None]:
df.iloc[118,6:10] = df.iloc[114,6:10]

Now we proceed to filter the `travel_motive` column. We only keep the rows that have `Total` as `travel_motive`.

In [None]:
df = df[df['travel_motive']=='Total']

In [None]:
df.iloc[:,6:10] = df.iloc[:,6:10].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 10 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   travel_motive                                   168 non-null    object 
 1   population                                      168 non-null    object 
 2   travel_mode                                     168 non-null    object 
 3   margine                                         168 non-null    object 
 4   region_characteristics                          168 non-null    object 
 5   year                                            168 non-null    int64  
 6   average_trips_per_person_per_day_number         168 non-null    float64
 7   average_trips_per_person_per_day_distance(km)   168 non-null    float64
 8   average_trips_per_person_per_year_number        168 non-null    float64
 9   average_trips_per_person_per_year_distance(

Below we delete the unnecessary columns `population` and `margine` since they have preditermined values that are explained below the `Data Used` heading.

In [None]:
print(df['population'].unique())
print(df['margine'].unique())

['Population 6 years or older']
['Value']


In [None]:
df.drop(['population', 'margine'], axis=1, inplace=True)
df.columns

Index(['travel_motive', 'travel_mode', 'region_characteristics', 'year',
       'average_trips_per_person_per_day_number',
       'average_trips_per_person_per_day_distance(km)',
       'average_trips_per_person_per_year_number',
       'average_trips_per_person_per_year_distance(km)'],
      dtype='object')

# Data analysis

The dataframe is now ready to be used for analysis. This chapter focus is on analyzing the data and finding an answer to the research question. 

In the first paragraph the relation between different urbanisation grades and general travel behavior for the year 2018 is analysed. The second paragraph highlights the relation between urbanisation grades and use of different modes of transport, also in 2018. The third paragraph now researches the development of general travel behavior over time for different urbanisation grades. The fourth paragraph gives a summary of the previous paragraphs and shows a complete visualisation to answer the research question.

The conclusions found in paragraph four cover urbanization grades that are defined per square kilometer. The fifth paragraph tries to find out to what extend these findings also hold for the general travel behaviour of entire provinces. 

## 1 Travel behaviour per urbanisation grade


##  1.1 Amount of trips per person per day for different urbanisation grades
The following bar chart is showing the average amount of trips made per person per day in different grades of urbanization. For a clear overview, the research only focusses on the average trips made without a distinction in travel modes. It is expected that the amount of trips made in more urbanised areas will be higher, because the activities are closer to peoples homes and therefore the resistance for a person to make a trip is smaller.

In [None]:
#Filtering the data to only look at data from 2018
df_2018 = df.query('year==2018')
df_total = df_2018[df_2018['travel_mode'] == 'Total']

#Creating bar chart
fig = px.bar(df_total, x='region_characteristics', y='average_trips_per_person_per_day_number', 
             title="Amount of trips made per person per day in different grades of urbanizations", 
             text='average_trips_per_person_per_day_number')

#Zoom in to improve readability
fig.update_layout(yaxis_range=[2,3]) 

#Update titles x- and y-axis
fig.update_layout(xaxis_title="Urbanisation grade", 
                  yaxis_title="Average amount of trips per person per day")

#Show final barchart
fig.show()

NameError: name 'df' is not defined

### Results
The graph does partly meet our expections. The amount of trips made in not urbanised areas are less than the trips made in all other urbanization grades. However, as can be obtained from the chart, the amount of trips made in strongly, moderately en hardly urbanized areas are higher than in extremely urbanized areas, which does not meet the expectation from above.

## 1.2 Average distance traveled per person per day for different urbanisation grades
The following bar chart is showing the average distance traveled per person per day in different grades of urbanizations. In this analysis the focus is only on the average distance without a distinction in travel modes, because that gives a more clear overview. It is expected that the average distance travelled in more urbanised areas will be lower, because activities are closer to people homes and therefore people have tot travel less distance to their activity.

In [None]:
#Creating bar chart
fig = px.bar(df_total, x='region_characteristics', y='average_trips_per_person_per_day_distance(km)', 
             title="Average distance traveled per person per day in different grades of urbanizations", 
             text='average_trips_per_person_per_day_distance(km)')

#Zoom in to improve readability
fig.update_layout(yaxis_range=[30,40]) 

#Update titles x- and y-axis
fig.update_layout(xaxis_title="Urbanisation grade", yaxis_title="Average distance traveled per person per day(km)")

#Show final barchart
fig.show()

### Results
The graph does partly meet the expectations. From the bar chart, it can be obtained that the average distance travelled becomes higher while the urbanization grade becomes less. However, the distance travelled in not urbanised areas is less than the distance travelled in hardly urbanized areas, which does not meet our expectations.

## 2 Usage of different travel modes per urbanisation grade    
In this research the relationship between urbanisation grades and usage of different modes of travel also gets analysed. In order to do this two sets of pie charts have been created.

## 2.1 Average distance traveled per person per day for different modes of transport
The first set of pie charts highlights the relationship between urbanisation grade and average distance traveled per person per day for different modes of transport.   
 
 ### Hypothesis
Before analysing the data some significant differences in transport use between urbanisation grades are expected. The following enumeration gives an overview of the expectations:
 - The more urbanised an area is, the less distance is traveld by car
 - The more urbanised an area is, the more distance is traveled by walking 
 - The more urbanised an area is, the more distance is traveled by metro
 - The more urbanised an area is, the less distance is traveled by bike

In [None]:
textding = 'average_trips_per_person_per_day_distance(km)'

#Import
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#Exclude data on all the modes of transport combined
df2 = df[df['travel_mode']!="Total"]

#Only use data from 2018
df2018 = df2[df2['year']==2018]

#Prepare dataframes for each level of urbanization
dfNu = df2018[df2018["region_characteristics"]=="Not urbanised"]
dfHu = df2018[df2018["region_characteristics"]=="Hardly urbanised"]
dfMu = df2018[df2018["region_characteristics"]=="Moderately urbanised"]
dfSu = df2018[df2018["region_characteristics"]=="Strongly urbanised"]
dfEu = df2018[df2018["region_characteristics"]=="Extremely urbanised"]

#Set labels 
labels = dfNu["travel_mode"]

#Create subplot frame
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=["Not urbanised", "Hardly urbanised", "Moderately urbanised", "Strongly urbanised", "Extremely urbanised"])

#Add each subplot
fig.add_trace(go.Pie(labels=labels, values=dfNu[textding]),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=dfHu[textding]),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=dfMu[textding]),
              1, 3)
fig.add_trace(go.Pie(labels=labels, values=dfSu[textding]),
              2, 1)
fig.add_trace(go.Pie(labels=labels, values=dfEu[textding]),
              2, 2)

fig.update_layout(title_text="Average distance traveled per person per day for different modes of transport")

#Create hole and add label + percentage as hover data
fig.update_traces(hole=.4, hoverinfo="label+value")

#Show total subplot frame
fig.show()

### Results
The results of the first pie chart visualisation can be seen directly above. The size of the coloured part indicates to what extent the particular mode of transport contributes towards the total average distance traveled per person per day. Hovering your mouse above the area shows how much kilometers the average distance per person per day is for that mode of transport. 

Most of the hypothesis were right, except the prediction about the bycicle usage. The results show that the more urbanised an area is, the more distance is traveled by bycicle. While the average distance per trip using a bycicle is probably higher in non-urbanised areas, the amount of trips in urbanised areas apparently makes up for it. This can be verified in the next set of pie charts, indicating the amount of trips for different modes of transport. 

Another interesting result is the significant increase in distance traveled by train in more urbanised regions. While the train only makes up 5.47% of the travel distance in non-urbanised regions, it makes up an astonishing 20.1% of the travel distance in extremely urbanised regions. 


## 2.2 Average amount of trips per person per day for different modes of transport

The second set highligts the relationship between urbanization grade and average amount of trips per person per day for different modes of transport. 

### Hypothesis
Before analysing the data the following results about the differences in transport use between urbanisation grades are expected:

 - The more urbanised an area is, the less trips made by car
 - The more urbanised an area is, the more trips walking
 - The more urbanised an area is, the more trips made by metro 
 - The more urbanised an area is, the more trips made by train
 - The more urbanised an area is, the more trips made by bike

In [None]:
textding = 'average_trips_per_person_per_day_number'

#Exclude data on all the modes of transport combined
df2 = df[df['travel_mode']!="Total"]

#Only use data from 2018
df2018 = df2[df2['year']==2018]

#Prepare dataframes for each level of urbanization
dfNu = df2018[df2018["region_characteristics"]=="Not urbanised"]
dfHu = df2018[df2018["region_characteristics"]=="Hardly urbanised"]
dfMu = df2018[df2018["region_characteristics"]=="Moderately urbanised"]
dfSu = df2018[df2018["region_characteristics"]=="Strongly urbanised"]
dfEu = df2018[df2018["region_characteristics"]=="Extremely urbanised"]

#Set labels 
labels = dfNu["travel_mode"]

#Create subplot frame
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=["Not urbanised", "Hardly urbanised", "Moderately urbanised", "Strongly urbanised", "Extremely urbanised"])

#Add each subplot
fig.add_trace(go.Pie(labels=labels, values=dfNu[textding]),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=dfHu[textding]),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=dfMu[textding]),
              1, 3)
fig.add_trace(go.Pie(labels=labels, values=dfSu[textding]),
              2, 1)
fig.add_trace(go.Pie(labels=labels, values=dfEu[textding]),
              2, 2)

fig.update_layout(title_text="Average amount of trips per person per day for different modes of transport")

#Create hole and add label + percentage as hover data
fig.update_traces(hole=.4, hoverinfo="label+value")

#Show total subplot frame
fig.show()

### Results
The results of the second pie chart visualisation can be seen directly above. The size of the coloured part indicates to what extent the particular mode of transport contributes towards the total amount of trips traveled per person per day. Hovering your mouse above the area shows how much trips the average person per day makes using that mode of transport. 

In the case of this subquestion, all our hypothesis were right. For every mode of transport, the percentage of trips per day increases, at the expense of traveling by car. Interesting to note is that the percentage of people traveling car as a driver decreases more rapidly compared to the percentage of people traveling by car as a passenger. This indicates that people in more urbanised regions are more likely to use the car together compared to less urbanised regions. 








## 3 Travel behaviour over time per urbanisation grade


In the previous visualisations we have explored the relation between grade of urbanisation and travel behaviour in the year 2018. To get a better insight in the general relation, it is also interesting to research how these relations develop over the years. The dataset we used contains data from 2018 to 2021, of which the last two years were affected by the coronavirus. The next visualisations explore the relation between urbanisation and travel behaviour from 2018 to 2021, thereby also providing insight into the effect of the corona virus. 

## 3.1 Average trips per person per day
The line graph below shows the average trips per person per day over time for different urbanisation grades. 

### Hypothesis


There is expected to see a decrease in average trips per person per day in the year 2020 and 2021 due to the corona virus. It is expected that the different urbanisation grades to follow the same pattern. 

In [None]:
df2 = df[df["travel_mode"]=="Total"]

fig = px.line(df2, x="year", y="average_trips_per_person_per_day_number", color="region_characteristics",
              title='Average trips per person per day over time for different urbanization grades', markers=True)


dfTot = df2[df2["region_characteristics"]=="The Netherlands"]
fig.update_xaxes(nticks = len(dfTot["year"]))
fig.update_layout(xaxis_title="Year", yaxis_title="Average trips per person per day", legend_title="Urbanisation grade")

fig.show()

### Results
The results of the visualisation can be seen directly above. Hovering over the data points shows the average trips per person per day at that point in the graph. 

The first part of our hypothesis was correct: The number of trips decreases in the years 2020 and 2021. The second part of our hypothesis is not entirely correct however. The extremely urbanised areas seem to be most affected by the coronavirus while the not urbanised areas seem to be least affected. 




## 3.2 Average distance traveled per person per day
The line graph below shows the average distance traveled per person per day over time for different urbanisation grades.

### Hypothesis 
We expect to see a decrease in average travel distance per person per day in the year 2020 and 2021 due to the corona virus. We expect the different urbanisation grades to follow the same pattern. 

In [None]:
df2 = df[df["travel_mode"]=="Total"]

fig = px.line(df2, x="year", y="average_trips_per_person_per_day_distance(km)", color="region_characteristics",
              title='Average distance traveled per person per day over time for different urbanization grades', markers=True)


dfTot = df2[df2["region_characteristics"]=="The Netherlands"]
fig.update_xaxes(nticks = len(dfTot["year"]))
fig.update_layout(xaxis_title="Year", yaxis_title="Average distance traveled per person per day(km)", legend_title="Urbanisation grade")

fig.show()

### Results
The results of the visualisation can be seen directly above. Hovering over the data points shows the average distance traveled per person per day at that point in the graph.

For this visualisation, our hypothesis was correct. The average travel distance decreases during the corona years, and all urbanisation grades seem to follow the same pattern. Interesting to note is that the average travel distance in the non urbanised areas was rising rapidly before the introduction of the coronavirus. The previous graph shows that the average amount of trips was decreasing in the non urbanised areas for the same period. 

## 4 Final overview
The following bar charts provide a summary of all the data research we did before. In both figures, the slider can be used to manually compare data from different years over the period 2018-2021, while the play button can be used to automatically show the data of every single year. It is especially interesting to compare the data of the years 2019 and 2020, because those provide good insight in the effects of COVID-19.

In both figures, the y-axis represents the different grades of urbanization. In the first figure, the x-axis represents the average amount of trips (per person per day), while the x-axis of the second figure represents the average distance travelled (per person per day). Modes of travel can easily be selected and deselected by a single click on a travel mode from the legend of the corresponding figure. Moreover, in every single bar of both figures the ratio between the different modes of travel can be obtained.

In [None]:
#Exclude average data for region characteristics and modes of travel
df_cleaned = df[(df['region_characteristics']!='The Netherlands') & (df['travel_mode']!='Total')] 

#Create bar chart
fig = px.bar(df_cleaned, x='average_trips_per_person_per_day_number', 
             y='region_characteristics', color='travel_mode', 
             animation_frame='year', 
             title="Average number of trips per urbanization grade over the period 2018-2021",
             orientation = 'h')

#Set correct range to improve readability
fig.update_layout(xaxis_range=[0,3]) 

#Update titles
fig.update_layout(xaxis_title="Distance travelled per person per day (km)", 
                  yaxis_title="Urbanisation grade", legend_title="Travel Modes") 

#Show final barchart
fig.show()

In [None]:
#Create bar chart
fig = px.bar(df_cleaned, x='average_trips_per_person_per_day_distance(km)',
             y='region_characteristics', color='travel_mode', 
             animation_frame='year', 
             title="Distance travelled per urbanization grade over the period 2018-2021",
             orientation = 'h')

#Set correct range to improve readability
fig.update_layout(xaxis_range=[0,40]) 

#Update titles
fig.update_layout(xaxis_title="Distance travelled per person per day (km)", 
                  yaxis_title="Urbanisation grade", legend_title="Travel Modes")

#Show final barchart
fig.show()

## 5 Province comparison

The conclusions in the previous paragraphs covered the relationship between travel behavior and urbanization grade for areas of a square kilometer. To find out if these conclusions still hold when looking at entire provinces, we compare the conclusions to the data on provinces from the same dataset. Before starting with comparison analysis a new dataset about the amount of adresses per province is imported, see the coding below.

In [3]:
file_path = 'nederlandinwoners2.csv'
dfnl = pd.read_csv(file_path, delimiter=';', encoding='Windows-1252') 
dfnl.head()

Unnamed: 0,ï»¿province,year,inhabitants_per_kmÂ²,adresses,province_area,address_per_km2
0,Groningen,2018,251,33560500,232394,144412076
1,Fryslan,2018,194,37238400,333562,1116386159
2,Drenthe,2018,187,27167000,263265,1031926006
3,Overijssel,2018,347,61171500,331900,1843070202
4,Flevoland,2018,292,21079500,141163,1493273733


The columns of the dataset get renamed to more easy and understanding names.

In [4]:
dfnl.rename(columns={'ï»¿province':'provinces', 'inhabitants_per_kmÂ²':'inhabitants_per_km2'}, inplace = True)
for column in dfnl.columns[3:]:
    dfnl[column] = dfnl[column].str.replace(',','.').astype(float)

The following diagram show the categorisation of the provinces per urbanisation grade. This diagram shows that the provinces Zuid-Holland, Noord-Holland and Utrecht are the most urbanised provinces. On the other hand the provinces Zeeland, Fryslan and Drenthe are the least urbanised areas. With this diagram the different provinces can be categorised in three subcategories: 
* The first one being the most urbanised provinces (red colored).These are all provinces with more than 400 addresses per square kilometre. 
* The second subcategory are the provinces with 200 to 400 addresses per square kilometre (blue colored). 
* The last subcategory has provinces below 200 addresses per square kilometre and these provinces are least urbanised (green colored).

These color codes will be used for the next graph as well.

In [20]:
fig = px.bar(dfnl, x='provinces', y='address_per_km2', 
             title='urbanisation of provinces based on number of adresses per km2', color='provinces', 
             color_discrete_map={'Groningen':'seaGreen', 'Fryslan':'seaGreen', 'Drenthe':'seaGreen',
                              'Overijssel':'seaGreen', 'Flevoland':'seaGreen', 'Gelderland':'royalBlue',
                              'Utrecht':'indianRed', 'Noord-Holland':'indianRed', 'Zuid-Holland':'indianRed',
                              'Zeeland':'seaGreen', 'Noord-Brabant':'royalBlue', 'Limburg':'royalBlue'})

fig.update_xaxes(title='Provinces', tickangle=-45)
fig.update_yaxes(title='Number of adresses per km2')
fig.update_layout(xaxis={'categoryorder':'total descending'})

fig.show()

In [8]:
file_path = 'nederlandprovincies4.csv'
df2 = pd.read_csv(file_path, delimiter=';', encoding='Windows-1252') 
df2.head()

Unnamed: 0,ï»¿ID,TravelMotives,Population,TravelModes,Margins,province,Periods,Trips_1,DistanceTravelled_2
0,20,T001080,A048710,T001093,MW00000,Groningen,2018,2.81,37.46
1,21,T001080,A048710,T001093,MW00000,Groningen,2019,2.63,39.76
2,22,T001080,A048710,T001093,MW00000,Groningen,2020,2.27,26.74
3,23,T001080,A048710,T001093,MW00000,Groningen,2021,2.41,30.47
4,24,T001080,A048710,T001093,MW00000,Fryslan,2018,2.7,42.6


In [9]:
# renaming the columns for more clarity
df2.rename(columns={'Periods':'year', 'Trips_1':'trips', 'DistanceTravelled_2':'distance_travelled'}, inplace=True)

In [22]:
#showing the distance travelled
df2018 = df2[df2['year']==2018]

fig = px.bar(df2018, x='province', y='distance_travelled', 
             title="Average distance traveled per person per day in different provinces", text = 'distance_travelled', color='province',
             color_discrete_map={'Groningen':'seaGreen', 'Fryslan':'seaGreen', 'Drenthe':'seaGreen',
                              'Overijssel':'seaGreen', 'Flevoland':'seaGreen', 'Gelderland':'royalBlue',
                              'Utrecht':'indianRed', 'Noord-Holland':'indianRed', 'Zuid-Holland':'indianRed',
                              'Zeeland':'seaGreen', 'Noord-Brabant':'royalBlue', 'Limburg':'royalBlue'})

fig.update_layout(xaxis={'categoryorder':'total descending'}, yaxis_range=[30,45])
fig.update_xaxes(title='Provinces')
fig.update_yaxes(title='DistanceTravelled_km')
fig.show()

This plot shows the provinces Drenthe, Flevoland and Fryslan have the most average distance travelled per person per day. The provinces with the least average distance travelled per day are Zuid-Holland, Limburg, Zeeland. As expected the average distance travelled becomes higher, when the urbanisation grades become less. Zeeland and Limburg are the least urbanised areas and their travel distance is less than the hardly urbanised areas like Zuid-Holland and Noord-Holland. These results are the same as found in paragraph 1. 

In [21]:
df2018 = df2[df2['year']==2018]

fig = px.bar(df2018, x='province', y='trips', 
             title="Average distance traveled per person per day in different provinces", text = 'trips', color='province',
             color_discrete_map={'Groningen':'seaGreen', 'Fryslan':'seaGreen', 'Drenthe':'seaGreen',
                              'Overijssel':'seaGreen', 'Flevoland':'seaGreen', 'Gelderland':'royalBlue',
                              'Utrecht':'indianRed', 'Noord-Holland':'indianRed', 'Zuid-Holland':'indianRed',
                              'Zeeland':'seaGreen', 'Noord-Brabant':'royalBlue', 'Limburg':'royalBlue'})
                              
fig.update_layout(xaxis={'categoryorder':'total descending'}, yaxis_range=[2.5,3])
fig.update_xaxes(title='Provinces')
fig.update_yaxes(title='Amount of trips per person per day')
fig.show()

In this diagram the average amount of trips per person per day is presented. The most urbanised provinces Zuid-Holland and Noord-Holland with the exception of Utrecht have a low number of trips per day. Furthermore the moderately urbanised areas don’t have the most number of trips. In conclusion the average number of trips per province doesn’t have the same relation as the urbanisation categories from the first chapter.

# Conclusion

The research results in the following conclusions.

The less urbanised an area is, the more trips and the longer distance residents travel on average. However, areas that are completely not urbanised seem to be an exception. Residents of these areas make very few trips and travel less distance compared to areas with a higher urbanisation grade. 

A higher urbanisation grade seems to have the following effects on the use of different transport modes for residents of the area:
- Cars are used less often, and less distance is travelled by cars on average.
- A trip by car is shared by more passengers on average.
- Train, metro and bus are used more often, and more distance is traveled using these modes of transport on average.
- Walking and biking is done more often, and more distance is traveled doing this on average.

The frequency and distance of travel decreased during the years 2020 and 2021 due to the coronavirus. This decrease was most significant in more urbanised areas, while less urbanised areas were not affected heavily.

The relation between average distance traveled and grade of urbanisation seems to follow the same patterns when looked at per province, compared to when looked at per square kilometre. The same is not true for the amount of trips: The relation to grade of urbanisation does not seem the same. 





# References


[Centraal Bureau voor de Statistiek. (2016, September 12). *Verstedelijking: verschillen tussen stad en land*. Centraal Bureau Voor De Statistiek.](https://www.cbs.nl/nl-nl/achtergrond/2016/36/verstedelijking-verschillen-tussen-stad-en-land)

[Centraal Bureau voor de Statistiek. (2022, February 23). *Trips*. Retrieved 20 October 2022](https://www.cbs.nl/en-gb/onze-diensten/methods/definitions/trip)
