
# Data Preparation Notebook

This Jupyter notebook contains steps for preparing the dataset for further analysis and modeling. 
It includes data cleaning, handling missing data, and (pre-)processing data.

## Steps Included:
1. Converting Date and Time Columns
2. Transforming Numeric Data
3. Handling Missing Data
4. Cleaning String Data
5. Data Quality Assessment

Let's start by loading the dataset and necessary libraries.


In [1]:

import pandas as pd

# Load the dataset
file_path = 'combined_data.csv' 
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,NLSitNummer,DatumFileBegin,DatumFileEind,TijdFileBegin,TijdFileEind,FileZwaarte,GemLengte,FileDuur,HectometerKop,HectometerStaart,...,TrajVan,TrajNaar,OorzaakGronddetail,OorzaakVerloop,OorzaakCodeVerloop,OorzaakCode,Oorzaak_1,Oorzaak_2,Oorzaak_3,Oorzaak_4
0,3346718,2023-01-11,2023-01-11,17:22:32,17:44:04,58340,2709000,21533,1165,1145,...,Utrecht,'s-Hertogenbosch,Defect(e) voertuig(en),"[Geen oorzaakcode opgegeven door VWM 2], [Defe...","[000], [BKD]",BKD,Defect(e) voertuig(en),Defect voertuig,Incident (gestrand voertuig),Incident
1,3346719,2023-01-11,2023-01-11,17:22:32,17:43:00,74363,3633000,20467,0,37,...,Amersfoort,Utrecht,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 20],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
2,3346720,2023-01-11,2023-01-11,17:22:32,17:59:00,213030,5842000,36467,700,666,...,Muiden,Lelystad,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 36],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
3,3346721,2023-01-11,2023-01-11,17:22:32,17:30:04,20768,2757000,7533,98,122,...,Aken,Geleen,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 8],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
4,3346722,2023-01-11,2023-01-11,17:22:32,17:24:00,3080,2100000,1467,2232,2253,...,Enschede,Varsseveld,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 1],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit



## Convert Date and Time Columns

We'll convert columns with dates and times into a standard datetime format.


In [2]:

# Converting columns to datetime format
date_columns = ['DatumFileBegin', 'DatumFileEind']
time_columns = ['TijdFileBegin', 'TijdFileEind']

for col in date_columns:
    data[col] = pd.to_datetime(data[col])

for col in time_columns:
    data[col] = pd.to_datetime(data[col], format='%H:%M:%S').dt.time

# Display the updated dataset
data.head()


Unnamed: 0,NLSitNummer,DatumFileBegin,DatumFileEind,TijdFileBegin,TijdFileEind,FileZwaarte,GemLengte,FileDuur,HectometerKop,HectometerStaart,...,TrajVan,TrajNaar,OorzaakGronddetail,OorzaakVerloop,OorzaakCodeVerloop,OorzaakCode,Oorzaak_1,Oorzaak_2,Oorzaak_3,Oorzaak_4
0,3346718,2023-01-11,2023-01-11,17:22:32,17:44:04,58340,2709000,21533,1165,1145,...,Utrecht,'s-Hertogenbosch,Defect(e) voertuig(en),"[Geen oorzaakcode opgegeven door VWM 2], [Defe...","[000], [BKD]",BKD,Defect(e) voertuig(en),Defect voertuig,Incident (gestrand voertuig),Incident
1,3346719,2023-01-11,2023-01-11,17:22:32,17:43:00,74363,3633000,20467,0,37,...,Amersfoort,Utrecht,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 20],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
2,3346720,2023-01-11,2023-01-11,17:22:32,17:59:00,213030,5842000,36467,700,666,...,Muiden,Lelystad,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 36],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
3,3346721,2023-01-11,2023-01-11,17:22:32,17:30:04,20768,2757000,7533,98,122,...,Aken,Geleen,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 8],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
4,3346722,2023-01-11,2023-01-11,17:22:32,17:24:00,3080,2100000,1467,2232,2253,...,Enschede,Varsseveld,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 1],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit



## Transform Numeric Data

Transform columns with numeric data stored as strings into numeric formats.


In [3]:

# Convert numeric data stored as strings to numeric format
numeric_columns = ['FileZwaarte', 'GemLengte']


for col in numeric_columns:
    data[col] = data[col].str.replace(',', '').astype(float)



data['HectometerKop'] = data['HectometerKop'].str.replace(',', '.').astype(float)
data['HectometerStaart'] = data['HectometerStaart'].str.replace(',', '.').astype(float)

# Display the updated dataset
data.head()

Unnamed: 0,NLSitNummer,DatumFileBegin,DatumFileEind,TijdFileBegin,TijdFileEind,FileZwaarte,GemLengte,FileDuur,HectometerKop,HectometerStaart,...,TrajVan,TrajNaar,OorzaakGronddetail,OorzaakVerloop,OorzaakCodeVerloop,OorzaakCode,Oorzaak_1,Oorzaak_2,Oorzaak_3,Oorzaak_4
0,3346718,2023-01-11,2023-01-11,17:22:32,17:44:04,58340.0,2709000.0,21533,116.5,114.5,...,Utrecht,'s-Hertogenbosch,Defect(e) voertuig(en),"[Geen oorzaakcode opgegeven door VWM 2], [Defe...","[000], [BKD]",BKD,Defect(e) voertuig(en),Defect voertuig,Incident (gestrand voertuig),Incident
1,3346719,2023-01-11,2023-01-11,17:22:32,17:43:00,74363.0,3633000.0,20467,0.0,3.7,...,Amersfoort,Utrecht,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 20],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
2,3346720,2023-01-11,2023-01-11,17:22:32,17:59:00,213030.0,5842000.0,36467,70.0,66.6,...,Muiden,Lelystad,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 36],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
3,3346721,2023-01-11,2023-01-11,17:22:32,17:30:04,20768.0,2757000.0,7533,9.8,12.2,...,Aken,Geleen,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 8],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
4,3346722,2023-01-11,2023-01-11,17:22:32,17:24:00,3080.0,2100000.0,1467,223.2,225.3,...,Enschede,Varsseveld,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 1],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit



## Handling Missing Data

Identifying and addressing missing data in the dataset.


In [4]:

# Check for missing values
missing_data = data.isnull().sum()
missing_data[missing_data > 0]



Series([], dtype: int64)

In [5]:
# Drop rows with any missing values
data = data.dropna()

# Verify the operation by checking for missing values again
missing_data_after = data.isnull().sum()
missing_data_after[missing_data_after > 0]

Series([], dtype: int64)


## Cleaning String Data

Trimming unnecessary whitespaces and handling new lines in string columns.


In [6]:
# Cleaning string columns
string_columns = data.select_dtypes(include='object').columns

for col in string_columns:
    data[col] = data[col].astype(str).str.strip()

# Display the updated dataset
data.head()


Unnamed: 0,NLSitNummer,DatumFileBegin,DatumFileEind,TijdFileBegin,TijdFileEind,FileZwaarte,GemLengte,FileDuur,HectometerKop,HectometerStaart,...,TrajVan,TrajNaar,OorzaakGronddetail,OorzaakVerloop,OorzaakCodeVerloop,OorzaakCode,Oorzaak_1,Oorzaak_2,Oorzaak_3,Oorzaak_4
0,3346718,2023-01-11,2023-01-11,17:22:32,17:44:04,58340.0,2709000.0,21533,116.5,114.5,...,Utrecht,'s-Hertogenbosch,Defect(e) voertuig(en),"[Geen oorzaakcode opgegeven door VWM 2], [Defe...","[000], [BKD]",BKD,Defect(e) voertuig(en),Defect voertuig,Incident (gestrand voertuig),Incident
1,3346719,2023-01-11,2023-01-11,17:22:32,17:43:00,74363.0,3633000.0,20467,0.0,3.7,...,Amersfoort,Utrecht,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 20],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
2,3346720,2023-01-11,2023-01-11,17:22:32,17:59:00,213030.0,5842000.0,36467,70.0,66.6,...,Muiden,Lelystad,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 36],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
3,3346721,2023-01-11,2023-01-11,17:22:32,17:30:04,20768.0,2757000.0,7533,9.8,12.2,...,Aken,Geleen,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 8],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit
4,3346722,2023-01-11,2023-01-11,17:22:32,17:24:00,3080.0,2100000.0,1467,223.2,225.3,...,Enschede,Varsseveld,Spitsfile (geen oorzaak gemeld),[Geen oorzaakcode opgegeven door VWM 1],[000],001,Spitsfile (geen oorzaak gemeld),Geen oorzaak gemeld,Drukte,Hoge intensiteit


## Removing Square Brackets in 'OorzaakVerloop' Column

In the `OorzaakVerloop` column, each entry is enclosed within square brackets, which are not necessary for our analysis and may interfere with certain data processing tasks. To clean this data, we will remove the square brackets from each entry in this column.

This step enhances the readability of the data and ensures that any subsequent text processing or analysis on this column does not get affected by these extraneous characters.


In [7]:
# Remove square brackets from 'OorzaakVerloop' column
data['OorzaakVerloop'] = data['OorzaakVerloop'].str.replace('[', '', regex=False)
data['OorzaakVerloop'] = data['OorzaakVerloop'].str.replace(']', '', regex=False)

# Display the updated 'OorzaakVerloop' column
data['OorzaakVerloop'].head()


0    Geen oorzaakcode opgegeven door VWM 2, Defect(...
1               Geen oorzaakcode opgegeven door VWM 20
2               Geen oorzaakcode opgegeven door VWM 36
3                Geen oorzaakcode opgegeven door VWM 8
4                Geen oorzaakcode opgegeven door VWM 1
Name: OorzaakVerloop, dtype: object

# Dataset Translation from Dutch to English

## Purpose
The dataset originally contains information in Dutch, which is not accessible or understandable to all potential users, especially those who are not proficient in Dutch. To make this dataset more universally accessible and easier to work with, we will translate it from Dutch to English.

## Scope
1. **Column Headers**: All column headers will be translated from Dutch to English to provide a clear understanding of the dataset's structure.
2. **Column Contents**: The textual contents of each column will also be translated. This step is crucial for columns containing descriptive information or categorical data.

## Method
We will use the `googletrans` library for this translation. This library provides a convenient way to access Google Translate's capabilities programmatically.

## Considerations
- The translation process can be time-consuming for large datasets.
- Automated translations may not always be perfect, and there could be nuances in the text that are not captured accurately.
- We will ensure to handle any translation errors and retain the original Dutch text where the translation is not feasible or produces unclear results.

By translating the dataset, we aim to enhance its accessibility and usability for a broader audience, facilitating better data understanding and analysis.


In [8]:
from googletrans import Translator
import pandas as pd


# Create a Translator object
translator = Translator()

# Function to translate text
def translate_text(text, src_language='nl', dest_language='en'):
    """Translate the specified text from src_language to dest_language."""
    try:
        return translator.translate(text, src=src_language, dest=dest_language).text
    except Exception as e:
        print(f"Error while translating '{text}': {e}")
        return text  # Return original text if translation fails

# Translate column headers
translated_headers = [translate_text(header) for header in data.columns]

# Create a new DataFrame with translated headers
translated_data = pd.DataFrame(data.values, columns=translated_headers)

# Check the translated headers
print(translated_data.columns)



Error while translating 'NLSitNummer': 'NoneType' object has no attribute 'group'
Error while translating 'DatumFileBegin': 'NoneType' object has no attribute 'group'
Error while translating 'DatumFileEind': 'NoneType' object has no attribute 'group'
Error while translating 'TijdFileBegin': 'NoneType' object has no attribute 'group'
Error while translating 'TijdFileEind': 'NoneType' object has no attribute 'group'
Error while translating 'FileZwaarte': 'NoneType' object has no attribute 'group'
Error while translating 'GemLengte': 'NoneType' object has no attribute 'group'
Error while translating 'FileDuur': 'NoneType' object has no attribute 'group'
Error while translating 'HectometerKop': 'NoneType' object has no attribute 'group'
Error while translating 'HectometerStaart': 'NoneType' object has no attribute 'group'
Error while translating 'RouteLet': 'NoneType' object has no attribute 'group'
Error while translating 'RouteNum': 'NoneType' object has no attribute 'group'
Error while 

The translation given is not very accurate, so it will be done manually

In [9]:

translated_headers = {
   'NLSitNummer': 'NL Site Number',
    'DatumFileBegin': 'File Start Date',
    'DatumFileEind': 'File End Date',
    'TijdFileBegin': 'File Start Time',
    'TijdFileEind': 'File End Time',
    'FileZwaarte': 'File Severity',
    'GemLengte': 'Average Length',
    'FileDuur': 'File Duration',
    'HectometerKop': 'Hectometer Head',
    'HectometerStaart': 'Hectometer Tail',
    'RouteLet': 'Route Letter',
    'RouteNum': 'Route Number',
    'RouteOms': 'Route Description',
    'hectometreringsrichting': 'Hectometering Direction',
    'KopWegvakVan': 'Head Road Section From',
    'KopWegvakNaar': 'Head Road Section To',
    'TrajVan': 'Trajectory From',
    'TrajNaar': 'Trajectory To',
    'OorzaakGronddetail': 'Cause Ground Detail',
    'OorzaakVerloop': 'Cause Progression',
    'OorzaakCodeVerloop': 'Cause Code Progression',
    'OorzaakCode': 'Cause Code',
    'Oorzaak_1': 'Cause 1',
    'Oorzaak_2': 'Cause 2',
    'Oorzaak_3': 'Cause 3',
    'Oorzaak_4': 'Cause 4',
  
}

# Replace the headers in the dataset
data.rename(columns=translated_headers, inplace=True)



now lets trasnlate the content of the column Hectometering Direction

In [10]:

translation_dict = {
    'oplopend': 'ascending',
    'aflopend': 'descending'
}

data['Hectometering Direction'] = data['Hectometering Direction'].replace(translation_dict)




data['Hectometering Direction']

0          ascending
1         descending
2          ascending
3         descending
4         descending
             ...    
352244    descending
352245    descending
352246     ascending
352247     ascending
352248     ascending
Name: Hectometering Direction, Length: 352249, dtype: object

The changes were succesfull.

Now I am going to drop all the undeeded columns

In [11]:

data.drop(columns=["Cause Code Progression", "Cause Code", "Cause 1", "Cause 2", "Cause 3", "Cause 4", "Head Road Section From", "Head Road Section To", "Cause Progression" ], inplace=True)
data




Unnamed: 0,NL Site Number,File Start Date,File End Date,File Start Time,File End Time,File Severity,Average Length,File Duration,Hectometer Head,Hectometer Tail,Route Letter,Route Number,Route Description,Hectometering Direction,Trajectory From,Trajectory To,Cause Ground Detail
0,3346718,2023-01-11,2023-01-11,17:22:32,17:44:04,58340.0,2709000.0,21533,116.5,114.5,A,2,A2,ascending,Utrecht,'s-Hertogenbosch,Defect(e) voertuig(en)
1,3346719,2023-01-11,2023-01-11,17:22:32,17:43:00,74363.0,3633000.0,20467,0.0,3.7,A,28,A28,descending,Amersfoort,Utrecht,Spitsfile (geen oorzaak gemeld)
2,3346720,2023-01-11,2023-01-11,17:22:32,17:59:00,213030.0,5842000.0,36467,70.0,66.6,A,6,A6,ascending,Muiden,Lelystad,Spitsfile (geen oorzaak gemeld)
3,3346721,2023-01-11,2023-01-11,17:22:32,17:30:04,20768.0,2757000.0,7533,9.8,12.2,A,76,A76,descending,Aken,Geleen,Spitsfile (geen oorzaak gemeld)
4,3346722,2023-01-11,2023-01-11,17:22:32,17:24:00,3080.0,2100000.0,1467,223.2,225.3,N,18,N18,descending,Enschede,Varsseveld,Spitsfile (geen oorzaak gemeld)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352244,3334697,2022-12-21,2022-12-21,16:48:31,16:57:00,24270.0,2861000.0,8483,39.7,42.5,N,35,N35,descending,Almelo,Zwolle,Spitsfile (geen oorzaak gemeld)
352245,3334698,2022-12-21,2022-12-21,16:48:31,16:52:00,11843.0,3400000.0,3483,52.1,55.5,N,57,N57,descending,Westenschouwen,Ellemeet,Spitsfile (geen oorzaak gemeld)
352246,3334699,2022-12-21,2022-12-21,16:49:32,16:53:00,11093.0,3200000.0,3467,61.8,58.6,A,12,A12,ascending,Den Haag,Utrecht,Spitsfile (geen oorzaak gemeld)
352247,3334700,2022-12-21,2022-12-21,16:49:32,16:55:03,15193.0,2754000.0,5517,19.7,17.7,A,16,A16,ascending,Rotterdam,Breda,Spitsfile (geen oorzaak gemeld)


I am going to translate the values in the Cause Ground Detail to English.

In [12]:
cause_ground_detail_translations = {
    'Defect(e) voertuig(en)': 'Defective vehicle(s)',
    'Spitsfile (geen oorzaak gemeld)': 'Rush hour traffic jam (no cause reported)',
    'File buiten spits (geen oorzaak gemeld)': 'Traffic jam outside rush hour (no cause reported)',
    'Spitsfile (geen oorzaak gemeld) met gevonden werk in Spin': 'Rush hour traffic jam (no cause reported) with work found in Spin',
    'Ongeval (in een spitsfile)': 'Accident (in a rush hour traffic jam)',
    'Ongeval(len)': 'Accident(s)',
    'Ongeval met vrachtwagen(s)': 'Accident involving truck(s)',
    'Spitsfile (met defect voertuig)': 'Rush hour traffic jam (with defective vehicle)',
    'Ongeval op aansluitende weg': 'Accident on connecting road',
    'Incident op aansluitende weg': 'Incident on connecting road',
    'Schade aan wegmeubilair': 'Damage to road furniture',
    'Wegwerkzaamheden': 'Roadworks',
    'Opruimingswerkzaamheden': 'Cleanup operations',
    'Spitsfile (met ongeval)': 'Rush hour traffic jam (with accident)',
    'Defecte vrachtwagen(s)': 'Defective truck(s)',
    'Opgehoogde werkzaamheden': 'Elevated works',
    'Eerder(e) ongeval(len)': 'Previous accident(s)',
    'Wegdek in slechte toestand': 'Road surface in poor condition',
    'Ongeval vrachtwagen (met opruim/berging)': 'Truck accident (with cleanup/tow)',
    'Afremmend verkeer als gevolg van kijkers naar ongeval(len)': 'Traffic slowing due to onlookers at accident(s)',
   
}

# Apply the translations to the 'OorzaakGronddetail' column
data['Cause Ground Detail'] = data['Cause Ground Detail'].map(cause_ground_detail_translations).fillna(data['Cause Ground Detail'])

# Now your DataFrame 'data' has the translated values in the 'OorzaakGronddetail' column
data['Cause Ground Detail']



0                              Defective vehicle(s)
1         Rush hour traffic jam (no cause reported)
2         Rush hour traffic jam (no cause reported)
3         Rush hour traffic jam (no cause reported)
4         Rush hour traffic jam (no cause reported)
                            ...                    
352244    Rush hour traffic jam (no cause reported)
352245    Rush hour traffic jam (no cause reported)
352246    Rush hour traffic jam (no cause reported)
352247    Rush hour traffic jam (no cause reported)
352248    Rush hour traffic jam (no cause reported)
Name: Cause Ground Detail, Length: 352249, dtype: object

Yet again the changes are succesfull and now the column is more easily acceptable.

Now since the data is in a more readable format and prepared I am going toapply filtering for routes only for Maarheeze and Eindhoven and also annotate the route.

In [13]:
data_for_filtering= data.copy()

data_for_filtering = data_for_filtering[
    (data["Route Description"] == "A2") &
    (data['Hectometer Head'] < 182.0) &
    (data['Hectometer Tail'] > 155.0) &
    (data['Hectometering Direction'] == 'descending')
]
data_for_filtering['Route'] = "M-E"



# Save the filtered data
modified_file_path = './Maarheeze-Eindhoven.csv'
data_for_filtering.to_csv(modified_file_path, index=False)

# Display info of the filtered data
data_for_filtering

Unnamed: 0,NL Site Number,File Start Date,File End Date,File Start Time,File End Time,File Severity,Average Length,File Duration,Hectometer Head,Hectometer Tail,Route Letter,Route Number,Route Description,Hectometering Direction,Trajectory From,Trajectory To,Cause Ground Detail,Route
24,3347248,2023-01-12,2023-01-12,07:30:28,09:45:05,1102808.0,8192000.0,134617,181.5,183.6,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
39,3347573,2023-01-12,2023-01-12,08:08:30,08:12:00,9273.0,2650000.0,3500,163.2,165.8,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
124,3348623,2023-01-12,2023-01-12,16:29:31,16:36:03,17240.0,2639000.0,6533,164.2,166.3,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
240,3350696,2023-01-16,2023-01-16,06:49:25,06:55:01,13518.0,2414000.0,5600,163.4,165.6,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
250,3350899,2023-01-16,2023-01-16,07:35:27,07:51:00,39717.0,2554000.0,15550,164.2,166.5,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
351682,3335191,2022-12-22,2022-12-22,07:51:29,08:03:00,26770.0,2324000.0,11517,163.4,166.4,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
351717,3335194,2022-12-22,2022-12-22,07:52:29,07:57:00,12253.0,2713000.0,4517,174.4,177.2,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
351834,3335214,2022-12-22,2022-12-22,08:03:30,08:10:02,16327.0,2499000.0,6533,164.4,166.7,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E
352068,3335151,2022-12-22,2022-12-22,07:39:28,07:48:00,17920.0,2100000.0,8533,164.2,166.3,A,2,A2,descending,Maastricht,Eindhoven,Rush hour traffic jam (no cause reported),M-E


Now I will filter the data for Eindhoven to Den bosh and apply the same logic

In [14]:
data_for_filtering= data.copy()

# Apply your first set of filters
data_for_filtering = data_for_filtering[
     (data["Route Description"] == "A2") &
     (data['Hectometer Head'] < 154.8) &
     (data['Hectometer Tail'] > 112.8) &
     (data['Hectometering Direction'] == 'descending')
]
data_for_filtering['Route'] = "E-D"

# Save the filtered data
modified_file_path = './Eindhoven-DenBosch.csv'
data_for_filtering.to_csv(modified_file_path, index=False)

# Display info of the filtered data
data_for_filtering


Unnamed: 0,NL Site Number,File Start Date,File End Date,File Start Time,File End Time,File Severity,Average Length,File Duration,Hectometer Head,Hectometer Tail,Route Letter,Route Number,Route Description,Hectometering Direction,Trajectory From,Trajectory To,Cause Ground Detail,Route
151,3348898,2023-01-12,2023-01-12,17:07:32,17:18:00,38938.0,3720000.0,10467,110.9,113.0,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
181,3349412,2023-01-12,2023-01-12,18:14:31,18:16:00,4005.0,2700000.0,1483,110.9,113.6,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
245,3350794,2023-01-16,2023-01-16,07:16:27,08:48:01,329302.0,3596000.0,91567,110.9,113.1,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
295,3346817,2023-01-11,2023-01-11,17:37:32,17:40:04,5320.0,2100000.0,2533,110.9,113.0,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
859,3346384,2023-01-11,2023-01-11,16:31:33,17:12:44,165080.0,4008000.0,41183,114.6,116.4,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Accident(s),E-D
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
351580,3337426,2022-12-26,2022-12-26,13:27:28,13:32:00,10117.0,2232000.0,4533,112.1,114.2,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Traffic jam outside rush hour (no cause reported),E-D
351671,3334924,2022-12-21,2022-12-21,17:33:30,17:36:02,5320.0,2100000.0,2533,110.9,113.0,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
351743,3334829,2022-12-21,2022-12-21,17:15:32,17:17:00,3667.0,2500000.0,1467,111.7,114.2,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D
352063,3334988,2022-12-21,2022-12-21,17:51:31,17:55:02,7033.0,2000000.0,3517,110.9,112.9,A,2,A2,descending,'s-Hertogenbosch,Utrecht,Rush hour traffic jam (no cause reported),E-D


Now I will filter the data for Eindhoven to Arnhem and apply the same logic

In [15]:
data_for_filtering= data.copy()

data_for_filtering = data_for_filtering[
    (data["Route Description"] == "A50") &
    (data['Hectometer Head'] < 162.2) &
    (data['Hectometer Tail'] > 93.5) &
    (data['Hectometering Direction'] == 'ascending')
]

data_for_filtering['Route'] = "E-A"

data_for_filtering.to_csv("./Eindhoven-Arnhem.csv", index=False)
data_for_filtering

Unnamed: 0,NL Site Number,File Start Date,File End Date,File Start Time,File End Time,File Severity,Average Length,File Duration,Hectometer Head,Hectometer Tail,Route Letter,Route Number,Route Description,Hectometering Direction,Trajectory From,Trajectory To,Cause Ground Detail,Route
33,3347448,2023-01-12,2023-01-12,07:54:30,08:44:16,224497.0,4511000.0,49767,136.8,134.6,A,50,A50,ascending,Oss,Arnhem,Accident(s),E-A
128,3348657,2023-01-12,2023-01-12,16:35:31,16:38:00,5745.0,2313000.0,2483,112.7,110.3,A,50,A50,ascending,Eindhoven,Oss,Rush hour traffic jam (no cause reported),E-A
278,3346466,2023-01-11,2023-01-11,16:45:32,17:19:00,110052.0,3288000.0,33467,117.8,113.8,A,50,A50,ascending,Eindhoven,Oss,Rush hour traffic jam (no cause reported),E-A
357,3347822,2023-01-12,2023-01-12,08:40:31,09:16:03,147383.0,4148000.0,35533,139.9,137.6,A,50,A50,ascending,Oss,Arnhem,Rush hour traffic jam (no cause reported),E-A
364,3347942,2023-01-12,2023-01-12,08:57:31,09:03:00,14012.0,2555000.0,5483,112.9,110.7,A,50,A50,ascending,Eindhoven,Oss,Rush hour traffic jam (no cause reported),E-A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352089,3334677,2022-12-21,2022-12-21,16:46:32,17:08:00,71295.0,3321000.0,21467,140.1,137.9,A,50,A50,ascending,Oss,Arnhem,Rush hour traffic jam (no cause reported),E-A
352132,3334785,2022-12-21,2022-12-21,17:07:32,18:09:00,312195.0,5079000.0,61467,118.8,114.4,A,50,A50,ascending,Eindhoven,Oss,Rush hour traffic jam (no cause reported),E-A
352138,3334893,2022-12-21,2022-12-21,17:27:32,17:48:00,64373.0,3145000.0,20467,140.2,137.7,A,50,A50,ascending,Oss,Arnhem,Rush hour traffic jam (no cause reported),E-A
352177,3334899,2022-12-21,2022-12-21,17:28:32,17:37:00,25895.0,3058000.0,8467,111.3,108.6,A,50,A50,ascending,Eindhoven,Oss,Rush hour traffic jam (no cause reported),E-A
