# Exploratory Data Analysis (EDA) of Maarheeze-Eindhoven Dataset

This notebook presents an exploratory data analysis of a dataset related to events between Maarheeze and Eindhoven. The analysis includes basic data structure examination, summary statistics, missing value analysis, unique value exploration, and data visualization.


In [None]:
import pandas as pd

# Load the dataset
file_path = 'dataMaarheezeEindhoven.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()


## Data Structure

First, let's examine the structure of the dataset, including the number of rows, columns, and the data types of each column.


In [None]:
# Checking the structure of the dataset
data_structure = {
    "Number of Rows": data.shape[0],
    "Number of Columns": data.shape[1],
    "Column Names": data.columns.tolist(),
    "Data Types": data.dtypes
}

data_structure


## Summary Statistics and Missing Values

Next, we generate basic summary statistics for the numerical columns and check for any missing values in the dataset.


In [None]:
# Generating summary statistics for numerical columns
numerical_summary = data.describe()

# Checking for missing values
missing_values = data.isnull().sum()

numerical_summary, missing_values[missing_values > 0]  # Only showing columns with missing values


## Unique Values in Categorical Columns

Understanding the diversity in categorical columns is essential. We will explore the unique values in some key categorical columns.


In [None]:
# Exploring unique values in certain categorical columns
unique_values = {
    "Trajectory From": data['Trajectory From'].unique(),
    "Trajectory To": data['Trajectory To'].unique(),
    "Cause Ground Detail": data['Cause Ground Detail'].unique(),
    "Cause Progression": data['Cause Progression'].unique(),
    "Month": data['Month'].unique(),
    "Route": data['route'].unique()
}

unique_values


## Data Visualization

Visualizations will help us better understand the distribution and relationship of data in the dataset. We will create plots to visualize the distribution of events over months, frequency of different causes, and severity distribution.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the aesthetic style of the plots
sns.set(style="whitegrid")



In [None]:
# Distribution of events over months
plt.subplot(1, 3, 1)
sns.countplot(x='Month', data=data, palette='viridis')
plt.title('Distribution of Events Over Months')

# Creating visualizations
plt.figure(figsize=(15, 5))


plt.tight_layout()
plt.show()




The bar chart illustrates the number of events that occurred in each month throughout a given year. The vertical axis represents the count of events, and the horizontal axis represents the months, labeled 1 through 12.

From the chart, we can observe the following:
- The number of events increases as the year progresses, with a slight dip in the mid-year months.
- The beginning of the year starts with a lower count of events, which could be attributed to fewer active projects or less movement due to winter conditions in some regions.
- A significant increase in events is noticeable in the later months, peaking in the last month, December. This could potentially be related to end-of-year activities, holiday-related events, or weather changes.
- The months with the highest number of events are 8 (August) and 12 (December), indicating specific periods of high activity or seasonal factors that may contribute to an increased incidence rate.


In [None]:
# Frequency of different causes (top 10)
plt.subplot(1, 3, 2)
top_causes = data['Cause Ground Detail'].value_counts().head(10)
sns.barplot(y=top_causes.index, x=top_causes.values, palette='rocket')
plt.title('Top 10 Causes of Events')

# Creating visualizations
plt.figure(figsize=(15, 5))


plt.tight_layout()
plt.show()


The horizontal bar chart outlines the most frequent causes for events within a dataset. The horizontal axis quantifies the frequency of each cause, while the vertical axis lists the causes.

The following observations can be made from the chart:
- The leading cause of events is 'rush hour traffic jam (no cause reported)', indicating that congestion during peak travel times is a significant contributor to the number of events.
- The second most common cause is 'traffic jam outside rush hour (no cause reported)', suggesting that traffic-related issues are prevalent even beyond traditional rush hours.
- 'Accident (len)' ranks third, underscoring accidents as another primary factor in event causation.
- Other notable causes include vehicle-related issues such as 'defective truck (s)' and 'defective vehicle (s)', and incidents related to 'road work' and 'clearance work'.
- The chart also highlights less frequent but notable causes such as 'accident with truck (s)' and 'accident (with cleaning/storage)', which have a considerable impact despite their lower ranking.


## Time-Series Analysis

We will conduct a time-series analysis to observe trends and patterns over time. This includes analyzing the distribution of events across different months, time of the day, and the duration of these events.


In [None]:
# Converting date and time columns to datetime format and setting new datetime columns
data['File Start Date'] = pd.to_datetime(data['File Start Date'])
data['File End Date'] = pd.to_datetime(data['File End Date'])
data['Start DateTime'] = pd.to_datetime(data['File Start Date'].astype(str) + ' ' + data['File Start Time'])
data['End DateTime'] = pd.to_datetime(data['File End Date'].astype(str) + ' ' + data['File End Time'])

# Creating time-series visualizations
plt.figure(figsize=(10, 6))

In [None]:
data.groupby(data['Start DateTime'].dt.to_period("M")).size().plot(kind='bar')
plt.title('Monthly Event Frequency')
plt.xlabel('Month')
plt.ylabel('Number of Events')
plt.xticks(rotation=45)

# Apply tight layout to adjust the spacing
plt.tight_layout()

# Display the plot
plt.show()

#### Monthly Event Frequency Analysis

The bar chart displays the number of events recorded each month throughout the year 2022. The x-axis represents the months of the year, labeled from January (2022-01) to December (2022-12), and the y-axis indicates the count of events that occurred in each month.

Observations from the chart:

- The number of events starts at a lower point in January, with a gradual increase in frequency as the year progresses, reaching the first peak around mid-year in June (2022-06).
- The highest peak occurs in August (2022-08), which could suggest a seasonal pattern or a specific event that caused a higher number of occurrences during this month.
- Following August, there is a noticeable decrease in September and October, with a slight increase again in November.
- The event frequency diminishes once more in December, ending the year on a lower note compared to the mid-year peak.


In [None]:
# Time of day patterns for event occurrences
data['Start DateTime'].dt.hour.value_counts().sort_index().plot(kind='bar')
plt.title('Time of Day for Event Occurrences')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Events')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

#### Time of Day for Event Occurrences Analysis

This bar chart represents the distribution of events across different hours of the day. The x-axis denotes the hour of the day in 24-hour format, ranging from 4 AM to 10 PM (22), while the y-axis indicates the number of events occurring at each hour.

Observations from the chart include:

- A significant peak is observed during the early hours of the day, around 6 AM. This could indicate a surge in events during the early morning, possibly related to the start of rush hour or the beginning of business operations.
- Following the morning peak, there is a substantial drop in the number of events until midday.
- Event occurrences increase again in the afternoon, peaking at around 3 PM. This might correlate with the end of the school day or the beginning of evening rush hour, suggesting another period of high activity.
- The number of events significantly decreases as the evening progresses, with the lowest counts in the late evening hours.




In [None]:
# Duration of events over time
data['Event Duration'] = (data['End DateTime'] - data['Start DateTime']).dt.total_seconds() / 60  # Duration in minutes
data.plot(x='Start DateTime', y='Event Duration', kind='line', ax=plt.gca())
plt.title('Duration of Events Over Time')
plt.xlabel('Date')
plt.ylabel('Duration (Minutes)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


#### Duration of Events Over Time Analysis

The line chart depicts the duration of events over a span of time, specifically from January 2022 to January 2023. The x-axis represents the date, and the y-axis represents the duration of events in minutes.

Key observations from the chart:

- There is considerable variability in event durations throughout the year. Most events tend to have shorter durations, but there are several notable spikes indicating events of much longer duration.
- The longest events appear to occur sporadically rather than following a clear seasonal or monthly pattern. These outliers could be due to extraordinary circumstances or significant incidents requiring extended attention.
- There is no clear trend indicating an increase or decrease in event duration over time, suggesting that the factors affecting duration are varied and possibly unrelated to time of year.
- The data points are densely packed, which could suggest a consistent recording of events throughout the year with no significant periods of inactivity.


## Correlation and Categorical Analysis

Now, we'll conduct a correlation analysis among numerical variables and explore the relationships between categorical variables and other key metrics like event severity and duration.


In [None]:
# Correlation Analysis among numerical variables
numerical_data = data[['File Severity', 'Average Length', 'Hectometer Head', 'Hectometer Tail', 'Event Duration']]

# Calculating correlation
corr = numerical_data.corr()

# Creating a heatmap for correlation
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()




## Correlation Heatmap Analysis

The correlation heatmap provides a visual summary of the relationships between different numerical variables. Each cell in the heatmap shows the correlation coefficient between two variables, with the scale on the right indicating the strength of the relationship, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Key observations:

- **File Severity and Event Duration**: There is a very high positive correlation (0.95) between 'File Severity' and 'Event Duration', suggesting that events with greater severity tend to have longer durations.
- **Hectometer Head and Hectometer Tail**: These two variables have an almost perfect positive correlation (0.99), indicating that as the head hectometer marker increases, the tail marker tends to increase correspondingly, which is expected since they are part of the same event location.
- **Average Length**: 'Average Length' has a moderate positive correlation with 'File Severity' (0.57), implying that more severe events are also longer in length.
- **Weak Correlations**: Several variables, such as 'Hectometer Head' and 'File Severity', have a very low correlation (0.03), indicating no significant linear relationship between them.


Now we will go more in-depth and look how different 'Cause Ground Detail' categories impact the 'File Severity' and 'Event Duration'. 

In [None]:
# In-depth analysis of categorical variables
# Analyzing the impact of 'Cause Ground Detail' on 'File Severity' and 'Event Duration'
cause_severity = data.groupby('Cause Ground Detail')['File Severity'].mean().sort_values(ascending=False)
cause_duration = data.groupby('Cause Ground Detail')['Event Duration'].mean().sort_values(ascending=False)

plt.figure(figsize=(15, 5))

# Impact on File Severity
cause_severity.head(10).plot(kind='barh', color='skyblue')
plt.title('Top 10 Causes by Average Severity')
plt.xlabel('Average Severity')
plt.ylabel('Cause')

plt.tight_layout()
plt.show()



#### Analysis of Top 10 Causes by Average Severity


The horizontal bar chart lists the top 10 causes of events, ranked by their average severity. The severity is represented on the horizontal axis, scaled to show severity values, which appears to be in the range of hundreds of thousands (as indicated by the `1e6` at the far end of the scale, which denotes scientific notation for one million). Each bar's length represents the average severity score for each cause, with the causes listed on the vertical axis.

Key observations from the chart:

- The most severe cause, by a significant margin, is 'spitsfile (with accident)', indicating that events classified under this cause are the most severe on average.
- 'Accident with truck (s)' and 'activities (and traffic jam outside spits without cause)' are also among the top causes with high average severity, which shows that these events can have serious implications.
- Less severe, but still within the top causes, are 'road work' and 'spitsfile (with defective vehicle)', which have relatively lower average severity scores compared to the top-ranking causes.
- The cause 'damaged guide rail' has the lowest average severity score among the top 10 causes but is still significant enough to be considered a major cause of events.



In [None]:
# Impact on Event Duration
cause_duration.head(10).plot(kind='barh', color='salmon')
plt.title('Top 10 Causes by Average Event Duration')
plt.xlabel('Average Duration (Minutes)')
plt.ylabel('Cause')

plt.tight_layout()
plt.show()

#### Analysis of Top 10 Causes by Average Event Duration

This horizontal bar chart depicts the average duration, in minutes, of events attributed to various causes. The x-axis quantifies the average duration while the y-axis lists the causes of events.

Observations from the chart include:

- 'Spitsfile (with an incident)' is the leading cause of events when considering the average duration, suggesting that incidents related to this cause tend to last longer than those associated with other causes.
- 'Spitsfile (defective vehicle and storage)' and 'accident (in a pointed file)' follow as the second and third causes leading to prolonged events, respectively.
- Other causes such as 'accident and (traffic jam outside striker without cause)', 'demonstration', and 'road work' are also significant contributors to longer event durations, though to a lesser extent than the top-ranking causes.
- 'Activities (and traffic jam outside spits without cause)' and 'damaged guide rail' have the lowest average duration among the top causes, indicating quicker resolution or less complexity in managing these types of events.

