# Introduction
This constellation of multiple Jupyter Notebooks documents the data mining process for a train data set. The selected data set contains records for various train journeys since november 2023. The data has been created by manually documenting key statistics about every journey in a spreadsheet. The aim of this analysis is to explore patterns and potential correlations within the data set. The data set includes information about the train line, departure and arrival stations as well as date & time, delay and ticket inspections. The exact attributes will be analyzed later on.
This data set has been chosen due to the personal connection based on manually creating the data over an extended period of time. Predictions and classifications could also help improve future decisions about which train connection to take for a minimal risk of delay.

The primary questions for this analysis are:
1. Is there a correlation between the train line and other journey statistics? This includes examining whether certain lines tend to have more delay or higher crowding.
2. Can journeys be classified based on key features? Using classification techniques it will be attempted to predict if a train is for example 'likely to be crowded'.
3. Is the likelihood of ticket inspections dependent on the time of day or month?

# Data exploration
The data set consists of two related spread sheets: one for the train journeys and general statistics about these journeys, the other for annoucements during the train drive.

## Train journeys

### Attributes
The data set has a limited amount of attributes which can be explained either by their name or by annotations in the data set. The original names are in german but for language consistency the english translation will be used when talking about the attributes. The most important attributes are:

- **ID** ("Lfd. Nummer") is an ID for every record
- **train line** ("Lininen Nummer") is the train number
- **departure station** ("Start Haltestelle") is the departure station
- **arrival station** ("Ziel Haltestelle") is the arrival station
- **planned departure** ("Planmäßige Abfahrtszeit") and date ("Datum") are the time and date of depature

These attributes are normally always known before the journey is even taken. They will be referenced as **fixed information** of a journey. The following attributes are determined during and after the journey has been taken and referenced as **variable information**:

- **delay** "Verspätung in min" the delay upon arrival in minutes, however delays are only documented if they are 2 min or longer. Arrival counts as the moment the first passengers step out of the vehicle.
- **ticket checked** ("Kontrolliert") if there was a ticket inspection. For context in german trains there are commonly no gates to force you to buy a ticket but random checks are performed by ticket inspectors in the trian.
- **platform changed** ("Gleis verlegt") if the platform (departure or arrival) was different from the planned platform
- **crowdiness** ("Fülle des Zuges") describes how full the train was
- **train model** ("Zugmodel") is only documented for specific trains which can be classified as old or new 
- **cleanliness** ("Sauberkeit") takes into consideration if there is trash, if floor or seats are dirty and any smell is in the train

Additionally there are three attributes which exist because the delay of one train does not always reflect the true experience of train driving. For example, if train A has a delay of 15 min but due to that the connecting train B is missed, the traveler might have to wait an additional 15 min for the next train. Therefore the total delay would be 30 min. The exact calculation is the following:<br>
*When planning the trip, the connection to be taken is determined. If due to train delays a different connection is taken during the whole trip the relative timings become relevant. External factors when traveling to the departure station (e. g. delayed bus, traffic jam, ...) do not affect this calculation.*

- **relative planned departure** ("rel. planmäßige Abfahrtszeit") the planned departure time, different from the actual departure time
- **relative delay** ("rel. Verspätung") the delay compared to when one would have arrived at the arrival station when using the planned connection
- **alternative connection** ("Alternativer Anschluss") if the relative delay is due to missing connecting train

The attribute note ("Bemerkung") is more a personal note and will not be taken into consideration for this analysis.


The following table shows the types of attributes:

| Attribute Name             | Attribute Type | Possible values                                                                                                                   | Annotation |
|----------------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------|------------|
| ID                         | nominal        | natrual numbers (0 excluded)                                                                                                      |            |
| train line                 | nominal        | any                                                                                                                               |            |
| departure station          | nominal        | any                                                                                                                               |            |
| arrival station            | nominal        | any                                                                                                                               |            |
| planned departure          | interval       | time                                                                                                                              |            |
| date                       | interval       | calendar date                                                                                                                     |            |
| delay                      | ratio          | {null, natrual numbers > 1 or 0; X; N}                                                                                                  | [1]        |
| ticket checked             | nominal        | true/false                                                                                                                        |            |
| platform changed           | nominal        | true/false                                                                                                                        |            |
| crowdiness                 | ordinal        | {nearly empty, light crowd, moderate, heavy crowd, packed} ({"(Fast) leer", "Wenig voll", "Normal voll", "Sehr voll", "Zu voll"}) | [2]        |
| train model                | nominal        | {old, new, other}, ({"Alt", "Neu", "Sonstiges"})                                                                                  | [3]        |
| cleanliness                | ordinal        | {dirty, alright, very clean}, ({"Dreckig", "Ok", "Sehr sauber"})                                                                  |            |
| relative planned departure | interval       | {time; X}                                                                                                                         | [4]        |
| relative delay             | ratio          | natrual numbers (0 included)                                                                                                      | [5]        |
| alternative connection     | nominal        | ture/false                                                                                                                        |            |

[1] *null* = 0 (a delay of 0 or 1 is documented by not entering a delay), *X* = train canceled and no relative delay on other connection; *N* = decided not to take the train due to overcrowding, apparent  delays, etc.<br>
[2] *nearly empty* = free choice of seats; *light crowd* = people in the train, lots of places including 4-seaters available; *moderate* = 2-seaters are available, maybe a bit of searching; *heavy crowd* = only sitting next to someone is possible; *packed* = even if all seats would be used people would have to stand<br>
[3] *other* = a new and old train compartment are connected or temporary model<br>
[4] *X* = when a departure is from a station not initially planned and therefore time can not be determined<br>
[5] due to the calculation this delay can also be 0 or 1<br>


### Sample data

The data set does not appear to have a lot of missing values however the last entries are all completely empty except the ID. As an example this is the record with ID 16:<br>
`16,RS30,Oldenburg,Bremen,07:05,24.11.2023,4,FALSE,FALSE,Sehr voll,Neu,,,,FALSE,,,,`<br>
Without further analysis, it is apparent that a lot of entries contain "Rastede", "Oldenburg" and "Bremen" as departure or arrival stations. A record which has a relative delay is quite rare. Most records have a normal or no delay at all.

### Visualizations


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# The first three lines are invalid -> skip
journeys = pd.read_csv('../data/train-drives.csv', skiprows=3, encoding='utf-8')

# Convert date to datetime, coerce errors to NaT
journeys['Datum'] = pd.to_datetime(journeys['Datum'], format='%d.%m.%Y', errors='coerce')

# Drop rows with invalid or missing dates
journeys = journeys.dropna(subset=['Datum'])

# Group by year and month
months = journeys['Datum'].dt.strftime('%Y-%m')
month_counts = months.value_counts().sort_index()

# Prepare sorted months and counts for plotting
sorted_months = month_counts.index.tolist()
counts = month_counts.values.tolist()


In [None]:
# Create a histogram of the number of records per month
plt.figure(figsize=(7, 4))
plt.bar(sorted_months, counts)
plt.xlabel('Month')
plt.ylabel('Number of records')
plt.title('Number of records per month')
plt.xticks(rotation=45, ha='right')
plt.show()


The records are quite regularly distributed with most months having at least 20 records. December 2024 has less than 10 records, which could indicate an anomaly in the data or just fewer journeys taken in that month.

In [None]:
# Convert columns to numeric, convert empty values to 0
delay = pd.to_numeric(journeys['Verspätung in min'], errors='coerce')
rel_delay = pd.to_numeric(journeys['rel. Verspätung'], errors='coerce')

delay_filled = delay.fillna(0)

# Create box plots
plt.figure(figsize=(7, 4))
plt.boxplot([delay_filled.dropna(), rel_delay.dropna()], tick_labels=['Delay', 'Relative Delay'])
plt.ylabel('Minutes')
plt.title('Delay and Relative Delay Distribution')
plt.show()

# Create historgram
plt.figure(figsize=(9, 4))
plt.subplot(1, 2, 1)
plt.hist(delay_filled, bins=50, alpha=0.7, label='Delay')
plt.xlabel('Minutes')
plt.ylabel('Frequency')
plt.title('Histogram of Delay')
plt.subplot(1, 2, 2)
plt.hist(rel_delay, bins=20, alpha=0.7, label='Relative Delay')
plt.xlabel('Minutes')
plt.ylabel('Frequency')
plt.title('Histogram of Relative Delay')
plt.show()


# Count occurrences of 'X' and 'N' in delay columns
delay_x_count = journeys['Verspätung in min'].astype(str).str.upper().eq('X').sum()
delay_n_count = journeys['Verspätung in min'].astype(str).str.upper().eq('N').sum()
rel_delay_x_count = journeys['rel. Verspätung'].astype(str).str.upper().eq('X').sum()

print('-- Statistics about Delay --')
print(delay_filled.describe())
print(f"'X' in Delay: {delay_x_count}; 'N' in Dealy: {delay_n_count}\n")
print('-- Statistics about Relative Delay --')
print(rel_delay.describe())
print(f"'X' in Relative Delay: {rel_delay_x_count}")


The statistics are not 100% accurate since X and N were converted to 0 for the delays, however there are not many records with an X or N. The boxplot shows quite clearly that the majority of delays are very low and mostly 0 since the 50% mark is at 0 minutes. Noteworthy is the outlier and maximum of 173 minutes delay. The relative delays are more spread out and the percentiles have a larger range.

In [None]:
# Count missing values for each column, ignoring the last 4 columns (these are fully empty/not relevant)
cols = journeys.columns[:-4]
missing_counts = journeys[cols].isna().sum() + (journeys[cols] == '').sum()
missing_counts = missing_counts[missing_counts > 0] # remove empty columns

# Create bar chart
plt.figure(figsize=(7, 4))
bars = plt.bar(missing_counts.index, missing_counts.to_numpy())
plt.bar_label(bars)
plt.ylabel('Amount')
plt.title('Missing values per column')
plt.xticks(rotation=45, ha='right')
plt.show()


The bar chart shows the missing values for all columns which have any missing values. Relative planned departure and relative delay have the most missing values. This is expected as relative delays are not expected to occur regularly and will be left empty when not relevant. The delay has also a lot of missing values since these are left empty when delay is 0. Similar for the train model which is not documented for every train.<br>
Notable are the couple missing values for departure station, arrival station, crowdines and cleanliness. These will have to be taken care of during preprocessing or analysis.

In [None]:
# Get value counts, drop missing values
departures = journeys['Start Haltestelle '].dropna().value_counts()
arrivals = journeys['Ziel Haltestelle'].dropna().value_counts()

# Get all unique stations
all_stations = set(departures.index).union(arrivals.index)

# Sort stations by combined counts (descending)
sorted_stations = departures.add(arrivals, fill_value=0).sort_values(ascending=False).index.tolist()

# Get separate counts for all stations
departures_counts = [departures.get(station, 0) for station in sorted_stations]
arrivals_counts = [arrivals.get(station, 0) for station in sorted_stations]


# Create bar chart for departure and arrival
x = np.arange(len(sorted_stations))
width = 0.4

plt.figure(figsize=(12, 5))
plt.bar(x - width/2, departures_counts, width, label='Departure')
plt.bar(x + width/2, arrivals_counts, width, label='Arrival')
plt.xticks(x, sorted_stations, rotation=90)
plt.xlabel('Station')
plt.ylabel('Frequency')
plt.title('Histogram of Departure and Arrival Stations')
plt.legend()
plt.tight_layout()
plt.xticks(rotation=45, ha='right')
plt.show()


As discussed in the "Sample Data" section most journeys either start or end in "Oldenburg", "Rastede" or "Bremen". The histogramm does visualize this quite clearly. Most other stations have very few records in the data set and might not supply sufficient data for a deeper analysis.


## Train announcements

TDB