# Introduction
This constellation of multiple Jupyter Notebooks documents the data mining process for a train data set. The selected data set contains records for various train journeys since november 2023. The data has been created by manually documenting key statistics about every journey in a spreadsheet. The aim of this analysis is to explore patterns and potential correlations within the data set. The data set includes information about the train line, departure and arrival stations as well as date & time, delay and ticket inspections. The exact attributes will be analyzed later on.
This data set has been chosen due to the personal connection based on manually creating the data over an extended period of time. Predictions and classifications could also help improve future decisions about which train connection to take for a minimal risk of delay.

The primary questions for this analysis are:
1. Is there a correlation between the train line and other journey statistics? This includes examining whether certain lines tend to have more delay or higher crowding.
2. Can journeys be classified based on key features? Using classification techniques it will be attempted to predict if a train is for example 'likely to be crowded'.
3. Is the likelihood of ticket inspections dependent on the time of day or month?

# Data exploration
The data set consists of two related spread sheets: one for the train journeys and general statistics about these journeys, the other for annoucements during the train drive.

## Train journeys

### Attributes
The data set has a limited amount of attributes which can be explained either by their name or by annotations in the data set. The original names are in german but for language consistency the english translation will be used when talking about the attributes. The most important attributes are:

- **ID** ("Lfd. Nummer") is an ID for every record
- **train line** ("Lininen Nummer") is the train number
- **departure station** ("Start Haltestelle") is the departure station
- **arrival station** ("Ziel Haltestelle") is the arrival station
- **planned departure** ("Planmäßige Abfahrtszeit") and date ("Datum") are the time and date of depature

These attributes are normally always known before the journey is even taken. They will be referenced as **fixed information** of a journey. The following attributes are determined during and after the journey has been taken and referenced as **variable information**:

- **delay** "Verspätung in min" the delay upon arrival in minutes, however delays are only documented if they are 2 min or longer. Arrival counts as the moment the first passengers step out of the vehicle.
- **ticket checked** ("Kontrolliert") if there was a ticket inspection. For context in german trains there are commonly no gates to force you to buy a ticket but random checks are performed by ticket inspectors in the trian.
- **platform changed** ("Gleis verlegt") if the platform (departure or arrival) was different from the planned platform
- **crowdiness** ("Fülle des Zuges") describes how full the train was
- **train model** ("Zugmodel") is only documented for specific trains which can be classified as old or new 
- **cleanliness** ("Sauberkeit") takes into consideration if there is trash, if floor or seats are dirty and any smell is in the train

Additionally there are three attributes which exist because the delay of one train does not always reflect the true experience of train driving. For example, if train A has a delay of 15 min but due to that the connecting train B is missed, the traveler might have to wait an additional 15 min for the next train. Therefore the total delay would be 30 min. The exact calculation is the following:<br>
*When planning the trip, the connection to be taken is determined. If due to train delays a different connection is taken during the whole trip the relative timings become relevant. External factors when traveling to the departure station (e. g. delayed bus, traffic jam, ...) do not affect this calculation.*

- **relative planned departure** ("rel. planmäßige Abfahrtszeit") the planned departure time, different from the actual departure time
- **relative delay** ("rel. Verspätung") the delay compared to when one would have arrived at the arrival station when using the planned connection
- **alternative connection** ("Alternativer Anschluss") if the relative delay is due to missing connecting train

The attribute note ("Bemerkung") is more a personal note and will not be taken into consideration for this analysis.


The following table shows the types of attributes:

| Attribute Name             | Attribute Type | Possible values                                                                                                                   | Annotation |
|----------------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------|------------|
| ID                         | nominal        | natrual numbers (0 excluded)                                                                                                      |            |
| train line                 | nominal        | any                                                                                                                               |            |
| departure station          | nominal        | any                                                                                                                               |            |
| arrival station            | nominal        | any                                                                                                                               |            |
| planned departure          | interval       | time                                                                                                                              |            |
| date                       | interval       | calendar date                                                                                                                     |            |
| delay                      | ratio          | {natrual numbers > 1 or 0; X; N}                                                                                                  | [1]        |
| ticket checked             | nominal        | true/false                                                                                                                        |            |
| platform changed           | nominal        | true/false                                                                                                                        |            |
| crowdiness                 | ordinal        | {nearly empty, light crowd, moderate, heavy crowd, packed} ({"(Fast) leer", "Wenig voll", "Normal voll", "Sehr voll", "Zu voll"}) | [2]        |
| train model                | nominal        | {old, new, other}, ({"Alt", "Neu", "Sonstiges"})                                                                                  | [3]        |
| cleanliness                | ordinal        | {dirty, alright, very clean}, ({"Dreckig", "Ok", "Sehr sauber"})                                                                  |            |
| relative planned departure | interval       | {time; X}                                                                                                                         | [4]        |
| relative delay             | ratio          | natrual numbers (0 included)                                                                                                      | [5]        |
| alternative connection     | nominal        | ture/false                                                                                                                        |            |

[1] *X* = train canceled and no relative delay on other connection; *N* = decided not to take the train due to overcrowding, apparent  delays, etc.<br>
[2] *nearly empty* = free choice of seats; *light crowd* = people in the train, lots of places including 4-seaters available; *moderate* = 2-seaters are available, maybe a bit of searching; *heavy crowd* = only sitting next to someone is possible; *packed* = even if all seats would be used people would have to stand<br>
[3] *other* = a new and old train compartment are connected or temporary model<br>
[4] *X* = when a departure is from a station not initially planned and therefore time can not be determined<br>
[5] due to the calculation this delay can also be 0 or 1<br>


### Sample data

The data set does not appear to have a lot of missing values however the last entries are all completely empty except the ID. As an example this is the record with ID 16:<br>
`16,RS30,Oldenburg,Bremen,07:05,24.11.2023,4,FALSE,FALSE,Sehr voll,Neu,,,,FALSE,,,,`<br>
Without further analysis, it is apparent that a lot of entries contain "Rastede", "Oldenburg" and "Bremen" as departure or arrival stations. A record which has a relative delay is quite rare. Most records have a normal or no delay at all.

### Visualizations


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import collections

# The first 3 lines are invalid -> skip
journeys = np.genfromtxt('../data/train-drives.csv', delimiter=',', skip_header=3, names=True, usecols=range(18), dtype=None, encoding='utf-8')

# Convert date to datetime objects, skip malformed dates
date_objs = []
for date in journeys['Datum']:
	if date != '':
		try:
			date_objs.append(datetime.strptime(date, "%d.%m.%Y"))
		except ValueError:
			# Skip dates that do not match the expected format
			continue

# Group by year and month
months = [date_obj.strftime("%Y-%m") for date_obj in date_objs]
month_counts = collections.Counter(months)

# Sort months
sorted_months = sorted(month_counts.keys())
counts = [month_counts[month] for month in sorted_months]

In [None]:
# Create a histogram of the number of records per month
plt.figure(figsize=(7, 4))
plt.bar(sorted_months, counts)
plt.xlabel("Month")
plt.ylabel("Number of records")
plt.title("Number of records per month")
plt.xticks(rotation=45)
plt.show()


## Train journeys

TDB