# Data preprocessing

The data exploration step in the previous notebook already showed that there are some missing values in the data set and some which are not in a good format for furhter analysis. Since the attributes of the data set are well known it is possible to filter out semantic inconsistencies. There might for example be records with the same departure and arrival station.

The goal of this step is to handle missing values, transform some values and export a clean data set which can be used for further analysis.

## Renaming columns

Firstly the columns will be renamed to have english labels as defined in the attribute description.

In [None]:
import pandas as pd

# The first three lines are invalid -> skip
journeys = pd.read_csv('../data/train-drives.csv', skiprows=3, encoding='utf-8')

# Rename columns, optional
journeys.rename(columns={
    'dateCrawled': 'date_crawled',
    'Lfd. Nummer': 'ID',
    'Linien Nummer': 'train line',
    'Start Haltestelle ': 'departure station',
    'Ziel Haltestelle': 'arrival station',
    'Planmäßige Abfahrtszeit': 'planned departure',
    'Verspätung in min': 'delay',
    'Kontrolliert': 'ticket checked',
    'Gleis verlegt': 'platform changed',
    'Fülle des Zuges': 'crowiness',
    'Zugmodel': 'train model',
    'Sauberkeit': 'cleanliness',
    'rel. planmäßige Abfahrtszeit': 'relative planned departure',
    'rel. Verspätung': 'relative delay',
    'Alternativer Anschluss': 'alternative connection',
}, inplace=True)

# Drop unnecessary columns
journeys.drop(columns=['Bemerkung'], inplace=True)
print(journeys.head(20))

# Drop all records with rare departure or arrival stations``
rastede_bremen_stations = ['Rastede', 'Oldenburg', 'Hude', 'Delmenhorst', 'Bremen', 'Bremen-Neustadt']
journeys = journeys[
    journeys['departure station'].isin(rastede_bremen_stations)
    & journeys['arrival station'].isin(rastede_bremen_stations)
]


# Save the cleaned data to a new CSV file
journeys.to_csv('../data/train-drives-cleaned.csv', index=False, encoding='utf-8')
