# Data extraction - Deutsche Bahn

This notebook describes the process of data extraction from various APIs, which are provided by *Deutsche Bahn*. 

Please note, that we splitted data extraction and data preparation. This is due the to fact that we need preparation time to store the data in a graph database (*Neo4J*).

In the following, we describe the processed and the data, which we extracted.

> **Important note:**
>
> We've started all processes from command line, as Jupyter notebooks have several disadvantages in terms of performance, surveilance, ... . The code will works anyway.

In [1]:
import pandas as pd
from pymongo import MongoClient
import requests
import urllib

# custom imports
from dataDB import dataDB 

## Object and object initialization

We've created a class, which will handle all operations regarding the data extraction. Since we are using various APIs from *Deutsche Bahn*, we had to register to get access to these APIs.

The initialization is described below.

In [None]:
# init class
extractor = dataDB()

# set API token (private)
extractor.setApiToken('################################')

## List of all train stations

The available APIs only provides various information for *only one* train station or train. There is no available API on the *Deutsche Bahn* portal, which let us download a list of all train stations. However, we are able to download a CSV file of all train stations in Germany from *Deutsche Bahn*. Since this file contains *Long distance stations* (FV) and *Short distance stations* (RV or DPN), we filtered only for *Long distance stations*.

In [None]:
# set train stations
extractor.setTrainStations("https://download-data.deutschebahn.com/static/datasets/haltestellen/D_Bahnhof_2020_alle.CSV")

This method will download all train stations from this CSV file and stores all stations, regardless of the attribute `Verkehr`. Below, you can see some examples of this data.

In [4]:
# download the data
stations = pd.read_csv("https://download-data.deutschebahn.com/static/datasets/haltestellen/D_Bahnhof_2020_alle.CSV", sep=';')

# filter the data
stations = stations[stations['Verkehr'] == 'FV'].reset_index(drop=True)

# display some data
stations.sample(5)

Unnamed: 0,EVA_NR,DS100,IFOPT,NAME,Verkehr,Laenge,Breite,Betreiber_Name,Betreiber_Nr,Status
116,8000266,KW,de:05124:11376,Wuppertal Hbf,FV,7149543,51254363,DB Station und Service AG,6914.0,
204,8002549,AH,de:02000:10950,Hamburg Hbf,FV,10006909,53552736,DB Station und Service AG,2514.0,
66,8000150,"FH,FH N,FH S",de:06435:4503,Hanau Hbf,FV,8929,50120953,DB Station und Service AG,2537.0,
310,8010240,UNM,de:15084:8010240,Naumburg(Saale)Hbf,FV,11796984,51163071,DB Station und Service AG,4309.0,
15,8000041,EBO,de:05911:5194,Bochum Hbf,FV,7223275,51478609,DB Station und Service AG,724.0,


For further data extraction the attribute `EVA_NR` is used, which is an unique identifier of each train station.

## Get station and train details

Since we used a class to extract all data, it is easy to say "*We only need to call a method and tada - all data are downloaded.*".

Below, you will find a detailed description of what happend in this method.

In [None]:
# start data extraction
extractor.startDataExtraction()

First, we are going to download the departure data for each train station data for a distinct date. In this case, we used the 1st July 2022 (*ISO format*: 2022-07-01T00:00:00 to 2022-07-01T23:59:59).

> **Important note:** The API is accessible only with an authetication token. 

Below you can see the departure data for a train station (Hamburg Hbf: 8002549). Please note, that this API is only returning 20 departures at once. We had to call this API several times with different date parameters to extract all departures for this day.

In [7]:
# set header        
headers = {}
headers["Accept"] = "application/json"
headers["Authorization"] = f'Bearer ################################'

# get data
data = requests.get("https://api.deutschebahn.com/fahrplan-plus/v1/departureBoard/8002549?date=2022-07-01T00:00:00", headers=headers)
data.json()

[{'name': 'ICE 591',
  'type': 'ICE',
  'boardId': 8002549,
  'stopId': 8002549,
  'stopName': 'Hamburg Hbf',
  'dateTime': '2022-07-01T03:20',
  'track': '14',
  'detailsId': '538968%2F182282%2F316104%2F21604%2F80%3fstation_evaId%3D8002549'},
 {'name': 'EC 7',
  'type': 'EC',
  'boardId': 8002549,
  'stopId': 8002549,
  'stopName': 'Hamburg Hbf',
  'dateTime': '2022-07-01T04:37',
  'track': '14',
  'detailsId': '457410%2F157477%2F182432%2F61254%2F80%3fstation_evaId%3D8002549'},
 {'name': 'ICE 571',
  'type': 'ICE',
  'boardId': 8002549,
  'stopId': 8002549,
  'stopName': 'Hamburg Hbf',
  'dateTime': '2022-07-01T04:50',
  'track': '14',
  'detailsId': '534792%2F180773%2F71370%2F142579%2F80%3fstation_evaId%3D8002549'},
 {'name': 'ICE 581',
  'type': 'ICE',
  'boardId': 8002549,
  'stopId': 8002549,
  'stopName': 'Hamburg Hbf',
  'dateTime': '2022-07-01T04:54',
  'track': '14',
  'detailsId': '416067%2F141255%2F810554%2F266588%2F80%3fstation_evaId%3D8002549'},
 {'name': 'ICE 783',
  'typ

Each dictionary in this list shows us the train, which is departing. To extract the train detail data, we can use the detailsId, which is an unique identifier for a train on a date. As example, a train from above is used: 

* Name: `ICE 591`
* detailsId: `538968%2F182282%2F316104%2F21604%2F80%3fstation_evaId%3D8002549`

Unfortunatelly, we can't use the raw detailsId since this string contains charactes, which needs to be encoded. To create a working string, we use the function `urllib.parse.quote()`.

In [6]:
# print detailsId
print(f'Raw detailsId:\t\t538968%2F182282%2F316104%2F21604%2F80%3fstation_evaId%3D8002549')

# encode characters of details Id
detailsId = urllib.parse.quote('538968%2F182282%2F316104%2F21604%2F80%3fstation_evaId%3D8002549')
print(f'Encoded detailsId:\t{detailsId}')

Raw detailsId:		538968%2F182282%2F316104%2F21604%2F80%3fstation_evaId%3D8002549
Encoded detailsId:	538968%252F182282%252F316104%252F21604%252F80%253fstation_evaId%253D8002549


After we encoded the detailsId, we can use it to get the `journeyDetails`.

In [8]:
# set header        
headers = {}
headers["Accept"] = "application/json"
headers["Authorization"] = f'Bearer ################################'

# get data
data = requests.get(f'https://api.deutschebahn.com/fahrplan-plus/v1/journeyDetails/{detailsId}', headers=headers)
data.json()

[{'stopId': 8002553,
  'stopName': 'Hamburg-Altona',
  'lat': '53.552697',
  'lon': '9.935175',
  'depTime': '03:02',
  'train': 'ICE 591',
  'type': 'ICE',
  'operator': 'DB',
  'notes': [{'key': 'PF',
    'priority': '200',
    'text': 'Please wear an FFP2 mask. You are legally required to do so'},
   {'key': 'CK',
    'priority': '200',
    'text': 'Komfort Check-in possible (visit bahn.de/kci for more information)'},
   {'key': '3G',
    'priority': '205',
    'text': 'Nationwide «3G» rule applies on trains: valid proof must be presented'},
   {'key': 'FK',
    'priority': '260',
    'text': 'Number of bicycles conveyed limited'},
   {'key': 'FR',
    'priority': '260',
    'text': 'Bicycles conveyed - subject to reservation'},
   {'key': 'EH', 'priority': '560', 'text': 'vehicle-mounted access aid'}]},
 {'stopId': 8002548,
  'stopName': 'Hamburg Dammtor',
  'lat': '53.560751',
  'lon': '9.989569',
  'arrTime': '03:10',
  'depTime': '03:11',
  'train': 'ICE 591',
  'type': 'ICE',
 

This API returns all stops of a disctinct train (e.g. 'ICE 591') and further information regarding this train. 

## Data store

All data from the CSV file and the different APIs are stored unprocessed in a MongoDB. This enables an iterative preparation process without loading all data again and again.

Below you can see, how many data we extracted for the date 2022-07-01.

In [9]:
# connect to MongoClient
mongoClient = MongoClient('mongodb://localhost:27017/')
mongoDatabase = mongoClient["deutscheBahn"]

# print number of documents
print(f'Number of stations:\t{mongoDatabase["station"].count_documents({})} (from CSV)')
print(f'Number of trains:\t{mongoDatabase["train"].count_documents({})} (from API: departureBoard)')
print(f'Number of stops:\t{mongoDatabase["stops"].count_documents({})} (from API: journeyDetails)')

Number of stations:	357
Number of trains:	964
Number of stops:	953


## Note: complete process

Below, you can find the code to extract the data as script. It is only mandataory to paste your personal API key.

It is recommended to copy the code into a python file and execute this file from command line. There are mulitple issues when running the code within Jupyter Notebooks and VS Code integrated 'Interactive Window'.

```python
from dataDB import dataDB

# init class
extractor = dataDB()

# set api token
extractor.setApiToken('################################')

# set train stations
extractor.setTrainStations("https://download-data.deutschebahn.com/static/datasets/haltestellen/D_Bahnhof_2020_alle.CSV")

# start data extraction
extractor.startDataExtraction()
```