## Analyzing the train and passengers movements using unsupervised machine learning

### Introduction

Analyzing the pedestrian and traffic flow has been getting more attention in recent years. Understanding the patterns and behavior of people helped the engineers and scientists to optimize the design of the civil and building structures for pedestrian use; furthermore, it reduces the accidents related to flow of the pedestrians. Within the scope of the project CroMa, which aims to improve the functionality of train stations, this project tries to predict the trains' arrival time by analyzing the macroscopic behavior of the passengers' flow. Using the unsupervised machine learning algorithm, DBSCAN, the python program could achieve this objective.

### CroMa project

#### Goal
As passenger numbers are steadily increasing, new concepts are needed to increase the efficiency of train and subway stations. As part of the Crowd Management in Transport Infrastructures (CroMa) project, various measures, such as structural regulations, appropriate crowd management and, innovative, cross-organizational action strategies, are developed to increase the robustness of stations during peak hours. The underlying basis is comprehensive studies on pedestrian traffic in transportation facilities and on the behavior of passengers in large crowds. Observations at the train stations and experiences gathered from the crowd control at large event venues, both played a vital role in finding subtle flaws in the system and proposing a more effective solution. The derived facts and results from this project intend to improve the efficiency of the traffic facilities and reduce the crowds.

#### Partners
The project CroMa is introduced by Federal Ministry of Education and Research. Many organizations have been collaborating for the project. Among them are University of Wuppertal, Juelich research center, Univeristy of Bochum, Deutsche Bahn AG, Schweizerische Bundesbahnen AG and federal police department.

#### Progress
So far many experiments and researches were done and published. This work tries to continue the research "Küpper, M.; Seyfried, A. (2020) Analysis of Space Usage on Train Station Platforms Based on Trajectory Data. Sustainability 12, no. 20: 8325".

The functionality of railway platforms could be assessed by level of service concepts (LOS).
LOS describes interactions between humans and the built environment and provides the possibility of proper risk assesment due to overcrowding. To improve existing concepts, a detailed analysis of how pedestrians use the
space was performed, and new measurement and evaluation methods were introduced. Boarding and
alighting passengers show different behavior, considering the travel paths, waiting times and mean
speed. Density, speed and flow profiles were exploited and a new measure for the occupation of space
is introduced. The analysis has shown that it is necessary to filter the data in order to reach a realistic
assessment of the level of service. Three main factors should be considered: the time of day, the times
when trains arrive and depart and the platform side. Therefore, density, speed and flow profiles
were averaged over one minute and calculated depending on the train arrival. The methodology
developed in this article is the basis for enhanced and more specific level of service concepts and offers
the possibility to optimize planning of transportation infrastructures with regard to functionality and
sustainability.

### Objective of this work
Following the above mentioned research, the main goal of this project is to predict the trains' arrival time according to the flow of the passengers. However, before discussing the method used to achieve the objective; Several points should be taken into consideration.

#### Data collecting and preparation
Data used for the analysis were tracking data acquired by stereo sensors provided by Swiss Federal
Railways (SBB AG) for Zurich main train station, Switzerland; and German railways (DB) for Frankfurt main train station. The datasets were checked with respect to plausibility; nevertheless, completeness of the trajectories cannot be guaranteed. Due to technical reasons during recording, the tracking data are mirrored horizontally.

#### JuPedSim
Juelich Pedestrian Simulator (JuPedSim) is an open source framework for simulating, analyzing and visualizing pedestrian dynamics. JuPedSim aids students and researchers to investigate pedestrian dynamics and focus on research. JuPedSim is currently focusing on evacuation, but is easily extendable to cover other areas. 

JuPedSim consists of four parts; jpscore for simulation, jpsreport for analysis, jpseditor for editing and jpsvis for visualization. In the previous work the trains' arrival time were detected using JPSVis manually, which is not a feasible detection way. The trains' arrival times considered the moments, which number of pedestrians disappear or appear on the edge of the platform. It was assumed that these people were getting on or getting off the train. The same approach for detection the passengers movement were used for this work as well. The JPSVis framework looks as below:

:::{figure-md}

<img src="figs/jpsvis.png" width="80%">

The platform is shown in dark blue and the green points are the people. The visualization continues shows the position of the people on the platform for every time step.

:::

#### Machine Learning

To understand what machine learning is, understanding the basic concept of artificial intelligence (AI) is a need. AI is defined as a program that exhibits cognitive ability similar to that of a human being. Making computers think like humans and solve problems the way we do is one of the main principles of artificial intelligence. AI exists as an umbrella term that is used to denote all computer programs that can think as humans do. 

With the help of machine learning algorithms, AI can develop tasks beyond what it was programmed to do. Machine learning, a sub-class of artificial intelligence can process large amounts of data and extract useful information, learn the patterns existing in the given data and improve upon their previous iterations by learning from their output. Statistics and calculus and linear algebra build the foundation of machine learning.

There are three types of machine learning: supervised learning, unsupervised learning and reinforcement learning.

In supervised learning, the algorithm uses labeled data. Labeled data has both the input and output parameters in itself. So the algorithm can compare its result with the output parameter of the dataset. In supervised learning, the algorithm trains itself with labeled data. After the training, it predicts the output for unlabeled data. 

Unsupervised machine learning holds the advantage of being able to work with unlabeled data. the algorithm finds the relations and patterns in the input data. It cannot compare its output with the provided output from the dataset to optimize its parameters.

Reinforcement learning directly takes inspiration from how human beings learn from data in their lives. It features an algorithm that learns from new situations using a trial-and-error method. 

### Zurich main train station

First of all the required datasets should be loaded.

- Passengers' location in every frame on the platform. In this dataset each frame is equal to one second.

:::{figure-md}

<img src="figs/lauf_data.png" width="30%">

`X` and `Y` indicate the position of a person according to X and Y axes.
`Pers_ID` is the ID number specified to each person.

:::

- Geometry of the platform.

:::{figure-md}

<img src="figs/geo.png" width="30%">

Position of every object and edges of the platform 

:::

In the next step the geometry is loaded. Along with geometry, for each edge of the platform a hypothetical line shoud be computed. These lines are a part of the method to filter the passengers on the platform. For computing the coefficients of the lines and finding the function, `Numpy.polyfit` is used.

In [None]:
def oberkante_koeff(tag, degree):
    
    k = geo.index[geo['Object'] == tag].tolist()
    return np.polyfit(geo['geo_x'][k[0]:k[-1]+1], geo['geo_y'][k[0]:k[-1]+1], degree)

koeff_o = oberkante_koeff(input('Oberkante Objekt Name: '), int(input('Ordnung 1 oder 2: ')))
print('Koeffizienten: ', koeff_o)

def g_oben(x , gleis_abstand=float(input('Abstand der hypothetische Linie von der Oberkante: '))):
    
    if len(koeff_o) == 2:
        gleis_oben = koeff_o[0]*x  + koeff_o[1] - gleis_abstand
    if len(koeff_o) == 3:
        gleis_oben = koeff_o[0]*x**2  + koeff_o[1]*x + koeff_o[2] - gleis_abstand
        
    return gleis_oben

:::{figure-md}

<img src="figs/platform.png" width="100%">

:::

At this point, The two hypothetical lines are used to detect the boarding and alighting passengers. Passengers, who are shown at the first time in the area between the lines and the edges of the platform are considered as alighting passengers. It is assumed that these passengers got off the train and were detected for the first time by the camera. On the contrary, those who disappear in these areas are considered boarding passengers. The same principle for detection of the alighting passengers is valid for boarding passengers as well. This approach is implemented as shown below:

In [None]:
for e in range(len(PersID_each)):
    person = lauf_data[PersID_each[e]==lauf_data['PersID']] #person ist ein DataFrame für alle Frames von jede einzelne Personen

    #Wenn die Person auf untere Kante des Bahnsteigs auftritt / Unten Aussteiger
    if (person[:1]['Y']<g_unten(person[:1]['X'])).bool():
        unten_aus.append(int(person[:1]['Frame']))
        aussteiger_unten_anzahl += 1
        
    #Wenn die Person auf untere Kante des Bahnsteigs verschwindet / Unten Einsteiger
    if (person[-1:]['Y']<g_unten(person[-1:]['X'])).bool():
        unten_ein.append(int(person[-1:]['Frame']))
        einsteiger_unten_anzahl += 1
        
    #Wenn die Person von obere Kante des Bahnsteigs auftritt / Oben Aussteiger 
    if (person[:1]['Y']>g_oben(person[:1]['X'])).bool():
        oben_aus.append(int(person[:1]['Frame']))
        aussteiger_oben_anzahl += 1
        
    #Wenn die Person von obere Kante des Bahnsteigs verschwindet / Oben Einsteiger
    if (person[-1:]['Y']>g_oben(person[-1:]['X'])).bool():
        oben_ein.append(int(person[-1:]['Frame']))
        einsteiger_oben_anzahl += 1

So far the code has generated a separate data frame for alighting and boarding passengers for each side of the platform. Each of these data frames contains the time of alighting or boarding of each person. To have the times of all passengers (alighting or boarding) of one side of the platform, the data frames of alighting and boarding times are combined.  

With the help of this data frame, the trains' arrival times can be detected. To be more precise, on the time-scale, there are time-points, which have more density. These are the points, which a number of people get on or get off the train, therefore they can be interpreted as the time, which train is in the station.

to find these high-density points, an unsupervised machine learning algorithm, DBSCAN, is used.

**DBSCAN**

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm. Given a set of points in some space (here time), it groups together points that are closely packed together (points with many nearby neighbors), the points with no or few neighbors are considered as noise. This algorithm group points in three categories:

1. Core point: Every point, which has the minimum $\mf n-1$ number of points to the distance of $\mf d$ in its surroundings.
2. Border point: A point that is in the range of $\mf d$ to a core point, but it is not a core point itself (does not have the minimum $\mf n-1$ number of points to the distance of d).
3. Noise point: Points, which has no core or border points to the distance of $\mf d$.

:::{figure-md}

<img src="figs/dbscan.png" width="40%">

In the above figure, red are core points, yellow are border points and blue is noise point. Every point with the minimum number $\mf 4-1$ points in its surrounding is a core point, so $\mf n=4$ and the radius of the circles is the distance $\mf d$.

source: https://en.wikipedia.org/wiki/DBSCAN

:::

In this case, $\mf n$ is set to $\mf 4$. It means, the algorithm looks for high-density points with the minimum of $\mf 4$ neighbors near to each othere, in other words, the trains, which have a minimum $\mf 4$ alighting and/or boarding passengers. On the other hand, $\mf d=45$, which means, these $\mf 4$ passengers should get on or get off the train within the maximum time difference of $\mf 45s$.
the values of above mentioned parameters were hand picked by try and error (Comparing the results from different values to the results from JPSVis). 
Giving the parameters to DBSCAN using the scikit-learn Package for machine learning:

In [None]:
frame_second_ratio = 1
db = DBSCAN( eps = 45 * frame_second_ratio , min_samples = 4).fit(Dataframe)

After processing by the DBSCAN, a data frame is created to show the results. The number of alighting and boarding passengers is seperated for each train. More importantly, The train arrival time, is calculated in 3 ways, for various applications:

1. The mean time of all passengers passengers. 
2. The time of the first alighting/boarding passenger as arrival time.
3. The mean time of the first $\mf 15$ passengers.

:::{figure-md}

<img src="figs/Zurich_Oberkante_Predicted.png" width="60%">

The arrival times in seconds, and passenger counts for the upper side of the platform.
:::

#### Comparing the prediction results with the results from JPSVis
Here the arrival times, which were manually wrote down in `jpsvis` are compared to the predictions by `DBSCAN`. In this case the mean time of first $\mf 15$ passengers is used.

:::{figure-md}

<img src="figs/Zurich_Compare.png" width="100%">

:::

As it is shown on the graph, the prediction has one False negative. At $\mf 3099s$ there was a train, but the algorithm did not recognize it. The main reason for this is the low ratio of number of frames to seconds (1). the algorithm works more proparly with higher frame rate. In a time step equal to one second a person can travel more than $\mf 0.4m$ which is the width of the observing area on the edge of the platform for prediction.

#### Speed and durations
Investigating the microscopic parameters of the flow, like speed, duration and distance of travel of each passenger also helps to validate the results. 

:::{figure-md}

<img src="figs/sonstiges_.png" width="30%">

The indexes show the **ID** of the passengers. Distance, Time and Mean speed are in $\mf m$, $\mf s$ and $\mf m/s$
:::

The speed of the passengers in every timestep is also calculated and visualized.

:::{figure-md}

<img src="figs/Timestep.png" width="100%">

As it is shown in the graph from the speed profile it can be concluded that the passengers with higher speed at first are probably the alighting passengers, which stay shorter on the platform as well. On the contrary, those who have higher speed at the last moments and stays longer on the platform are probably the boarding passengers.
:::

### Frankfurt main train station
In order to have a better perspective of how the application works, following the algorithm is tested with another data set. Note that in Frankfurt main station the geometry of the edges of platform is linear. Furthermore, the recorded data set is $\mf 4$ frames per second.

The geometry of the platform is like below. The hypothetical lines are in the distance of $\mf 1.5m$ from the edges of the platform.

:::{figure-md}

<img src="figs/FF_Geometry.png" width="100%">

The outlines in the center of the picture are constructs and objects such as columns, stairs and advertising board. 

:::

In [None]:
frame_second_ratio = 4
db = DBSCAN( eps = 45 * frame_second_ratio , min_samples = 4).fit(Dataframe)

The predicted arrival times are as below:

:::{figure-md}

<img src="figs/FF_Oberkante_Predicted.png" width="60%">

The values of arrival times in the table are in `frame rate`. Dividing by $\mf 4$ converts them to $\mf s$. 

:::

#### Comparing the prediction results with the results from JPSVis
In this case as well, the mean time of first $\mf 15$ passengers is used.

:::{figure-md}

<img src="figs/FF_Compare.png" width="100%">

For this case, the algorithm predicted the number of trains correctly. `mean error` and `variance` of the prediction is $\mf -3.08$ and $\mf 18$, respectively.

:::

### Conclusion and future insight

The DBSCAN algorithm can be a proper solution to predict the train arrival times for further usage. Nevertheless, `frame-second` ratio plays a vital role in recognizing the movement of passengers on the platform in every time step. 

Hypertuning other parameters involving in the algorithum, which are:  Distance of the hypothetical line from the edge of the platform, minimum number of alighting/boarding passengers $\mf n$ and the time difference between two time-scaled neighboring passengers $\mf d$ can be the next step to optimizing the algorithm. For hypertuning, a supervised machine learning model is needed, thus a big data set from different train stations is a requirement, which is not available currently.

### References

- Website of CroMa project:
http://www.croma-projekt.de/croma-projekt/DE/Home/home_node.html

- Page of the project in the federal ministry website:
https://www.sifo.de/sifo/de/projekte/schutz-kritischer-infrastrukturen/verkehrsinfrastrukturen/croma-crowd-management-in-verkehrsinfrastrukturen/croma-crowd-management-in-verkehrsinfrastrukturen.html

- Website of the computer simulation for fire and pedestrian dynamics department of the university of Wuppertal:
https://www.asim.uni-wuppertal.de/

- Website of DBSCAN algorithm:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

- What Is Machine Learning: Definition, Types, Applications and Examples:
https://www.toolbox.com/tech/artificial-intelligence/tech-101/what-is-machine-learning-definition-types-applications-and-examples/

- Machine Learning and Deep Learning Applications-A Vision:
https://www.sciencedirect.com/science/article/pii/S2666285X21000042

- JuPedSim website:
https://www.jupedsim.org/index.html

- Analysis of Space Usage on Train Station Platforms Based on Trajectory Data:
https://juser.fz-juelich.de/record/885761

### Source code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statistics
import plotly.graph_objects as go
import dataframe_image as dfi
#DBSCAN
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

### Personen/Lauf Datei einlesen
lauf_datei_name = input('Name der Personen/Lauf Datei: ')

lauf_data = pd.read_csv(lauf_datei_name, comment='#', sep='\s+', names=['PersID','Frame','X','Y','H'])
lauf_data = lauf_data.drop(columns='H')

PersID = lauf_data['PersID']
PersID_each = sorted(set(PersID))

lauf_data.head()
# dfi.export(lauf_data,'lauf_data.png', max_rows=10)

### Geometrie Datei einlesen von 'Geo.txt'
geo_datei_name = input('Name der Geometrie Datei: ')

geo = pd.read_csv(geo_datei_name, comment='#', sep='\s+', names=['geo_x','geo_y','Object'], index_col=False)
geo.head()
# dfi.export(geo,'geo.png', max_rows=12)

### Gleiskante Funktion erstellen 

#### Oberkante
# Die Zeilen von Obere-/Untere Kante in geo.txt sollen nebeneinander sein!

def oberkante_koeff(tag, degree):
    
    k = geo.index[geo['Object'] == tag].tolist()
    return np.polyfit(geo['geo_x'][k[0]:k[-1]+1], geo['geo_y'][k[0]:k[-1]+1], degree)

koeff_o = oberkante_koeff(input('Oberkante Objekt Name: '), int(input('Ordnung 1 oder 2: ')))
print('Koeffizienten: ', koeff_o)

def g_oben(x , gleis_abstand=float(input('Abstand der hypothetische Linie von der Oberkante: '))):
    
    if len(koeff_o) == 2:
        gleis_oben = koeff_o[0]*x  + koeff_o[1] - gleis_abstand
    if len(koeff_o) == 3:
        gleis_oben = koeff_o[0]*x**2  + koeff_o[1]*x + koeff_o[2] - gleis_abstand
        
    return gleis_oben

#### Unterkante
def unterkante_koeff(tag, degree):
    
    k = geo.index[geo['Object'] == tag].tolist()
    return np.polyfit(geo['geo_x'][k[0]:k[-1]+1], geo['geo_y'][k[0]:k[-1]+1], degree)

koeff_u = unterkante_koeff(input('Unterkante Objekt Name: '), int(input('Ordnung 1 oder 2: ')))
print('Koeffizienten: ', koeff_u)

def g_unten(x , gleis_abstand=float(input('Abstand der hypothetische Linie von der Unterkante: '))):
        
    if len(koeff_u) == 2:
        gleis_unten = koeff_u[0]*x  + koeff_u[1] + gleis_abstand
    if len(koeff_u) == 3:
        gleis_unten = koeff_u[0]*x**2  + koeff_u[1]*x + koeff_u[2] + gleis_abstand
        
    return gleis_unten

### Geometrie und Linie von Gleiskante Funktion plotten
# Bahnsteig Geometrie und Kurve plotten

t = np.arange(min(geo['geo_x']), max(geo['geo_x']), 0.1)    # X-Werte für die Funktion definieren

fig = plt.figure(figsize=(16,4),dpi=100)
plt.plot(t, g_oben(t), 'r')
plt.plot(t,g_unten(t), 'b')

for p in range(len(geo['Object'].unique())):
     plt.plot(geo[geo['Object'] == geo['Object'].unique()[p]]['geo_x'],
              geo[geo['Object'] == geo['Object'].unique()[p]]['geo_y'], 
              color='black', linewidth = 0.7, markersize = 100)  
plt.savefig('FF_Geometry.png')
plt.axis('equal')

### Erkennen und unterscheiden Einsteiger/Aussteiger
# Hier bekommen wir Anzahl der Einsteiger/Aussteiger und die Zeitpunkten dazu

unten_ein = []    
unten_aus = []
oben_ein = []      
oben_aus = []

einsteiger_unten_anzahl = 0    
aussteiger_unten_anzahl = 0    
einsteiger_oben_anzahl = 0
aussteiger_oben_anzahl = 0


for e in range(len(PersID_each)):
    person = lauf_data[PersID_each[e]==lauf_data['PersID']] #person ist ein DataFrame für alle Frames von jede einzelne Personen

    #Wenn die Person auf untere Kante des Bahnsteigs auftritt / Unten Aussteiger
    if (person[:1]['Y']<g_unten(person[:1]['X'])).bool():
        unten_aus.append(int(person[:1]['Frame']))
        aussteiger_unten_anzahl += 1
        
    #Wenn die Person auf untere Kante des Bahnsteigs verschwindet / Unten Einsteiger
    if (person[-1:]['Y']<g_unten(person[-1:]['X'])).bool():
        unten_ein.append(int(person[-1:]['Frame']))
        einsteiger_unten_anzahl += 1
        
    #Wenn die Person von obere Kante des Bahnsteigs auftritt / Oben Aussteiger 
    if (person[:1]['Y']>g_oben(person[:1]['X'])).bool():
        oben_aus.append(int(person[:1]['Frame']))
        aussteiger_oben_anzahl += 1
        
    #Wenn die Person von obere Kante des Bahnsteigs verschwindet / Oben Einsteiger
    if (person[-1:]['Y']>g_oben(person[-1:]['X'])).bool():
        oben_ein.append(int(person[-1:]['Frame']))
        einsteiger_oben_anzahl += 1

print(einsteiger_unten_anzahl, aussteiger_unten_anzahl, einsteiger_oben_anzahl, aussteiger_oben_anzahl)

# Oberekante
# pd.DataFrame von Einsteiger, Aussteiger und beide 

Oben_ein = pd.DataFrame(sorted(oben_ein))
Oben_aus = pd.DataFrame(sorted(oben_aus))

oben = oben_ein+oben_aus 

Oben = pd.DataFrame(sorted(oben))

### DBSCAN
db = DBSCAN(eps=45 * int(input('Jede Sekunde ist gleich wie viele Frames?: ')) , min_samples=4).fit(Oben)

# Unten sind die direkte Ergebnisse von DBSCAN
# db.labels_

# Verhältnis von: noises / all 

noise = np.count_nonzero(db.labels_ == -1)
print('''Anzahl der Personen, die auf die Bahnkante getreten haben aber zum keinen Zug gehören / 
Alle Personen (Ein-/Aussteiger) :''', noise, '/', len(oben))

result_expand = pd.DataFrame()
result_expand['MW_alle_P_Zeit'] = Oben[0]
result_expand['Zug_nummer'] = db.labels_
# result_expand

noises = np.where(result_expand['Zug_nummer'] == -1)[0]
# noises

result_expand = result_expand.drop(noises, axis=0)
result_expand = result_expand.reset_index(drop=True)
# result_expand

clusters = np.delete(db.labels_, np.where(db.labels_ == -1))

unique, counts = np.unique(clusters, return_counts=True)
z = dict(zip(unique, counts))
Anzahl = pd.DataFrame.from_dict(z.values())
# Anzahl

# Zeit der erste Ein-/Aussteiger bzw. erste 15 Ein-/Aussteiger als Zug einfahrtzeit.

Ersteperson_zeit = list()
Anfahrtzeit_erste15 = list()

for i in result_expand['Zug_nummer'].unique():
    index = result_expand.index[result_expand['Zug_nummer']==i].tolist()
    Ersteperson_zeit.append(result_expand['MW_alle_P_Zeit'][index[0]])
    Anfahrtzeit_erste15.append(round(statistics.mean(result_expand['MW_alle_P_Zeit'][index[:15]])))
    
#Ein-/Aussteiger für jeden Zug trennen

trenn = pd.DataFrame(columns=['Einsteiger','Aussteiger'])

for c in range(len(z.keys())):
               
    ein = 0
    aus = 0
               
    zug = result_expand[:][result_expand['Zug_nummer']==c]
               
    for e in oben_ein:
        if min(zug['MW_alle_P_Zeit']) <= e <= max(zug['MW_alle_P_Zeit']):
            ein += 1
    for a in oben_aus:
        if min(zug['MW_alle_P_Zeit']) <= a <= max(zug['MW_alle_P_Zeit']):
            aus += 1
               
    trenn.loc[c] = list([ein, aus])
    
### Tabelle von Einfahrtzeiten
result = result_expand.groupby(['Zug_nummer']).mean()
result['MW_alle_P_Zeit'] = result['MW_alle_P_Zeit'].astype(int)
result['Erste_P_Zeit'] = Ersteperson_zeit
result['Erste_15_P_Zeit'] = Anfahrtzeit_erste15
result['Pers_Anzahl'] = Anzahl
result['Einsteiger'] = trenn['Einsteiger']
result['Aussteiger'] = trenn['Aussteiger']
result

# Tabelle speichern
dfi.export(result, 'FF_Oberkante_Predicted.png')

### Beobachtete Anfahrtzeiten von jpsvis einlesen
obs = pd.read_csv(input('Name der Datei:'))
obs

### Vorhersagte und Beobachtete zusammen plotten
fig = go.Figure()

col = input('Name der Einfahrtzeit Column: ')

fig.add_trace(go.Scatter(
    x = obs['zeit'] , y = pd.Series(0, np.arange(len(obs['zeit']))) , name='observed', 
    mode='markers', marker_size=6,
))

x = result[col]
fig.add_trace(go.Scatter(
    x = result[col].to_list(), y = pd.Series(0.5, np.arange(len(x))).to_list() , name='DBSCAN', 
    mode='markers', marker_size=6,
))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False, 
                 zeroline=True, zerolinecolor='black', zerolinewidth=3,
                 showticklabels=False)
fig.update_layout(height=210, plot_bgcolor='white')
fig.show()

### Vorhersage Error
#Errors für alle Einfahrten: dct_errors, Mittelwert Error:mw_error, Varianz: varianz
# Mittelwert alle Ein-/Aussteiger als Zugeinfahrtzeit

error = []
dct_errors = {}
column = input('Name der Einfahrtzeit Column: ')

for i in range(len(obs)):
    
    e = (int(obs[i:i+1]['zeit']) - int(result[i:i+1][column]))
    dct_errors[i] = e
    error.append(e)
    
add = []
for v in error:
    add.append(v**2)
    
mw_error = sum(error)/len(obs)
varianz = sum(add)/len(obs)-1 

print('Error für {}:   Mittelwert Error:{},   Varianz:{}'.format(column, round(mw_error,2), round(varianz)))
dct_errors

# Unterkante
# pd.DataFrame von Einsteiger, Aussteiger und beide

Unten_ein = pd.DataFrame(sorted(unten_ein))
Unten_aus = pd.DataFrame(sorted(unten_aus))

unten = unten_ein+unten_aus 

Unten = pd.DataFrame(sorted(unten))

### DBSCAN
db = DBSCAN(eps=50 * int(input('Jede Sekunde ist gleich wie viele Frames?: ')) , min_samples=3).fit(Unten)

# Unten sind die direkte Ergebnisse von DBSCAN
# db.labels_

# Verhältnis von: noises / all 

noise = np.count_nonzero(db.labels_ == -1)
print('''Anzahl der Personen, die auf die Bahnkante getreten haben aber zum keinen Zug gehören / 
Alle Personen (Ein-/Aussteiger) :''', noise, '/', len(unten))

result_expand = pd.DataFrame()
result_expand['MW_alle_P_Zeit'] = Unten[0]
result_expand['Zug_nummer'] = db.labels_
# result_expand

noises = np.where(result_expand['Zug_nummer'] == -1)[0]
# noises

result_expand = result_expand.drop(noises, axis=0)
result_expand = result_expand.reset_index(drop=True)
# result_expand

clusters = np.delete(db.labels_, np.where(db.labels_ == -1))

unique, counts = np.unique(clusters, return_counts=True)
z = dict(zip(unique, counts))
Anzahl = pd.DataFrame.from_dict(z.values())
# Anzahl

# Zeit der erste Ein-/Aussteiger bzw. erste 15 Ein-/Aussteiger als Zug einfahrtzeit.

Ersteperson_zeit = list()
Anfahrtzeit_erste15 = list()

for i in result_expand['Zug_nummer'].unique():
    index = result_expand.index[result_expand['Zug_nummer']==i].tolist()
    Ersteperson_zeit.append(result_expand['MW_alle_P_Zeit'][index[0]])
    Anfahrtzeit_erste15.append(round(statistics.mean(result_expand['MW_alle_P_Zeit'][index[:15]])))
    
#Ein-/Aussteiger für jeden Zug trennen

trenn = pd.DataFrame(columns=['Einsteiger','Aussteiger'])

for c in range(len(z.keys())):
               
    ein = 0
    aus = 0
               
    zug = result_expand[:][result_expand['Zug_nummer']==c]
               
    for e in unten_ein:
        if min(zug['MW_alle_P_Zeit']) <= e <= max(zug['MW_alle_P_Zeit']):
            ein += 1
    for a in unten_aus:
        if min(zug['MW_alle_P_Zeit']) <= a <= max(zug['MW_alle_P_Zeit']):
            aus += 1
               
    trenn.loc[c] = list([ein, aus])
    
### Tabelle von Einfahrtzeiten
result = result_expand.groupby(['Zug_nummer']).mean()
result['MW_alle_P_Zeit'] = result['MW_alle_P_Zeit'].astype(int)
result['Erste_P_Zeit'] = Ersteperson_zeit
result['Erste_15_P_Zeit'] = Anfahrtzeit_erste15
result['Pers_Anzahl'] = Anzahl
result['Einsteiger'] = trenn['Einsteiger']
result['Aussteiger'] = trenn['Aussteiger']
result

# Sonstiges für jede ID
# Funktion für die Geschwindigkeit der jede Person in alle Timesteps

def set_key(dictionary, key, value):
    if key not in dictionary:
        dictionary[key] = value
    elif type(dictionary[key]) == list:
        dictionary[key].append(value)
    else:
        dictionary[key] = [dictionary[key], value]
        
# Gesamte Abstand, Dauer des Wartens und Geschwindigkeit in jede Timestep und sein Mittelwert

PersID_ordered = sorted(set(lauf_data.PersID))

dct_general = {}
dct_speed = {}
framesec = int(input('Jede Sekunde ist gleich wie viele Frames?: '))

for P in PersID_ordered:
    
    datafile_filtered = lauf_data[P == PersID]
    Dauer = 0
    Abstand_gesamt = 0
    
    for F in range(len(datafile_filtered)-1):
        
        Dauer += 1/framesec
        
        xx = (datafile_filtered.iloc[F+1]['X'] - datafile_filtered.iloc[F]['X'])
        yy = (datafile_filtered.iloc[F+1]['Y'] - datafile_filtered.iloc[F]['Y'])
        
        Abstand_timestep = (xx**2+yy**2)**0.5
        Speed_timestep = Abstand_timestep/(1/framesec)   # m/s

        #Dictionary jede Person Geschw. in alle Timesteps
        set_key(dct_speed, '%s' % P, round(Speed_timestep,4))
        Abstand_gesamt += Abstand_timestep
                  
    if Abstand_gesamt == 0:
        mean_speed = 0
    else:
        mean_speed = Abstand_gesamt/Dauer
    
    #Dictionary mit values:[Abstand, Dauer, Durschsch. Geschw.]          
    set_key(dct_general, '%s' % P, [round(Abstand_gesamt,1), Dauer, round(mean_speed,2)])

# dct_general
# dct_speed

sonstiges = pd.DataFrame.from_dict(dct_general, orient='index', columns=['Abstand', 'Dauer', 'M_speed'])
sonstiges.head()

# Geschwindigkeit für eine ID in alle Timesteps Plotten

P = input('Person ID:')

#Erste Frame
erst = lauf_data['Frame'][lauf_data.PersID.searchsorted(P, side='left')]
#Letzte Frame
letzt = lauf_data['Frame'][lauf_data.PersID.searchsorted(P, side='right')-1]

X = np.linspace((erst+1), letzt, letzt-(erst+1)+1)
Y = np.array([dct_speed['%s' % P]]).ravel()

plt.plot(X,Y)
plt.rcParams['figure.figsize'] = (25,5)
plt.xlabel('Frame',fontsize=10)
plt.ylabel('Geschwindigkeit / m/s',fontsize=10)
plt.grid()
plt.title('Person ID = %s' % P, fontsize=20)
plt.show()