<div style="text-align: center; font-weight: bold; font-size: 300%">Projet INF442</div>                                                        <br /> 
<div style="text-align: center; font-weight: bold; font-size: 180%">Processing et preprocessing</div>                                                        <br />  
<div style="text-align: center; font-size: 150%">École Polytechnique, mai 2020</div><br />  
<div style="text-align: center; font-size: 120%">Paul Calot et Jean-Charles Layoun</div>

# INTRODUCTION

Avant de passer dans le code à proprement parler et la démonstration de ce que nous avons fait, nous faisons un rapide récapitulatif de la démarche que nous avons suivi ici.

## Compréhension du problème

### Explication du formatage des données dans le $.csv$
Une requête s'effectue à un instant donné et pour une station. On obtient alors l'estimation des horaires d'arrivée des prochains trains (le plus souvent 6 pour la ligne B). 
Suit alors le nombre total de trains dans la requête. Les quatre champs suivant concernent les deux sens de marche : le sens "aller" (ou "A") suivi du nombre de trains, puis le sens "retour" (ou "R") suivi du nombre de trains qui sont concernés.
Enfin, des estimations sur les dates d'arrivée des trains sont fournies. En théorie, vienent d'abord les trains pour le sens "aller" au nombre indiqué précédemment puis celles sur le sens retour. Enfin, un certain nombre de messages sont disponibles. Ils seront discutés plus tard.


|         | nom de la station | date de la requête | total trains | sens aller | nb aller | sens retour | nb retour | estimations | messages |
|---------|-------------------|--------------------|--------------|------------|----------|-------------|-----------|-------------|----------|
| exemple | Lozere            | 201805280917       |  12          |  A         |  6       |  R          | 6         |  201805280933 | 09:33 |

### Données utilisées
N'ayant eu accès que trop tard à l'API, nous avons disposer des données datant du 28 et 29 mai 2018 et concernant la ligne B. Un total de 47 requêtes (une requête par station) ont été effectuées chaque minute. Une première phase a donc été la compréhension de ces données.

Problème d'échantillonnage : du fait de l'échantillonnage toutes les minutes, certains messages n'apparaissent pas (cas d'un train qui s'arrête que quelques secondes à une station), tandis que d'autres vont apparaître plusieurs fois (cas d'un train qui s'arrête plusieurs minutes à une station).

Les algorithmes devront donc pouvoir s'adapter à cette situation.

### Notre objectif

Notre objectif est d'arriver à obtenir le couple (temps de premières estimation par la RATP, temps réel d'arrivée). Pour cela, nous allons traquer les trains, par station. Et nous répéterons le procéssus pour chaque station.

### Idée générale 

La "traque des trains" est, au vu des données dont nous disposons (en particulier du manque d'identifiant par train), incertaine. Nous ne pourrons jamais être sûr que nous avons le bon train et que nous ne nous sommes pas trompés.

Notre idée initiale était donc de regarder naïvement requête après requêtes en essayant de trouver les estimations suffisemment proches les unes des autres pour dire que c'est le même train. L'arrivée dudit train à la station étant fournie par les messages accompagnants les estimations.


## Problèmes rencontrés

En plus du problème d'échantillonnage déjà mentionné, de nombreux autre problèmes ont été rencontrés. 

Le principal étant que les données ne sont pas parfaites et que nous ne disposons pas des règles qui régissent leur apparition. Le fait de ne pas disposer des règles d'apparition des données (ou en tout cas de ne pas avoir compris certaines bases dès le début) nous a probablement été préjudiciables et si les choses étaient à refaire, nous ferions différemments.

En effet, beaucoup des algorithmes qui suivent reposent sur l'idée que les données initiales étaient bruitées. Actuellement, nous n'en sommes pas sûrs. Au cours de l'élaboration de la réponse que nous apportons ici, nous avons gagné une compréhension plus fines des données et de leur logique. Ainsi, même si les données demeurent légèrement bruitées, nous pensons que certains de nos algorithmes ajoutent en fait un biais aux données en les déformant, ne serait-ce que légèrement.

Parmi les difficultés que nous avons rencontrés, et pour une station donnée :
    1. Certains trains semblent disparaître puis réaparaître quelques minutes plus loin.
    2. La gestion du retard des trains est problématique. Dans la mesure où nos algorithmes traquent les trains en essayant de diminuer la discontinuité entre deux estimation de temps d'arrivée consécutive (d'une minute à l'autre) et lorsque des estimations se croisent, nos algorithmes se révèlent incapables de faire la différence (il faudrait probablement ajouter un contrôle moins local en regardant l'évolution de l'estimation et en priviligiant la continuité de la "dérivé" - ou plutôt de la variation - de cette estimation).
    3. Incohérence entre le nombre de trains affichés, le nombre d'estimation et le nombre de messages. Parfois, il manque une estimation. 

## Implémentation 

Notre implémentation se découpent en deux temps : 
    1. Preprocessing
    2. Processing
Le résultat du processing est alors envoyer dans un nouveau fichier .csv au format : station, sens, première estimation de l'heure d'arrivée, $\Delta T$ entre l'heure de la première estimation et la première estimation de l'heure d'arrivée, si le train a été noté comme en retard (si le message correspondant est apparu) et finalement l'heure d'arrivée.

Le preprocessing se découpe en plusieurs algorithmes qui permet à la fin d'obtenir pour une ligne un vecteur d'entiers seulement.

Le processing repose sur l'utilisation de listes double chaînées donc l'idée initiale est de les utiliser commes queues FIFO, tout en ayant davantage de souplesse pour les cas pathologiques.


# Implémentation 
## Lecture d'un fichier .csv
### Conversion en liste numpy

In [1]:
# libraries
import pandas as pd # to read csv files
import numpy as np
import time

La fonction suivant permet d'ajouter des délimiters de façon à avoir exactement le même nombre de colonnes à chaque ligne et ainsi permettre la lecture du fichier par $\textit{pandas}$.

In [2]:
import io

# https://stackoverflow.com/questions/52861571/pandas-read-csv-load-data-with-irregular-rows
# to add delimiter to have exactly the same numbers of fields for each row
def add_delimiters(fpath, delimiter=','):

    s_data = ''
    max_num_delimiters = 0

    with open(fpath, 'r', encoding = "utf-8") as f:
        for line in f:
            s_data += line
            delimiter_count = line.count(delimiter)
            if delimiter_count > max_num_delimiters:
                max_num_delimiters = delimiter_count

    s_delimiters = delimiter * max_num_delimiters + '\n'

    return io.StringIO(s_delimiters + s_data)

In [3]:
import os

base_path ="./DonneesRATP_Projet2020_Jeux_1/Donnees_RerB_6/LineInfoRB"
extension = ".csv"
files = "1234"
path = base_path + files[0] + extension

raw_data_1 = pd.read_csv(add_delimiters(path)).to_numpy() # to convert it to numpy 
raw_data_2 = pd.read_csv(add_delimiters(base_path + files[1] + extension)).to_numpy() 
raw_data_3 = pd.read_csv(add_delimiters(base_path + files[2] + extension)).to_numpy() 
raw_data_4 = pd.read_csv(add_delimiters(base_path + files[3] + extension)).to_numpy() 

# for testing functions 
raw_data_testing = pd.read_csv(add_delimiters(path)).to_numpy() # to convert it to numpy 


  interactivity=interactivity, compiler=compiler, result=result)


### Obtention de quelques caractéristiques du fichier : nombre de stations, nombre de requêtes.
Maintenant que les fichiers ont été lus et convertis en listes numpy, on va chercher à connaître le nombre de stations et de requêtes par fichier.

In [4]:
def get_data_nbs(raw_data):
    nb_stations = 0
    station1 = raw_data[0][0]
    for k in range(1,len(raw_data)):
        if(raw_data[k][0]==station1):
            nb_stations = k
            break
    nb_requests = (int) (len(raw_data)/nb_stations)
    return (nb_stations, nb_requests)

def print_fichier(raw_data, name):
    nb_stations, nb_requests = get_data_nbs(raw_data)
    print(" ")
    print(name + "         nb de stations : " + str(nb_stations) + " ; nb de requêtes (par station) : " + str(nb_requests))
    print("Date première requête : " + str(raw_data[0][1]) + " ; date dernière requête  : " + str(raw_data[-1][1])) 
    
print_fichier(raw_data_1, "Fichier 1") # for the first file, that last one should not be taken into account. Because it's one hour after the previous one.
print_fichier(raw_data_2, "Fichier 2") 
print_fichier(raw_data_3, "Fichier 3")
print_fichier(raw_data_4, "Fichier 4")

# for testing the next functions
nb_stations, nb_requests = get_data_nbs(raw_data_testing)
offset = 7

 
Fichier 1         nb de stations : 47 ; nb de requêtes (par station) : 285
Date première requête : 201805280917 ; date dernière requête  : 201805281514
 
Fichier 2         nb de stations : 47 ; nb de requêtes (par station) : 585
Date première requête : 201805281514 ; date dernière requête  : 201805290105
 
Fichier 3         nb de stations : 47 ; nb de requêtes (par station) : 227
Date première requête : 201805290641 ; date dernière requête  : 201805291142
 
Fichier 4         nb de stations : 47 ; nb de requêtes (par station) : 54
Date première requête : 201805291143 ; date dernière requête  : 201805291236


## Preprocessing : 
L'objectif final est d'arriver à produire le couple (temps de première estimation, temps d'arrivée) pour tous les trains. Cependant, on ne dispose pas d'id pour les trains. pour les trains. Ainsi, il va falloir en s'aidant des temps estimés ainsi que des messages réussir à suivre les trains le long de leur parcours. Cependant, les données ont besoin d'être formatées d'une certaine façon afin de faciliter la phase de traitement (processing) dans la suite. Cependant, certaines fonctions se justifient afin d'essayer de corriger un maximum des erreurs observées dans les données.

Dans l'odre :
 1. Suppression des "NaN" dans les fichiers numpy (liés à l'ajoute de délimiteurs inutiles).
 2. Remplacement des messages par des messages plus formalisés (code plus formel pour pouvoir analyser les messages plus facilement). 
 3. Création d'un nouveau data set avec uniquement des entiers. Utilisation d'un dictionnaire pour les codes et pour les stations.
 4. Trie alors les données dans l'ordre croissant afin de pouvoir faciliter le suivi des trains.
 5. Ajout des trains qui n'apparaissent pas, mais le devraient (par continuité avec la précédente fonction).

### Correction des NaN

In [5]:
# deleting "nan"

def delete_NaN(requests):
    new_data = []
    for index in range(len(requests)):
        l = len(requests[index])
        L = []
        nb_tot_trains = requests[index][2] # already an int

        for k in range(offset):
            L.append(requests[index][k])

        i = 0
        while(requests[index][offset+i] == requests[index][offset+i] and type(requests[index][offset+i]) == int):        
            i+=1

        for k in range(offset + i, l):
            if(requests[index][k] == requests[index][k]):
                L.append(requests[index][k])
            else:
                break
        new_data.append(np.array(L))
    return new_data

print("Before : " + str(raw_data_testing[0]))
raw_data_testing_NaN = delete_NaN(raw_data_testing)
print("")
print("After : " + str(raw_data_testing_NaN[0]))


Before : ['Saint Remy les Chevreuse' 201805280917 12 'A' 6 'R' 6 201805280918.0
 201805280934.0 201805280949.0 201805281004.0 201805281019.0
 201805281034.0 201805280927.0 201805280933.0 201805280932.0
 201805280941.0 201805280917.0 '201805281006' '09:18 Depart Voie 1'
 '09:34 Depart Voie 2' '09:49 Depart Voie 1' '10:04 Depart Voie 2'
 '10:19 Depart Voie 1' '10:34 Depart Voie 2' 'Train terminus V.2'
 'Train terminus V.1' 'Sans voyageurs V.3' 'Train terminus V.2'
 'Train terminus V.2' 'Train terminus V.1' nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]

After : ['Saint Remy les Chevreuse' '201805280917' '12' 'A' '6' 'R' '6'
 '201805280918.0' '201805280934.0' '201805280949.0' '201805281004.0'
 '201805281019.0' '201805281034.0' '201805280927.0' '201805280933.0'
 '201805280932.0' '201805280941.0' '201805280917.0' '201805281006'
 '09:18 Depart Voie 1' '09:34 Depart Voie 2' '09:49 Depart Voie 1'
 '10:04 Depart Voie 2' '10:19 Depart Voie 1' '1

### Formalisation des messages :

La formalisation des messages se fait au cas par cas. En pratique, on a ajouté les premiers champs les plus évidents, tout en affichant les lignes où un "NA" était remplaçé afin d'ajouter au fur et à mesure les cas qu'on aurait oublié. A l'heure actuelle, il ne reste de "NA" que les messages affichant "Voie .." qui n'ajoute aucune valeur à nos données comme on regarde à l'échelle des stations et non des voies.

In [6]:
# second part of the preprocessing part
offset = 7 # this is the offset to say that the first 7 parameters are not estimations.
def formalizing_messages(requests,nb_requests,nb_stations): # means that the trains info starts at number 6, and then we need to add the total number of trains.
    # replace french by simpler english voca
    for i in range(len(requests)):
        total_trains = requests[i][2]
        for j in range(offset , len(requests[i])):

            if requests[i][j][:16] == "Train sans arret" :
                requests[i][j] = "NO STOP"
            elif requests[i][j][:18] == "Train a l'approche" :
                requests[i][j] = "INCOMING"
            elif requests[i][j][:12] == "A l'approche" :
                requests[i][j] = "INCOMING"
            elif requests[i][j][:12] == "Train a quai" : 
                requests[i][j] = "ARRIVED"
            elif requests[i][j][6:12] == "Depart":
                requests[i][j] = "DEPARTURE SOON"
            elif requests[i][j][2] == ":" :
                requests[i][j] = int(requests[i][j][:2]+requests[i][j][3:5])
            elif requests[i][j][:10] == "Sans arret":
                requests[i][j] = "NO STOP"
            elif requests[i][j][:14] == "Train terminus": 
                requests[i][j] = "TERMINUS" # means that the train stop here
            elif requests[i][j][:14] == "Sans voyageurs":
                requests[i][j] = "NO PASSENGERS"
            elif requests[i][j][:13] == "Train retarde":
                requests[i][j] = "LATE"
            elif requests[i][j][:9] == "Stationne": # follow a "ARRIVED"
                requests[i][j] = "WAITING"
            elif requests[i][j][:6] == "Depart":
                requests[i][j] = "DEPARTURE"
            elif requests[i][j][:8] == requests[i][1][:8]: # we check if it's a time (but of type str)
                continue # if it is we simply pass instead of writing "NA" -> conversion to int is effectued later
            else : 
                #print(requests[i][j])
                requests[i][j] = "NA" # generally speaking the message is juste "Voie 2" something. We could pick up the estimated time in the data we deleted but...
            # we may have to deal with this later on, but for now it's fine, it's before "ARRIVED"
            # and it's never the first time a we have data on the train (a priori)

    return requests

# testing : 
print(raw_data_testing_NaN[47])
raw_data_testing_formalizing = formalizing_messages(raw_data_testing_NaN, nb_requests, nb_stations)
print(raw_data_testing_formalizing[47])

['Saint Remy les Chevreuse' '201805280918' '12' 'A' '6' 'R' '6'
 '201805280918.0' '201805280934.0' '201805280949.0' '201805281004.0'
 '201805281019.0' '201805281034.0' '201805280927.0' '201805280934.0'
 '201805280931.0' '201805280941.0' '201805280918.0' '201805281006'
 '09:18 Depart Voie 1' '09:34 Depart Voie 2' '09:49 Depart Voie 1'
 '10:04 Depart Voie 2' '10:19 Depart Voie 1' '10:34 Depart Voie 2'
 'Train terminus V.2' 'Train terminus V.1' 'Sans voyageurs V.3'
 'Train terminus V.2' 'Train terminus V.2' 'Train terminus V.1']
['Saint Remy les Chevreuse' '201805280918' '12' 'A' '6' 'R' '6'
 '201805280918.0' '201805280934.0' '201805280949.0' '201805281004.0'
 '201805281019.0' '201805281034.0' '201805280927.0' '201805280934.0'
 '201805280931.0' '201805280941.0' '201805280918.0' '201805281006'
 'DEPARTURE SOON' 'DEPARTURE SOON' 'DEPARTURE SOON' 'DEPARTURE SOON'
 'DEPARTURE SOON' 'DEPARTURE SOON' 'TERMINUS' 'TERMINUS' 'NO PASSENGERS'
 'TERMINUS' 'TERMINUS' 'TERMINUS']


### Conversion des données en entier

On commence par définir le code qui sera utilisé pour les messages ainsi formalisés, ainsi qu'un dictionnaire pour remplacer le nom des villes par des entiers.

In [7]:
code = dict()
code["NO STOP"] = -10
code["WAITING"] = -1
code["ARRIVED"] = -2
code["INCOMING"] = -3
code["LATE"] = -4
code["NO PASSENGERS"] = -100
code["TERMINUS"] = -5
code["NA"] = -1000
code["DEPARTURE"] = -7
code["DEPARTURE SOON"] = -6
print(code)
print("")
dico_station = dict()
for k in range(nb_stations):
    dico_station[raw_data_1[k][0]] = k
print(dico_station)

{'NO STOP': -10, 'WAITING': -1, 'ARRIVED': -2, 'INCOMING': -3, 'LATE': -4, 'NO PASSENGERS': -100, 'TERMINUS': -5, 'NA': -1000, 'DEPARTURE': -7, 'DEPARTURE SOON': -6}

{'Saint Remy les Chevreuse': 0, 'Courcelle Sur Yvette': 1, 'Gif Sur Yvette': 2, 'La Hacquiniere': 3, 'Bures Sur Yvette': 4, 'Orsay Ville': 5, 'Le Guichet': 6, 'Lozere': 7, 'Palaiseau Villebon': 8, 'Palaiseau': 9, 'Massy Palaiseau': 10, 'Massy Verrieres': 11, 'Les Baconnets': 12, 'Fontaine Michalon': 13, 'Antony': 14, 'La Croix de Berny': 15, 'Parc de Sceaux': 16, 'Bourg la Reine': 17, 'Bagneux': 18, 'Arcueil Cachan': 19, 'Laplace': 20, 'Gentilly': 21, 'Cite Universitaire': 22, 'Denfert Rochereau': 23, 'Port Royal': 24, 'Luxembourg': 25, 'Saint Michel': 26, 'Chatelet': 27, 'Gare du Nord': 28, 'La Plaine-Stade de France': 29, 'Aubervilliers': 30, 'Le Bourget': 31, 'Drancy': 32, 'Blanc-Mesnil': 33, 'Aulnay Sous Bois': 34, 'Sevran Beaudottes': 35, 'Villepinte': 36, 'Parc des Expositions': 37, 'Aeroport Ch.De Gaulle 1': 38, 'A

In [8]:

# required dico_statio and the dictionnary code to be defined properly before.
def data_to_Int(data,nb_requests,nb_stations):
    preprocesseddata = []
    date = int(data[0][1].astype(np.float)/10000)*10000 # initial date - that we are going to delete to only keep the hours + minutes
    
    #print("Date : " + str(date))
    for k in range(len(data)):
        date = int(data[k][1].astype(np.float)/10000)*10000 # the 2nd file spread on the 28th and the 29th
        L = []
        l = len(data[k])
        L.append(dico_station[data[k][0]])
        L.append(int(data[k][1].astype(np.float))-date)
        L.append(int(data[k][2].astype(np.float)))
        L.append(1)
        L.append(int(data[k][4].astype(np.float)))
        L.append(-1)
        L.append(int(data[k][6].astype(np.float)))
        
        count = 0
        for j in range(offset,l):
            try :
                if data[k][j][:8] == data[k][1][:8]: # we check if it's a time
                    L.append(int(data[k][j].astype(np.float))-date)
                else :
                    if(count == 0):
                        L.append(0.0) # we add a zero to make the difference betwee nthe messaged and the estimations
                        count = 1
                    L.append(int(data[k][j].astype(np.float)))
            except ValueError:
                L.append(code[data[k][j]])
        preprocesseddata.append(L)
    return preprocesseddata

#  testing : 
raw_data_testing_to_Int = data_to_Int(raw_data_testing_formalizing,nb_requests,nb_stations)
print("Note the 0.0 between the messages and the estimations.")
print(raw_data_testing_to_Int[0])

Note the 0.0 between the messages and the estimations.
[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 927, 933, 932, 941, 917, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]


### Correction des lignes
Dans certaines lignes, on constate une incohérence entre le nombre de messages et le nombre d'estimation. Souvent, il manque une estimation par rapport au nombre de messages. Dans ces cas-là on cherche à ajouter le train manquant en mettant le message à la place si le message est un temps, sinon on va chercher le temps de la requête précédente pour la même station.

Une fonction permettant de soustraire des temps a été également ajoutée. On aurait pu utiliser le module $\textit{datatime}$ de Python mais une bonne portion du code ayant déjà été écrite, nous avons préféré rajouter simplement cette fonction.

In [9]:
def substract_times(t1,t2): # t1 = 1939, t2 = 1045 for example, for 10:45 and 19:39
    # returns a result in minutes
    t2_hours = int(t2/100)
    t1_hours = int(t1/100)
    t2_mins = t2 - t2_hours*100
    t1_mins = t1 - t1_hours*100
    
    # we add, afterwards, to add those conditions because it was causing troubles for file 2.
    if(t2_hours==0):
        t2_hours = 24
    elif(t2_hours==1):
        t2_hours = 25
    elif(t2_hours==2):
        t2_hours = 26
    
    if(t1_hours==0):
        t1_hours = 24
    elif(t1_hours==1):
        t1_hours = 25
    elif(t1_hours==2):
        t1_hours = 26
        
    dt = t2_hours - t1_hours
    return dt*60+ t2_mins - t1_mins # minutes

# testing
print(substract_times(2345,223)) # this is for 2h23 - 23h45 -> diff : 2h38 = 158


158


In [10]:
def correcting_line_errors(data, nb_requests,nb_stations): # we should be careful in case there is an arrival too
    # the goal here is to correct abnormal numbers in the estimated time (compared to the message one)
    # does not really work
    local_offset = 1 # because we added a zero previously
    max_acceptable_diff = 2
    new_data = []
    for k in range(nb_requests):
        for j in range(nb_stations):
            current_request = data[nb_stations*k+j]

            qty = current_request[2]
            i = offset
            while(qty > 0 and current_request[i]!=0.0): 
                i+=1
            missing_trains = qty - (i - offset)

            if (missing_trains > 0): # so this is the number of request with issues (i.e. there is a train missing in the estimated time)
               
                shift = 0
                new_current = current_request[:offset]
                if(k>0): # we can not change a train of the first requests because we don't have a previous one
                    previous_request = new_data[nb_stations*(k-1)+j] # we take the new one which is already corrected
                    #print(previous_request)
                  #  assert(len(previous_request) == len(current_request) + missing_trains) # we hope that the previous one is okay in the other direction too, like no missing train
                    shift = 0
                    # we'll only try to deal with the previous one (because the next ones can recieve a new train too)
                    kk = offset
                    while(kk < offset + qty - missing_trains): # for all trains in the current one (in the given direction), we check if we can verify it's in the right place
                        ##print(kk, shift)
                        ##print(abs(previous_request[kk + shift]-current_request[kk]))
                        if(shift == 1): # we do that because sometimes there is a train that is arriving and in this case it's normal that there is a =/=
                            new_current.append(current_request[kk])
                            kk+=1
                        else :
                            # we check if there is too big of a difference 
                            if(abs(substract_times(previous_request[kk + shift],current_request[kk]))<max_acceptable_diff):
                                new_current.append(current_request[kk])
                                kk+=1
                            else : # there is not - se we decide which one we are going to add
                                new_current.append(previous_request[kk + shift])
                                shift+=1
                                """
                                if(kk + 1 < offset + qty - missing_trains and substract_times(current_request[kk],current_request[kk+1])>0):
                                    # in this case they may just be inverted 
                                    new_current.append(current_request[kk])
                                    kk+=1 
                                else:
                                    new_current.append(previous_request[kk + shift])
                                    shift+=1
                                """

                for r in range(qty+1): # for now we don't add anymore the ones in the other direction
                    new_current.append(current_request[r + offset + qty - missing_trains])
                #print("new : ")
                #print(new_current)
                new_data.append(new_current)
            else:
                new_data.append(current_request)
    return new_data

# testing 

not_working_line = 10209
print("Missing an estimation apparently ... ")
print(raw_data_testing_to_Int[not_working_line]) 
raw_data_testing_corrected_lines=  correcting_line_errors(raw_data_testing_to_Int, nb_requests,nb_stations)
print("Never mind all good ! ")
print(raw_data_testing_corrected_lines[not_working_line]) 
print("The previous line looked like : ")
print(raw_data_testing_to_Int[not_working_line-47])


Missing an estimation apparently ... 
[10, 1257, 12, 1, 6, -1, 6, 1257, 1300, 1310, 1315, 1325, 1300, 1310, 1314, 1326, 1328, 1339, 0.0, -4, -1000, -6, 1310, -6, 1325, 1300, -5, 1314, -5, 1328, -5]
Never mind all good ! 
[10, 1257, 12, 1, 6, -1, 6, 1257, 1300, 1300, 1310, 1315, 1325, 1300, 1310, 1314, 1326, 1328, 1339, 0.0, -4, -1000, -6, 1310, -6, 1325, 1300, -5, 1314, -5, 1328, -5]
The previous line looked like : 
[10, 1256, 12, 1, 6, -1, 6, 1256, 1301, 1300, 1310, 1315, 1325, 1259, 1311, 1314, 1326, 1328, 1339, 0.0, -4, 1301, -7, 1310, -6, 1325, 1259, -5, 1314, -5, 1328, -5]


### Trie des données
L'objectif de l'algorithme suivant est de trier les horaires d'estimation des trains. Cet algorithme permet ensuite de résoudre tous les probèmes de trains "manquants" qui n'ont pas lieu d'être. C'est-à-dire les trains qui disparaissent pendant plusieurs minutes pour réaparaître ensuite de façon un peu aléatoire. On fonctionne alors par interpolation. 

Cependant, cet algorithme peut intégrer des biais : les règles auxquelles obéissent le renvoie des trains lors des requêtes nous sont inconnues. Cependant et dans l'hypothèse où la RATP renvoie ses données dans l'ordre dans lequel les trains sont apparus pour la première fois, alors si jamais un train prend du retard tel que son horaire se décale après l'horaire d'arrivée de trains initialement prévus plus tard, alors on aura échangé leur estimation ... 


In [11]:
def order_data(raw_dataset,nb_requests,nb_stations): # as it is coded here, it's modifies what we git it
    """
    sorting along the first direction, and then along the last direction.
    """
    local_offset = 1

    dataset = raw_dataset
    nb = 0
    new_dataset = []
    for k in range(nb_requests-1):
        for j in range(nb_stations):
            current_request = dataset[nb_stations*k+j]
            qty = current_request[2]
            qty1 = current_request[4]
            qty2 = current_request[6]
            b = False
                # sorting each list
            save_current_request = [current_request[k] for k in range(len(current_request))]
        
            i = offset + qty1 # we start at the end of the estimations
            while(i>offset): # we check if there is a valid one before (to take the previous, if there is not, then no pb)
                for e in range(offset+1, i):
                    current = current_request[e]
                    previous = current_request[e-1]
                    if(previous > current): 
                        # some work
                        b = True
                        current_request[e] = previous
                        current_request[e-1] = current
                        # swapping in the messages section too
                        idx_message = qty + local_offset + e
                        try : 
                            current_request[idx_message],  current_request[idx_message-1] = current_request[idx_message-1],  current_request[idx_message]
                        except IndexError:
                            break
                        nb += 1
                i-=1
                    # we should check if there are multiple previous that are bigger and there are

            #2nd part of the tab
            i = offset + qty # we start at the end of the estimations
            while(i>offset+qty1): # we check if there is a valid one before (to take the previous, if there is not, then no pb)
                for e in range(offset+qty1+1, i):
                    current = current_request[e]
                    previous = current_request[e-1]
                    if(previous > current): 
                        # some work
                        b = True
                        current_request[e] = previous
                        current_request[e- 1] = current
                        # swapping in the messages section too
                        idx_message = qty+local_offset + e
                        try:
                            current_request[idx_message],  current_request[idx_message-1] = current_request[idx_message-1],  current_request[idx_message]
                        except IndexError:
                            break
                        nb += 1
                i-=1
            if(b):
                print("")
                print(save_current_request)
                print(current_request)
            new_dataset.append(current_request)
    print(" ")
    print("Number of lines that were concerned by it: " + str(nb))
    dataset = new_dataset
    return dataset

# test - only the lines that are concerned will be printed ...
raw_data_testing_ordered = order_data(raw_data_testing_corrected_lines, nb_requests,nb_stations)


[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 927, 933, 932, 941, 917, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]
[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 917, 927, 932, 933, 941, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]

[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 950, 948, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -5, -1000]
[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 948, 950, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -1000, -5]

[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 947, 946, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 947, 946]
[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 946, 947, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 946, 947]

[12, 917, 12, 1, 6, -1, 6, 917, 928, 934, 942, 949, 958, 927, 938, 939, 946, 944, 951, 0.0, -3, -10,


[13, 1307, 12, 1, 6, -1, 6, 1308, 1307, 1316, 1320, 1328, 1335, 1312, 1320, 1328, 1333, 1341, 1356, 0.0, 1308, -10, -10, 1320, -10, 1335, -10, 1320, -10, 1333, -10, -10]
[13, 1307, 12, 1, 6, -1, 6, 1307, 1308, 1316, 1320, 1328, 1335, 1312, 1320, 1328, 1333, 1341, 1356, 0.0, -10, 1308, -10, 1320, -10, 1335, -10, 1320, -10, 1333, -10, -10]

[14, 1307, 12, 1, 6, -1, 6, 1310, 1307, 1317, 1322, 1330, 1337, 1310, 1318, 1327, 1331, 1339, 1354, 0.0, 1310, -4, 1317, 1322, 1330, 1337, 1310, 1318, 1327, 1331, 1339, 1354]
[14, 1307, 12, 1, 6, -1, 6, 1307, 1310, 1317, 1322, 1330, 1337, 1310, 1318, 1327, 1331, 1339, 1354, 0.0, -4, 1310, 1317, 1322, 1330, 1337, 1310, 1318, 1327, 1331, 1339, 1354]

[15, 1307, 12, 1, 6, -1, 6, 1307, 1311, 1307, 1319, 1324, 1331, 1308, 1316, 1325, 1330, 1337, 1352, 0.0, -2, 1311, -4, 1319, 1324, 1331, 1308, 1316, 1325, 1330, 1337, 1352]
[15, 1307, 12, 1, 6, -1, 6, 1307, 1307, 1311, 1319, 1324, 1331, 1308, 1316, 1325, 1330, 1337, 1352, 0.0, -2, -4, 1311, 1319, 1324, 133

[24, 1319, 12, 1, 6, -1, 6, 1320, 1325, 1328, 1335, 1334, 1340, 1322, 1325, 1334, 1340, 1343, 1349, 0.0, 1320, 1325, 1328, 1335, 1334, 1340, 1322, 1325, 1334, 1340, 1343, 1349]
[24, 1319, 12, 1, 6, -1, 6, 1320, 1325, 1328, 1334, 1335, 1340, 1322, 1325, 1334, 1340, 1343, 1349, 0.0, 1320, 1325, 1328, 1334, 1335, 1340, 1322, 1325, 1334, 1340, 1343, 1349]

[25, 1319, 12, 1, 6, -1, 6, 1320, 1322, 1326, 1330, 1337, 1336, 1320, 1324, 1332, 1338, 1341, 1347, 0.0, -3, 1322, 1326, 1330, 1337, 1336, 1320, 1324, 1332, 1338, 1341, 1347]
[25, 1319, 12, 1, 6, -1, 6, 1320, 1322, 1326, 1330, 1336, 1337, 1320, 1324, 1332, 1338, 1341, 1347, 0.0, -3, 1322, 1326, 1330, 1336, 1337, 1320, 1324, 1332, 1338, 1341, 1347]

[26, 1319, 12, 1, 6, -1, 6, 1322, 1324, 1328, 1332, 1339, 1338, 1319, 1322, 1330, 1336, 1339, 1345, 0.0, 1322, 1324, 1328, 1332, 1339, 1338, -2, 1322, 1330, 1336, 1339, 1345]
[26, 1319, 12, 1, 6, -1, 6, 1322, 1324, 1328, 1332, 1338, 1339, 1319, 1322, 1330, 1336, 1339, 1345, 0.0, 1322, 1324, 13

[21, 1330, 12, 1, 6, -1, 6, 1331, 1330, 1338, 1344, 1342, 1350, 1331, 1340, 1346, 1349, 1355, 1401, 0.0, -10, 1330, 1338, -10, 1342, 1350, -10, 1340, -10, 1349, 1355, -10]
[21, 1330, 12, 1, 6, -1, 6, 1330, 1331, 1338, 1342, 1344, 1350, 1331, 1340, 1346, 1349, 1355, 1401, 0.0, 1330, -10, 1338, 1342, -10, 1350, -10, 1340, -10, 1349, 1355, -10]

[11, 1331, 12, 1, 6, -1, 6, 1333, 1332, 1343, 1347, 1357, 1402, 1331, 1340, 1343, 1357, 1407, 1412, 0.0, -10, 1332, -10, 1347, -10, 1402, -10, 1340, 1343, -10, 1407, -10]
[11, 1331, 12, 1, 6, -1, 6, 1332, 1333, 1343, 1347, 1357, 1402, 1331, 1340, 1343, 1357, 1407, 1412, 0.0, 1332, -10, -10, 1347, -10, 1402, -10, 1340, 1343, -10, 1407, -10]

[18, 1331, 12, 1, 6, -1, 6, 1333, 1342, 1337, 1344, 1352, 1353, 1333, 1333, 1345, 1348, 1354, 1400, 0.0, 1333, -10, 1337, 1344, -10, 1353, 1333, -10, 1345, -10, 1354, 1400]
[18, 1331, 12, 1, 6, -1, 6, 1333, 1337, 1342, 1344, 1352, 1353, 1333, 1333, 1345, 1348, 1354, 1400, 0.0, 1333, 1337, -10, 1344, -10, 1353, 


[27, 1343, 11, 1, 5, -1, 6, 1345, 1350, 1356, 1355, 1406, 1344, 1343, 1349, 1352, 1358, 1404, 0.0, 1345, 1350, 1356, 1355, 1406, -3, 1343, 1349, 1352, 1358, 1404]
[27, 1343, 11, 1, 5, -1, 6, 1345, 1350, 1355, 1356, 1406, 1343, 1344, 1349, 1352, 1358, 1404, 0.0, 1345, 1350, 1355, 1356, 1406, 1343, -3, 1349, 1352, 1358, 1404]

[10, 1344, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1348, 1405, 1416, 1413, 1424, 1428, 0.0, -6, -1000, 1355, -6, 1410, -6, 1348, 1405, -5, -1000, -5, 1428]
[10, 1344, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1348, 1405, 1413, 1416, 1424, 1428, 0.0, -6, -1000, 1355, -6, 1410, -6, 1348, 1405, -1000, -5, -5, 1428]

[11, 1344, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1404, 1413, 1412, 1422, 1427, 0.0, 1347, -10, -10, 1402, -10, 1417, 1346, -10, 1413, -10, 1422, -10]
[11, 1344, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1404, 1412, 1413, 1422, 1427, 0.0, 1347, -10, -10, 1402, -10, 1417, 1346, -10, -10, 1413, 14

### Interpolation des trains manquants
On ajoute les trains qui disparaissent puis réaparaissent ensuite. Pour cela, on parcours les données dans le sens chronologique des requêtes et on utilise la requête obtenue la minute précédente pour trouver un potentiel train disparu et l'ajouter dans ce cas-là.

Pour cet algorithme, il y a concordance entre le nombre de messages et le nombre d'estimation et les trains sont triés. 

In [12]:

def interpolate_missing_trains(dataset,nb_requests,nb_stations):
    new_dataset = [dataset[k] for k in range(nb_stations)] # init new dataset
    max_diff = 3 # I think that should be enough (it's for the next to the other)
    local_offset = 1
    offset = 7
    min_diff_ = 10
    for k in range(1, nb_requests-1): # will not work for the first request but it's ok!
                for j in range(nb_stations):
                    current_request = dataset[nb_stations*k+j]
                    qty = dataset[nb_stations*k+j][2]
                    qty_previous = new_dataset[nb_stations*(k-1)+j][2]  # we'll treat previous first so it shoud be fine
                    i = offset + qty + local_offset
                    l = len(current_request)

                    if(qty_previous > qty):
                        previous_request = new_dataset[nb_stations*(k-1)+j] # new_dataset
                        
                        print("current request :")
                        print(current_request)
                        print("previous request :")
                        print(previous_request)
                        
                        # preparing the new "current"
                        new_current = current_request[:offset]
                        new_current[2] = previous_request[2]
                        new_current[4] = previous_request[4]
                        new_current[6] = previous_request[6]
                        messages_new_current = []
                        # trains in each direction
                        qty1 = dataset[nb_stations*k+j][4]
                        qty1_previous = new_dataset[nb_stations*(k-1)+j][4]
                        qty2 = dataset[nb_stations*k+j][6]
                        qty2_previous = new_dataset[nb_stations*(k-1)+j][6]
                        
                        current_time = current_request[1] # the time of the request
                        
                        if(qty1 < qty1_previous): # checking for the first direction
                            print("PROBLEM with DIRECTION 1")
                            tr = offset
                            shift = 0
                            nb_previously_arrived_trains = 0
                            while(tr < offset + qty1 and tr+shift+nb_previously_arrived_trains < offset + qty1_previous):
                                # sometimes the last train is problematic (weirdly enough), so we have to add the remaing ...
                                # qty_previous is for the previous,qty is for the current !!!
                                ##print("je suis passé")
                                ##print("shift : " + str(shift))
                                ##print("arr. trained : " + str(nb_previously_arrived_trains))
                                ##print(tr)
                                try :
                                    q = tr+shift+nb_previously_arrived_trains + qty_previous + local_offset # index of the associated message in the previous request
                                    A = previous_request[q] == code["ARRIVED"] and current_request[tr + qty + local_offset] != code["ARRIVED"]# if the previous request saw an arrived train and not the actual one
                                   # B = previous_request[q] == code["INCOMING"] and current_request[tr + qty + local_offset] != code["INCOMING"]
                                   # C = previous_request[q] == code["INCOMING"]  and current_request[tr + qty + local_offset] != code["ARRIVED"]
                                    D = previous_request[q] == code["ARRIVED"] and current_request[tr + qty + local_offset] != code["WAITING"] # if the previous request saw an arrived train and not the actual one
                                    E = previous_request[q] == code["WAITING"] and current_request[tr + qty + local_offset] != code["WAITING"]
                                # B means that the train arrived and left the station between two consecutive requests #theproblemofsamplingfrequence
                                # or (B and C)
                                except IndexError:
                                    break
                                if((A and D)  or E): # basically this means : if there has been an arrival in the previous request then we have to shift one more to the right in the previous estimations times
                                    nb_previously_arrived_trains+=1
                        
                                if(abs(substract_times(previous_request[tr+shift+nb_previously_arrived_trains],current_request[tr])) > max_diff):
                                    # then there is a shift here (due to a missing train)
                                    try :
                                        q = tr+shift+nb_previously_arrived_trains + qty_previous
                                        # in this case, we add the previous estimation time to the new current

                                        new_current.append(previous_request[tr+shift+nb_previously_arrived_trains])
                                        messages_new_current.append(previous_request[q+local_offset])
                                        ##print("Trop diff. J'ai ajouté " + str(previous_request[tr+shift+nb_previously_arrived_trains]))
                                        shift += 1
                                    except IndexError:
                                        break
                                else: 
                                    # then it's the same, np
                                    try:
                                        new_current.append(current_request[tr])         
                                        messages_new_current.append(current_request[tr + qty + local_offset]) 
                                        ##print("Aucune diff. J'ai ajouté " + str(current_request[tr]))
                                        tr+=1
                                    except IndexError:
                                        break
                            # maybe there is still some trains left that we could not add ... because too much trains came in
                            if(shift == 0): 
                                # then problem - it means that we added nothing
                                for trr in range(tr, offset + qty1_previous): # we add the next ones in previous 
                                    q = trr+nb_previously_arrived_trains + qty_previous
                                    try:
                                        new_current.append(previous_request[trr+nb_previously_arrived_trains])
                                        messages_new_current.append(previous_request[q+local_offset])
                                    except IndexError:
                                        break
                             # old :  to be sure we are not adding trains that have nothing to do here (error in the xtrain_2 where 2 trains should have not been added - or not really at least)
                            
                            for trr in range(tr, offset + qty1):
                                try:
                                    new_current.append(current_request[trr])
                                    messages_new_current.append(current_request[trr + qty + local_offset]) 
                                except IndexError:
                                    break

                        else : # if the issue is not with quantity one, then we still have to add it..
                            for tr in range(offset, offset + qty1):
                                try :
                                    new_current.append(current_request[tr])
                                    messages_new_current.append(current_request[tr+qty+local_offset])
                                except IndexError:
                                    break
            
                        if(qty2 < qty2_previous): # checking for the first direction
                            print("PROBLEM with DIRECTION 2")
                            tr = offset + qty1 # this is the only things that change compared to before
                            shift = 0
                            nb_previously_arrived_trains = 0
                            
                            while(tr < offset + qty and tr+shift+nb_previously_arrived_trains < offset + qty_previous):
                                q = tr+shift+nb_previously_arrived_trains + qty_previous + local_offset # index of the associated message in the previous request
                                try:
                                    A = previous_request[q] == code["ARRIVED"] and current_request[tr + qty + local_offset] != code["ARRIVED"]# if the previous request saw an arrived train and not the actual one
                                  # B = previous_request[q] == code["INCOMING"] and current_request[tr + qty + local_offset] != code["INCOMING"]
                                  #  C = previous_request[q] == code["INCOMING"]  and current_request[tr + qty + local_offset] != code["ARRIVED"]
                                    D = previous_request[q] == code["ARRIVED"] and current_request[tr + qty + local_offset] != code["WAITING"] # if the previous request saw an arrived train and not the actual one
                                    E = previous_request[q] == code["WAITING"] and current_request[tr + qty + local_offset] != code["WAITING"]
                                # B means that the train arrived and left the station between two consecutive requests #theproblemofsamplingfrequence
                                except IndexError: 
                                    break
                                if((A and D)  or E): # basically this means : if there has been an arrival in the previous request then we have to shift one more to the right in the previous estimations times
                                    nb_previously_arrived_trains+=1

                                if(abs(substract_times(previous_request[tr+shift+nb_previously_arrived_trains],current_request[tr])) > max_diff):
                                    # then there is a shift here
                                    try :
                                        q = tr+shift+nb_previously_arrived_trains + qty_previous
                                        # in this case, we add the previous estimation time to the new current
                                        new_current.append(previous_request[tr+shift+nb_previously_arrived_trains])
                                        messages_new_current.append(previous_request[q + local_offset])
                                        # we update shift at the end
                                        shift += 1
                                    except IndexError:
                                        break
                                else: 
                                    # then it's the same, np
                                    try :
                                        new_current.append(current_request[tr])
                                        messages_new_current.append(current_request[tr + qty + local_offset]) 
                                        tr+=1
                                    except IndexError:
                                        break
                                
                                if(shift == 0): 
                                # then problem - it means that we added nothing
                                    for trr in range(tr, offset + qty_previous): # we add the next ones in previous 
                                        q = trr+nb_previously_arrived_trains + qty_previous
                                        try :
                                            new_current.append(previous_request[trr+nb_previously_arrived_trains])
                                            messages_new_current.append(previous_request[q+local_offset])
                                        except IndexError :
                                            break
                      
                                for trr in range(tr, offset + qty2):
                                    try:
                                        new_current.append(current_request[trr])
                                        messages_new_current.append(current_request[trr + qty + local_offset]) 
                                    except IndexError:
                                        break
                        else : # if the issue is not with quantity two, then we still have to add it..
                            for tr in range(offset+qty1, offset + qty): # qty = qty1 + qty2
                                try :
                                    new_current.append(current_request[tr])
                                    messages_new_current.append(current_request[tr+qty+local_offset])
                                except IndexError:
                                    break
                        new_current.append(0.0) # don't forget this one 
                        for message in range(len(messages_new_current)):
                            new_current.append(messages_new_current[message])
                        print("new request : ")
                        print(new_current)
                        print("")
                        if(len(new_current) > len(previous_request)):
                            new_current = current_request # tant pis
                    else:
                        new_current = current_request
                    new_dataset.append(new_current)
    return new_dataset

# test - only the concerned case are printed
raw_data_testing_interpolate = interpolate_missing_trains(raw_data_testing_ordered,nb_requests,nb_stations)

current request :
[9, 1257, 11, 1, 5, -1, 6, 1308, 1323, 1338, 1353, 1408, 1302, 1317, 1331, 1346, 1401, 1416, 0.0, 1308, 1323, 1338, 1353, 1408, 1302, 1317, 1331, 1346, 1401, 1416]
previous request :
[9, 1256, 12, 1, 6, -1, 6, 1259, 1308, 1323, 1338, 1353, 1408, 1302, 1316, 1331, 1346, 1401, 1416, 0.0, 1259, 1308, 1323, 1338, 1353, 1408, 1302, 1316, 1331, 1346, 1401, 1416]
PROBLEM with DIRECTION 1
new request : 
[9, 1257, 12, 1, 6, -1, 6, 1259, 1308, 1323, 1338, 1353, 1408, 1302, 1317, 1331, 1346, 1401, 1416, 0.0, 1259, 1308, 1323, 1338, 1353, 1408, 1302, 1317, 1331, 1346, 1401, 1416]

current request :
[14, 1257, 11, 1, 5, -1, 6, 1257, 1257, 1307, 1315, 1322, 1303, 1310, 1318, 1324, 1331, 1339, 0.0, -2, -4, 1307, 1315, 1322, 1303, 1310, 1318, 1324, 1331, 1339]
previous request :
[14, 1256, 12, 1, 6, -1, 6, 1256, 1256, 1306, 1307, 1315, 1322, 1256, 1303, 1309, 1318, 1324, 1331, 0.0, -3, -4, 1306, 1307, 1315, 1322, -2, 1303, 1309, 1318, 1324, 1331]
PROBLEM with DIRECTION 1
new request 

### Algorithme final de preprocessing

In [13]:
offset = 7
L = []
def final_preprocessing(raw_data):
    nb_stations, nb_requests = get_data_nbs(raw_data)
        # 1
    t1 = time.time()
    new_data = delete_NaN(raw_data)
    dt = time.time() - t1
    L.append(dt)
        # 2
    t1 = time.time()
    new_data_1 = formalizing_messages(new_data,nb_requests,nb_stations)
    dt = time.time() - t1
    L.append(dt)    
        # 3
    t1 = time.time()
    new_data_2 = data_to_Int(new_data_1,nb_requests,nb_stations) # offset = 7 (cf. before)
    dt = time.time() - t1
    L.append(dt)
        # 4
    t1 = time.time()

    new_data_3 = correcting_line_errors(new_data_2, nb_requests, nb_stations)
    dt = time.time() - t1
    L.append(dt)
        # 5
    t1 = time.time()
    new_data_4 = order_data(new_data_3,nb_requests,nb_stations) 
    dt = time.time() - t1
    L.append(dt)
        # 6
    t1 = time.time()
    preprocessed_data = interpolate_missing_trains(new_data_4,nb_requests,nb_stations)
    dt = time.time() - t1
    L.append(dt)
    return preprocessed_data


In [14]:
dataset1 =  final_preprocessing(raw_data_1)


[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 927, 933, 932, 941, 917, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]
[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 917, 927, 932, 933, 941, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]

[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 950, 948, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -5, -1000]
[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 948, 950, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -1000, -5]

[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 947, 946, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 947, 946]
[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 946, 947, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 946, 947]

[12, 917, 12, 1, 6, -1, 6, 917, 928, 934, 942, 949, 958, 927, 938, 939, 946, 944, 951, 0.0, -3, -10,

[10, 1343, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1348, 1405, 1416, 1413, 1424, 1428, 0.0, -1000, -6, 1355, -6, 1410, -6, 1348, 1405, -5, -1000, -5, 1428]
[10, 1343, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1348, 1405, 1413, 1416, 1424, 1428, 0.0, -1000, -6, 1355, -6, 1410, -6, 1348, 1405, -1000, -5, -5, 1428]

[11, 1343, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1404, 1413, 1412, 1422, 1427, 0.0, -10, 1347, -10, 1402, -10, 1417, 1346, -10, 1413, -10, 1422, -10]
[11, 1343, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1404, 1412, 1413, 1422, 1427, 0.0, -10, 1347, -10, 1402, -10, 1417, 1346, -10, -10, 1413, 1422, -10]

[18, 1343, 12, 1, 6, -1, 6, 1343, 1350, 1356, 1353, 1359, 1406, 1345, 1354, 1400, 1400, 1403, 1409, 0.0, -10, 1350, -10, 1353, 1359, -10, 1345, -10, 1400, 1401, -10, 1409]
[18, 1343, 12, 1, 6, -1, 6, 1343, 1350, 1353, 1356, 1359, 1406, 1345, 1354, 1400, 1400, 1403, 1409, 0.0, -10, 1350, 1353, -10, 1359, -10, 1345, -1

In [15]:
dataset2 =  final_preprocessing(raw_data_2)


[25, 1518, 12, 1, 6, -1, 6, 1518, 1522, 1527, 1531, 1535, 1542, 1524, 1523, 1526, 1532, 1538, 1541, 0.0, -3, 1522, 1527, 1531, 1535, 1542, 1524, 1523, 1526, 1532, 1538, 1541]
[25, 1518, 12, 1, 6, -1, 6, 1518, 1522, 1527, 1531, 1535, 1542, 1523, 1524, 1526, 1532, 1538, 1541, 0.0, -3, 1522, 1527, 1531, 1535, 1542, 1523, 1524, 1526, 1532, 1538, 1541]

[26, 1518, 12, 1, 6, -1, 6, 1520, 1525, 1529, 1533, 1537, 1544, 1522, 1521, 1524, 1530, 1536, 1539, 0.0, 1520, 1525, 1529, 1533, 1537, 1544, 1522, 1521, 1524, 1530, 1536, 1539]
[26, 1518, 12, 1, 6, -1, 6, 1520, 1525, 1529, 1533, 1537, 1544, 1521, 1522, 1524, 1530, 1536, 1539, 0.0, 1520, 1525, 1529, 1533, 1537, 1544, 1521, 1522, 1524, 1530, 1536, 1539]

[27, 1518, 12, 1, 6, -1, 6, 1518, 1522, 1527, 1531, 1535, 1539, 1520, 1519, 1522, 1528, 1534, 1537, 0.0, -2, 1522, 1527, 1531, 1535, 1539, 1520, 1519, 1522, 1528, 1534, 1537]
[27, 1518, 12, 1, 6, -1, 6, 1518, 1522, 1527, 1531, 1535, 1539, 1519, 1520, 1522, 1528, 1534, 1537, 0.0, -2, 1522, 152

[45, 1615, 12, 1, 6, -1, 6, 1617, 1632, 1647, 1702, 1717, 1732, 1626, 1637, 1636, 1651, 1707, 1722, 0.0, 1617, 1632, 1647, 1702, 1717, 1732, 1626, 1637, 1636, 1651, 1707, 1722]
[45, 1615, 12, 1, 6, -1, 6, 1617, 1632, 1647, 1702, 1717, 1732, 1626, 1636, 1637, 1651, 1707, 1722, 0.0, 1617, 1632, 1647, 1702, 1717, 1732, 1626, 1636, 1637, 1651, 1707, 1722]

[46, 1615, 12, 1, 6, -1, 6, 1619, 1633, 1648, 1703, 1718, 1733, 1624, 1635, 1634, 1649, 1705, 1720, 0.0, 1619, 1633, 1648, 1703, 1718, 1733, 1624, 1635, 1634, 1649, 1705, 1720]
[46, 1615, 12, 1, 6, -1, 6, 1619, 1633, 1648, 1703, 1718, 1733, 1624, 1634, 1635, 1649, 1705, 1720, 0.0, 1619, 1633, 1648, 1703, 1718, 1733, 1624, 1634, 1635, 1649, 1705, 1720]

[10, 1616, 12, 1, 6, -1, 6, 1621, 1629, 1636, 1644, 1650, 1659, 1621, 1629, 1630, 1642, 1641, 1639, 0.0, 1621, -6, 1636, -6, 1650, -6, 1621, -5, 1630, -5, -1000, -5]
[10, 1616, 12, 1, 6, -1, 6, 1621, 1629, 1636, 1644, 1650, 1659, 1621, 1629, 1630, 1639, 1641, 1642, 0.0, 1621, -6, 1636, -6,

[16, 1820, 12, 1, 6, -1, 6, 1825, 1832, 1840, 1850, 1855, 1902, 1827, 1835, 1837, 1841, 1844, 1852, 0.0, 1825, 1832, 1840, 1850, 1855, 1902, 1827, 1835, 1837, 1841, 1844, 1852]

[18, 1820, 12, 1, 6, -1, 6, 1821, 1822, 1829, 1836, 1838, 1844, 1825, 1828, 1832, 1837, 1830, 1833, 0.0, -10, 1822, 1829, -10, 1838, 1844, 1825, 1828, -10, 1837, 1830, -10]
[18, 1820, 12, 1, 6, -1, 6, 1821, 1822, 1829, 1836, 1838, 1844, 1825, 1828, 1830, 1832, 1833, 1837, 0.0, -10, 1822, 1829, -10, 1838, 1844, 1825, 1828, 1830, -10, -10, 1837]

[19, 1820, 12, 1, 6, -1, 6, 1821, 1824, 1831, 1836, 1840, 1846, 1823, 1826, 1831, 1835, 1829, 1832, 0.0, -10, 1824, 1831, -10, 1840, 1846, 1823, 1826, -10, 1835, 1829, -10]
[19, 1820, 12, 1, 6, -1, 6, 1821, 1824, 1831, 1836, 1840, 1846, 1823, 1826, 1829, 1831, 1832, 1835, 0.0, -10, 1824, 1831, -10, 1840, 1846, 1823, 1826, 1829, -10, -10, 1835]

[20, 1820, 12, 1, 6, -1, 6, 1822, 1826, 1833, 1837, 1842, 1848, 1821, 1824, 1830, 1833, 1827, 1832, 0.0, -10, 1826, 1833, -10, 1

[20, 2007, 12, 1, 6, -1, 6, 2008, 2012, 2017, 2019, 2025, 2027, 2007, 2016, 2023, 2016, 2021, 2027, 0.0, -10, 2012, 2017, 2019, -10, 2027, -10, 2016, 2023, -10, -1000, 2027]
[20, 2007, 12, 1, 6, -1, 6, 2008, 2012, 2017, 2019, 2025, 2027, 2007, 2016, 2016, 2021, 2023, 2027, 0.0, -10, 2012, 2017, 2019, -10, 2027, -10, 2016, -10, -1000, 2023, 2027]

[21, 2007, 12, 1, 6, -1, 6, 2009, 2014, 2019, 2021, 2025, 2029, 2014, 2021, 2016, 2019, 2025, 2031, 0.0, -10, 2014, 2019, 2021, -10, 2029, 2014, 2021, -10, 2019, 2025, -10]
[21, 2007, 12, 1, 6, -1, 6, 2009, 2014, 2019, 2021, 2025, 2029, 2014, 2016, 2019, 2021, 2025, 2031, 0.0, -10, 2014, 2019, 2021, -10, 2029, 2014, -10, 2019, 2021, 2025, -10]

[22, 2007, 12, 1, 6, -1, 6, 2007, 2010, 2015, 2020, 2023, 2026, 2013, 2019, 2014, 2017, 2023, 2030, 0.0, -3, 2010, 2015, -5, 2023, 2026, 2013, 2019, 2014, 2017, 2023, -6]
[22, 2007, 12, 1, 6, -1, 6, 2007, 2010, 2015, 2020, 2023, 2026, 2013, 2014, 2017, 2019, 2023, 2030, 0.0, -3, 2010, 2015, -5, 2023, 20


[9, 2225, 10, 1, 4, -1, 6, 2234, 2302, 2332, 2225, 2238, 2306, 2336, 2225, 2225, 2225, 0.0, 2234, 2302, 2332, 2, 2238, 2306, 2336, 6, 36, 54]
[9, 2225, 10, 1, 4, -1, 6, 2225, 2234, 2302, 2332, 2225, 2225, 2225, 2238, 2306, 2336, 0.0, 2, 2234, 2302, 2332, 6, 36, 54, 2238, 2306, 2336]

[0, 2226, 9, 1, 3, -1, 6, 2244, 2314, 2344, 2227, 2234, 2257, 2325, 2355, 2226, 0.0, -6, -6, -6, -5, -5, -5, -5, -5, -5]
[0, 2226, 9, 1, 3, -1, 6, 2244, 2314, 2344, 2226, 2227, 2234, 2257, 2325, 2355, 0.0, -6, -6, -6, -5, -5, -5, -5, -5, -5]

[1, 2226, 9, 1, 3, -1, 6, 2246, 2316, 2346, 2232, 2254, 2322, 2352, 2226, 2226, 0.0, 2246, 2316, 2346, 2232, 2254, 2322, 2352, 22, 52]
[1, 2226, 9, 1, 3, -1, 6, 2246, 2316, 2346, 2226, 2226, 2232, 2254, 2322, 2352, 0.0, 2246, 2316, 2346, 22, 52, 2232, 2254, 2322, 2352]

[2, 2226, 9, 1, 3, -1, 6, 2249, 2319, 2349, 2229, 2251, 2319, 2349, 2226, 2226, 0.0, 2249, 2319, 2349, 2229, 2251, 2319, 2349, 19, 49]
[2, 2226, 9, 1, 3, -1, 6, 2249, 2319, 2349, 2226, 2226, 2229, 225


[8, 2325, 6, 1, 2, -1, 4, 0.0, 2335, 0.0, -1000, 2339, -1000, -1000, -1000, 2335]
[8, 2325, 6, 1, 2, -1, 4, 0.0, 2335, -1000, 0.0, 2339, -1000, -1000, -1000, 2335]

[9, 2325, 6, 1, 2, -1, 4, 2303, 2337, 0.0, -1000, 2338, -1000, -1000, -1000, 2337]
[9, 2325, 6, 1, 2, -1, 4, 2303, 2337, -1000, 0.0, 2338, -1000, -1000, -1000, 2337]

[10, 2325, 9, 1, 3, -1, 6, 0.0, 2339, 2340, 0.0, -1000, 2335, -1000, 2348, -1000, -1000, -1000, 2339, -6]
[10, 2325, 9, 1, 3, -1, 6, 0.0, 2339, 2340, -1000, -1000, 0.0, 2335, 2348, -1000, -1000, -1000, 2339, -6]

[11, 2325, 9, 1, 3, -1, 6, 0.0, 2341, 2342, 0.0, -1000, 2333, 2347, 2346, -1000, -1000, -1000, 2341, 2342]
[11, 2325, 9, 1, 3, -1, 6, 0.0, 2341, 2342, -1000, 0.0, 2333, 2346, 2347, -1000, -1000, -1000, 2341, 2342]

[12, 2325, 9, 1, 3, -1, 6, 0.0, 2343, 2344, 0.0, -1000, 2331, 2346, 2344, -1000, -1000, -1000, 2343, 2344]
[12, 2325, 9, 1, 3, -1, 6, 0.0, 2343, 2344, -1000, 0.0, 2331, 2344, 2346, -1000, -1000, -1000, 2343, 2344]

[13, 2325, 10, 1, 4, -1,

PROBLEM with DIRECTION 1
new request : 
[12, 1543, 12, 1, 6, -1, 6, 1549, 1604, 1604, 1610, 1619, 1625, 1552, 1556, 1600, 1605, 1611, 1620, 0.0, 1549, -10, 1605, 1610, 1619, 1625, -10, -10, 1600, 1605, -10, 1620]

current request :
[13, 1543, 11, 1, 5, -1, 6, 1550, 1604, 1605, 1612, 1620, 1551, 1556, 1558, 1603, 1611, 1618, 0.0, 1550, -10, 1605, 1612, 1620, -10, -10, 1558, 1603, -10, 1618]
previous request :
[13, 1542, 12, 1, 6, -1, 6, 1550, 1604, 1605, 1612, 1620, 1627, 1551, 1556, 1558, 1603, 1611, 1618, 0.0, 1550, -10, 1605, 1612, 1620, 1627, -10, -10, 1558, 1603, -10, 1618]
PROBLEM with DIRECTION 1
new request : 
[13, 1543, 12, 1, 6, -1, 6, 1550, 1604, 1605, 1612, 1620, 1627, 1551, 1556, 1558, 1603, 1611, 1618, 0.0, 1550, -10, 1605, 1612, 1620, 1627, -10, -10, 1558, 1603, -10, 1618]

current request :
[16, 1543, 11, 1, 5, -1, 6, 1555, 1609, 1610, 1617, 1625, 1547, 1551, 1553, 1558, 1606, 1613, 0.0, 1555, -10, 1610, 1617, 1625, -10, -10, 1553, 1558, -10, 1613]
previous request :
[16


current request :
[4, 2238, 9, 1, 3, -1, 6, 2252, 2322, 2352, 2238, 2238, 2238, 2248, 2318, 2346, 0.0, 2252, 2322, 2352, 16, 46, 103, 2248, 2318, 2346]
previous request :
[4, 2237, 10, 1, 4, -1, 6, 2252, 2322, 2352, 2225, 2237, 2237, 2237, 2248, 2316, 2346, 0.0, 2252, 2322, 2352, -2, 16, 46, 103, 2248, 2316, 2346]
PROBLEM with DIRECTION 1
new request : 
[4, 2238, 10, 1, 4, -1, 6, 2252, 2322, 2352, 2225, 2238, 2238, 2238, 2248, 2318, 2346, 0.0, 2252, 2322, 2352, -2, 16, 46, 103, 2248, 2318, 2346]

current request :
[5, 2238, 9, 1, 3, -1, 6, 2254, 2324, 2354, 2238, 2238, 2238, 2246, 2316, 2344, 0.0, 2254, 2324, 2354, 14, 44, 102, 2246, 2316, 2344]
previous request :
[5, 2237, 10, 1, 4, -1, 6, 2254, 2324, 2354, 2227, 2237, 2237, 2237, 2246, 2314, 2344, 0.0, 2254, 2324, 2354, 14, 14, 44, 102, 2246, 2314, 2344]
PROBLEM with DIRECTION 1
new request : 
[5, 2238, 10, 1, 4, -1, 6, 2254, 2324, 2354, 2227, 2238, 2238, 2238, 2246, 2316, 2344, 0.0, 2254, 2324, 2354, 14, 14, 44, 102, 2246, 2316, 23

[15, 2313, 12, 1, 6, -1, 6, 2259, 2259, 2314, 2328, 2343, 2349, -1000, 0.0, -1000, 2314, 2328, 2343, 2349]
PROBLEM with DIRECTION 1
new request : 
[15, 2314, 12, 1, 6, -1, 6, 2259, 2259, 2314, 2328, 2343, 2349, -1000, 0.0, -1000, -2, 2328, 2343, 2349]

current request :
[16, 2314, 11, 1, 5, -1, 6, 0.0, 2316, 2330, 2345, 2350, -1000, 0.0, 2324, 2328, 2335, 2353, -1000, -1000, 2316, 2330, 2345, 2350]
previous request :
[16, 2313, 12, 1, 6, -1, 6, -1000, 0.0, 2316, 2330, 2345, 2350, -1000, 0.0, -1000, 2316, 2330, 2345, 2350]
PROBLEM with DIRECTION 1
new request : 
[16, 2314, 12, 1, 6, -1, 6, -1000, 0.0, 2316, 2330, 2345, 2350, -1000, 0.0, -1000, 2316, 2330, 2345, 2350]

current request :
[44, 2314, 11, 1, 5, -1, 6, 2301, 2316, 2331, 2344, 2352, -1000, 0.0, 2323, 2338, 2352, -1000, -1000, -1000, -6, -6, -100, -6]
previous request :
[44, 2313, 12, 1, 6, -1, 6, 0.0, 2301, 2316, 2331, 2344, 2352, -1000, 0.0, -6, -1000, -6, -6, -100, -6]
PROBLEM with DIRECTION 1
new request : 
[44, 2314, 12, 1

[20, 2348, 12, 1, 6, -1, 6, -1000, 0.0, 0.0, 2354, 2359, -1000, 0.0, -1000, 2359, -10]
PROBLEM with DIRECTION 1
new request : 
[20, 2349, 12, 1, 6, -1, 6, -1000, 0.0, 0.0, 2354, 2359, -1000, 0.0, -1000, 2359, -10]

current request :
[21, 2349, 11, 1, 5, -1, 6, -1000, 0.0, 0.0, 2355, -1000, -1000, -1000, -1000, 2349, 2355, -1000, -1000, -10, -1000]
previous request :
[21, 2348, 12, 1, 6, -1, 6, -1000, -1000, 0.0, 0.0, -1000, 0.0, -10, -1000]
PROBLEM with DIRECTION 1
new request : 
[21, 2349, 12, 1, 6, -1, 6, -1000, -1000, 0.0, 0.0, -1000, 0.0, -10, -1000]

current request :
[22, 2349, 11, 1, 5, -1, 6, -1000, 0.0, 0.0, 2356, -1000, -1000, -1000, -1000, 2353, 2358, -1000, -1000, 2356, -1000]
previous request :
[22, 2348, 12, 1, 6, -1, 6, -1000, -1000, 0.0, 2356, -1000, -1000, -1000, -1000, 2348, 2353, 2359, -1000, -1000]
PROBLEM with DIRECTION 1
new request : 
[22, 2349, 12, 1, 6, -1, 6, -1000, -1000, 0.0, 0.0, -1000, 0.0, 2356, -1000]

current request :
[44, 2349, 8, 1, 2, -1, 6, 2301, 2

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [16]:
dataset3 =  final_preprocessing(raw_data_3)


[17, 644, 24, 1, 12, -1, 12, 648, 651, 657, 704, 705, 711, 719, 720, 726, 735, 750, 805, 647, 651, 657, 704, 705, 711, 718, 720, 733, 748, 803, 644, 0.0, 648, 651, 657, 704, 705, 711, 719, 720, 726, 735, 750, 805, 647, 651, 657, 704, 705, 711, 718, 720, 733, 748, 803, -3]
[17, 644, 24, 1, 12, -1, 12, 648, 651, 657, 704, 705, 711, 719, 720, 726, 735, 750, 805, 644, 647, 651, 657, 704, 705, 711, 718, 720, 733, 748, 803, 0.0, 648, 651, 657, 704, 705, 711, 719, 720, 726, 735, 750, 805, -3, 647, 651, 657, 704, 705, 711, 718, 720, 733, 748, 803]

[10, 704, 12, 1, 6, -1, 6, 706, 714, 721, 727, 729, 735, 711, 704, 715, 724, 733, 740, 0.0, 706, -6, 721, -6, -6, 735, -5, -100, -1000, -5, -1000, -5]
[10, 704, 12, 1, 6, -1, 6, 706, 714, 721, 727, 729, 735, 704, 711, 715, 724, 733, 740, 0.0, 706, -6, 721, -6, -6, 735, -100, -5, -1000, -5, -1000, -5]

[10, 705, 12, 1, 6, -1, 6, 706, 714, 721, 727, 729, 735, 711, 2359, 715, 724, 736, 740, 0.0, -3, -6, 721, -6, -6, 735, -5, -100, -1000, -5, -1000, -5

[26, 915, 12, 1, 6, -1, 6, 915, 921, 925, 929, 933, 937, 916, 918, 916, 921, 924, 930, 0.0, -2, 921, 925, 929, 933, 937, 916, 918, 916, 921, 924, 930]
[26, 915, 12, 1, 6, -1, 6, 915, 921, 925, 929, 933, 937, 916, 916, 918, 921, 924, 930, 0.0, -2, 921, 925, 929, 933, 937, 916, 916, 918, 921, 924, 930]

[27, 915, 12, 1, 6, -1, 6, 915, 917, 923, 927, 931, 935, 915, 916, 915, 919, 922, 928, 0.0, -2, 917, 923, 927, 931, 935, -2, 916, 915, 919, 922, 928]
[27, 915, 12, 1, 6, -1, 6, 915, 917, 923, 927, 931, 935, 915, 915, 916, 919, 922, 928, 0.0, -2, 917, 923, 927, 931, 935, -2, 915, 916, 919, 922, 928]

[0, 916, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 916, 932, 930, 948, 1005, 1010, 0.0, -6, -6, -6, -6, -6, -6, -100, -5, -100, -5, -5, -5]
[0, 916, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 916, 930, 932, 948, 1005, 1010, 0.0, -6, -6, -6, -6, -6, -6, -100, -100, -5, -5, -5, -5]

[10, 916, 12, 1, 6, -1, 6, 924, 930, 939, 945, 955, 1000, 925, 926, 937, 943, 940, 948, 0.0, 924, -

In [17]:
dataset4 =  final_preprocessing(raw_data_4)


[0, 1143, 12, 1, 6, -1, 6, 1149, 1204, 1219, 1234, 1249, 1304, 1155, 1208, 1143, 1221, 1235, 1250, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -5, -5, -5, -5]
[0, 1143, 12, 1, 6, -1, 6, 1149, 1204, 1219, 1234, 1249, 1304, 1143, 1155, 1208, 1221, 1235, 1250, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -5, -5, -5, -5]

[18, 1143, 12, 1, 6, -1, 6, 1143, 1144, 1152, 1153, 1159, 1206, 1150, 1149, 1154, 1200, 1203, 1209, 0.0, -2, 1144, -10, 1153, 1159, -10, 1150, -10, 1154, 1200, -10, 1209]
[18, 1143, 12, 1, 6, -1, 6, 1143, 1144, 1152, 1153, 1159, 1206, 1149, 1150, 1154, 1200, 1203, 1209, 0.0, -2, 1144, -10, 1153, 1159, -10, -10, 1150, 1154, 1200, -10, 1209]

[0, 1144, 12, 1, 6, -1, 6, 1149, 1204, 1219, 1234, 1249, 1304, 1155, 1208, 1144, 1222, 1235, 1250, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -5, -5, -5, -5]
[0, 1144, 12, 1, 6, -1, 6, 1149, 1204, 1219, 1234, 1249, 1304, 1144, 1155, 1208, 1222, 1235, 1250, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -5, -5, -5, -5]

[18, 1144, 12, 1, 6, -1, 6, 1144, 1152, 1153, 

In [18]:
print(L) # on affiche les temps de calcul pour le premier dataset.

[0.2707960605621338, 0.7487435340881348, 0.7599058151245117, 0.01100301742553711, 0.2484283447265625, 0.07592654228210449, 0.5126974582672119, 1.34794020652771, 1.4666287899017334, 0.02561807632446289, 1.2140088081359863, 1.1128361225128174, 0.22516465187072754, 0.6031894683837891, 0.6487061977386475, 0.00873708724975586, 0.36187028884887695, 0.008671045303344727, 0.05383419990539551, 0.14783310890197754, 0.1455390453338623, 0.0019838809967041016, 0.05408740043640137, 0.001142740249633789]


## Processing 

Afin de procéder au traitement des données et à l'acquisition des horaires d'arrivée des trains, nous définissons deux classes. Une première classe $\textit{train}$, qui implémente les différentes données que nous avons sur un train suvi. Une classe $\textit{LinkedList}$ qui implémente un liste doublement chaînée d'instances de la classe train.

In [19]:
class linkedList : 
    def __init__(self, _root, _tail, _current, _lenght):
        self.m_lenght = _lenght
        self.m_root = _root 
        self.m_tail = _tail
        self.m_current = _current
    
    def get_root(self):
        self.m_current = self.m_root
        return self.m_root
    
    def get_tail(self):
        self.m_current = self.m_tail
        return self.m_tail
    
    def get_lenght(self):
        return self.m_lenght
    
    def get_current(self):
        return self.m_current
    
    def set_current(self, train):
        self.m_current = train
        
    def get_next(self):
        if(self.m_current != None):
            self.m_current = self.m_current.get_next()
            return self.m_current
        else :
            raise ValueError("self.m_current is null, can not get next !")

    def get_previous(self):
        if(self.m_current != None):
            self.m_current = self.m_current.get_previous()
        else :
            raise ValueError("self.m_current is null, can not get previous !")
            
    def add_queue(self, new_train):
        if(self.m_lenght == 0):
            self.m_root = new_train
            self.m_tail = new_train
        else: 
            self.m_tail.set_next(new_train)
            new_train.set_previous(self.m_tail)
            self.m_tail = new_train
        self.m_lenght+=1
        
    def add_after_current(self, new_train):
        if(self.m_lenght == 0):
            self.m_root = new_train
            self.m_tail = new_train
        else: 
            if(self.m_current != None):
                self.m_current.get_next().set_previous(new_train)
                new_train.set_next(self.m_current.get_next())
                self.m_current.set_next(new_train)
                new_train.set_previous(self.m_current)
            else :
                raise ValueError("Current = None : you can not add. Set current first. ")
        self.m_lenght+=1
    
    def add_before_current(self, new_train):
        if(self.m_lenght == 0):
            self.m_root = new_train
            self.m_tail = new_train
        else: 
            if(self.m_current != None):
                if(self.m_current.get_previous()!= None):
                    self.m_current.get_previous().set_next(new_train)
                    new_train.set_previous(self.m_current.get_previous())
                    self.m_current.set_previous(new_train)
                    new_train.set_next(self.m_current)
                else : 
                    self.add_root(new_train)
            else :
                raise ValueError("Current = None : you can not add. Set current first. ")
        self.m_lenght+=1
    
    def add_root(self, new_train):
        if(self.m_lenght == 0):
            self.m_root = new_train
            self.m_tail = new_train
        else:
            self.m_root.set_previous(new_train)
            new_train.set_next(self.m_root)
            self.m_root = new_train
        self.m_lenght+=1
        
    def remove(self, train):
        self.m_current = train
        self.remove_current()
        
    def remove_current(self): # returns the previous one if it's not None
        if(self.m_current != None): # in theory it's never the case
            if(self.m_current.get_current_state() == -10 or self.m_current.get_arrival_time() > 0  or self.m_current.get_isLate()): 
                # isLate => maybe it has been cancelled... but we'll never know
                if(self.m_current == self.m_root):
                    self.m_root = self.m_current.get_next()
                    if(self.m_root != None):
                        self.m_root.set_previous(None)
                elif(self.m_current == self.m_tail):
                    self.m_tail = self.m_current.get_previous()
                    if(self.m_tail != None):
                        self.m_tail.set_next(None)
                else : 
                    self.m_current.get_previous().set_next(self.m_current.get_next()) # it has a previous and a next since it's neither the root nor the tail
                    self.m_current.get_next().set_previous(self.m_current.get_previous())
                    
                self.m_lenght-=1
                self.m_current = self.m_current.m_next
                return self.m_current
            # not too sure about it
            
            else :
                raise NameError("You can't do that, the 'current' has not arrived yet !")

        else :
            print("The 'current' is already a None type")

            
        # should we return the previous one, or the next one? # maybe not
    def toString(self):
        print("Nb trains : "+ str(self.m_lenght))
        train = self.m_root
        while(train != None):
            print(train.toString())
            train = train.get_next()

class Train :
    
    def __init__(self, first_estimated_time,current_state, _next, _previous, request_time):
        self.first_estimated_time = first_estimated_time
        self.current_state = current_state
        self.m_next = _next
        self.m_previous = _previous
        self.arrival_time = 0
        self.isLate = False
        self.last_estimation = first_estimated_time
        self.first_estimation_request_time = request_time
    
    def set_estimated_time(self, new_time):
        if(self.first_estimated_time >= 0):
            raise ValueError("The first estimated time has already been set !")
        else : 
            self.first_estimated_time = new_time
    
    def get_estimated_time(self):
        return self.first_estimated_time

    def set_current_state(self, new_state):
        # it depends
        self.current_state = new_state
        # exceptions to raise ?
        
    def get_current_state(self):
        return self.current_state
    
    def set_next(self, new_next):
        self.m_next = new_next
        """
        if(self.m_next == None):
            self.m_next = new_next
        else :
            if(self.m_next.get_current_state() == -2) : # only way to change the next is if the previous one arrived
                self.m_next = new_next
        """
    def get_next(self):
        return self.m_next

    
    def set_previous(self, new_previous):
        self.m_previous = new_previous
        """
        if(self.m_previous == None):
            
        else :
            if(self.m_previous.get_current_state() == -2) : 
                self.m_previous = new_previous
        """
    def get_previous(self):
        return self.m_previous
    
    def set_arrival_time(self,time):
        self.arrival_time = time;
        
    def get_arrival_time(self):
        return self.arrival_time
    
    def set_isLate(self,late):
        self.isLate = late;
        
    def get_isLate(self):
        return self.isLate
    
    # not used so far
    def set_last_estimation(self,time):
        self.last_estimation = time;
        
    def get_last_estimation(self):
        return self.last_estimation
    
    def toList(self, station_idx, direction):
        b=0
        if(self.isLate==1):
            b=1
        L = [station_idx, direction, convert(self.first_estimated_time), substract_times(self.first_estimation_request_time,self.first_estimated_time), b, substract_times(self.first_estimated_time,self.arrival_time)]
        return L
    
    def toString(self):
        string = ""
        if (self.m_next != None):
            string += " ; next : yes"
        if (self.m_previous != None):
            string += " ; previous : yes"
        return(" first estimated time : " + str(self.first_estimated_time) + " last estimated time : " + str(self.get_last_estimation()) + "  time first request: " + str(self.first_estimation_request_time) + " ; state : " + str(self.current_state) + " ; is late : " + str(self.isLate))
        
def convert(time): # this functions is useful to convert the time to only minutes before adding it to the dataset.

    hours = int(time/100)
    minutes = time - hours*100
    if(hours == 0):
        hours=24
    if(hours==1):
        hours=25
    if(hours == 2):
        hours = 25
    return(hours*60+minutes)

### Phase de processing en elle-même
La fonction suvante implémente le traitement des donnéers en lui-même.

Du fait de la découverte de nouvelles erreurs au fur et à mesure du traitement et de notre étude, de nombreux aller-retours entre pré-traitement et tratement ont été effectués. De plus, fort de l'expérience acquise et si les choses étaient à refaire, nous ferions certainement très différemments.

In [20]:

def processing(station_idx, nb_requetes, nb_station, requests, direction):

    """
    This function take a station index (between 0 and 46), a direction (-1 for "aller" or 1 for "retour"),
    the preprocessed requests set and the number of station and requests per station in the given dataset. 
    
    It reads everyline, starting from the beginning, and tracks the different trains.
    
    The algorithm uses practical criteria to try to say weather a train has arrived, if we have to delete it from our tracked trains etc.
    """
    # init of some parameters useful for what's next
    local_offset = 1 # because of the 0.0 we added between the messages and the estimations.
    dataset = [] # to store the data
    
    max_difference_last_current_est = 3 # to delete a train when the difference between the current estimation and the next one is bigger than 3.
    max_diff_late = 5
    min_new_train = 6
    max_deleting_no_message = 2
    max_difference_last_current_est_spotted_missing_train = 2
    
    count = station_idx # count allows us to only check the lines of "requests" concerning the station statioin_idx.
    
    # get number of trains of 1st request
        # here we compute the indexes to which are the first estimation and the last one
    qty = requests[count][2] + local_offset # total nu

    #number of trains, useful to check messages (as "ARRIVED" or rather -2 in our code)
    start = offset
    end = offset + requests[count][4]
    indx = 4
    if direction == -1 :
        indx = 6
        start += requests[count][4]
        end += requests[count][6]
        
    # initializin linkedlist
    linkedlist = linkedList(None, None, None, 0) 
    
    # initializing with first request
    time_request = requests[count][1]
    for j in range(start, end):
        if (requests[count][j+qty] != code["NO PASSENGERS"] or requests[count][j+qty] != code["NO STOP"]): # NO STOP or NO PASSENGERS are taken into account.
            if(requests[count][j+qty] != code["NA"]): # NA means there were a "Voie .." - so we can simply write the "current" instead
                train = Train(requests[count][j], requests[count][j+qty], None, None, time_request)
            else :
                train = Train(requests[count][j],requests[count][j], None, None, time_request)
            if(requests[count][j+qty] == -2 or requests[count][j+qty] == -5): # we are at the station, -5 means we are departing from the station
                train.set_arrival_time(requests[count][1]) # just so the delete function is happy, the train is not added to our dataset though
            linkedlist.add_queue(train) # and we add the train to the linkedlist

    
    count += nb_station # update count after init

    # update
    for i in range(1, nb_requetes): # loop on all the request
        try :
            request_time = requests[count][1] # at some point there were an error (file 3) - 
        except IndexError:
            break
        # get the number of trains for the current (station, request)
        qty = requests[count][2] + local_offset # total number of trains, useful to check messages (as "ARRIVED")
        start = offset 
        end = offset + requests[count][4]
        if direction == -1 : 
            start += requests[count][4]
            end += requests[count][6]
            
        print("  ")
        print("  ")
        

        ##print(" end - start " + str(end-start))
        #print(" lenght : " + str(linkedlist.get_lenght()))
        """
        the following piece of code is a pretty recent idea: may be we can guess if there is a new train 
        by checking the last trains which are less susceptibles to change. 
        it can give us some idea of what's going on. But it also adds complexity ...
         1. Check weather or not there is a NO PASSENGERS or NO STOP in the request. -> sometimes a random train would appear
         and in most cases it was because of a NO PASSENGERS or NO STOP. So we simply ignored all those cases (even if some could have been used)
             -> TODO : precise things so only the not working cases are ignored
         2. If there is not : check if a train is missing (we don't go further than one for now)
        """
        nb_of_added_train = 0
        no_stop_or_no_passengers = False 
        try :
            for j in range(start+qty,end + qty): # remember than quantity has already offset in it
                if(requests[count][j] == code["NO STOP"] or requests[count][j] == code["NO PASSENGERS"]):
                    no_stop_or_no_passengers = True
                    break
        except IndexError:
            continue
            
        if(linkedlist.get_lenght() >= end-start and not no_stop_or_no_passengers): # it's possible we already deleted a train ...
            #print(requests[count])
            train = linkedlist.get_tail() # how does the tail can be equal to zero
            if(train != None):
                #print(train.toString())
                pass
            for j in range(end-1, start-1, -1):
                current = requests[count][j] # we start at the end
                current_state = requests[count][j+qty]
                if(abs(substract_times(train.get_last_estimation(), current))>min_new_train):
                    # it means that probably a train has left - it is beyond the station 
                    nb_of_added_train += 1
                    break # we'll just do that for now
                else :
                    break
                    
        # loop over the new estimations / messages
        train = linkedlist.get_root() # we get the root 
        j = start
        print("New passage : " + str(i) + " ; request time: " + str(request_time))
        linkedlist.toString()
        #print("  ")
        #print("trains missing : " + str(nb_of_added_train))
        
        while(j < end):
            """
            code["NO STOP"] = -10
            code["WAITING"] = -1
            code["ARRIVED"] = -2
            code["INCOMING"] = -3
            code["LATE"] = -4
            code["NO PASSENGERS"] = -100
            code["TERMINUS"] = -5
            code["NA"] = -1000
            code["DEPARTURE"] = -5
            code["DEPARTURE SOON"] = -6
            """
            try :
                current = requests[count][j] # it's an INT (actual estimated arrival time)
                current_state = requests[count][j+qty]
                print(" ----> current estimation : " + str(current) + " ; current state : " + str(current_state))
            except IndexError:
                break
                
            if(current_state == code["NO PASSENGERS"] or current_state == code["NO STOP"]): # NO PASSENGER / NO STOP are ignored for now
                j+=1
                continue
                
                
                # we delete the train that we were able to spot by looking at the last estimations
                # we still check if there is indeed a big enough discontinuity between the new estimate and the last on
            if(train != None and nb_of_added_train > 0 and abs(substract_times(train.get_last_estimation(), request_time)) < max_difference_last_current_est_spotted_missing_train):
                train.set_arrival_time(request_time)
                dataset.append(train.toList(station_idx, direction))  # adding to the dataset
                train_tmp = train.get_next()
                print( " DELETED : "  + train.toString()  + "SPOTTED MISSING TRAIN ")
                linkedlist.remove(train) # removing
                train = train_tmp
                nb_of_added_train -= 1 
                continue 

            if(train != None): # if train != None, then the current colums we are reading are not of a new train (if we did thing correctly)
                print(train.toString()) # we print it
                
                
                A = (not train.get_isLate() or abs(substract_times(train.get_last_estimation(), current))>max_diff_late) and current_state != code["LATE"]
                B = not (train.get_previous() != None and train.get_previous().get_isLate())
                C = abs(substract_times(train.get_last_estimation(), current)) >  max_difference_last_current_est # can we use a comparison with the next one instead 
                D =(train.get_next()!= None) and (abs(substract_times(train.get_last_estimation(), current) > abs(substract_times(train.get_next().get_last_estimation(), current))))
                # this is basically to check weather we should delete the train or not. This started with few conditions, but others were added with apparent new problems.
                if (A and B and (C or D)): 
                    print("Time difference between current and last estimation : " + str(abs(substract_times(train.get_last_estimation(), current))))
                    
                    if(train.get_current_state() == code["INCOMING"]): # the train stopped but we did not get any "ARRIVED" -  so we have to add a time of arrival 
                        train.set_arrival_time(request_time)
                        dataset.append(train.toList(station_idx, direction))
                        
                    elif(train.get_current_state() == code["DEPARTURE SOON"]): # the train departed but we did not get any "DEPARTURE" -  so we have to add a time of departure 
                        train.set_arrival_time(request_time)
                        dataset.append(train.toList(station_idx, direction))
                    
                    elif(abs(substract_times(train.get_last_estimation(), request_time))<max_deleting_no_message) : # not sure if it's still relevant considering the preprocessing we added
                        train.set_arrival_time(request_time)
                        dataset.append(train.toList(station_idx, direction))
                        break
                        
                    train.set_arrival_time(request_time) 
                    """
                    None that the previous line is NOT supposed to be here... The initial idea was to make this "if" the "deleting trains one" while
                    others would only be to set the arrival time. So if we entered this "if" condition without a already set arrival time, the 
                    "linkedlist.remove(train)" would fail (the remove function checks if the train has an arrival time already defines and if not 
                    returns an error).
                    After losing lots and lots of hours on trying to fixe every issues (that would make us understand better what was going on in the dataset)
                    and not managing (some errors seems to be too difficult to fix).
                    """
                    train_tmp = train.get_next()
                    print("DELETED :" + train.toString())
                    linkedlist.remove(train)
                    train = train_tmp
                    continue # when we do that, we lose the current request, not good
                
                if(current_state != code["NA"]): #
                    
                    if((train.get_current_state() == code["ARRIVED"] and current_state != code["ARRIVED"]  and current_state != code["WAITING"])):
                        # this was added when a problem appeared on station 10 : a train was not deleted because it was a little bit late (but never got such a message) 
                        # and too close from the next train
                        # so since we don't delete train immediately we did not delete this train (where as its code changed from ARRIVED to something else that was not possible)
                        # then we have to delete this train 
                        train_tmp = train.get_next()
                        print("has been delete because it arrived -> fixed in checking state")
                        linkedlist.remove(train)
                        train = train_tmp
                        continue
                    
                train.set_current_state(current_state)
                train.set_last_estimation(current) 
                if current_state == code["LATE"] :
                    print(" ----> LATE or still LATE")
                    train.set_isLate(True)
        
                
            else : # we can add it - we've never seen this train before
                if (current_state != code["NO STOP"] or current_state != code["NO PASSENGERS"]):
                    if(current_state == code["NA"]):
                        train = Train(current, current, None, None, request_time)
                    else:
                        train = Train(current, current_state, None, None, request_time)
                    if(current_state == -2 or current_state == -5): 
                        train.set_arrival_time(request_time) # 
                    linkedlist.add_queue(train)
                
                
            """
            The 3 conditions that follows allow to set the arrival time in 3 different ways. 
            After some issues occured, we add a possibility to delete the train by checking the next request and hoping that 
            no train before departed... 
            """
            if(train.get_current_state() == code["ARRIVED"]):
                 # if the train is arrived, then we have to treat it
                train.set_arrival_time(request_time)
                dataset.append(train.toList(station_idx, direction))
                # we'll try to delete here, by checking on the next one
                if(i + 1 < nb_requests): # the issue here is that if a previous train departed => trouble but it works ok for ARRIVED 
                    try:
                        if( train.get_isLate() and (requests[count+nb_station][j+qty] != (code["ARRIVED"] or requests[count+nb_station][j+qty] != code ["WAITING"]))): # qty already has offset
                            train_tmp = train.get_next()
                            #print("has been delete because it arrived (check next)")
                            linkedlist.remove(train)
                            train = train_tmp
                            j+=1
                            continue
                    except :
                        break
                
                
            elif(train.get_current_state() == code["DEPARTURE"]):
                train.set_arrival_time(request_time)
                dataset.append(train.toList(station_idx, direction))
            
                if(i + 1 < nb_requests):
                    try :
                        if(requests[count+nb_station][j+qty] != code["DEPARTURE"] and abs(substract_times(train.get_last_estimation(), requests[count+nb_station][j]))> max_difference_last_current_est): # careful ... the local offset...
                            train_tmp = train.get_next()
                            #print("has been delete because it DEPARTED (check next)")
                            linkedlist.remove(train)
                            train = train_tmp
                            j+=1
                            continue
                    except :
                        break
                
            elif(train.get_current_state() == code["TERMINUS"]): # in this case, it stays at TERMINUS forever. Maybe we should try to replace it by the time ? Or we completely ignore it.
                # it means we are leaving a station, but we dont really want to add such things to the data set right ?
                train.set_arrival_time(request_time)
                dataset.append(train.toList(station_idx, direction))
                
                if(i + 1 < nb_requests):
                    if( (requests[count+nb_station][j+qty] != code["TERMINUS"] ) and abs(substract_times(train.get_last_estimation(), requests[count+nb_station][j]))> max_difference_last_current_est):
                        train_tmp = train.get_next()
                        #print("has been delete because it arrived to TERMINUS (check next)")
                        linkedlist.remove(train)
                        train = train_tmp
                        j+=1
                        continue
                
            j+=1
            train = train.get_next()

        count += nb_station

    return dataset

In [21]:
# testing ...

dataset_ = []
dataset = final_preprocessing(raw_data_1)
t1 = time.time()
dataset_.append(processing(1, nb_requests-20, nb_stations, dataset, -1)) # the last request is 1h after ... # 14
dt = time.time() - t1
"""
code["NO STOP"] = -10
code["WAITING"] = -1
code["ARRIVED"] = -2
code["INCOMING"] = -3
code["LATE"] = -4
code["NO PASSENGERS"] = -100
code["TERMINUS"] = -5
code["NA"] = -1000
code["DEPARTURE"] = -5
code["DEPARTURE SOON"] = -6
"""
print(dataset_)


[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 927, 933, 932, 941, 917, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]
[0, 917, 12, 1, 6, -1, 6, 918, 934, 949, 1004, 1019, 1034, 917, 927, 932, 933, 941, 1006, 0.0, -6, -6, -6, -6, -6, -6, -5, -5, -100, -5, -5, -5]

[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 950, 948, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -5, -1000]
[10, 917, 12, 1, 6, -1, 6, 925, 930, 939, 945, 955, 1000, 919, 931, 942, 943, 948, 950, 0.0, 925, -6, 939, -6, 955, -6, 919, 931, -5, 943, -1000, -5]

[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 947, 946, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 947, 946]
[11, 917, 12, 1, 6, -1, 6, 927, 932, 941, 947, 957, 1002, 917, 928, 940, 941, 946, 947, 0.0, -10, 932, -10, 947, -10, 1002, -2, 928, 940, 941, 946, 947]

[12, 917, 12, 1, 6, -1, 6, 917, 928, 934, 942, 949, 958, 927, 938, 939, 946, 944, 951, 0.0, -3, -10,


[10, 1342, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1342, 1348, 1405, 1416, 1413, 1424, 0.0, -1000, -6, 1355, -6, 1410, -6, -5, 1348, 1405, -5, -1000, -5]
[10, 1342, 12, 1, 6, -1, 6, 1345, 1345, 1355, 1400, 1410, 1415, 1342, 1348, 1405, 1413, 1416, 1424, 0.0, -1000, -6, 1355, -6, 1410, -6, -5, 1348, 1405, -1000, -5, -5]

[11, 1342, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1403, 1413, 1412, 1422, 1427, 0.0, -10, 1347, -10, 1402, -10, 1417, 1346, -10, 1413, -10, 1422, -10]
[11, 1342, 12, 1, 6, -1, 6, 1347, 1347, 1357, 1402, 1412, 1417, 1346, 1403, 1412, 1413, 1422, 1427, 0.0, -10, 1347, -10, 1402, -10, 1417, 1346, -10, -10, 1413, 1422, -10]

[18, 1342, 12, 1, 6, -1, 6, 1343, 1349, 1356, 1353, 1359, 1406, 1345, 1354, 1400, 1400, 1403, 1409, 0.0, -10, 1349, -10, 1353, 1359, -10, 1345, -10, 1400, 1401, -10, 1409]
[18, 1342, 12, 1, 6, -1, 6, 1343, 1349, 1353, 1356, 1359, 1406, 1345, 1354, 1400, 1400, 1403, 1409, 0.0, -10, 1349, 1353, -10, 1359, -10, 1345, -10, 

 first estimated time : 1147 last estimated time : 1147  time first request: 1053 ; state : 1147 ; is late : False
 first estimated time : 1202 last estimated time : 1202  time first request: 1053 ; state : 1202 ; is late : False
 first estimated time : 1217 last estimated time : 1217  time first request: 1053 ; state : 1217 ; is late : False
 ----> current estimation : 1103 ; current state : 1103
 first estimated time : 1102 last estimated time : 1104  time first request: 950 ; state : 1104 ; is late : False
 ----> current estimation : 1120 ; current state : 1120
 first estimated time : 1117 last estimated time : 1119  time first request: 1010 ; state : 1119 ; is late : False
 ----> current estimation : 1138 ; current state : 1138
 first estimated time : 1138 last estimated time : 1138  time first request: 1053 ; state : 1138 ; is late : False
 ----> current estimation : 1147 ; current state : 1147
 first estimated time : 1147 last estimated time : 1147  time first request: 1053 ; sta

In [22]:
print("Processed time for the first station - file 1 : " + str(round(dt,2)) + " secs.")

Processed time for the first station - file 1 : 0.29 secs.


### Conversion en fichier .csv

Ici on s'occupe de faire le preprocessing, puis le processing, à toutes les données lues en début de notebook. L'objectif final est la conversion du résultat en format .csv.

Après quelques essais, certaines stations ont été supprimées car elles fournissaient de trop mauvais résultats et nous avons au vu du temps qu'il nous restait, nous avons préféré passer à une phase plus "data science" des données, bien qu'imparfaites, dont nous disposions. 

Notamment toutes les données 

In [23]:
# this is the algorithm to convert to csv
from numpy import asarray
from numpy import savetxt

def make_training_data_to_csv(name_raw_data, sens, name):
    nb_stations, nb_requests = get_data_nbs(name_raw_data)
    dataset = final_preprocessing(name_raw_data)
    doesnotwork = []
    processed_data = []
    for k in range(1, nb_stations):

        if(sens == -1):
            if(k==10):
                continue
            elif(k==44):
                continue
        if(k==23):
            continue
        #print("Station : " + str(k))
        try:
            processed_data.append(processing(k, nb_requests-1, nb_stations, dataset, sens)) 
        except ValueError:
            print("Did not work for station" + str(k))
            processed_data.append(k)
        except NameError:
            print("Did not work for station" + str(k))
            processed_data.append(k)
    print("Stations that does not work : ")
    print(doesnotwork)
    count = 0
    for k in range(len(processed_data)):
        count += len(processed_data[k]) 
    print("Nb of arrived trains :")
    print(count)
    trainingdata = []
    for k in range(len(processed_data)):
        for j in range(len(processed_data[k])):
            trainingdata.append(processed_data[k][j])
    savetxt(name, asarray(trainingdata), delimiter=',')

Ce qu'affichent le code de la cellule suivante (the output) a été supprimé (clear) car la quantité de lignes était gigantesque. Les données affichées sont celles du preprocessing et processing pour les quatre fichiers. 

In [None]:
# 1 : sens 1 ; -1 : sens -1
make_training_data_to_csv(raw_data_1, 1,'trainingdata_1_aller.csv')
make_training_data_to_csv(raw_data_2, 1 ,'trainingdata_2_aller.csv')
make_training_data_to_csv(raw_data_3, 1 ,'trainingdata_3_aller.csv')
make_training_data_to_csv(raw_data_4, 1 ,"trainingdata_4_aller.csv")

make_training_data_to_csv(raw_data_1, -1 ,"trainingdata_1_retour.csv")
make_training_data_to_csv(raw_data_2, -1 ,"trainingdata_2_retour.csv")
make_training_data_to_csv(raw_data_3, -1 ,"trainingdata_3_retour.csv")
make_training_data_to_csv(raw_data_4, -1 ,"trainingdata_4_retour.csv")

# Conclusion 

Les données produites ici sont traitées dans la partie 3 de notre étude.

Concernant le travail que nous avons mené ici. Nous avons le sentiment que beaucoup d'amélioration pourraient être effectuées, tant sur l'architecture du code, que sur la stratégie de traque des trains ou encore sur l'implémention à proprement parlée. 

Notamment, forts de ce que nous avons compris du jeu de données, nous arriverions probablement, en repartant de quasiment zéro, à faire une lecture plus fine des données produisant des résultats dans lesquels nous aurions davantage confiance. Cela pourrait par exemple se faire en catégorisant d'abord tous les cas de figure possibles (dont bon nombre nous avons déjà découverts) et en élaborant la meilleur stratégie pour pouvoir les traiter. 

En effet, et comme vous avez pu le constater, ce notebook n'a pas été écrit linéairement : de nombreux allers-retours et ajouts on été réalisés. Cela donne un code un peu patchwork où des morceaux ont été ajoutés deci delà au fur et à mesure que de nouveaux problèmes apparaissaient.