# Data Gathering and Management Lab - 094290
# Final Task - Bus Trip App
#### Nitzan Shamir  & Omer Shubi 

# Project Overview

The project consists of multiple notebooks that work and interact together.

*Lab4_Main* (this notebook) includes explanations and descriptions about Part 1 (Preprocessing & App) and Part 2 (EDA, Article and App).

*Lab4_Preparations* consists of the preprocessing part, in which we also train and save the ML model, and the relevant helper models (indexer and encoder, for categorical variables).

*Lab4_UI_part1* - The front-end of the first part of our App.

*Lab4_UI_part2* - The front-end of the second part of our App.

*Lab4_functions* includes all the functions we used through out the project. Includes back-end connection functions (such as connecting to Elastic) as well.

# Part 1 – Warmup

In this part we provide an application, which is able to read stream/batch data on top of our task.

Recall our task from before: *Predict whether a bus trip includes congestion*

Where we define a 'bus trip' as a trip taken by a bus on a specific trip, from start to finish. 

Specifically, each trip is defined by the following features:

 * Numerical features:
   * Total distance covered
   * Total trip time
 * Categorial features:
   * Whether the trip was during the weekend or not
   * Area id (all areas passed through)
   * Journy pattern id
   * Hour group (part of day)
 * Label:
   * isCongestion
   
Partitioning the data into trips is done by aggregating (groupBy) over JournyPatternId, vehicleId, date and hourGroup.

The model we use is a **Logistic Regression model**. We trained the model offline on 80% of the batch data (the code is in the *Lab4_preparations* notebook).

In addition, similarly to lab 2, we define each record in the raw data as **uncertain** if the ellapsedTime > 600 or ellapsedTime < 0. And we handle the uncertain records by removing them, as this is the method that yielded the best results.

Moreover, similarly to lab 3, we **hydrate** the data with popular locations, and wide-spread events, and added to each bus trip the features: isEventDate and isPassedLocation.

### The app flow:

1. The application allows the user to provide a path from which to load batch data or IP to load stream data.
2. It allows the user to filter the bus trips to predict on.
3. Then, behind the scene, 
  1. it aggregates the data to bus trips
  2. predicts congestion on the bus trips using the pre-trained model
  3. uploads the result to our Elastic DB,
4. Finally it  displays the journeys (journeyPatternId) with congestion on an interactive map.

### Link to our application: [DUBUS - Dublin Bus Trips - congesntion](https://eastus.azuredatabricks.net/?o=6694791539123117#notebook/1325942436209506/dashboard/1325942436209514/present)

### Business Use Case:

First of all, by defining and computing bus trips (as opposed to the given bus sensor log records) we allow decision makers in general to focus on more abstract and meaningful tasks. The task we choose can be incorporated as part of a bigger system. By providing the above features to the model, it predicts if there is a Congestion during the trip.
So, for example, one can use the correlations, and causal effects that the model has learned and improve the real world system better.
Additionally, when planning a new line, route planners can provide input (assumed and computed by real world knowledge) and predict the probability of a congestion during the trip. Allowing for better decision making.

**below there is visualizations of our applications in use.**

In [4]:
from IPython.display import Image
Image(url="https://drive.google.com/uc?id=1GSDmRRKhauj1Iw813Kj3YdDW0ijWmW09",  width=1200)

In [5]:
Image(url="https://drive.google.com/uc?id=1eEJvJtOy-iO8zLrYkaT91wU6p796A0ty",  width=1200)

# Part 2 – New Task

In this part we define a task over the Dublin dataset based on one of the paper *Overview of Data Exploration Techniques* by Stratos Idreos (Harvard), Olga Papaemmanouil (Brandeis) and Surajit Chaudhuri (Microsoft Research).

The article is about modern-day data exploration. 

We define our task over the Dublin dataset in a  similar manner as the paper. And more spcificaly, our task defined on the aggregate data (bus trips), as we present on part 1 above.

We present an application that allow the user to do modern-day data exploration using the article definition for such a data exploration. And more spcifically, we provide a collection of advanced visualization & data exploration tools, bundled in a convinent interface. 

In part it allows the user to: 
1. Dynamic overview of the data that they uploaded via an interactive dashboard.
2. Perform further investigation of the data in an intuitive, code-less way, using state-of-the-art tools.
3. Quick query result estimation and visualization using sampled data.


### Link to our application: [DUBUS - Dublin Bus Trips - Analytics](https://eastus.azuredatabricks.net/?o=6694791539123117#notebook/2483473424244723/dashboard/1109751670127317/present)

**below there is visualizations of our applications in use.**


## Business Use Case - 

**Why should a user  use our app for data exploration and not directly explore by himself in Kibana?**
* The app  wraps the interface together with the Elastic, adding the data automatically to the index in the Elastic. 
* The app updates automatically the visualizations on Kibana and shows them to the user in the app.
* Our code gets as input raw data, aggregates the data into bus trips, and then the user can do data exploration on the **bus trips**.
* We present a dashboard that contains a collection of useful visualizations, that we saw fit to allow easy data exploration.
* We researched and searched for the advanced Kibana and Elastic features. Using our appp, the user has a guide and an easy interface to use them.
* So actually, the user does not need to produce any graph of their own. They can use our graphs in the interactive dashboard, produce automatic graphs with the automatic visualization part in the app, or see some recommended visualizations on the sampled data.



## Relation to Article

We now explain how the application utilizes the knowledge from the paper, how it is different, and the limitations in our case.

First of all, the paper surveys recent developments in the emerging area of database systems tailored for data exploration.
In the main section, the authors discuss new ideas on how to store and access data as well as new ideas on how to interact with a data system to enable users and applications to quickly figure out
which data parts are of interest. Additionally they discuss lessons-learned from past research, the new challenges data exploration crafts, emerging applications and future research directions.

Specifically - 

1. **User Interaction** - Query Result Visualization & Exploration Interfaces: 
  - advanced visualization tools and alternative exploration interfaces for big data exploration tasks. Divided into three sub-categories: 
  - systems that assist SQL query formulation.
  - systems that automate the data exploration process by identifying and presenting relevant data items.
  - novel query interfaces such as keyword search queries over databases and gestural queries.

  This comes into play by:
  * Automatic query formalation based on keywords.
  * Automatic recommendations of visualizations according to the queries.
  * Interactive visualization tools that assist users in navigating the underlying data structures.

  When building our app we mainly focused on the above part. 

  Using Kibana Lens, data transforms into visualizations with a single mouse gesture.

  As can be seen in the app, the user can drag & drop a field of interest into the center part of the display. 

  Then, the system automatically provides smart suggestions of alternative ways to visualize the data. Doing so by combining common usage patterns with relevance ranking to present an optimal visualization type.

  If the user is not satisfied with the result they can easily change the underlying query, visualization type or filters, all with an easy-to-use, intuitive UI.

  This elegantly handles the first two points of manually formulating the query and manually configuring visualizations, in an efficient manner.

  For the third point, we showcase an interactive dashboard.

  The dashboard provides the user with the capability to both get an overview of the data and to dive into the details:
  - The first is accomplished by clean, high level information, such as unique counts and general distributions of the different fields. 
  - The second is acheived by allowing the user to zoom-in \ filter the data by a simple mouse selection \ click on the part they are interested in.

  Note, when uploading additional data the visualizations are updated in real-time based on the underlining index, and do not require any coding to view the changes.
  
  Specifically, as we are dealing with aggregated bus trips based on the dublin dataset, the visualizations include the specific fields that we have - numerical fields such as time and distance, and categorical fields such as hour group, journey pattern ID and the booelan flags such as isWeekend and isEventDate. The visualization types are according to the field type, and the most common metrics (such as mean and max).

2. **Middleware** - 
  The article mainly refers to different kinds of data prefetching and query approximation in order to reduce wait times, in the front-end visualizations.

  *Data prefetching & preloading* - The article covers research that aims to reduce the overall exploration time through result prefetching techniques. In our work data prefetching comes into play behind the scenes, in Elastic & Kibana in several ways. By caching - the results of queries used in the filter context are cached in the node query cache for fast lookup. The cache uses an LRU eviction policy: when the cache is full, the least recently used query results are evicted to make way for new data. Elastic uses the concept of *bitset* - representing which documents match the filter. When Elasticsearch determines a bitset is likely to be reused in the future, it will be cached directly in memory for later use. These cached bitsets are “smart”: they are updated incrementally. As you index new data, only those that are new need to be added to the existing bitsets, rather than having to recompute the entire cached filter over and over. Additionally, independent query caching is utilized. Meaning that the bitsets belonging to a query component are independent from the rest of the search request. This means that, once cached, a query can be reused in multiple search requests. It is not dependent on the "context" of the surrounding query. This allows caching to accelerate the most frequently used portions of your queries, without wasting overhead on the less frequent / more volatile portions.
 
 We note a limitation for our case, using advanced methods such as background execution of similar speculative queries is not implemented. 

  *Query Approximation* is an extremely time-saving component when dealing with large databases. 
  According to the article this is done by different sampling techniques, where one must make sure for controlling the quality of the query results while bounding their execution time. 
  In our work, we showcase this point in a third dashboard that allows quick query result estimation and visualization using sampled Dublin data, where the data is sampled in a diversified & efficient manner. 

3. **Database Layer** -

  The article refers to this as work that aims at rethinking database architectures at their core, reconsidering the fundamental methods to store and access data to match exploration patterns.

  In general, using Elastic as our datawarehouse layer, is suitable for these tasks. This is as Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. So, after data is stored and indexed it is fully searchable in near real-time. This is opposed to more traditional datawarehouse structures which would require more time to iterate over all the data in order to answer queries. 

  More specifically, the authors seperate the work into four areas: adaptive indexing, adaptive loading, adaptive storage and sampling based architectures.
  We focus on the relations to *Adaptive Loading* and *Adaptive Storage*. The two other areas are more in the research phase and not yet implemented.

  - *Adaptive Loading*: During data exploration not all data is needed. Adaptive loading relates to the notion that users can start querying a database even before all data is loaded or even leaving some parts of the data unloaded.
    In our case, when querying and displaying the map in Kibana, the data is filtered according to the coordinates that are visible on the map showing the Dublin Data. Only when you scroll through the map, the relevant (the data that is in the same aread) is loaded and added to the visualization. This enables fast usage, without waiting for all the data to load.
  - *Adaptive Storage*:  Older systems rely on static layouts and build the whole architecture around a single layout. In a data exploration scenario we cannot a priori decide what is a good layout as we do not know the exact query patterns up front, leading to sub-optimal performance for traditional static systems. Using Elastic we are able to 1. With dynamic mapping enabled, automatically detect and add new fields to the index.  Elasticsearch detects and maps booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch data types. 2. index the same field in different ways for different purposes. This allows for more dynamic and faster query run times, according to the usage. For the Dublin Dataset, according to the use needs, they can create additional index for the same fields. Additionally, we explicitly define the Geo Point as a geo_point data type, that cannot be automatically detected, but leave other fields for Elastic to determine their type automatically. 

As we mention above, we are able to implement and make use of some of the suggestions for a good data-exploration driven warehouse. However there are also limitations and differences between what the authors propose and what can be done in our case, as we mention above as well.

## Notes:

As in part 1 we want the user to be able to view clearly the data that they uploaded, the default configuration is that the app shows only the last updated data on the map. To change this, change 'overwrite' flag from True to False. However, note that the map may become crowded if you do not filter enough.

In part 2, as the user is explorating there data, it makes sense to keep all the previous data and just append to it, so the default configuration is to append newly uploaded data to all the previously uploaded data.


We provide full explanations on how to set up and get the app up and running on **your** system as well. Full instructions are on the [github repo](https://github.com/scaperex/DUBUS) of the project.

However, **be aware!** There are a lot of steps that need to be taken in order to allow everything (DataBricks, Kibana, ElasticSearch) to work and communicate with each other.

For example, 
- we tested the system on specific versions of Spark, DBR, Elastic, Kibana, VM OS and such, and cannot provide guarentees for other software versions.
- DataBricks requires embedded webpages to support the HTTPS standard for secure connections. Therefore, you must set up a SSL Certificate on your VM for it to work.

Additionally, for both usage on your system or on ours:

1. As we are displaying our app inside the Databricks, you must have a valid Databricks account inorder to interact with the app. 
2. The server must be on for the app to work.
3. Due to security limitations, first-time visitors must first open [this link](https://da2020w-0001.eastus.cloudapp.azure.com:5601) in their browser, allow the necessary permissions, and only then the app will be displayed correctly.
4. To avoid reruning the whole process on changes in filters, open the [databricks notebook Part1](https://eastus.azuredatabricks.net/?o=6694791539123117#notebook/1325942436209506/command/2483473424243540) and [databricks notebook Part2](https://eastus.azuredatabricks.net/?o=6694791539123117#notebook/2483473424244723/command/1109751670127311)  make sure that the databricks widgets option in the notebook is set to `on_widget_change: do nothing`

## [Link to a Youtube Demo of our Interactive Dashboard application](https://youtu.be/7zQYNa5ib-s)

In [9]:
Image(url="https://drive.google.com/uc?id=12eZIJ2J3cffro4nQCXi-IHiD2eXfxUjx",  width=1200)

## [Link to a Youtube Demo of our Automatic Visualizations application](https://youtu.be/C_gWaofDWhk)

In [11]:
Image(url="https://drive.google.com/uc?id=1HmWAdA8BKVplvJR_-aNGYK5WugtEsdnJ",  width=1200)

In [12]:
Image(url="https://drive.google.com/uc?id=1eVX8VRBA1iU8laO2mOGkJyGiTmcjQGnj",  width=1200)

In [13]:
Image(url="https://drive.google.com/uc?id=1E4cwgF3SbhvWayejtXcgEbVdWfmX-xev",  width=1500)

## EDA

In this phase we explore the aggregated data (bus trips), as this is the data our task is defined over.

Before implementing the defined task, we first need to decide the visualizations and the interactive parts we want to display and allow to the user.

We do so by first performing our own preliminary data exploration. We do so to understand the data for our specific task. 

We examine what fields we have in the data, how many records, and distinct counts for categorical fields and general distributions for the numerical ones.

The below shows the process and the results.

Following the below analysis, we now understand what fields there are, their type, and the general distribution of each.

From this, we determine the fields, the aggregations and the visualization type (metric\bar\line charts) that we show in the interactive dashboard.

Notes:
- We define each record in the **raw** data (before aggregation) as "uncertain" if the ellapsedTime > 600 or ellapsedTime < 0.
Therefore, as in this part we are exploring the **aggregated** data, we show it after dealing with the uncertainty. Specifically, we handled uncertain records by removing them.

- We hydrate the data with popular locations, and wide-spread events, and added to each bus trip the features: isEventDate and isPassedLocation. As this hydratin relevants for our EAD phase, and as is performed again on the **raw** data (before aggregation), we focus our attention on the final form of the hydrated data (aggregated data, after hydrating with isEventDate and isPassedLocation).

In [15]:
!pip install libify
import libify
lab4_functions = libify.importer(globals(), '/Users/shubi@campus.technion.ac.il/Lab4_functions')
import pyspark.sql.functions as F

In [16]:
df = lab4_functions.read_elastic('hydrated-grouped-removed-uncertain-df', array_field='areaIds').drop("NumberOfBusStop", 'recordCount', 'uncertainCount', 'uncertainPercentage', 'date', 'congestionPercentage', 'areaIds')
df = df.withColumn('isEventDate', F.col('isEventDate').cast('int'))\
       .withColumn('isPassedLocation', F.col('isPassedLocation').cast('int'))

In [17]:
display(df)

TotalDistance,TotalTime,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [18]:
df.count()

In [19]:
display(spark.createDataFrame(df.dtypes).withColumnRenamed('_1', 'feature').withColumnRenamed('_2', 'feature_type'))

feature,feature_type
TotalDistance,float
TotalTime,float
areaIds,array
date,timestamp
day,timestamp
hourGroup,bigint
isCongestion,float
isEventDate,string
isPassedLocation,string
isWeekend,bigint


In [20]:
numerical_columns = ["summary", "TotalTime", "TotalDistance"]

display(df.describe().select(numerical_columns))

summary,TotalTime,TotalDistance
count,1322001.0,1322001.0
mean,4238.983413023137,19.888527015087536
stddev,2014.7369132903964,12.986275602033292
min,140.0,0.0
max,14730.0,5518.5


In [21]:
display(df)

TotalDistance,TotalTime,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [22]:
display(df)

TotalDistance,TotalTime,areaIds,date,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [23]:
import pandas as pd
import pyspark.sql.functions as F
categorial_columns = ["journeyPatternId"]
cat_df = pd.DataFrame(df.agg(*(F.countDistinct(F.col(c)).alias("Unique Count of "+c) for c in categorial_columns)).collect(), index=['Number of Unique Categories'],columns=categorial_columns)
cat_df.transpose()

Unnamed: 0,Number of Unique Categories
journeyPatternId,584


In [24]:
display(df)

TotalDistance,TotalTime,areaIds,date,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [25]:
display(df)

TotalDistance,TotalTime,areaIds,date,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [26]:
display(df)

TotalDistance,TotalTime,areaIds,date,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,"List(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,"List(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [27]:
display(df)

TotalDistance,TotalTime,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075


In [28]:
display(df)

TotalDistance,TotalTime,day,hourGroup,isCongestion,isEventDate,isPassedLocation,isWeekend,journeyPatternId,vehicleId
15.623412,5683.0,2017-11-10T00:00:00.000+0000,4,0.0,0,1,0,10001,33320
20.184248,6111.0,2017-08-01T00:00:00.000+0000,3,0.0,0,1,0,10001,33470
5.242347,1133.0,2017-07-04T00:00:00.000+0000,4,0.0,0,1,0,10001,33491
15.449061,4147.0,2018-06-26T00:00:00.000+0000,3,0.0,0,1,0,10001,33494
9.148318,1976.0,2017-07-19T00:00:00.000+0000,2,0.0,0,1,0,10001,33519
12.40341,3839.0,2018-04-26T00:00:00.000+0000,2,1.0,0,1,0,10001,33529
1.6609503,502.0,2017-10-20T00:00:00.000+0000,2,0.0,0,1,0,10001,38008
8.682408,2010.0,2017-07-17T00:00:00.000+0000,3,0.0,0,1,0,10001,43074
29.210276,7450.0,2018-05-07T00:00:00.000+0000,2,0.0,0,1,0,10001,43075
14.374625,3856.0,2018-07-21T00:00:00.000+0000,2,0.0,0,1,1,10001,43075
