# tb.lx Data Science Challenge - Part I
----
----
## Introduction

Dear applicant,

Congratulations on passing the first screening! We’re excited to get to know you better and get a better feeling of your competences. In this round, we will test you on your problem-solving skills and data science experience by giving you a case to solve.

After handing us over your solution, we will review it and let you know our feedback. In the case you have passed, you will be called to an on-site interview. During the interview, you’ll get the opportunity to explain your solution and the steps that you took to get there. We've prepared this notebook for you, to help you walk us through your ideas and decisions.

If you're not able to fully solve the case, please elaborate as precisely as you can:

- Which next steps you'd be taking;
- Which problems you'd be foreseeing there and how you'd solve those.

In case you have any questions, feel free to contact ana.cunha@daimler.com or sara.gorjao@daimler.com for any more info. 

Best of luck!

## Context

Working with GPS data is part of tb.lx day to day life. We need to extract and analyze patterns from fleets in order to enable intelligence over it. Just by knowing the history of the position of vehicles (latitude, longitude, and timestamp) it is possible to answer the questions we ask below. Take into consideration the following concepts:

**Frequently stopping location** - Delimited location where vehicles stop regularly with a specific purpose. In the trucks world it can be warehouses, fuel stations, rest locations, etc;

**Trip** - This is what results of a vehicle that moved from a '*frequently stopping location A*' to a '*frequently stopping location B*'. To make it short let's say trip(A,B);

**Trajectory** - Is the actual path of the vehicle that he took to make the *trip*(A,B);

<img src="../images/concepts.png" style="width: 700px;"/>


## Data:

In this challenge, we ask you to perform simple analyses on a vehicle telematics dataset. The dataset to use in order to answer the questions can be found associated with the paper: [Vehicle Energy Dataset (VED), A Large-scale Dataset for Vehicle Energy Consumption Research](https://arxiv.org/abs/1905.02081).

#### Important: You cannot use the `Trip` column for your calculations.

## Tasks:

1. How would you find the *frequently stopping locations* from the data?
2. What is the most popular *frequently stopping location*? How many vehicles start or end their *trips* there? Please identify the *frequently stopping location* using its bounding box coordinates.
3. What is the most frequent *trip*? How many statistically different *trajectories* make up this trip?
4. What are the *trips* with the highest and lowest average speed?
5. Discuss the anonymization process used by the dataset authors. Are there any obvious flaws? Can you devise a way to counter it?


## Requirements:

- Solution implemented in Python3.6+;
- Provide requirements.txt to test the solution in the same environment;
- Write well structured, documented, maintainable code;
- Write sanity checks to test the different steps of the pipeline;

In [None]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

# Defining spark session and context
spark = SparkSession.builder \
           .master('local[*]') \
           .appName('tblx') \
           .getOrCreate()

sc = spark.sparkContext

# Read all dynamic csvs into one spark dataframe
df = spark.read.option('header', 'true') \
        .csv('/home/lucas/Projects/tblx-challenge/data/part1/dynamic-data/*.csv')

# Keeping only the needed columns
df = df.select(["DayNum", "VehId", "Timestamp(ms)", "Latitude[deg]", "Longitude[deg]", "Vehicle Speed[km/h]"])

# Replace the string "NaN" with null values
df = df.replace("NaN", None)

# Drop rows where all values are null (sanity check)
df = df.na.drop(how="all")

# Get a sample of the dataframe
# df_sample = df.sample(False, 0.01, 42)

first_to_weeks = df.filter((df.DayNum >= 1) & (df.DayNum < 15))

Looking at the data the pair (DayNum, VehId) seems to define a trip

A stopping location could viewed as be a pair of (lat, long) where a trip begins (i.e. timestamp 0) or ends (i.e. maximum timestamp for that trip)

However we see the anonymization process in action where the a trip starts and ends with a speed > 0, indicating that the trajectories prior to $t_{min}$ and $t_{max}$ are supressed as stated in the paper

In [38]:
# Get the first and last (lat, long) pair for every pair of (DayNum, VehId)
grouped = first_to_weeks.groupBy(["DayNum", "VehId"])

# Taking the first latitude and longitude from the grouped data
start_locations = grouped.agg(first(col("Latitude[deg]")).alias("Latitude"),
                                first(col("Longitude[deg]")).alias("Longitude"))

# Taking the last latitude and longitude from the grouped data
end_locations = grouped.agg(last(col("Latitude[deg]")).alias("Latitude"),
                                last(col("Longitude[deg]")).alias("Longitude"))

stopping_locations = start_locations.unionAll(end_locations)