# Community Prediction based on Taxi Dataset

## Load dataset
1. TRIP_ID: (String) It contains an unique identifier for each trip;
1. CALL_TYPE: (char) It identifies the way used to demand this service. It may contain one of three possible values:
    - ‘A’ if this trip was dispatched from the central;
    - ‘B’ if this trip was demanded directly to a taxi driver on a specific stand;
    - ‘C’ otherwise (i.e. a trip demanded on a random street).
1. ORIGIN_CALL: (integer) It contains an unique identifier for each phone number which was used to demand, at least, one service. It identifies the trip’s customer if CALL_TYPE=’A’. Otherwise, it assumes a NULL value;
1. ORIGIN_STAND: (integer): It contains an unique identifier for the taxi stand. It identifies the starting point of the trip if CALL_TYPE=’B’. Otherwise, it assumes a NULL value;
1. TAXI_ID: (integer): It contains an unique identifier for the taxi driver that performed each trip;
1. TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start; 
1. DAYTYPE: (char) It identifies the daytype of the trip’s start. It assumes one of three possible values:
    - ‘B’ if this trip started on a holiday or any other special day (i.e. extending holidays, floating holidays, etc.);
    - ‘C’ if the trip started on a day before a type-B day;
    - ‘A’ otherwise (i.e. a normal day, workday or weekend).
1. MISSING_DATA: (Boolean) It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing
1. POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. \[ and \], respectively). Each pair of coordinates is also identified by the same brackets as \[LONGITUDE, LATITUDE\]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start;


In [None]:
import csv
import json
from datetime import datetime
from typing import Iterator

enum_mapping = {'A': 1, 'B': 2, 'C': 3}

def load_csv_content() -> Iterator:
    '''Returns a generator for all lines in the csv file with correct field types.'''
    
    with open('train.csv') as csv_file:
        reader = csv.reader(csv_file)    

        headers = [h.lower() for h in next(reader)]

        for line in reader:
            # convert line fields to correct type
            for i in range(len(headers)):
                # trip_id AS string
                if i == 0:
                    continue
                # call_type, day_type 
                if i in [1, 6]:
                    line[i] = enum_mapping[line[i]]
                # origin_call, origin_stand, taxi_id AS int
                elif i in [2, 3, 4]:
                    line[i] = int(line[i]) if line[i] != "" else ""
                # timestamp AS timestamp
                elif i == 5:
                    # datetime is not serializable
                    # line[i] = datetime.fromtimestamp(int(line[i]))
                    line[i] = int(line[i])
                # missing_data AS bool
                elif i == 7: 
                    line[i] = line[i].lower() == 'true'
                # polyline AS List[List[float]]
                elif i == 8:
                    line[i] = json.loads(line[i])

            entry = dict(zip(headers, line))
            yield entry


In [None]:
print(next(load_csv_content()))

## Display some dataset routes

In [None]:
from typing import List
import folium

def displayNodes(nodes: List[List[float]]):  
    '''
    Displays the nodes on a map of the city.

    :param nodes: A list of coordinates, eg. [[1,2],[1,3]]
    '''
    m = folium.Map(location=[41.15,-8.6],tiles='stamenterrain',zoom_start=12, control_scale=True) 

    for idx, node in enumerate(nodes): 
        popupLabel = idx

        folium.Marker(
          location=[node[1], node[0]],
          #popup='Cluster Nr: '+ str(node.cluster_no),
          popup=popupLabel,
          icon=folium.Icon(color='red', icon='circle'),
        ).add_to(m)
      
    display(m)

In [None]:
content = load_csv_content()

In [None]:
displayNodes(next(content)['polyline'])

# Model Training

## Split dataset in multiple layers
Use the SMART pipeline to split up the data in multiple layers. <br />
Therefore, upload the csv file to the Semantic Linking microservice for layer creation. Next, the Role Stage Discovery microservice will cluster the individual layers.

## Define features per cluster/ layer
### "Local" features based on single clusters
- cluster size
- cluster variance (variance from cluster mean)
- cluster density (ratio $\frac{cluster\ range}{\#\ cluster\ nodes}$ )
- (cluster trustworthiness)
### "Global" features based on clusters in the context of a layer
- cluster importance I (ratio $\frac{\#\ cluster\ nodes}{\#\ layer\ nodes}$)
- cluster importance II (ratio $\frac{1}{diversity}$, where *diversity* = number of clusters with nodes > 0)

## Calculate the Metrics for the Clusters

In [None]:
from typing import List
import json
import os
from entities import TimeWindow, Cluster

def calculate_metrics_for_clusters(layer_name: str = 'CallTypeLayer', feature_name: str = 'call_type'):
    path_in = f'timeslices/{layer_name}'
    path_out = f"metrics/{layer_name}.json"

    complete_clusters: List[Cluster] = []

    for root, _, files in os.walk(path_in):
        for f in files:
            print(f"Working on file: {root}/{f}")
            with open(os.path.join(root, f), 'r') as file:
                json_slice = json.loads(file.read())
                time_window = TimeWindow.create_from_serializable_dict(json_slice)

                # create all clusters + metrics for one time window
                clusters = Cluster.create_from_time_window(time_window, feature_name)
                complete_clusters.extend(clusters)
        
    # store the cluster metrics
    with open(path_out, 'w') as file:
        file.write(json.dumps([cl.__dict__ for cl in complete_clusters]))

In [None]:
calculate_metrics_for_clusters()

## Provide the cluster metrics and labels as data points for learning

In [None]:
# Example how to convert time to a cyclic 2d feature

MAX_TIME_VAL = 52 # for weeks

import numpy as np
import matplotlib.pyplot as plt

times = np.asarray([i+1 for i in range(52)][::])

df = {}
df['sin_time'] = np.sin(2*np.pi*times/MAX_TIME_VAL)
df['cos_time'] = np.cos(2*np.pi*times/MAX_TIME_VAL)

plt.plot(df['sin_time'])
plt.plot(df['cos_time'])
plt.show()

plt.scatter(df['sin_time'], df['cos_time'])
plt.show()

# feature_new = {i+1:(s,c) for i,(s,c) in enumerate(zip(df['sin_time'], df['cos_time']))}

In [None]:
import json
from entities import Cluster
import collections
import numpy as np

def get_evolution_label(old_size: int, new_size: int) -> int:
    '''Returns the evolution label as int by mapping 0..4 to {continuing, shrinking, growing, dissolving, forming}.'''
    if old_size == new_size:
        return 0 # continuing
    if old_size == 0 and new_size != 0:
        return 4 # forming
    if old_size != 0 and new_size == 0:
        return 3 # dissolving
    if old_size > new_size:
        return 1 # shrinking
    if old_size < new_size:
        return 2 # growing

def get_cyclic_time_feature(time: int, max_time_value: int = 52) -> (float, float):
    return (np.sin(2*np.pi*time/max_time_value),
            np.cos(2*np.pi*time/max_time_value))

def create_data(N: int = 3, layer_name: str = 'CallTypeLayer'):
    """
    A single training data point should look like this:

    (cluster_size, cluster_variance, cluster_density, cluster_import1, cluster_import2, time_info) ^ N, evolution_label
    time_info ... the time as 2d cyclic feature, i.e. time_info := (time_f1, time_f2)

    The first tuple represents metrics from the cluster in t_i-(N-1).
    The Nth tuple represents metrics from the cluster in t_i.
    The label is one of {continuing, shrinking, growing, dissolving, forming} \ {splitting, merging} and identifies the change for t_i+1.
    
    :param N: number of cluster metric tuples
    """
    
    path_in = f"metrics/{layer_name}.json"
    with open(path_in, 'r') as file:
        data = [Cluster.create_from_dict(cl_d) for cl_d in json.loads(file.read())]

    data.sort(key=lambda cl: (cl.cluster_id, cl.time_window_id))

    # manually prepare deque with N metric_tuples + evolution label
    tuples = []
    prev_cluster_id = -1

    for i, cur_cluster in enumerate(data[:-1]):

        if cur_cluster.cluster_id != data[i+1].cluster_id:
            # next cluster slice in list will be another cluster id -> restart deque and skip adding the current (last) cluster slice
            print("new cluster")
            tuples = []
            continue

        cur_metrics = (cur_cluster.size, cur_cluster.variance, cur_cluster.density, cur_cluster.importance1, cur_cluster.importance2, get_cyclic_time_feature(cur_cluster.get_time_info()))

        # deque function: adding N+1st element will remove oldest one
        if len(tuples) == N:
            tuples.pop(0)
        tuples.append(cur_metrics)

        label = get_evolution_label(cur_cluster.size, data[i+1].size)

        if len(tuples) == N:
            yield list(tuples) + [label]


## Approach

### 1. Prediction of cluster evolution based on metrics from clusters in one layer
Use cluster metrics from last N time windows to predict the change in $t_{i+1}$.
Either use normal classification with $(cluster\_metrics)^{N} \cup (label)$ or choose a RNN.

### 2. Prediction of cluster evolution based on metrics from cluster interaction between multiple layers
*todo*

In [None]:
def flatten(data: list) -> ('X', 'Y'):
    '''
    Flattens training data in the form:
    [(cluster_size, cluster_variance, cluster_density, cluster_import1, cluster_import2, (time_f1, time_f2))^N, evolution_label]
    to:
    (X: np.array, evolution_label)
    '''
    flat_list = []
    for entry in data[:-1]: # for all x
        flat_list.extend(entry[:-1]) # add all number features except the time tuple
        flat_list.extend(entry[-1]) # add time tuple

    # flat_list.append(data[-1]) # add y

    return np.asarray(flat_list), data[-1]
