# Community Prediction based on Taxi Dataset

## Load dataset
1. TRIP_ID: (String) It contains an unique identifier for each trip;
1. CALL_TYPE: (char) It identifies the way used to demand this service. It may contain one of three possible values:
    - ‘A’ if this trip was dispatched from the central;
    - ‘B’ if this trip was demanded directly to a taxi driver on a specific stand;
    - ‘C’ otherwise (i.e. a trip demanded on a random street).
1. ORIGIN_CALL: (integer) It contains an unique identifier for each phone number which was used to demand, at least, one service. It identifies the trip’s customer if CALL_TYPE=’A’. Otherwise, it assumes a NULL value;
1. ORIGIN_STAND: (integer): It contains an unique identifier for the taxi stand. It identifies the starting point of the trip if CALL_TYPE=’B’. Otherwise, it assumes a NULL value;
1. TAXI_ID: (integer): It contains an unique identifier for the taxi driver that performed each trip;
1. TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start; 
1. DAYTYPE: (char) It identifies the daytype of the trip’s start. It assumes one of three possible values:
    - ‘B’ if this trip started on a holiday or any other special day (i.e. extending holidays, floating holidays, etc.);
    - ‘C’ if the trip started on a day before a type-B day;
    - ‘A’ otherwise (i.e. a normal day, workday or weekend).
1. MISSING_DATA: (Boolean) It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing
1. POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. \[ and \], respectively). Each pair of coordinates is also identified by the same brackets as \[LONGITUDE, LATITUDE\]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start;


In [None]:
import csv
import json
from datetime import datetime
from typing import Iterator

enum_mapping = {'A': 1, 'B': 2, 'C': 3}

def load_csv_content() -> Iterator:
    '''Returns a generator for all lines in the csv file with correct field types.'''
    
    with open('train.csv') as csv_file:
        reader = csv.reader(csv_file)    

        headers = [h.lower() for h in next(reader)]

        for line in reader:
            # convert line fields to correct type
            for i in range(len(headers)):
                # trip_id AS string
                if i == 0:
                    continue
                # call_type, day_type 
                if i in [1, 6]:
                    line[i] = enum_mapping[line[i]]
                # origin_call, origin_stand, taxi_id AS int
                elif i in [2, 3, 4]:
                    line[i] = int(line[i]) if line[i] != "" else ""
                # timestamp AS timestamp
                elif i == 5:
                    # datetime is not serializable
                    # line[i] = datetime.fromtimestamp(int(line[i]))
                    line[i] = int(line[i])
                # missing_data AS bool
                elif i == 7: 
                    line[i] = line[i].lower() == 'true'
                # polyline AS List[List[float]]
                elif i == 8:
                    line[i] = json.loads(line[i])

            entry = dict(zip(headers, line))
            yield entry


In [None]:
print(next(load_csv_content()))

## Display some dataset routes

In [None]:
from typing import List
import folium

def displayNodes(nodes: List[List[float]]):  
    '''
    Displays the nodes on a map of the city.

    :param nodes: A list of coordinates, eg. [[1,2],[1,3]]
    '''
    m = folium.Map(location=[41.15,-8.6],tiles='stamenterrain',zoom_start=12, control_scale=True) 

    for idx, node in enumerate(nodes): 
        popupLabel = idx

        folium.Marker(
          location=[node[1], node[0]],
          #popup='Cluster Nr: '+ str(node.cluster_no),
          popup=popupLabel,
          icon=folium.Icon(color='red', icon='circle'),
        ).add_to(m)
      
    display(m)

In [None]:
content = load_csv_content()

In [None]:
displayNodes(next(content)['polyline'])

# Model Training

## Split dataset in multiple layers
Use the SMART pipeline to split up the data in multiple layers. <br />
Therefore, upload the csv file to the Semantic Linking microservice for layer creation. Next, the Role Stage Discovery microservice will cluster the individual layers.

## Define features per cluster/ layer
### "Local" features based on single clusters
- cluster size
- cluster variance (variance from cluster mean)
- cluster density (ratio $\frac{cluster\ range}{\#\ cluster\ nodes}$ )
- (cluster trustworthiness)
### "Global" features based on clusters in the context of a layer
- cluster importance I (ratio $\frac{\#\ cluster\ nodes}{\#\ layer\ nodes}$)
- cluster importance II (ratio $\frac{1}{diversity}$, where diversity = number of clusters with nodes > 0)