### User-to-cells association

This notebook implements the following operations:

1. materializes a set of uniform grids, with different resolutions and alignments, over the geographical area covering the
stop segments detected for a given set of trajectories. 
2. for every user and grid, it assigns each user's stop segment to a grid cell.
3. for every user and grid, it computes the number of distinct days their stop segments temporally covered in the associated cells.
4. for every user and grid, it finds the top-k cells ranked w.r.t. the number of distinct days covered by their stop segments. The intuition being that the cells with the highest number of distinct days covered by a user's stop segments are likely to be the user's most frequented locations, such as home and work.

The final output consists of a set of files, one per grid, each containing the set of top-k cells for each user.

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import copy

from src.stop_grid_mapper import StopGridMapper, parallel_user_to_cells_mapping
from src.stop_explorer import StopExplorer
from src.grid_partitioning import Grid

## Main code

In [2]:
# Create a GeoDataFrame contaning the stop segments dataset.
path_stops = './data_simulator/huge_dataset/dataset_simulator_trajectories.compressed.parquet.stops.parquet'
stop_explorer = StopExplorer(path_stops)
# display(stop_explorer.get_df_stops())
# stop_explorer.get_df_stops().info()

### Materialize a set of uniform grids

In [3]:
# Compute the bounding box of the objects in the geodataframe. Retrieve also its original CRS, and determine
# a suitable metric CRS we can use for the area covered by the geodataframe. Then pass all these informations
# when creating a set of grids.
stops_df = stop_explorer.get_df_stops()
bbox = stops_df.total_bounds  # return a tuple having the form '(minx, miny, maxx, maxy)'
orig_crs = stops_df.crs # Get the original CRS of the geodataframe.
metric_crs = stops_df.estimate_utm_crs() # Estimate a metric CRS we can use for the area covered by the geodataframe.


# compute multiple grids with different cell lengths and offsets.
set_grids = {}
min_length, max_length, step_length = 100, 1000, 100
for length in range(min_length, max_length + step_length, step_length):
    for offset in range(0, length, length // 10):
        # print(f'Computing grid with cell length {length} and offset {offset}')
        grid = Grid(bbox, orig_crs, metric_crs, length, offset)
        set_grids[(length, offset)] = grid

### User-to-cells association

In [4]:
stops_df = stop_explorer.get_df_stops()
top_k_cells_user = 8
num_proc = 6
path_out = './test/'
parallel_user_to_cells_mapping(set_grids, stops_df, path_out, top_k_cells_user, num_proc)

**DEBUG**: generate a map of a grid to check that everything's going well.

**DEBUG**: Compute statistics concerning the pairs '(uid, cell_id)', and the cells of the grid.

**DEBUG**: Plot heatmaps of the grid, each focused on a different statistics.