<h1> NumMobility Filters </h1>
<p>
    This Jupyter Notebook contains several examples of filtering
    functions like filtering the data based on time, date,
    proximity to a point etc.
    <br>
    <br>
    Apart from filtering, this Jupyter
    Notebook also has examples of outlier detection and removal
    from the dataset.
</p>

<hr>

<p align='justify'>
    This Jupyter Notebook contains Dataset filtering examples and
    the following datasets are used to demonstrate them:
    <ul>
        <li> <a href="https://github.com/YakshHaranwala/NumMobility/blob/main/examples/data/geolife_sample.csv"> Geolife Sample </a> </li>
        <li> <a href="https://github.com/YakshHaranwala/NumMobility/blob/main/examples/data/gulls.csv"> Seagulls Dataset </a> </li>
        <li> <a href="https://github.com/YakshHaranwala/NumMobility/blob/main/examples/data/atlantic.csv"> Hurricane Dataset </a> </li>
    </ul>
</p>
</html>

In [1]:
import numpy as np

from core.TrajectoryDF import NumPandasTraj as NumTrajDF
from features.spatial_features import SpatialFeatures as spatial
from features.temporal_features import TemporalFeatures as temporal
from utilities.conversions import Conversions as con
from preprocessing.filters import Filters as filters

import pandas as pd
import time
np.seterr(invalid='ignore')
start = time.time()

In [2]:
%%time
"""
    First of all, lets import all the datasets one by one
    and check out a few of their points.
"""
# Reading the geolife dataset and converting to NumPandasTraj.
# Also, lets, print the first 5 points of the datset to
# see how the dataframe looks.
geolife = pd.read_csv('./data/geolife_sample.csv')
geolife = NumTrajDF(geolife,'lat','lon','datetime','id')
geolife.head()

CPU times: user 594 ms, sys: 63.7 ms, total: 657 ms
Wall time: 660 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2008-10-23 05:53:11,39.984224,116.319402
1,2008-10-23 05:53:16,39.984211,116.319389
1,2008-10-23 05:53:21,39.984217,116.319422
1,2008-10-23 05:53:23,39.98471,116.319865
1,2008-10-23 05:53:28,39.984674,116.31981


In [3]:
%%time

# Reading the gulls dataset and converting to NumPandasTraj.
# Also, lets, print the first 5 points of the datset to
# see how the dataframe looks.
gulls = pd.read_csv('./data/gulls.csv')
gulls = NumTrajDF(gulls,
                 latitude='location-lat',
                 longitude='location-long',
                 datetime='timestamp',
                 traj_id='tag-local-identifier',
                 rest_of_columns=[])
gulls.head()

CPU times: user 311 ms, sys: 4.4 ms, total: 316 ms
Wall time: 314 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,event-id,visible,lon,lat,sensor-type,individual-taxon-canonical-name,individual-local-identifier,study-name
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
91732,2009-05-27 14:00:00,1082620685,True,24.58617,61.24783,gps,Larus fuscus,91732A,Navigation experiments in lesser black-backed ...
91732,2009-05-27 20:00:00,1082620686,True,24.58217,61.23267,gps,Larus fuscus,91732A,Navigation experiments in lesser black-backed ...
91732,2009-05-28 05:00:00,1082620687,True,24.53133,61.18833,gps,Larus fuscus,91732A,Navigation experiments in lesser black-backed ...
91732,2009-05-28 08:00:00,1082620688,True,24.582,61.23283,gps,Larus fuscus,91732A,Navigation experiments in lesser black-backed ...
91732,2009-05-28 14:00:00,1082620689,True,24.5825,61.23267,gps,Larus fuscus,91732A,Navigation experiments in lesser black-backed ...


In [4]:
%%time
'''
    1. Reading the atlantic dataset, cleaning it up and then
       converting it to NumPandasTraj.
    2. It is to be noted that apart from reading the dataset,
       before converting to NumPandasTraj, the dataframe needs
       some cleanup as the Time format provided in the dataframe
       needs to be first converted into a library supported time
       format. Also, the format of the coordinates need to be
       converted to library supported format before converting'
       it to NumPandasTraj.
    3. Also, lets, print the first 5 points of the dataset to
      see how the dataframe looks.
'''
atlantic = pd.read_csv('./data/atlantic.csv')
atlantic = con.convert_directions_to_degree_lat_lon(atlantic, 'Latitude',"Longitude")
def convert_to_datetime(row):
        this_date = '{}-{}-{}'.format(str(row['Date'])[0:4], str(row['Date'])[4:6], str(row['Date'])[6:])
        this_time = '{:02d}:{:02d}:00'.format(int(row['Time']/100), int(str(row['Time'])[-2:]))
        return '{} {}'.format(this_date, this_time)
atlantic['DateTime'] = atlantic.apply(convert_to_datetime, axis=1)
atlantic = NumTrajDF(atlantic,
                         latitude='Latitude',
                         longitude='Longitude',
                         datetime='DateTime',
                         traj_id='ID',
                         rest_of_columns=[])
atlantic = temporal.create_date_column(atlantic)
atlantic.head()


CPU times: user 8.56 s, sys: 68.9 ms, total: 8.63 s
Wall time: 8.65 s


Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Date,Time,Event,Status,lat,lon,Maximum Wind,Minimum Pressure,Low Wind NE,...,Low Wind SW,Low Wind NW,Moderate Wind NE,Moderate Wind SE,Moderate Wind SW,Moderate Wind NW,High Wind NE,High Wind SE,High Wind SW,High Wind NW
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AL011851,1851-06-25 00:00:00,UNNAMED,1851-06-25,0,,HU,28.0,-94.8,80,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
AL011851,1851-06-25 06:00:00,UNNAMED,1851-06-25,600,,HU,28.0,-95.4,80,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
AL011851,1851-06-25 12:00:00,UNNAMED,1851-06-25,1200,,HU,28.0,-96.0,80,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
AL011851,1851-06-25 18:00:00,UNNAMED,1851-06-25,1800,,HU,28.1,-96.5,80,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
AL011851,1851-06-25 21:00:00,UNNAMED,1851-06-25,2100,L,HU,28.2,-96.8,80,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999


In [5]:
# %%time

#Now, let create a bounding box of 100 km radius from the
#coordinates 39, 116.

bbox = filters.get_bounding_box_by_radius(39, 116, 100000)
bbox

(38.100678394081264, 114.84275815636957, 39.89932160591873, 117.15724184363044)

In [6]:
# Now, lets filter the trajectory based on date. We will
# try all the possible combinations for the filtering.

small = filters.filter_by_date(atlantic, start_date='1851-06-25',end_date='2011-01-01')
print(f"Length of atlantic: {len(atlantic)}")
print(f"Length of small: {len(small)}")

Length of atlantic: 49105
Length of small: 46909


In [7]:
# Now, lets filter the trajectory based on datetime. We will
# try all the possible cases one by one.

tiny = filters.filter_by_datetime(atlantic, start_dateTime='1859-09-21 23:00:00' ,
                                  end_dateTime='2011-09-21 23:00:00')
print(f"Length of atlantic: {len(atlantic)}")
print(f"Length of tiny: {len(tiny)}")

Length of atlantic: 49105
Length of tiny: 46536


In [8]:
# Now, lets filter the dataframe based on maximum speed.

atlantic = spatial.create_speed_from_prev_column(atlantic)
max_speed_filt_df = filters.filter_by_max_speed(atlantic, 10)
print(f"Length of atlantic: {len(atlantic)}")
print(f"Length of speed_filt_df: {len(max_speed_filt_df)}")

Length of atlantic: 49105
Length of speed_filt_df: 41358


In [9]:
# Now, lets filter the dataframe based on minimum speed.

min_speed_filt = filters.filter_by_min_speed(max_speed_filt_df, 5)
print(f"Length of speed_filt_df: {len(max_speed_filt_df)}")
print(f"Length of min_speed_filt: {len(min_speed_filt)}")

Length of speed_filt_df: 41358
Length of min_speed_filt: 20485


In [10]:
# Now, lets filter the dataframe based on minimum distance
# between consecutive points.

min_distance_filt = filters.filter_by_min_consecutive_distance(atlantic,
                                                               125000)
print(f"length of atlantic: {len(atlantic)}")
print(f"length of min_distance_filt: {len(min_distance_filt)}")

length of atlantic: 49105
length of min_distance_filt: 20584


In [11]:
# Now, lets filter the dataframe based on maximum distance
# between consecutive points.

max_distance_filt = filters.filter_by_max_consecutive_distance(min_distance_filt,
                                                               500000)
print(f"length of min_distance_filt: {len(min_distance_filt)}")
print(f"length of max_distance_filt: {len(max_distance_filt)}")

length of min_distance_filt: 20584
length of max_distance_filt: 20412


In [12]:
# Now, lets filter the data based on maximum speed as
# well as maximum distance between 2 consecutive points.

max_dist_speed_filt = \
    filters.filter_by_max_distance_and_speed(atlantic, max_distance=300000, max_speed=5)
print(f"length of atlantic: {len(atlantic)}")
print(f"length of max_dist_speed_filt: {len(max_dist_speed_filt)}")

length of atlantic: 49105
length of max_dist_speed_filt: 20873


In [13]:

# Now, lets filter the data based on minimum speed as
# well as minimum distance between 2 consecutive points.

min_dist_speed_filt = \
    filters.filter_by_min_distance_and_speed(atlantic, min_distance=150000, min_speed=10)
print(f"length of atlantic: {len(atlantic)}")
print(f"length of max_dist_speed_filt: {len(min_dist_speed_filt)}")

length of atlantic: 49105
length of max_dist_speed_filt: 5773


In [14]:
# Now, lets remove the outliers based on the
# distance between 2 consecutive points.

geolife = spatial.create_speed_from_prev_column(geolife)
outlier_df = filters.filter_outliers_by_consecutive_distance(geolife)
print(f"length of geolife: {len(geolife)}")
print(f"length of outlier_df: {len(outlier_df)}")
print(f"Number of outliers: {len(geolife) - len(outlier_df)}")

length of geolife: 217653
length of outlier_df: 212124
Number of outliers: 5529


In [15]:
odf_two = filters.filter_outliers_by_consecutive_speed(geolife)
print(f"length of geolife: {len(geolife)}")
print(f"length of outlier_df: {len(odf_two)}")
print(f"Number of outliers: {len(geolife) - len(odf_two)}")

length of geolife: 217653
length of outlier_df: 195283
Number of outliers: 22370


In [16]:
# Now, lets remove the trajectories that have
# fewer than 5 points.

short_traj_gone = filters.remove_trajectories_with_less_points(atlantic, 5)
print(f"Number of unique Traj IDs in atlantic: {atlantic.traj_id.nunique()}")
print(f"Number of unique Traj IDs left after filter: {short_traj_gone.traj_id.nunique()}")



Number of unique Traj IDs in atlantic: 1814
Number of unique Traj IDs left after filter: 1772
