# Outlier detection

The dataset contains a large number of samples that are trivially wrong: only by considering the points whose latitude and longitude fall approximately in the New York City area, several samples can be removed. This leads to the possibility that also in New York City there are several point that can be removed.

In [None]:
import matplotlib.pyplot as plt
import polars as pl
import taxifare.data as data
import taxifare.boroughs as boroughs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import numpy as np
import math


In [None]:
df = data.load_data().fetch(500_000)
# df = data.load_data().collect()

# Run preprocess.py to obtain the parquet dataset
# df = pl.read_parquet('datasets/train.parquet')
df.head()

In [None]:
describe_infos = df.describe()
describe_infos

### Passenger count
According to the [NYC taxi commission](https://www.nyc.gov/site/tlc/passengers/passenger-frequently-asked-questions.page#:~:text=The%20maximum%20amount%20of%20passengers,of%20an%20adult%20passenger%20seated) the maximum number of passengers, for suitable vehicles, is five. An additional sixth person (child) is admitted. Thus, it is possible to consider all samples that exceed the number of six passengers to be noise. In fact, values greater than six are highly underrepresented.

In [None]:
df.groupby('passenger_count').agg(pl.count()).sort('passenger_count')

In [None]:
df = df.filter(pl.col('passenger_count') <= 6)

## Analyzing spatial locations
Thanks to [1] we can download a map of New York City that help us visualyze the pickup and dropoff locations.
There are several style that can be used but for our purposes is useful to have a map without any label or decoration.

To download the image of the map we have to give a bounding box of the area we wanto to download, for visualization purposes we chose to use a square bounding box (mind that a square in a sphere is not a square in a plane so there are additional steps to do).

In [None]:
max_values = describe_infos.filter(pl.col('describe') == 'max').select(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])
min_values = describe_infos.filter(pl.col('describe') == 'min').select(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])
x_max = max(max_values['pickup_longitude'][0], max_values['dropoff_longitude'][0])
y_max = max(max_values['pickup_latitude'][0], max_values['dropoff_latitude'][0])
x_min = min(min_values['pickup_longitude'][0], min_values['dropoff_longitude'][0])
y_min = min(min_values['pickup_latitude'][0], min_values['dropoff_latitude'][0])

# Make the area a square
width = data.distance((x_min,y_min), (x_max,y_min))
height = data.distance((x_min,y_min), (x_min,y_max))

assert width > height

additional_space = (width - height)/2

new_lat_min, _ = data.find_latitude_correction((x_min,y_min), additional_space, b=-1)
new_lat_max, _ = data.find_latitude_correction((x_min,y_max), additional_space, b=1)

points_area = x_min, x_max, new_lat_min, new_lat_max

url = 'https://b.basemaps.cartocdn.com/light_nolabels/{z}/{x}/{y}.png'
image = data.new_york_map(points_area)

plt.imshow(image)
plt.show()
print(points_area)

## Detecting point on the ocean - Oceanic detection
Now that we have a map of New York City we can use the image as a mask to filter out the points that falls into the ocean

In [None]:
# Remove points on ocean, not working at the moment
ocean_pickup = df.select(
    pl.struct(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])
    .map(data.polars_point_on_ocean(points_area, pickup=True))
    ).get_columns()[0].alias('ocean_pickup')
ocean_dropoff = df.select(
    pl.struct(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])
    .map(data.polars_point_on_ocean(points_area, dropoff=True))
    ).get_columns()[0].alias('ocean_dropoff')

print('Pickups in the ocean', ocean_pickup.arg_true().shape[0])
print('Dropoffs in the ocean', ocean_dropoff.arg_true().shape[0])
print('Total ocean outlier samples',
      (ocean_dropoff | ocean_pickup).arg_true().shape[0])

outsiders_pickup = df.filter(ocean_pickup)
outsiders_dropoff = df.filter(ocean_dropoff)

In [None]:
def print_point_on_map(ax, x, y, points_area, image, markersize=.5, color='b', title=None):
    left, right, bottom, top = points_area
    ax.imshow(image, extent=(left, right, bottom, top))
    ax.set_ylim(bottom, top)
    ax.set_xlim(left, right)
    ax.scatter(x, y, markersize, color)
    if title is not None:
        ax.title.set_text(str(title))

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(30, 30))

print_point_on_map(axs[0], outsiders_pickup['pickup_longitude'], outsiders_pickup['pickup_latitude'], points_area, image, color='b', markersize=3)
print_point_on_map(axs[1], outsiders_dropoff['dropoff_longitude'], outsiders_dropoff['dropoff_latitude'], points_area, image, color='r', markersize=3)

In [None]:
df = df.filter(~ocean_pickup & ~ocean_dropoff)

## Detectiong point outside boroughs

Analyzing the dataset we noticed that the fee for the taxi also depends on the boroughs the pickup and dropoff location are. Noticing this fact we also included the boroughs and neighborhood of pickup and dropoff as a feature. Now that we have geographical information of boroughs we can remove additional outliers that are outside the area of New York City.

This method makes the first oceanic detection usless because now we have a better filter but we decided to keep in this notebook that implementation anyway

Thanks to [2] we filtered the samples that have pickup or dropoff outside the boroughs of New York City.

Using the areas as polygons leads us to use low performance function to check if a point is inside a non convex shape. We can "compile" the polygons into an image and use it as an array to get the value associated to a position (similar to the oceanic detection). This increased the performance by 10x.

In [None]:
boros = boroughs.load()

In [None]:
boros_image, boros_colors = boroughs.get_image_boroughs(boros, points_area)

df = df.with_column(pl.struct(['pickup_longitude', 'pickup_latitude'])
                    .map(boroughs.point_boroughs(boros_image, boros_colors, points_area, "pickup_")).alias('pickup_borough'))

df = df.with_column(pl.struct(['dropoff_longitude', 'dropoff_latitude'])
                    .map(boroughs.point_boroughs(boros_image, boros_colors, points_area, "dropoff_")).alias('dropoff_borough'))

In [None]:
df = df.filter((pl.col('pickup_borough') != 'None') & (pl.col('dropoff_borough') != 'None'))

In [None]:
fig, ax = plt.subplots(1, figsize=(30, 30))

for b, color in zip(boros.values(), ['r','g','b','c', 'm', 'y']):
    df_tmp = df.filter(pl.col('pickup_borough') == b['name'])
    print_point_on_map(ax, df_tmp['pickup_longitude'], df_tmp['pickup_latitude'], points_area, image, color=color)

# References
*TODO: properly cite?*

[1]: CARTO basemap styles, https://github.com/CartoDB/basemap-styles
[2]: New York City Neighborhoods, 2007, https://geodata.lib.utexas.edu/catalog/sde-columbia-nycp_2007_nynh