# Filtering and Preprocessing
We have some GPS data of a vehicle driving around munich for a few weeks. The data is recorded with a smartphone app, that should detect if the vehicle is driving to trigger a record. This dosn't work every time, therefore we need to find and remove points where the vehicle is not driving. When driving in cities the GPS signal isn't that good all the time as well. So we need to find out when a signal is bad and remove this points.

Let's start importing the modules

In [None]:
import pandas as pd, geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from shapely import wkb
import mplleaflet
import sys
import pickle
# import additional functions
%run ./data/custom_functions.py



In [None]:
# get the raw data from the smartphone. This is prepared in a geopandas dataframe with all the gps Points converted as shapely points
data = pd.read_pickle('data/gps_data.p')

In [None]:
# First display the data we got. 
display(data)

## Preparing data for filtering
The first task will be to find and remove points indicationg a error in measurement.    
Find some aspects that can indicate these errors if you have a time series of GPS points as database.

In [None]:
# Aspects incdication measurment error
#<<solution>>
# - Big change in Speed over short time period = Acceleration
# - Big change of location/distance over short time period = Acceleration
# - Bad GPS signal indication from GPS module (HDOP, VDOP)
#<</solution>>

No we precalculate some values between two conscutive data points, that we will need several times.    
- Time difference
- Distance
- Speed difference = Acceleration
- Speed from Distance difference (for comparison)

For distance calculation between two points there aready is a function given in 'custom_functiions.py'. It Return the distance in meter.

    lat_lon_2_m(latitude_1, longitude_1, latitude_2, longitude_2)    
    
Write a function to calculate the distance of all datapoints in the geopandas dataframe. 
Hint: 

    You can access the longitude/latitude with .x /.y on the geometry object.

In [None]:
def calculate_distance_points(points):
    # initialize distance array
    distance = np.zeros(len(points))
    # loop all points
    for i in range(0, len(points) - 1):
        # calculate distance between two consecutive points using lat_lon_2_km function
        d = lat_lon_2_m(
            points[i].x,
            points[i].y,
            points[i+1].x,
            points[i+1].y,)
        # append distance to array
        distance[i + 1] = d
    return distance

In [None]:
# calculate time difference between points using pandas.shift() function
data['time_diff'] = data.time-data.time.shift(1,fill_value = 0)
data['distance_diff'] = calculate_distance_points(data.geom)
# calculate distance between a point and the one following in meter
data['acceleration'] = (data.speed-data.speed.shift())/(data.time_diff.dt.total_seconds())
# for comparison get the speed and acceleration values from the positions and thime between them
data['speed_calc'] = (data.distance_diff/(data.time_diff.dt.total_seconds()))

In [None]:
# display rows where the time difference is bigger than one minute. The time difference needs to be a timeDelta object.
display(data[data.time_diff > np.timedelta64(1, 'm')])

In [None]:
# compare speed from GPS vs speed claculated from distance/time
def rsme(predictions,targets):
    return np.sqrt(np.mean((predictions-targets)**2))

# Inser your code here...
#<<solution>>
data['gps_vs_calc'] = data.speed - data.speed_calc
data['speed_error'] = data.apply( lambda x: rsme(x.speed_calc, x.speed).astype(float), axis =1)
#<</solution>>

#<<solution>>
data.gps_vs_calc.plot(ylim = (0,10), figsize = (30,7))

In [None]:
data.speed_error.plot(ylim = (0,20), figsize = (30,7))
print("Total RSME: " + str(data.speed_error.mean(axis = 0)))
#<</solution>>

# Filter values with low speed
Remove rows from data where speed is near zero. But let's visualize the current speed distribution by a KDE Plot.    
Hint: 

    You can use the plotting function integrated in pandas for a quick plot. 
    Parameters are explained in the soure.
    https://github.com/pandas-dev/pandas/blob/v0.25.1/pandas/plotting/_core.py#L504-L1533

In [None]:

# Plot the KDE before removing the near zero values
data['speed'].plot.kde(figsize = (30,7))


In [None]:
data.speed.plot(figsize = (30,7))

Plot some points with very low speed on a map. How do they behave?    
Hint:

    Use the prepared function 'map_folium' for plotting a dataframe.
    map_folium(pos_data, colorvalue, geometry = 'geom', c_min = -1, c_max = -1, line = False, cm_type='jet')
    take a subset of the data to make it printable (max. 6500 points) by conditional filtering (see link below for examples)
    https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/

In [None]:
condition = (data.time > '2015-01-14 14:05:29.748') & (data.time < '2015-01-14 16:28:26.000') & (data.speed < 1)
map_folium(data[condition], 'speed')

In [None]:
# Filter Data
# Select a threshold value for clipping points by speed
speed_thr = 0.1 # m/s
# First calculate a filtered speed with a rolling window of 10 seconds. This prevents short stops to be removed
data['speed_median']= data['speed'].rolling(3).median()

#apply filter
data = data[data.speed_median > speed_thr]

In [None]:
data.speed.plot(figsize = (30,7))

In [None]:
# Plot the KDE after removing the near zero values
#<<solution>>
data['speed'].plot.kde()
#<</solution>>

# Filter GPS outliers
Filter out points with bad signal quality index (Horizontal Dilution of precision = HDOP)    

| DOP Value | Rating    | Description                                                                                                                                                                              |   |   |
|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|
| 1         | Ideal     | Highest possible confidence level to be used for applications demanding the highest possible precision at all times.                                                                     |   |   |
| 1-2       | Excellent | At this confidence level, positional measurements are considered accurate enough to meet all but the most sensitive applications.                                                        |   |   |
| 2-5       | Good      | Represents a level that marks the minimum appropriate for making accurate decisions. Positional measurements could be used to make reliable in-route navigation suggestions to the user. |   |   |
| 5-10      | Moderate  | Positional measurements could be used for calculations, but the fix quality could still be improved. A more open view of the sky is recommended.                                         |   |   |
| 10-20     | Fair      | Represents a low confidence level. Positional measurements should be discarded or used only to indicate a very rough estimate of the current location.                                   |   |   |
| >20       | Poor      | At this level, measurements are inaccurate by as much as 300 meters with a 6-meter accurate device (50 DOP × 6 meters) and should be discarded.                                          |   |   |    
https://en.wikipedia.org/wiki/Dilution_of_precision_(navigation)    

Visualize the HDOP Values indicating a poor quality. Check out the locations on the map.

In [None]:
condition = data.hdop > 8
map_folium(data[condition], 'hdop')

In [None]:
# display rows where the speed is very high
#display(data[ #condition])
#<<solution>>
display(data[(data.acceleration > 5) & (data.time_diff < np.timedelta64(5, 's'))])
#<</solution>>

In [None]:
# display rows where hdop is very high
display(data[data.hdop > 20])

In [None]:
# Remove Points with very bad (high) HDOP
#hdop_thr = select_a_trehold_for_bad_hdop_value
#filter data
#<<solution>>
hdop_thr = 20
data = data[data.hdop < hdop_thr]
#<<solution>>

In [None]:
# clean up
# first reset the index to comensate the removed datapoints
data =data.reset_index(drop=True)
# Recalculate all values to their neighbours usinf the functions we already had above. This hels us finding the start end of a trip
# recalculate distance between a point and the one following in meter
data['distance_diff'] = calculate_distance_points(data.geom)
data['time_diff'] = data.time-data.time.shift(1,fill_value = 0)
# for comparison get the speed and acceleration values from the positions and thime between them
data['speed_calc'] = (data.distance_diff/(data.time_diff.dt.total_seconds()))
data['acceleration_calc'] = (data.speed_calc-data.speed_calc.shift(1))/(data.time_diff.dt.total_seconds())
#display(data)

# Analyze Tracks
After some basic filtering we will analyze the raw bulk of data to find single Trips / Tracks. This can be done finding a time difference between two consecutive data points.

In [None]:
# First define a new Dataframe for storing the tracks we extract
tracks = pd.DataFrame(columns=['time_start','time_stop'])
# For the Start of a track simply take the time difference we calculated before. If its biffger than x minutes (you can play around with this value), we will declare a new track.
tracks['time_start'] = data[data.time_diff > np.timedelta64(5, 'm')].time

# For the track ent we will take every point before the one of a new track.
tracks['time_stop'] = data.iloc[tracks.index-1].time.sort_values().values
# Lets calculate our track durations.
tracks['duration'] = tracks.time_stop- tracks.time_start
# We can filter short tracks by duration. This are the leftovers of our prevoiouse filtering
tracks = tracks[tracks.duration > np.timedelta64(30, 's')]

# reindex the Tracks to Track number instead of first Datapoint in original Dataframe
tracks.reset_index(drop = True, inplace=True)
# Add a date column for grouping tracks by date for export
tracks['date'] = tracks.time_start.apply(lambda x: x.date())

In [None]:
display(tracks)

In [None]:
# reset track view counter
track_no = 0

In [None]:
# Plot a single Track on a map, show next track when executing again
if track_no >= tracks.shape[0]: 
    track_no=0
    print("Restart")
my_track = data[(data.time > tracks.time_start[track_no]) & (data.time < tracks.time_stop[track_no])]
print("Track No. " + str(track_no) + " Duration: " + str(tracks.duration[track_no]))
track_no += 1
map_folium(my_track, 'speed' , line=True)

In [None]:
# Display Data of this track for manual analysis
display(my_track)

In [None]:
# Now we store the data in an array of dataframes, one for each day
def generate_daily_driving(data, tracks):
    trackdata =[]
    for i, row in tracks.iterrows():
        trackdata.insert(i,[tracks.iloc[i].date, [data[['time', 'geom', 'speed']][(data.time >= tracks.iloc[i].time_start)&(data.time <= tracks.iloc[i].time_stop)]]])
    return trackdata
    
trackdata = generate_daily_driving(data, tracks)
pickle.dump( trackdata, open( "trackdata.p", "wb" ))
#load with: trackdata = pickle.load( open( "trackdata.p", "rb" ))
#display(trackdata)

# Extra Task 1
Calculate the distance and average speed for each track by summing up the point-to-point distances.

In [None]:
#<<solution>>
for i, row in tracks.iterrows():
        tracks['distance'].iloc[i] = data[(data.time > tracks.iloc[i].time_start)&(data.time <= tracks.iloc[i].time_stop)].distance_diff.sum()/1000
tracks['avg_speed'] = tracks.distance/(tracks.duration.dt.total_seconds()/3600)
display(tracks)
#<</solution>>

# Extra Task 2
 Try to optimize the filtering to get better results regarding track separation. Optimize the low-speed filtering to avoid deleting traffic jams.