In [9]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon, LineString

<font size="6"> Joining Datasets </font>

The goal of training classiers on routes with points of interest they pass by in sight, we had first to match our cleaned routes data with our cleaned points of interest (poi) data. We chose a spacial join from the geopandas library for matching on the geometry of both datasets. 

Of cause, this being a study project, we did this for only 451 scraped and cleaned routes from wandermap.net and a (satisfactory large) subset of 18 types of poi scraped from openstreetmap.org. 

In [10]:
# reading in the routes and poi data
routes = pd.read_csv('cleaned_routes_data/cleaned_all_routes_data_long.csv')
#poi = pd.read_csv('cleaned_all_poi_data.csv').iloc[0:5000, :] #using smaller sample for experimenting
poi = pd.read_csv('cleaned_all_poi_data_fixed.csv')

In [11]:
# converting routes df into a GeoDataFrame
gdf_routes = gpd.GeoDataFrame(routes, geometry=gpd.points_from_xy(routes.longitude, routes.latitude))
gdf_routes.drop(['Unnamed: 0', 'lat_lgt'], axis=1, inplace=True)
gdf_routes

Unnamed: 0,route_id,num_of_waypoint,latitude,longitude,geometry
0,1005019,0,52.50607,13.33208,POINT (13.33208 52.50607)
1,1005019,1,52.50553,13.33163,POINT (13.33163 52.50553)
2,1005019,2,52.50525,13.33148,POINT (13.33148 52.50525)
3,1005019,3,52.50515,13.33337,POINT (13.33337 52.50515)
4,1005019,4,52.50520,13.33366,POINT (13.33366 52.50520)
...,...,...,...,...,...
186229,933359,151,52.50444,13.38246,POINT (13.38246 52.50444)
186230,933359,152,52.50525,13.38633,POINT (13.38633 52.50525)
186231,933359,153,52.50643,13.38615,POINT (13.38615 52.50643)
186232,933359,154,52.50648,13.39023,POINT (13.39023 52.50648)


In [12]:
# converting points from routes data to linestrings and adding these into a new column
gdf_routes['route_linestring'] = gdf_routes['route_id'].map(gdf_routes.groupby(['route_id'])['geometry'].apply(lambda x: LineString(x.tolist())))
gdf_routes

Unnamed: 0,route_id,num_of_waypoint,latitude,longitude,geometry,route_linestring
0,1005019,0,52.50607,13.33208,POINT (13.33208 52.50607),"LINESTRING (13.33208 52.50607, 13.33163 52.505..."
1,1005019,1,52.50553,13.33163,POINT (13.33163 52.50553),"LINESTRING (13.33208 52.50607, 13.33163 52.505..."
2,1005019,2,52.50525,13.33148,POINT (13.33148 52.50525),"LINESTRING (13.33208 52.50607, 13.33163 52.505..."
3,1005019,3,52.50515,13.33337,POINT (13.33337 52.50515),"LINESTRING (13.33208 52.50607, 13.33163 52.505..."
4,1005019,4,52.50520,13.33366,POINT (13.33366 52.50520),"LINESTRING (13.33208 52.50607, 13.33163 52.505..."
...,...,...,...,...,...,...
186229,933359,151,52.50444,13.38246,POINT (13.38246 52.50444),"LINESTRING (13.49061 52.50157, 13.48447 52.503..."
186230,933359,152,52.50525,13.38633,POINT (13.38633 52.50525),"LINESTRING (13.49061 52.50157, 13.48447 52.503..."
186231,933359,153,52.50643,13.38615,POINT (13.38615 52.50643),"LINESTRING (13.49061 52.50157, 13.48447 52.503..."
186232,933359,154,52.50648,13.39023,POINT (13.39023 52.50648),"LINESTRING (13.49061 52.50157, 13.48447 52.503..."


Each row now shows a waypoint from one specific route ("route_id" is categorical data) with the corresponding Linestring that shows the whole route, where the waypoint is based on.

In [14]:
# converting poi df into a GeoDataFrame
gdf_poi = gpd.GeoDataFrame(poi, geometry=gpd.points_from_xy(poi.lon, poi.lat))
gdf_poi.drop(['Unnamed: 0'], axis=1, inplace=True)
gdf_poi

Unnamed: 0,category,name,id,lat,lon,geometry
0,atm,Bank für Sozialwirtschaft,78252154,52.523744,13.398627,POINT (13.39863 52.52374)
1,atm,Sparda-Bank,87036263,52.532985,13.384282,POINT (13.38428 52.53299)
2,atm,Bankhaus August Lenz,89275133,52.518025,13.406956,POINT (13.40696 52.51802)
3,atm,,213106623,52.542170,13.441137,POINT (13.44114 52.54217)
4,atm,Berliner Sparkasse,213113204,52.542750,13.392862,POINT (13.39286 52.54275)
...,...,...,...,...,...,...
213006,viewpoint,,8931299152,52.487989,13.275393,POINT (13.27539 52.48799)
213007,viewpoint,,9024702237,52.506772,13.334563,POINT (13.33456 52.50677)
213008,viewpoint,Alpengipfel,9026936271,52.401704,13.366960,POINT (13.36696 52.40170)
213009,viewpoint,,9038673666,52.482133,13.291911,POINT (13.29191 52.48213)


In [19]:
# spacial joining both datasets on nearest distance of any a route's waypoint from a poi
poi_routes = gpd.sjoin_nearest(gdf_poi, gdf_routes, how='inner', max_distance=0.001, distance_col='distance')

# followed by some manipulation for the optics
poi_routes.drop(['id', 'index_right'], axis=1, inplace=True) #drop info to reduce risk of overfitting
poi_routes.rename({'geometry': 'poi_lat_lgt', 'lat': 'poi_latitude', 'lon': 'poi_longitude', 'category': 'poi_category', 'name': 'poi_name', 'latitude':'waypoint_latitude', 'longitude':'waypoint_longitude'}, axis=1, inplace=True) #renaming cloumns
poi_routes = poi_routes.iloc[:, [5,9,6,8,7,4,3,2,0,1,10]] #rearranging columns
poi_routes['route_id'] = poi_routes['route_id'].astype(int, errors='ignore') #converting float to int
poi_routes.dropna(thresh=5, inplace=True) #dropping any poi without a route passing by in max_distance #update: after inner joining that should be dispensable
poi_routes.sort_values(by=['route_id', 'num_of_waypoint'], inplace=True) #sorting by route_id and by number of waypoint to keep the order
poi_routes.reset_index(drop=True, inplace=True) #reset index
poi_routes.head()


Unnamed: 0,route_id,route_linestring,num_of_waypoint,waypoint_longitude,waypoint_latitude,poi_lat_lgt,poi_longitude,poi_latitude,poi_category,poi_name,distance
0,113043,"LINESTRING (13.69072 52.45147, 13.69075 52.451...",133,13.67734,52.43849,POINT (13.67747 52.43852),13.677475,52.438521,viewpoint,Müggeleck,0.000138
1,113043,"LINESTRING (13.69072 52.45147, 13.69075 52.451...",282,13.62759,52.44386,POINT (13.62791 52.44431),13.627905,52.444312,bench,,0.000551
2,113043,"LINESTRING (13.69072 52.45147, 13.69075 52.451...",288,13.62704,52.44442,POINT (13.62716 52.44445),13.627159,52.444446,bench,,0.000122
3,113043,"LINESTRING (13.69072 52.45147, 13.69075 52.451...",288,13.62704,52.44442,POINT (13.62695 52.44448),13.626949,52.444481,bench,,0.00011
4,113043,"LINESTRING (13.69072 52.45147, 13.69075 52.451...",292,13.62716,52.44509,POINT (13.62739 52.44532),13.627385,52.445316,bench,,0.000319


Since the join was only applied to very small distances within Berlin, coordinate reference system (CRS) metrics for the max_distance parameter were neglectable. We found a tolerable distance value (max_distance=0.001) by (visual) experimentation with gpsvisualizer.com. That distance from the route translates into the poi being maximally as far away as on the other side of the street, or elseway visible to a pedestrian.

In [20]:
# writing joint sample data into a csv file
#poi_routes.to_csv('joint_sample_data.csv')

# writing joint data into a csv file
poi_routes.to_csv('joint_data.csv', index=False)

 <font size="5"> Notes on joining routes and points of interest data </font> 

For discovering at which points of interest our popular routes pass by (which we believe justifies pedestrians to chose those routes in the first place) we had to inner join our routes data and our data about specific points of interest.

The geopandas method sjoin_nearest matches the coordinates of every point of interest (poi) to the nearest waypoints from our routes dataset, if they are closer than the defined max_distance. (If there are several route's waypoints equally near to a poi, sjoin_nearest matches both.) 

After inner joining, we can find that from the 451 routes in our routes dataset, 430 pass by points of interest from our poi dataset. We had grouped the waypoints before and had connected a linestring for the whole route to every waypoint, so that after joining the poi are essentially matched with information on full routes. We're keeping the data in long format, so it's easier to work with.