# В каком районе станция?

[old new yorker app](https://projects.newyorker.com/story/citi-bike.html)

![caption](./_assets/map.png)

В первой части лекцции мы попробуем получить дополнительную информацию о станциях - в каком районе города они находятся - используя reverse geocoding,
а заодно научимся использовать **apply** и **merge**

In [12]:
%matplotlib inline

import pylab as plt
import pandas as pd

In [2]:
path = '../data/l5/201801-citibike-tripdata.csv.zip'
df = pd.read_csv(path)

In [3]:
df.head(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,970,2018-01-01 13:50:57.4340,2018-01-01 14:07:08.1860,72,W 52 St & 11 Ave,40.767272,-73.993929,505,6 Ave & W 33 St,40.749013,-73.988484,31956,Subscriber,1992,1
1,723,2018-01-01 15:33:30.1820,2018-01-01 15:45:33.3410,72,W 52 St & 11 Ave,40.767272,-73.993929,3255,8 Ave & W 31 St,40.750585,-73.994685,32536,Subscriber,1969,1
2,496,2018-01-01 15:39:18.3370,2018-01-01 15:47:35.1720,72,W 52 St & 11 Ave,40.767272,-73.993929,525,W 34 St & 11 Ave,40.755942,-74.002116,16069,Subscriber,1956,1


In [4]:
df.shape

(718994, 15)

# Уникальные станции

In [7]:
cols = ['station id', 'station name', 'station latitude', 'station longitude'] 
s = df[['start ' + el for el in  cols]]
e = df[['end ' + el for el in  cols]]
s.columns, e.columns = cols, cols

In [8]:
stations = pd.concat([s, e]).set_index('station id')
stations = stations[~stations.index.duplicated(keep='first')]  # в других случаях можно stations.drop_duplicates(subset=[названия колонок])

In [9]:
stations.shape

(768, 3)

In [10]:
stations.head(3)

Unnamed: 0_level_0,station name,station latitude,station longitude
station id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
72,W 52 St & 11 Ave,40.767272,-73.993929
79,Franklin St & W Broadway,40.719116,-74.006667
82,St James Pl & Pearl St,40.711174,-74.000165


## get Boroughs

[Nominatim API policy](https://operations.osmfoundation.org/policies/nominatim/)

In [13]:
import requests as rq
import time

# caching
from functools import lru_cache
# OR - from fastcache import clru_cache <- external package, 30x-40x faster

# just a nice ui progress bar
from tqdm import tqdm #, tqdm_notebook - хтмл-прогрессбар, работает в jupyter notebook, но не jupyter
tqdm().pandas()

0it [00:00, ?it/s]


### Декораторы

In [19]:
def my_decorator(func):
    def wrapper():
        print("Something is happening before the function is called.")
        func()
        print("Something is happening after the function is called.")
    return wrapper

In [20]:
@my_decorator
def say_whee():
    print("Whee!")

In [21]:
say_whee()

Something is happening before the function is called.
Whee!
Something is happening after the function is called.


### Назад

In [22]:
@lru_cache(maxsize=2000)
def generic_json_api_request(url:str, time_sleep=1, **kwargs) -> dict:
    response = rq.get(url, params=kwargs)
    response.raise_for_status()
    time.sleep(time_sleep)
    return response.json()

In [23]:
def reverse_geocode(lat:float, lon:float):
    base_url = ' https://nominatim.openstreetmap.org/reverse'
    params = {'format':'jsonv2', 'lat': lat, 'lon': lon}
    
    return generic_json_api_request(base_url, **params)

In [24]:
# test
reverse_geocode(40.767272, -73.993929)

{'place_id': '48356435',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'node',
 'osm_id': '3708656763',
 'lat': '40.7672722',
 'lon': '-73.9939289',
 'place_rank': '30',
 'category': 'amenity',
 'type': 'bicycle_rental',
 'importance': '0',
 'addresstype': 'amenity',
 'name': 'Citi Bike - W 52 St & 11 Ave',
 'display_name': 'Citi Bike - W 52 St & 11 Ave, West 52nd Street, Diamond District, Midtown, Manhattan, Manhattan Community Board 5, New York County, NYC, New York, 10019, USA',
 'address': {'address29': 'Citi Bike - W 52 St & 11 Ave',
  'road': 'West 52nd Street',
  'neighbourhood': 'Diamond District',
  'suburb': 'Midtown',
  'city_district': 'Manhattan',
  'city': 'NYC',
  'county': 'New York County',
  'state': 'New York',
  'postcode': '10019',
  'country': 'USA',
  'country_code': 'us'},
 'boundingbox': ['40.7671722', '40.7673722', '-73.9940289', '-73.9938289']}

In [25]:
reverse_geocode(40.6838, -73.9763)

{'place_id': '48359864',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'node',
 'osm_id': '3708656243',
 'lat': '40.683826',
 'lon': '-73.9763233',
 'place_rank': '30',
 'category': 'amenity',
 'type': 'bicycle_rental',
 'importance': '0',
 'addresstype': 'amenity',
 'name': 'Citi Bike - Atlantic Ave & Fort Greene Pl',
 'display_name': 'Citi Bike - Atlantic Ave & Fort Greene Pl, Atlantic Avenue, Fort Greene, BK, Kings County, NYC, New York, 11208, USA',
 'address': {'address29': 'Citi Bike - Atlantic Ave & Fort Greene Pl',
  'road': 'Atlantic Avenue',
  'neighbourhood': 'Fort Greene',
  'suburb': 'BK',
  'county': 'Kings County',
  'city': 'NYC',
  'state': 'New York',
  'postcode': '11208',
  'country': 'USA',
  'country_code': 'us'},
 'boundingbox': ['40.683726', '40.683926', '-73.9764233', '-73.9762233']}

# reGeocode all

In [26]:
# progress_apply is tqdm's wrapper around pandas `apply` method. only thing it changes is the progress bar

# geocoded = stations.apply(lambda row: reverse_geocode(row['station latitude'],
#                                                       row['station longitude'])['address'],
#                                                axis=1)

# для примера используйте первые 10 строк
first10 = stations.head(10)
geocoded = first10.progress_apply(lambda row: reverse_geocode(row['station latitude'], row['station longitude'])['address'],
                                             axis=1)



100%|██████████| 10/10 [00:13<00:00,  1.43s/it]


In [28]:
regeocoded = pd.DataFrame(geocoded.tolist(), index=first10.index)

In [29]:
regeocoded.head(3)

Unnamed: 0_level_0,address29,city,city_district,country,country_code,county,neighbourhood,pedestrian,postcode,residential,road,state,suburb
station id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
72,Citi Bike - W 52 St & 11 Ave,NYC,Manhattan,USA,us,New York County,Diamond District,,10019,,West 52nd Street,New York,Midtown
79,Citi Bike - Franklin St & W Broadway,NYC,Manhattan,USA,us,New York County,,Franklin Street,10013,,,New York,TriBeCa
82,Citi Bike - St James Pl & Pearl St,NYC,Manhattan,USA,us,New York County,,,10038,,Saint James Place,New York,Civic Center


In [31]:
regeocoded.isnull().sum().sort_values()

address29        0
city             0
country          0
country_code     0
county           0
postcode         0
state            0
road             1
suburb           3
neighbourhood    4
city_district    5
pedestrian       9
residential      9
dtype: int64

In [32]:
cols = ['county', 'postcode', 'suburb', 'city_district', 'neighbourhood']
regeocoded = regeocoded[cols]

In [34]:
stations = stations.join(regeocoded)

In [58]:
stations.to_csv('../data/l6/stations_regeocoded.csv.gzip', index=None)
# stations = pd.read_csv('../data/l6/stations_regeocoded.csv.gzip')

In [57]:
stations.head(5)

Unnamed: 0,station id,station name,station latitude,station longitude,county,postcode,suburb,city_district,neighbourhood
0,72,W 52 St & 11 Ave,40.767272,-73.993929,New York County,10019,Midtown,Manhattan,Diamond District
1,79,Franklin St & W Broadway,40.719116,-74.006667,New York County,10013,TriBeCa,Manhattan,
2,82,St James Pl & Pearl St,40.711174,-74.000165,New York County,10038,Civic Center,Manhattan,
3,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,Kings County,11208,BK,,Fort Greene
4,119,Park Ave & St Edwards St,40.696089,-73.978034,Kings County,11251,,,Fort Greene


In [41]:
stations['county'].value_counts()

New York County              421
Kings County                 266
Queens County                 78
Agglomération de Montréal      2
Hudson County                  1
Name: county, dtype: int64

In [42]:
stations[stations['county'] == 'Agglomération de Montréal']

Unnamed: 0,station id,station name,station latitude,station longitude,county,postcode,suburb,city_district,neighbourhood
607,3488,8D QC Station 01,45.506364,-73.569463,Agglomération de Montréal,H3A 1N9,Ville-Marie,,Quartier des Spectacles
752,3650,8D Mobile 01,45.506264,-73.568906,Agglomération de Montréal,H3A 1N9,Ville-Marie,,Quartier des Spectacles


In [44]:
stations.head(5)

Unnamed: 0,station id,station name,station latitude,station longitude,county,postcode,suburb,city_district,neighbourhood
0,72,W 52 St & 11 Ave,40.767272,-73.993929,New York County,10019,Midtown,Manhattan,Diamond District
1,79,Franklin St & W Broadway,40.719116,-74.006667,New York County,10013,TriBeCa,Manhattan,
2,82,St James Pl & Pearl St,40.711174,-74.000165,New York County,10038,Civic Center,Manhattan,
3,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,Kings County,11208,BK,,Fort Greene
4,119,Park Ave & St Edwards St,40.696089,-73.978034,Kings County,11251,,,Fort Greene
