## Feature engineering
- Bringing external data: Finding neighbourhoods
- Finding distance from city center

In [1]:
import json
import os
import pandas as pd
from utils import get_coordinate_lookup, get_maps, lat_long_in_neighbourhood
data_dir = "../data"

In [2]:
with open(os.path.join(data_dir,"neighbourhoods.geojson")) as file:
    json_data = json.loads(file.read())

The function `lat_long_in_neighbourhood()`
checks if a given lat-long pair is in neighbourhood polygon or not. It returns a boolean value. `get_coordinate_lookup` returns a dictionary of neigbourhoods and their coordinate-polygons 

In [3]:
%load_ext sql
%sql sqlite:///../data/sample_db/airbnb.db
listings = %sql select * from listings
listings = pd.DataFrame(listings)

 * sqlite:///../data/sample_db/airbnb.db
Done.


In [4]:

listings.head(2)

Unnamed: 0,listing_id,listing_url,name,description,latitude,longitude,property_type,room_type,accomodates,bathrooms_text,bedrooms,beds,amenities,host_id
0,50904,https://www.airbnb.com/rooms/50904,aplace/antwerp: cosy suite - fashion district,Decorated in a vintage style combined with a f...,51.218575,4.398631,Room in boutique hotel,Hotel room,2,1 private bath,1.0,1.0,"[""Kitchen"", ""Smoke alarm"", ""Hair dryer"", ""Keyp...",234077
1,116134,https://www.airbnb.com/rooms/116134,Spacious apartment nearby Mas,Enjoy your stay at our 4 person apartment in t...,51.23051,4.40593,Entire rental unit,Entire home/apt,4,2.5 baths,2.0,2.0,"[""Refrigerator"", ""Elevator"", ""Paid street park...",586942


In [5]:
listings.iloc[0][['latitude','longitude']]

latitude     51.218575
longitude     4.398631
Name: 0, dtype: object

In [6]:
coordinate_lookup = get_coordinate_lookup(json_data)
coordinate_lookup.keys()

dict_keys(['Haringrode', 'Linkeroever', 'Haven Antwerpen', 'Haven Bezali', 'Deurne Zuid Oost', 'Dam', 'Amandus - Atheneum', 'Brederode', 'Centraal Station', 'Donk', 'Harmonie', 'Lambrechtshoeken', 'Lillo', 'Eilandje', 'Kiel', 'Ekeren Centrum', 'Leugenberg', 'Hoboken - Centrum', 'Hoboken - Noord', 'Hoboken - West', 'Hoboken - Zuidoost', 'Hoogte', 'Mariaburg', 'Valaar', 'Nieuw - Zuid', 'Merksem - Heide', 'Nieuwdreef', 'Oosterveld - Elsdonk', 'Schoonbroek-Rozemaai', 'Oud - Merksem', 'Stadspark', 'Sint-Andries', 'Polder', 'Universiteitsbuurt', 'Stuivenberg', 'Tentoonstellingswijk', 'Theaterbuurt-Meir', 'Wilrijk Centrum', 'Zuid', 'Historisch Centrum', 'Groenenhoek', 'Markgrave', 'Middelheim', 'Oud - Berchem', 'Deurne Vlieghaven', 'Borgerhout Intra Muros Zuid', 'Petroleum - Zuid', 'Deurne Noord', 'Neerland', 'Deurne Dorp - Gallifort', 'Koornbloem', 'Nieuw - Kwartier West', 'Kruininge - Bremweide', 'Rivierenhof', 'Luchtbal', 'Deurne Oost', 'Deurne Zuid West', 'Borgerhout Extra Muros', 'Borger

In [7]:
lat_long_in_neighbourhood(long = 4.398631,lat=51.218575,coordinates=coordinate_lookup['Deurne Zuid Oost'])

False

In [8]:
for neighbourhood in coordinate_lookup:
    exists = lat_long_in_neighbourhood(long = 4.398631,lat=51.218575,coordinates=coordinate_lookup[neighbourhood])
    print(f"{neighbourhood}: {exists}")

Haringrode: False
Linkeroever: False
Haven Antwerpen: False
Haven Bezali: False
Deurne Zuid Oost: False
Dam: False
Amandus - Atheneum: False
Brederode: False
Centraal Station: False
Donk: False
Harmonie: False
Lambrechtshoeken: False
Lillo: False
Eilandje: False
Kiel: False
Ekeren Centrum: False
Leugenberg: False
Hoboken - Centrum: False
Hoboken - Noord: False
Hoboken - West: False
Hoboken - Zuidoost: False
Hoogte: False
Mariaburg: False
Valaar: False
Nieuw - Zuid: False
Merksem - Heide: False
Nieuwdreef: False
Oosterveld - Elsdonk: False
Schoonbroek-Rozemaai: False
Oud - Merksem: False
Stadspark: False
Sint-Andries: False
Polder: False
Universiteitsbuurt: False
Stuivenberg: False
Tentoonstellingswijk: False
Theaterbuurt-Meir: False
Wilrijk Centrum: False
Zuid: False
Historisch Centrum: True
Groenenhoek: False
Markgrave: False
Middelheim: False
Oud - Berchem: False
Deurne Vlieghaven: False
Borgerhout Intra Muros Zuid: False
Petroleum - Zuid: False
Deurne Noord: False
Neerland: False
De

In [9]:
## Write a python code to determine the neighbourhood in which the listing belongs
mapper = {}
for idx,row in listings.iterrows():
    lat=row['latitude']
    long=row['longitude']
    listing_id=row['listing_id']
    for neighbourhood in coordinate_lookup:
        exists = lat_long_in_neighbourhood(long = long,lat=lat,coordinates=coordinate_lookup[neighbourhood])
        if exists:
            mapper[listing_id]=neighbourhood

In [10]:
listings['neighbourhood']=listings['listing_id'].map(mapper)

In [11]:
listings.head(2)

Unnamed: 0,listing_id,listing_url,name,description,latitude,longitude,property_type,room_type,accomodates,bathrooms_text,bedrooms,beds,amenities,host_id,neighbourhood
0,50904,https://www.airbnb.com/rooms/50904,aplace/antwerp: cosy suite - fashion district,Decorated in a vintage style combined with a f...,51.218575,4.398631,Room in boutique hotel,Hotel room,2,1 private bath,1.0,1.0,"[""Kitchen"", ""Smoke alarm"", ""Hair dryer"", ""Keyp...",234077,Historisch Centrum
1,116134,https://www.airbnb.com/rooms/116134,Spacious apartment nearby Mas,Enjoy your stay at our 4 person apartment in t...,51.23051,4.40593,Entire rental unit,Entire home/apt,4,2.5 baths,2.0,2.0,"[""Refrigerator"", ""Elevator"", ""Paid street park...",586942,Eilandje


## Finding city center and then finding distance of each listing from city center

**Finding city center**
https://www.expatica.com/be/moving/location/living-in-antwerp-1500930/

In [12]:
get_maps(json_data)

In [13]:
listings[listings['neighbourhood']=='Eilandje'][['latitude','longitude']].mean()

latitude     51.231227
longitude     4.409973
dtype: float64

To calculate the distance between points whose latitude and longitude coordinates are available, we can use `haversine` formula. `sklearn` has a function which impliments this [here is the link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html)

In [14]:
lat_c = 51.231227
long_c = 4.409973
listings[['listing_id','latitude','longitude','neighbourhood']].head(5)

Unnamed: 0,listing_id,latitude,longitude,neighbourhood
0,50904,51.218575,4.398631,Historisch Centrum
1,116134,51.23051,4.40593,Eilandje
2,218916,51.20633,4.42942,Zurenborg
3,224333,51.19772,4.45853,Deurne Zuid West
4,224682,51.21722,4.3979,Sint-Andries


In [15]:
from sklearn.metrics.pairwise import haversine_distances
from math import radians
lat_lon_c = [lat_c,long_c]
lat_lon_c = [radians(_) for _ in lat_lon_c]

In [16]:
lat_lon_c

[0.894153590986617, 0.07696854877405189]

In [17]:
lat_lon_l1 = [51.218575,4.405930]
lat_lon_l1 = [radians(_) for _ in lat_lon_l1]
lat_lon_l1

[0.8939327719296547, 0.07689798511239376]

In [18]:
(haversine_distances([lat_lon_c,lat_lon_l1])*6371000/1000)

array([[0.        , 1.43473377],
       [1.43473377, 0.        ]])

In [19]:
(haversine_distances([lat_lon_c,lat_lon_l1])*6371000/1000)[0,1]

1.4347337732227432

In [20]:
## Can you now extend this logic and compute distances for all listings?
def get_distances(row):
    lat = row['latitude']
    long = row['longitude']
    lat_lon_l = [lat,long]
    lat_lon_l = [radians(_) for _ in lat_lon_l]
    distance = (haversine_distances([lat_lon_c,lat_lon_l])*6371000/1000)[0,1]
    return distance

In [21]:
listings['distance_city_center']=listings.apply(get_distances,axis=1)

In [22]:
listings.head()

Unnamed: 0,listing_id,listing_url,name,description,latitude,longitude,property_type,room_type,accomodates,bathrooms_text,bedrooms,beds,amenities,host_id,neighbourhood,distance_city_center
0,50904,https://www.airbnb.com/rooms/50904,aplace/antwerp: cosy suite - fashion district,Decorated in a vintage style combined with a f...,51.218575,4.398631,Room in boutique hotel,Hotel room,2,1 private bath,1.0,1.0,"[""Kitchen"", ""Smoke alarm"", ""Hair dryer"", ""Keyp...",234077,Historisch Centrum,1.613389
1,116134,https://www.airbnb.com/rooms/116134,Spacious apartment nearby Mas,Enjoy your stay at our 4 person apartment in t...,51.23051,4.40593,Entire rental unit,Entire home/apt,4,2.5 baths,2.0,2.0,"[""Refrigerator"", ""Elevator"", ""Paid street park...",586942,Eilandje,0.29258
2,218916,https://www.airbnb.com/rooms/218916,Apartment with terrace in trendy Zurenborg,Do you enjoy authentic places with a lot of ch...,51.20633,4.42942,Entire condominium (condo),Entire home/apt,5,1 bath,1.0,3.0,"[""Paid street parking off premises"", ""Kitchen""...",915664,Zurenborg,3.081981
3,224333,https://www.airbnb.com/rooms/224333,Large stylish room in 1930s house + garden,"Large bedroom in classic 1930s house. Kitchen,...",51.19772,4.45853,Private room in residential home,Private room,2,2 shared baths,1.0,1.0,"[""Fire extinguisher"", ""Backyard"", ""Long term s...",1167377,Deurne Zuid West,5.031959
4,224682,https://www.airbnb.com/rooms/224682,APARTMENT ROSCAM - OLD CENTRE ANTWERP,"<b>The space</b><br />Apartment ""Roscam"" is a ...",51.21722,4.3979,Entire rental unit,Entire home/apt,3,1 bath,1.0,2.0,"[""Refrigerator"", ""Kitchen"", ""Smoke alarm"", ""Sh...",1263933,Sint-Andries,1.769939


In [23]:
# path = "../../../../Content/Excel_Course/listings.csv"
# listings.to_csv(path,index=False)

In [24]:
# path = "../../../../Content/Excel_Course/hosts.csv"
# hosts = %sql select * from hosts
# hosts = pd.DataFrame(hosts)
# hosts.to_csv(path,index=False)

 * sqlite:///../data/sample_db/airbnb.db
Done.
