# Second phase notebook: Recommending Rental Properties / Case: Boston
This notebook includes 10 parts:
1. downloading the neighbourhoods' report
2. importing, cleaning and forming the dataset
3. finding the coordinates of neighbourhoods
4. plotting location of neighbourhoods
5. filtering neighbourhoods by distance
6. finding venues in the selected neighbourhoods
7. analysing each neighbourhood by found venues
8. developing scoring frame and ranking neighbourhoods
9. finding, filtering, and plotting rental properties
10. clustring and visualisation of results
each steps is explianed later. This notebook is published on **github** and for veiwing plots **nbviewer** should be used.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
from dotenv import load_dotenv
from pathlib import Path
import os, tabula, wget, requests, subprocess, warnings
# disabling warnings 
warnings.filterwarnings("ignore")

## 1st step, downloading the neighbourhoods' report
in this step the dataset is downloaded using wget which is utilised with subprocess library. The downloaded file is saved in the main directory.

In [2]:
# importing dataset
url = r'http://www.bostonplans.org/getattachment/6f48c617-cf23-4c9f-b54b-35c8a954091c'
file_name = r'boston_statistics.pdf'
path = os.path.dirname(os.path.abspath('Recommending rental properties.ipynb'))
subprocess.run(["wget", "-r", "-nd", "-O", file_name, path, url])

CompletedProcess(args=['wget', '-r', '-nd', '-O', 'boston_statistics.pdf', 'C:\\Users\\King Aron\\Desktop\\Python_projects\\Coursera_Capstone\\Boston Neighbourhoods', 'http://www.bostonplans.org/getattachment/6f48c617-cf23-4c9f-b54b-35c8a954091c'], returncode=4)

## 2nd step, importing, cleaning and forming the dataset
The downloaded pdf file has many pages including demographic data. Page 5 includes age distribtion of Boston City and becasue it might be useful if later analysis this page has been used. The pdf file is read by use of tabula library. In the next step, desired columns are selected and the data is coverted into float type.

In [3]:
# cleaning and forming the dataset
df = tabula.read_pdf(file_name, pages = 5)[0]
df.drop([0,1,2], inplace = True, axis = 0) # the first three rows are irrelevent
columns_list = list(range(0,3,1)) + list(range(3,15,2))
df = df[df.columns[columns_list]] # Aron is interested in young neighbourhoods
df.columns = ['neighbourhood', 'total_population', 'median_age', '0-9', '10-19', '20-34', '35-54', '55-64', '65+']
df.reset_index(inplace = True, drop = True)
df.head()

Got stderr: Aug 22, 2020 2:05:30 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font ABCDEE+Calibri are not implemented in PDFBox and will be ignored
Aug 22, 2020 2:05:30 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font ABCDEE+Calibri-Bold are not implemented in PDFBox and will be ignored



Unnamed: 0,neighbourhood,total_population,median_age,0-9,10-19,20-34,35-54,55-64,65+
0,Allston,19761,27,644,3152,12741,1912,691,621
1,Back Bay,17577,33,756,1372,7681,3229,1973,2566
2,Beacon Hill,9305,31,686,247,4909,1751,646,1066
3,Brighton,47768,31,2852,3016,25485,7238,3577,5600
4,Charlestown,18058,34,2559,1256,5754,4774,2006,1709


In [4]:
# coversion of types
def replace_resi(col):
    for ii in range(0, len(col)):
        col[ii] = col[ii].replace(',', '')
    return col
col_list = [1] + list(range(3,len(df.columns)))
df.iloc[:,col_list] = df.iloc[:,col_list].apply(replace_resi, axis=0)
df.iloc[:,1:] = df.iloc[:,1:].astype('float')
df.head()

Unnamed: 0,neighbourhood,total_population,median_age,0-9,10-19,20-34,35-54,55-64,65+
0,Allston,19761.0,27.0,644.0,3152.0,12741.0,1912.0,691.0,621.0
1,Back Bay,17577.0,33.0,756.0,1372.0,7681.0,3229.0,1973.0,2566.0
2,Beacon Hill,9305.0,31.0,686.0,247.0,4909.0,1751.0,646.0,1066.0
3,Brighton,47768.0,31.0,2852.0,3016.0,25485.0,7238.0,3577.0,5600.0
4,Charlestown,18058.0,34.0,2559.0,1256.0,5754.0,4774.0,2006.0,1709.0


In [5]:
# brief insights
print('There are {:.0f} neighbourhoods in Boston \n'.format(df['neighbourhood'].count()))
print('Boston total population is {:.0f} \n'.format(df['total_population'].sum()))
print('{:.2f}% of this population aged between 20-34 \n'.format(df['20-34'].sum() / df['total_population'].sum() * 100))

There are 23 neighbourhoods in Boston 

Boston total population is 650281 

34.68% of this population aged between 20-34 



## 3rd step, finding coordinates
In this section, geocoder is used in a loop to find correspnding long/lat of each rows. Important parts are as follows:
1. using geopy, Nominatim
2. passing GeocoderTimedOut for avoiding errors of timing out
3. setting a search limit for a neighbourhood
4. using sleep of 1 sec for avoiding server runtime limit block
5. passing a random symbolic password
6. random ordering of address 
<br>

Finally, becasue all neighbourhoods' coordinates found, there is no need of setting them manually.

In [6]:
import geopy, random
from time import sleep
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

In [7]:
def do_geocode(address):
    geopy = Nominatim(user_agent="aron.shirazi@gmail.com")
    try:
        sleep(1)
        return geopy.geocode(address)
    except GeocoderTimedOut:
        return do_geocode(address)

df['latitude'] = 'NA'
df['longitude'] = 'NA'
max_try = 10
for nn in range(0, len(df)):
    neighbourhood = df['neighbourhood'].iloc[nn]
    location = None
    count = 0
    while (location == None) & (count < max_try):
        password = ''.join(random.choice(['#', '$', '%', '@', '*', '-', '&', '~', '!']) for i in range(8))
        address_list = [neighbourhood, 'Boston', 'Massachusetts', password]
        order = ''.join(random.sample(['0', '1', '2', '3'], 4))
        n0 = int(order[0]); n1 = int(order[1]); n2 = int(order[2]); n3 = int(order[3])
        address = '{}, {}, {}, {}'.format(address_list[n0], address_list[n1], address_list[n2], address_list[n3])
        location = do_geocode(address)
        count += 1
    if location is not None:
        print('{}, coordinates found for {}'.format(nn, neighbourhood))
        df['latitude'].iloc[nn] = location.latitude
        df['longitude'].iloc[nn] = location.longitude
    else:
        print('{}, coordinates not found for {}'.format(nn, neighbourhood))
df

0, coordinates found for Allston
1, coordinates found for Back Bay
2, coordinates found for Beacon Hill
3, coordinates found for Brighton
4, coordinates found for Charlestown
5, coordinates found for Dorchester
6, coordinates found for Downtown
7, coordinates found for East Boston
8, coordinates found for Fenway
9, coordinates found for Harbor Islands
10, coordinates found for Hyde Park
11, coordinates found for Jamaica Plain
12, coordinates found for Longwood
13, coordinates found for Mattapan
14, coordinates found for Mission Hill
15, coordinates found for North End
16, coordinates found for Roslindale
17, coordinates found for Roxbury
18, coordinates found for South Boston
19, coordinates found for South Boston Waterfront
20, coordinates found for South End
21, coordinates found for West End
22, coordinates found for West Roxbury


Unnamed: 0,neighbourhood,total_population,median_age,0-9,10-19,20-34,35-54,55-64,65+,latitude,longitude
0,Allston,19761.0,27.0,644.0,3152.0,12741.0,1912.0,691.0,621.0,42.3554,-71.1321
1,Back Bay,17577.0,33.0,756.0,1372.0,7681.0,3229.0,1973.0,2566.0,42.3503,-71.1012
2,Beacon Hill,9305.0,31.0,686.0,247.0,4909.0,1751.0,646.0,1066.0,42.3587,-71.0678
3,Brighton,47768.0,31.0,2852.0,3016.0,25485.0,7238.0,3577.0,5600.0,42.3501,-71.1564
4,Charlestown,18058.0,34.0,2559.0,1256.0,5754.0,4774.0,2006.0,1709.0,42.3814,-71.0727
5,Dorchester,124489.0,33.0,15841.0,16428.0,33342.0,33529.0,13470.0,11879.0,42.3329,-71.0448
6,Downtown,16903.0,34.0,888.0,2440.0,5647.0,3303.0,2067.0,2558.0,42.3586,-71.0639
7,East Boston,44989.0,34.0,5778.0,4237.0,13361.0,13917.0,3745.0,3951.0,42.3811,-71.035
8,Fenway,32210.0,23.0,452.0,8582.0,17575.0,2883.0,1271.0,1447.0,42.3373,-71.1057
9,Harbor Islands,329.0,42.0,0.0,5.0,127.0,94.0,70.0,33.0,42.2697,-70.9209


## 4th step, plotting locations
in this step, found coordinates of neighbourhoods are plotted along with their attached names. The initial zoom command in folium is not used, instead a more efficient method of fit_bound has been utilised.

In [8]:
# importing the library
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
# finding the center of map for illustration purpuse
df_loc = df[['neighbourhood', 'latitude', 'longitude']]
center_lat = df_loc['latitude'].mean()
center_lon = df_loc['longitude'].mean()
# to set boundaries of folium
lat_min = df_loc['latitude'].min()
lat_max = df_loc['latitude'].max()
lon_min = df_loc['longitude'].min()
lon_max = df_loc['longitude'].max()

In [9]:
map_boston = folium.Map(location=[center_lat, center_lon], width=800, height=600)
map_boston.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])
# add markers to map
for lat, lng, label in zip(df_loc['latitude'], df_loc['longitude'], df_loc['neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boston)  
map_boston

# 5th step, filtering neighbourhoods by distance
Aron has decided to find an apratment as close as possible to the campus. His ideal choice would be less than 3 km from the MIT campus. The selected neighbourhoods are filtered by the calculated distances.

In [10]:
from geopy.distance import geodesic 
campus_loc = (42.3601, -71.0942)
df['distance'] = np.nan
for ii in range(0,len(df)):
    test_loc = (df['latitude'][ii], df['longitude'][ii])
    df['distance'].iloc[ii] = geodesic(campus_loc, test_loc).km
desired_distance = 3 #km
df_select = df[df['distance'] <= desired_distance]
df_select

Unnamed: 0,neighbourhood,total_population,median_age,0-9,10-19,20-34,35-54,55-64,65+,latitude,longitude,distance
1,Back Bay,17577.0,33.0,756.0,1372.0,7681.0,3229.0,1973.0,2566.0,42.3503,-71.1012,1.23121
2,Beacon Hill,9305.0,31.0,686.0,247.0,4909.0,1751.0,646.0,1066.0,42.3587,-71.0678,2.178011
4,Charlestown,18058.0,34.0,2559.0,1256.0,5754.0,4774.0,2006.0,1709.0,42.3814,-71.0727,2.956335
6,Downtown,16903.0,34.0,888.0,2440.0,5647.0,3303.0,2067.0,2558.0,42.3586,-71.0639,2.503793
8,Fenway,32210.0,23.0,452.0,8582.0,17575.0,2883.0,1271.0,1447.0,42.3373,-71.1057,2.699708
12,Longwood,5233.0,21.0,12.0,2358.0,2663.0,120.0,40.0,40.0,42.3371,-71.1019,2.63195
17,Roxbury,51252.0,31.0,6844.0,7959.0,14228.0,12277.0,4879.0,5065.0,42.3379,-71.1014,2.534625
20,South End,31601.0,35.0,2752.0,1841.0,11211.0,8885.0,3172.0,3740.0,42.3413,-71.0772,2.512266
21,West End,5945.0,34.0,364.0,297.0,2415.0,1622.0,522.0,725.0,42.3603,-71.0583,2.958284


In [11]:
# finding the center of map for illustration purpuse
df_loc = df_select[['neighbourhood', 'latitude', 'longitude']]
center_lat = df_loc['latitude'].mean()
center_lon = df_loc['longitude'].mean()
# to set boundaries of folium
lat_min = df_loc['latitude'].min()
lat_max = df_loc['latitude'].max()
lon_min = df_loc['longitude'].min()
lon_max = df_loc['longitude'].max()
map_boston = folium.Map(location=[center_lat, center_lon], width=800, height=600)
map_boston.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])
# add markers to map
for lat, lng, label in zip(df_loc['latitude'], df_loc['longitude'], df_loc['neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boston)  
map_boston

## 6th step, finding venues
After taking out coordinates of neighbourhoods, it is time to extract the specifications of registered venues. To do so, Foursquare API service is used. There are 4 corresonding steps introduced in the following:
1. defining API credentials by using dot env. in this method credentials are savved in a .env file which set to be ignored by Github in the time of publication in .gitignore file.
2. defining two main functions: the first function find venues aroud a specified location by passing lat/lon. The limit is set to 100 and the radius is 3000m by default. The second function, extract venues specification stored in the retrieved JSON file.
3. exploring neighbourhoods' venues by runing two functions along all extracted coordinates in the former step. A new dataset is generated here which stores specifications of venues.
4. analysing the venues dataset which starts by finding how many venues found per neighbourhood. Then the number of unique venues is calculated as well as their categories.
5. plotting found venues imposed on neighbourhoods' plot to see the disturbution of them.

In [12]:
# Defining Foursquare Credentials and Version
# importing credentials
load_dotenv()
env_path = Path('.') / '.env'
load_dotenv(dotenv_path=env_path)
CLIENT_ID = os.getenv("Foursquare_CLIENT_ID")
CLIENT_SECRET = os.getenv("Foursquare_CLIENT_SECRET")
VERSION = '20200630' # Foursquare API version

In [13]:
# defining to main functions
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def find_venue(lat, lon, limit = 1000, radius = 3000):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lon, 
    radius, 
    limit)
    results = requests.get(url).json()
    try:
        venues = results['response']['groups'][0]['items']
    except:
        venues = []
    nearby_venues = None
    if len(venues) > 0:
        nearby_venues = pd.json_normalize(venues) # flatten JSON
        # filter columns
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        nearby_venues =nearby_venues.loc[:, filtered_columns]
        # filter the category for each row
        nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
        # clean columns
        nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    return nearby_venues

In [21]:
df_venues = pd.DataFrame(columns = ['name', 'categories', 'lat', 'lng'])
nn = 0
for name, lat, lng in zip(df_select['neighbourhood'], df_select['latitude'], df_select['longitude']):
    df_tr = find_venue(lat, lng)
    if df_tr is None:
        len_found = 0
    else: 
        len_found = len(df_tr)
        df_tr['neighbourhood'] = name
    print('{}, venues of {} explored at lat: {} and long: {}, with {} venues'.format(nn, name, lat, lng, len_found))
    df_venues = pd.concat([df_venues, df_tr])
    nn += 1
df_venues.reset_index(inplace = True, drop = True)
print('venues are explored, the dataset shape is {} \n'.format(df_venues.shape))
df_venues

0, venues of Back Bay explored at lat: 42.35031725 and long: -71.10122545064246, with 100 venues
1, venues of Beacon Hill explored at lat: 42.3587085 and long: -71.067829, with 100 venues
2, venues of Charlestown explored at lat: 42.3813909 and long: -71.0726639, with 100 venues
3, venues of Downtown explored at lat: 42.35860195 and long: -71.06387508501135, with 100 venues
4, venues of Fenway explored at lat: 42.33734685 and long: -71.10571720213595, with 100 venues
5, venues of Longwood explored at lat: 42.3371008 and long: -71.10187956391178, with 100 venues
6, venues of Roxbury explored at lat: 42.337915550000005 and long: -71.10139842213647, with 100 venues
7, venues of South End explored at lat: 42.34131 and long: -71.0772298, with 100 venues
8, venues of West End explored at lat: 42.3602534 and long: -71.0582912, with 100 venues
venues are explored, the dataset shape is (900, 5) 



Unnamed: 0,name,categories,lat,lng,neighbourhood
0,Charles River Esplanade,Trail,42.351128,-71.100407,Back Bay
1,Island Creek Oyster Bar,Seafood Restaurant,42.348838,-71.095280,Back Bay
2,Fenway Park,Baseball Stadium,42.346282,-71.097535,Back Bay
3,Fenway Beer Shop,Liquor Store,42.344928,-71.099908,Back Bay
4,Mei Mei,Chinese Restaurant,42.347481,-71.105949,Back Bay
...,...,...,...,...,...
895,Thinking Cup,Coffee Shop,42.351653,-71.074884,West End
896,Bella Sante The SPA on Newbury,Spa,42.352087,-71.073132,West End
897,Residence Inn by Marriott Boston Downtown/Seaport,Hotel,42.350179,-71.047857,West End
898,Bacco's Fine Foods,Gourmet Shop,42.350750,-71.071220,West End


In [22]:
print('There are {} uniques categories \n'.format(len(df_venues['categories'].unique())))
print('There are {} uniques venues \n'.format(len(df_venues['name'].unique())))

There are 133 uniques categories 

There are 348 uniques venues 



In [23]:
# plotting venues along their neighbourhoods
map_venue_boston = folium.Map(location=[center_lat, center_lon], width=800, height=600)
map_venue_boston.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])
# add markers to map for neighbourhoods
for lat, lng, label in zip(df_select['latitude'], df_select['longitude'], df_select['neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius= 10,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_venue_boston)
# add markers to map for venues
for lat, lng, label in zip(df_venues['lat'], df_venues['lng'], df_venues['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=0.2,
        popup=label,
        color='green',
        fill=False,
        fill_color='#31cc67',
        fill_opacity=0.5,
        parse_html=False).add_to(map_venue_boston)
map_venue_boston

## 7th step, analysing each neighbourhood
in this step neighbourhoods are analysed by classifying their venues into interested groups of Restaurant, Bar, Sport, Coffee, and Gym. Then the corresponding freuquencies are calculated and based on given scores of classes, neighbourhoods are ranked. this part is consisted of four steps:
1. establishing onehot dataset
2. grouping the dataset by its neighbourhood
3. grouping catergories into 5 desired classes
4. frequency calculation

In [24]:
# one hot encoding
boston_onehot = pd.get_dummies(df_venues[['categories']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
boston_onehot['neighbourhood'] = df_venues['neighbourhood'] 
# move neighborhood column to the first column
fixed_columns = [boston_onehot.columns[-1]] + list(boston_onehot.columns[:-1])
boston_onehot = boston_onehot[fixed_columns]
print('shape of neighbourhood-venues dataset is {}'.format(boston_onehot.shape))
boston_onehot.head()

shape of neighbourhood-venues dataset is (900, 134)


Unnamed: 0,neighbourhood,Afghan Restaurant,American Restaurant,Aquarium,Arepa Restaurant,Art Museum,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,...,Theater,Tour Provider,Trail,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Yoga Studio
0,Back Bay,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,Back Bay,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Back Bay,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Back Bay,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Back Bay,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# classification of categories by keywords
search_list = ['Restaurant', 'Bar', 'Sport', 'Coffee', 'Gym'] # Aron is interested in these group of classes
df_onehot = pd.DataFrame(columns = search_list, index = boston_onehot.index)
for search_spec in search_list:
    df_onehot[search_spec] = boston_onehot[list(filter(lambda a: search_spec in a, boston_onehot.columns))].sum(axis = 1)
df_onehot.index = boston_onehot['neighbourhood']
df_onehot

Unnamed: 0_level_0,Restaurant,Bar,Sport,Coffee,Gym
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Back Bay,0,0,0,0,0
Back Bay,1,0,0,0,0
Back Bay,0,0,0,0,0
Back Bay,0,0,0,0,0
Back Bay,1,0,0,0,0
...,...,...,...,...,...
West End,0,0,0,1,0
West End,0,0,0,0,0
West End,0,0,0,0,0
West End,0,0,0,0,0


In [26]:
# grouping neighbourhood-venues dataset by its neighbourhood to find densities
boston_onehot = df_onehot.groupby('neighbourhood').sum().reset_index()
boston_onehot['Gym'] += boston_onehot['Sport']
boston_onehot.drop('Sport', axis = 1, inplace = True)
boston_onehot.set_index('neighbourhood', inplace = True, drop = True)
print('the shape of neighbourhood- selected venues dataset is {}'.format(boston_onehot.shape))
boston_onehot = boston_onehot / boston_onehot.sum(axis = 0)
boston_onehot

the shape of neighbourhood- selected venues dataset is (9, 4)


Unnamed: 0_level_0,Restaurant,Bar,Coffee,Gym
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Back Bay,0.121076,0.071429,0.071429,0.108108
Beacon Hill,0.085202,0.0,0.142857,0.135135
Charlestown,0.107623,0.214286,0.035714,0.054054
Downtown,0.089686,0.0,0.142857,0.108108
Fenway,0.112108,0.071429,0.071429,0.108108
Longwood,0.125561,0.071429,0.107143,0.135135
Roxbury,0.125561,0.142857,0.107143,0.135135
South End,0.152466,0.357143,0.142857,0.135135
West End,0.080717,0.071429,0.178571,0.081081


# 8th Step, developing scoring frame
Aron has put these scores out of 10 for each class of venue: <br>
1. Gym: 10 / 10
2. Coffee: 8 / 10
3. Restaurant: 5 / 10
4. Bar: 5 10

Taking these scores, it is possible to rate neighbourhoods and select top 2.

In [27]:
boston_onehot['Restaurant'] *= 5
boston_onehot['Bar'] *= 5
boston_onehot['Coffee'] *= 8
boston_onehot['Gym'] *= 10
boston_onehot['Score'] = boston_onehot.sum(axis = 1)
boston_onehot = boston_onehot.sort_values(by = 'Score', ascending=False)
boston_onehot

Unnamed: 0_level_0,Restaurant,Bar,Coffee,Gym,Score
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
South End,0.762332,1.785714,1.142857,1.351351,5.042255
Roxbury,0.627803,0.714286,0.857143,1.351351,3.550583
Longwood,0.627803,0.357143,0.857143,1.351351,3.19344
West End,0.403587,0.357143,1.428571,0.810811,3.000113
Beacon Hill,0.426009,0.0,1.142857,1.351351,2.920217
Downtown,0.44843,0.0,1.142857,1.081081,2.672369
Back Bay,0.605381,0.357143,0.571429,1.081081,2.615034
Fenway,0.560538,0.357143,0.571429,1.081081,2.570191
Charlestown,0.538117,1.071429,0.285714,0.540541,2.4358


In [29]:
n_top_neighbourhood = 2
negihbour_selected_list = list(boston_onehot.index[range(0,n_top_neighbourhood)])
df_selected_neighbourhood = df_select[df_select.neighbourhood.isin(negihbour_selected_list)]
df_selected_neighbourhood.reset_index(inplace = True)
df_selected_neighbourhood

Unnamed: 0,index,neighbourhood,total_population,median_age,0-9,10-19,20-34,35-54,55-64,65+,latitude,longitude,distance
0,17,Roxbury,51252.0,31.0,6844.0,7959.0,14228.0,12277.0,4879.0,5065.0,42.3379,-71.1014,2.534625
1,20,South End,31601.0,35.0,2752.0,1841.0,11211.0,8885.0,3172.0,3740.0,42.3413,-71.0772,2.512266


# 9th step, finding rental properties
In this section, rental properties are identified by use of realtor API. The the JSON file is flatten and a pandas dataframe is developed with includes interested parameters. This section is consisted of following parts:
1. exploring the rental properties in Boston by use of realtor API
2. coversion of JSON file to pandas dataframe and selection of interested columns
3. finding disctance of properties from the campus and selected neighbourhoods, and filtering them
4. visualisation of found properties in folium

In [66]:
# exploring rental properties
load_dotenv()
env_path = Path('.') / '.env'
load_dotenv(dotenv_path=env_path)

url = "https://realtor.p.rapidapi.com/properties/v2/list-for-rent"
querystring = {"beds_min":"0","price_max":"10000","prop_type":"single_family","sort":"relevance","baths_min":"0","price_min":"0","city":"Boston","state_code":"MA","limit":"10000","offset":"0"}
headers = {
    'x-rapidapi-host': os.getenv("rapidapi_host_realtor"),
    'x-rapidapi-key': os.getenv("rapidapi_key_realtor")
    }
response = requests.request("GET", url, headers=headers, params=querystring)

In [67]:
# coversion of JSON file
import json
all_col = pd.json_normalize(json.loads(response.text)['properties']).columns
selected_col = ['prop_type','list_date','last_update','listing_status','beds','baths_full','prop_status','price','baths','address.lat','address.lon','address.line','address.postal_code','address.neighborhood_name','address.neighborhoods','garage']
df_rent = pd.json_normalize(json.loads(response.text)['properties'])[selected_col]
print('{} properties found with specified features'.format(len(df_rent)))
df_rent.head()

64 properties found with specified features


Unnamed: 0,prop_type,list_date,last_update,listing_status,beds,baths_full,prop_status,price,baths,address.lat,address.lon,address.line,address.postal_code,address.neighborhood_name,address.neighborhoods,garage
0,single_family,,2020-08-21T03:51:46.000Z,active,2,1,for_rent,2450,2,42.379986,-71.028741,285 Princeton St,2128,Central Maverick Square - Paris Street,[{'id': 'aee6de62-e741-59c4-8c5f-3e8fbf20f153'...,
1,single_family,2020-03-31T14:12:53.000Z,2020-08-21T07:04:00.000Z,active,5,2,for_rent,4000,2,42.355079,-71.125291,30 Wadsworth St,2134,,,
2,single_family,2020-06-26T03:23:35.000Z,2020-08-20T17:40:00.000Z,active,3,1,for_rent,2600,2,42.32058,-71.063353,21 Elder St Unit 1,2125,Uphams Corner - Jones Hill,[{'id': '6494a6e6-7a5d-56aa-930a-e62b83647e63'...,
3,single_family,2020-08-03T18:08:47.000Z,2020-08-21T03:05:00.000Z,active,3,2,for_rent,2650,2,42.320423,-71.104588,63 Mozart St,2130,Hyde Square,[{'id': '3772888b-1edf-5c39-9c98-f156f52402a4'...,
4,single_family,2020-05-27T23:05:11.000Z,2020-06-11T15:21:00.000Z,active,4,1,for_rent,3200,2,42.317317,-71.053103,36 Spring Garden St Unit Sf,2125,Columbia Point,[{'id': 'a527fe3e-5350-5fd3-93d2-662b858c8f3b'...,


In [68]:
# filtering rental properties
# filtering by features, number of bedrooms and bathrooms
# df_rent_filtered = df_rent[df_rent['beds'] == 1][df_rent['baths_full'] == 1]
# df_rent_filtered.reset_index(inplace = True, drop = True)
# 2. filtering by distance
df_rent['distance_n1'] = np.nan
df_rent['distance_n2'] = np.nan
for ii, lat, lon in zip(df_rent.index, df_rent['address.lat'], df_rent['address.lon']):
    test_loc = (lat, lon)
    check_loc = (df_selected_neighbourhood['latitude'][0], df_selected_neighbourhood['longitude'][0])
    df_rent['distance_n1'].iloc[ii] = geodesic(check_loc, test_loc).km
    check_loc = (df_selected_neighbourhood['latitude'][1], df_selected_neighbourhood['longitude'][1])
    df_rent['distance_n2'].iloc[ii] = geodesic(check_loc, test_loc).km
allowable_distance = 3 # set to 3 km
dist_mask = (df_rent['distance_n1'] < allowable_distance) + (df_rent['distance_n2'] < allowable_distance)
df_rent_filtered = df_rent[dist_mask]
df_rent_filtered.reset_index(inplace = True, drop = True)
df_rent_filtered

Unnamed: 0,prop_type,list_date,last_update,listing_status,beds,baths_full,prop_status,price,baths,address.lat,address.lon,address.line,address.postal_code,address.neighborhood_name,address.neighborhoods,garage,distance_n1,distance_n2
0,single_family,2020-03-31T14:12:53.000Z,2020-08-21T07:04:00.000Z,active,5,2,for_rent,4000,2,42.355079,-71.125291,30 Wadsworth St,2134,,,,2.740575,4.245214
1,single_family,2020-06-26T03:23:35.000Z,2020-08-20T17:40:00.000Z,active,3,1,for_rent,2600,2,42.32058,-71.063353,21 Elder St Unit 1,2125,Uphams Corner - Jones Hill,[{'id': '6494a6e6-7a5d-56aa-930a-e62b83647e63'...,,3.679841,2.571081
2,single_family,2020-08-03T18:08:47.000Z,2020-08-21T03:05:00.000Z,active,3,2,for_rent,2650,2,42.320423,-71.104588,63 Mozart St,2130,Hyde Square,[{'id': '3772888b-1edf-5c39-9c98-f156f52402a4'...,,1.960771,3.235335
3,single_family,2020-07-31T17:01:58.000Z,2020-08-04T03:05:00.000Z,active,7,3,for_rent,6000,3,42.317518,-71.073618,11 Hartford St,2125,Dudley Triangle,[{'id': 'e8e244b2-7fff-52bb-87d8-74f47fa3881a'...,,3.221294,2.659521
4,single_family,2020-06-24T16:56:25.000Z,2020-08-19T21:25:00.000Z,active,2,2,for_rent,7500,3,42.342501,-71.077431,149 West Newton St,2118,Columbus,[{'id': '9ad686c5-a3cb-5e0b-bd67-55f2298a6513'...,,2.039727,0.133331
5,single_family,2020-07-11T14:53:08.000Z,2020-07-15T03:05:00.000Z,active,2,2,for_rent,3000,3,42.328977,-71.054366,,2127,Columbus Park - Andrew Square,[{'id': '5d49913a-c9f7-57f2-ad15-f784fa25f49e'...,,4.001405,2.329674
6,single_family,,2020-08-15T00:00:00.000Z,active,3,2,for_rent,3150,2,42.354268,-71.128082,106 Chester St Apt 2,2134,,,,2.851991,4.430456
7,single_family,2020-06-03T20:35:42.000Z,2020-06-07T03:05:00.000Z,active,2,1,for_rent,6800,2,42.358455,-71.067997,87 MT Vernon Unit Carriageh,2108,South Slope,[{'id': '920a66e5-be56-5330-a1f8-f4753d35f84a'...,,3.574903,2.050786
8,single_family,2020-05-18T16:17:07.000Z,2020-08-18T03:05:00.000Z,active,2,1,for_rent,2600,1,42.322284,-71.061392,223 Boston St Unit House,2125,Columbia Point,[{'id': 'a527fe3e-5350-5fd3-93d2-662b858c8f3b'...,,3.726607,2.484023
9,single_family,2020-08-04T20:32:39.000Z,2020-08-08T03:05:00.000Z,active,1,1,for_rent,2200,2,42.334595,-71.05362,147 W 8th St Unit 147,2127,D Street - West Broadway,[{'id': '5a8b50f3-cdef-5f2b-b2c3-0f0fbb150324'...,,3.954808,2.08378


In [70]:
# plotting properties
center_lat = df_rent_filtered['address.lat'].mean()
center_lon = df_rent_filtered['address.lon'].mean()
# to set boundaries of folium
lat_min = df_rent_filtered['address.lat'].min()
lat_max = df_rent_filtered['address.lat'].max()
lon_min = df_rent_filtered['address.lon'].min()
lon_max = df_rent_filtered['address.lon'].max()
map_rent_boston = folium.Map(location=[center_lat, center_lon], width=800, height=600)
map_rent_boston.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])
# add markers to map for neighbourhoods
for lat, lng, label in zip(df_selected_neighbourhood['latitude'], df_selected_neighbourhood['longitude'], df_selected_neighbourhood['neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius= 10,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rent_boston)
# add markers to map for venues
for lat, lng, label_1, label_2 in zip(df_rent_filtered['address.lat'], df_rent_filtered['address.lon'], df_rent_filtered['address.line'], df_rent_filtered['price']):
    label = 'address:' + str(label_1) + '_price:' + str(label_2)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        popup=label,
        color='red',
        fill=False,
        fill_color='#31cc67',
        fill_opacity=0.5,
        parse_html=False).add_to(map_rent_boston)
map_rent_boston

## 10th step, clustring properties
This is the last stage where filtered properties are classified and plotted. This process is accomplished through three steps:
1. clustring properties by use of kmean method which is an unsupervised machine learning method. The clustering is applied on number of features, distance from center of neighbourhoods, number of bathrooms and bedrooms, more importantly price.
2. creating a dataset for passing to plotting section. it includes main columns of features.
3. plotting the properties using folium library, in which each cluster is colour coded. <br>

In [71]:
# forming a new dataframe containing main features of properties
df_learn = df_rent_filtered[['distance_n1', 'distance_n2', 'price', 'beds', 'baths_full', 'baths']]
# importing libraries
from sklearn.cluster import KMeans
kclusters = 3
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_learn)
# check cluster labels generated for each row in the dataframe
print('The 10 first properties clustring is ', kmeans.labels_[0:10])
df_rent_filtered.insert(0, 'Cluster Labels', kmeans.labels_)
df_rent_filtered

The 10 first properties clustring is  [2 0 0 1 1 2 2 1 0 0]


Unnamed: 0,Cluster Labels,prop_type,list_date,last_update,listing_status,beds,baths_full,prop_status,price,baths,address.lat,address.lon,address.line,address.postal_code,address.neighborhood_name,address.neighborhoods,garage,distance_n1,distance_n2
0,2,single_family,2020-03-31T14:12:53.000Z,2020-08-21T07:04:00.000Z,active,5,2,for_rent,4000,2,42.355079,-71.125291,30 Wadsworth St,2134,,,,2.740575,4.245214
1,0,single_family,2020-06-26T03:23:35.000Z,2020-08-20T17:40:00.000Z,active,3,1,for_rent,2600,2,42.32058,-71.063353,21 Elder St Unit 1,2125,Uphams Corner - Jones Hill,[{'id': '6494a6e6-7a5d-56aa-930a-e62b83647e63'...,,3.679841,2.571081
2,0,single_family,2020-08-03T18:08:47.000Z,2020-08-21T03:05:00.000Z,active,3,2,for_rent,2650,2,42.320423,-71.104588,63 Mozart St,2130,Hyde Square,[{'id': '3772888b-1edf-5c39-9c98-f156f52402a4'...,,1.960771,3.235335
3,1,single_family,2020-07-31T17:01:58.000Z,2020-08-04T03:05:00.000Z,active,7,3,for_rent,6000,3,42.317518,-71.073618,11 Hartford St,2125,Dudley Triangle,[{'id': 'e8e244b2-7fff-52bb-87d8-74f47fa3881a'...,,3.221294,2.659521
4,1,single_family,2020-06-24T16:56:25.000Z,2020-08-19T21:25:00.000Z,active,2,2,for_rent,7500,3,42.342501,-71.077431,149 West Newton St,2118,Columbus,[{'id': '9ad686c5-a3cb-5e0b-bd67-55f2298a6513'...,,2.039727,0.133331
5,2,single_family,2020-07-11T14:53:08.000Z,2020-07-15T03:05:00.000Z,active,2,2,for_rent,3000,3,42.328977,-71.054366,,2127,Columbus Park - Andrew Square,[{'id': '5d49913a-c9f7-57f2-ad15-f784fa25f49e'...,,4.001405,2.329674
6,2,single_family,,2020-08-15T00:00:00.000Z,active,3,2,for_rent,3150,2,42.354268,-71.128082,106 Chester St Apt 2,2134,,,,2.851991,4.430456
7,1,single_family,2020-06-03T20:35:42.000Z,2020-06-07T03:05:00.000Z,active,2,1,for_rent,6800,2,42.358455,-71.067997,87 MT Vernon Unit Carriageh,2108,South Slope,[{'id': '920a66e5-be56-5330-a1f8-f4753d35f84a'...,,3.574903,2.050786
8,0,single_family,2020-05-18T16:17:07.000Z,2020-08-18T03:05:00.000Z,active,2,1,for_rent,2600,1,42.322284,-71.061392,223 Boston St Unit House,2125,Columbia Point,[{'id': 'a527fe3e-5350-5fd3-93d2-662b858c8f3b'...,,3.726607,2.484023
9,0,single_family,2020-08-04T20:32:39.000Z,2020-08-08T03:05:00.000Z,active,1,1,for_rent,2200,2,42.334595,-71.05362,147 W 8th St Unit 147,2127,D Street - West Broadway,[{'id': '5a8b50f3-cdef-5f2b-b2c3-0f0fbb150324'...,,3.954808,2.08378


In [72]:
# importing libraries
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[center_lat, center_lon], width=800, height=600)
map_clusters.fit_bounds([[lat_min, lon_min], [lat_max, lon_max]])

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, price, address, cluster in zip(df_rent_filtered['address.lat'], df_rent_filtered['address.lon'], df_rent_filtered['price'], df_rent_filtered['address.line'], df_rent_filtered['Cluster Labels']):
    if ~np.isnan(cluster):
        cluster = int(cluster)
        label = folium.Popup('Price:' + str(price) + ',Cluster:' + str(cluster) + ',Address:' + str(address) , parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=2,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
        
map_clusters

# END of CODE
Please send your inquiries to aron.shirazi (at) gmail.com