# Rate Neighborhoods

## Introduction

Completed as the Capstone Project for IBM Professional Certificate in Data Science.

### Business Question

For a real estate company, **how can we more efficiently sell houses** (or rent apartments)? A major factor in home buyers' decision is the neighborhood surrounding the home. i.e., What businesses and services are nearby? This tool will allow buyers to select their top venue types and will generate a score for each neighborhood based on the buyers' desired features. This will allow real estate agents to focus on neighborhoods that are more suited to the buyers' needs, reducing time wasted on unproductive leads. 


## Methodology


### Data Collection

Inputs: neighborhood or postal code location; buyer's prefered venues

Outputs: neighborhood score from 0 (worst) to 100 (best) 

Data Needs: 
- What types of businesses and services are located within a given radius from a given location. (Foursquare API)


### Data Visualization

A map of the city will be generated with neighborhoods color-coded based on their neighborhood score (high score = green ; low score = purple). 


### Data Manipulation

- A list of neighborhoods and their location is obtained from Wikipedia 
    - using pandas built-in read_html method
- A list of venue categories is scraped from the Foursquare website
    - using BeautifulSoup 
- A list of venues and their categories within each neighborhood is obtained from Foursquare
    - using API requests
- The venue categories are checked against the user preferences. 
- Neighborhoods with a greater number of venues within the user preferences are ranked more highly. 
    - At this time, all venue types within the user preference lists are equally weighted. 
    
    - The users' prefence lists may be of variable length. For example (see below), User 1 is happy with any restuarant type, which means that User 1 will have a greater number of preferred venues in total; this imbalance inevitably skews the results towards User 1's preferences. In order that both users have their preferences more equally weighted, I normalize the number of preferred venues by the number of venue types preferred for each user. 
    
    - Scores for each user are added to get the total score for the neighborhood.
    
    - All scores are divided by the maximum score and multiplied by 100 so that the maximum score is scaled to 100. 



##### For Development and Testing Purposes

- two users with pre-defined preferences
    

|     User 1's preferences :    |    User 2's preferences :   |
|-------------------------------|-----------------------------|
|  Restaurant - any type        |   Football Stadium          |
|  Parks                        |   Music Venues              |
|  Church                       |   Nightlife Spot - any type |
|  Library                      |   Music Festivals           |
|  Elementary School            |   Mexican Restaurants       |
 
  
- the city is pre-defined as New Orleans, LA
    - with latitude and longitude of neighborhoods available in a Wikipedia table

- The number of venues is limited to 50 per neighborhood

In [2]:
!python -V

Python 3.7.3


In [3]:
import numpy as np
np.__version__

'1.16.2'

In [4]:
import pandas as pd
pd.__version__

'1.1.0'

In [5]:
from pandas.io.json import json_normalize

In [6]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [7]:
import json
json.__version__

'2.0.9'

In [8]:
import requests
requests.__version__

'2.18.3'

In [9]:
import folium
folium.__version__

'0.5.0'

In [10]:
from time import sleep

In [11]:
from dotenv import load_dotenv

In [12]:
import os

In [13]:
import re
re.__version__

'2.2.1'

In [14]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 
%matplotlib inline

In [15]:
from bs4 import BeautifulSoup

In [17]:
# scrape venue categories from Foursquare

url = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

# print(soup.prettify())

In [18]:
# The soup contains a hierarchical list of venue categories and sub-categories.
# convert the hierarchical list into a dictionary with key = sub-category and value = category

regex = re.compile('VenueCategories__Wrapper\S*')

page = soup.find('ul', class_=regex)

In [27]:
cat_dict = {}
cat_list = []

for entry in page :
    category = entry.find_all('h3')
    value = category[0].contents[0]
    if value not in cat_list :
        cat_list.append(value)
    #print(value)
    for item in category :
        #print("\t", item.contents[0])
        key = item.contents[0]
        cat_dict.update({key : value})

print(cat_list)
print(cat_dict)

['Arts & Entertainment', 'College & University', 'Event', 'Food', 'Nightlife Spot', 'Outdoors & Recreation', 'Professional & Other Places', 'Residence', 'Shop & Service', 'Travel & Transport']
{'Arts & Entertainment': 'Arts & Entertainment', 'Amphitheater': 'Arts & Entertainment', 'Aquarium': 'Arts & Entertainment', 'Arcade': 'Arts & Entertainment', 'Art Gallery': 'Arts & Entertainment', 'Bowling Alley': 'Arts & Entertainment', 'Casino': 'Arts & Entertainment', 'Circus': 'Arts & Entertainment', 'Comedy Club': 'Arts & Entertainment', 'Concert Hall': 'Arts & Entertainment', 'Country Dance Club': 'Arts & Entertainment', 'Disc Golf': 'Arts & Entertainment', 'Exhibit': 'Arts & Entertainment', 'General Entertainment': 'Arts & Entertainment', 'Go Kart Track': 'Arts & Entertainment', 'Historic Site': 'Arts & Entertainment', 'Karaoke Box': 'Arts & Entertainment', 'Laser Tag': 'Arts & Entertainment', 'Memorial Site': 'Arts & Entertainment', 'Mini Golf': 'Arts & Entertainment', 'Movie Theater': '

In [19]:
# This code block will eventually be turned into a user input menu

user1_prefs = ['Food - any', 'Park', 'Church', 'Library', 'Elementary School']
user2_prefs = ['Football Stadium', 'Music Venue', 'Nightlife Spot - any', 'Music Festival', 'Mexican Restaurant']

location = 'New Orleans, LA, USA'

radius = 1000    # units of meters

In [20]:
# get neighborhood longitude and latitudes 

url = 'https://en.wikipedia.org/wiki/Neighborhoods_in_New_Orleans'

df_neighborhoods = pd.read_html(url)
df_neighborhoods = df_neighborhoods[0]
df_neighborhoods.head()

Unnamed: 0,Neighborhood,Longitude,Latitude
0,U.S. NAVAL BASE,-90.026093,29.946085
1,ALGIERS POINT,-90.051606,29.952462
2,WHITNEY,-90.042357,29.9472
3,AUDUBON,-90.12145,29.932994
4,OLD AURORA,-90.0,29.92444


In [119]:
# I have my credientials in a .env file to keep them private. This code loads the credientials from .env (in same directory as Jupyter Notebook).

load_dotenv()

CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
VERSION = os.getenv('VERSION')

LIMIT = 50
test_limit = 10


In [120]:
def url_builder(lat, lng, rad, action, item, lim) :
    # Making a function to build the url from uri and user input
    # for query to Foursqure API
    
    text_uri = 'https://api.foursquare.com/v2/'
    text_etc = 'client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'
    
    latitude = lat
    longitude = lng
    radius = rad
    
    item = item
    action = action
    limit = lim
    
    url = text_uri+item+'/'+action+'?'+text_etc
    url = url.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, limit)
    # print(url)
    return url

In [121]:
venues_list=[]

for lat, lng, neighborhood in zip(df_neighborhoods['Latitude'], 
                                              df_neighborhoods['Longitude'], 
                                              df_neighborhoods['Neighborhood']):
    dict1 = { 'lat' : lat , 
              'lng' : lng , 
              'rad' : radius , 
              'action' : 'explore' ,
              'item' : 'venues' ,
              'lim' : LIMIT }
    
    url = url_builder(**dict1)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
    venues_list.append([(
            neighborhood, 
            lat, lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    

In [122]:
nearby_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,U.S. NAVAL BASE,29.946085,-90.026093,Behrman Stadium,29.939423,-90.030863,Stadium
1,U.S. NAVAL BASE,29.946085,-90.026093,Federal City Inn & Suites,29.947682,-90.032891,Hotel
2,U.S. NAVAL BASE,29.946085,-90.026093,Subway,29.950125,-90.034375,Sandwich Place
3,U.S. NAVAL BASE,29.946085,-90.026093,The Mighty Missisippi,29.949695,-90.02371,Boat or Ferry
4,U.S. NAVAL BASE,29.946085,-90.026093,Navy Federal Credit Union,29.949349,-90.032268,Credit Union


In [123]:
# This could be modified to make a function that can be called for each user. 

for item1 in user1_prefs : 
    if " - any" in item1 : 
        super_category = item1.replace(" - any", "")
        # print(super_category)
        for item2 in cat_dict : 
            if cat_dict[item2] == super_category :
                user1_prefs.append(item2)

for item1 in user2_prefs : 
    if " - any" in item1 : 
        super_category = item1.replace(" - any", "")
        # print(super_category)
        for item2 in cat_dict : 
            if cat_dict[item2] == super_category :
                user2_prefs.append(item2)
                
print(user2_prefs)

['Football Stadium', 'Music Venue', 'Nightlife Spot - any', 'Music Festival', 'Mexican Restaurant', 'Nightlife Spot', 'Bar', 'Beach Bar', 'Beer Bar', 'Beer Garden', 'Champagne Bar', 'Cocktail Bar', 'Dive Bar', 'Gay Bar', 'Hookah Bar', 'Hotel Bar', 'Karaoke Bar', 'Pub', 'Sake Bar', 'Speakeasy', 'Sports Bar', 'Tiki Bar', 'Whisky Bar', 'Wine Bar', 'Brewery', 'Lounge', 'Night Market', 'Nightclub', 'Other Nightlife', 'Strip Club', 'Nightlife Spot', 'Bar', 'Beach Bar', 'Beer Bar', 'Beer Garden', 'Champagne Bar', 'Cocktail Bar', 'Dive Bar', 'Gay Bar', 'Hookah Bar', 'Hotel Bar', 'Karaoke Bar', 'Pub', 'Sake Bar', 'Speakeasy', 'Sports Bar', 'Tiki Bar', 'Whisky Bar', 'Wine Bar', 'Brewery', 'Lounge', 'Night Market', 'Nightclub', 'Other Nightlife', 'Strip Club']


In [124]:
# use one-hot encoding to show whether a venue has a venue category within the user preferences lists
# if a preference contains " - any" that means that the key should be checked instead of the value

nearby_venues["User1 Prefs"] = nearby_venues['Venue Category'].isin(user1_prefs)

In [125]:
nearby_venues["User2 Prefs"] = nearby_venues['Venue Category'].isin(user2_prefs)

In [126]:
nearby_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,User1 Prefs,User2 Prefs
0,U.S. NAVAL BASE,29.946085,-90.026093,Behrman Stadium,29.939423,-90.030863,Stadium,False,False
1,U.S. NAVAL BASE,29.946085,-90.026093,Federal City Inn & Suites,29.947682,-90.032891,Hotel,False,False
2,U.S. NAVAL BASE,29.946085,-90.026093,Subway,29.950125,-90.034375,Sandwich Place,True,False
3,U.S. NAVAL BASE,29.946085,-90.026093,The Mighty Missisippi,29.949695,-90.02371,Boat or Ferry,False,False
4,U.S. NAVAL BASE,29.946085,-90.026093,Navy Federal Credit Union,29.949349,-90.032268,Credit Union,False,False


In [55]:
# We need to normalize the preference lists so that one user with a greater number of preferences does not 
# dominate the neighborhood score. 

user1_ncat = len(user1_prefs)
user2_ncat = len(user2_prefs)

print(user1_ncat, user2_ncat)

# It might be more efficient to combine both lists into a single list
# or a dictionary with key=category and value=wieght from ranked choice preferences

703 30


In [127]:
nearby_venues_groupby_neighborhood = nearby_venues.groupby(['Neighborhood'])


In [128]:
user1_scores = nearby_venues_groupby_neighborhood['User1 Prefs'].sum() / user1_ncat
user2_scores = nearby_venues_groupby_neighborhood['User2 Prefs'].sum() / user1_ncat

In [98]:
type(user2_scores)

pandas.core.series.Series

In [129]:
user1_scores_df = user1_scores.to_frame()
user2_scores_df = user2_scores.to_frame()

In [130]:
scores_df = pd.merge(user1_scores_df, user2_scores_df, on='Neighborhood')

In [131]:
scores_df['Total'] = scores_df['User1 Prefs'] + scores_df['User2 Prefs']
print(scores_df['Total'].min(), scores_df['Total'].max())
scores_df['Scaled Score'] = 100.0 * scores_df['Total'] / scores_df['Total'].max()

0.0 0.059743954480796585


In [132]:
scores_df.sort_values('Total', ascending=False, inplace=True)

In [136]:
sorted_neighborhoods = scores_df.index.tolist()
print("All neighborhoods in order : ", sorted_neighborhoods)
top5_neighborhoods = sorted_neighborhoods[0:5]
print("Top 5 : ", top5_neighborhoods)

All neighborhoods in order :  ['MID-CITY', 'FRERET', 'UPTOWN', 'LOWER GARDEN DISTRICT', 'EAST CARROLLTON', 'EAST RIVERSIDE', 'BLACK PEARL', 'TOURO', 'MARIGNY', 'MILAN', 'BAYOU ST. JOHN', 'TULANE - GRAVIER', 'CENTRAL BUSINESS DISTRICT', 'ST. THOMAS DEV', 'GARDEN DISTRICT', 'LEONIDAS', 'IBERVILLE', 'FAIRGROUNDS', 'SEVENTH WARD', 'TREME - LAFITTE', 'BROADMOOR', 'WEST RIVERSIDE', 'FRENCH QUARTER', 'WEST END', 'LAKEVIEW', 'ST. CLAUDE', 'MARLYVILLE - FONTAINEBLEAU', 'IRISH CHANNEL', 'GERT TOWN', 'MILNEBURG', 'NAVARRE', 'CENTRAL CITY', 'BYWATER', 'DIXON', 'ALGIERS POINT', 'GENTILLY WOODS', 'ST. ROCH', 'LAKE TERRACE & OAKS', 'ST. ANTHONY', 'B. W. COOPER', 'AUDUBON', 'READ BLVD EAST', 'GENTILLY TERRACE', 'McDONOGH', 'LAKEWOOD', 'WHITNEY', 'LITTLE WOODS', 'HOLLYGROVE', 'WEST LAKE FOREST', 'CITY PARK', 'TALL TIMBERS - BRECHTEL', 'OLD AURORA', 'PINES VILLAGE', 'FISCHER DEV', 'DILLARD', 'FILMORE', 'FLORIDA AREA', 'PONTCHARTRAIN PARK', 'FLORIDA DEV', 'HOLY CROSS', 'LAKESHORE - LAKE VISTA', 'PLUM ORC

In [134]:
print("Your top 5 neighborhoods are : ", top5_neighborhoods)

Your top 5 neighborhoods are :  ['MID-CITY', 'FRERET', 'UPTOWN', 'LOWER GARDEN DISTRICT', 'EAST CARROLLTON']


## Results

New Orleans's City Planning Commission divides the city into 72 neighborhoods.

The top 5 neighborhoods for this couple are :
 - Mid-City
 - Freret
 - Uptown
 - Lower Garden District
 - East Carrollton
 
Based on my knowledge of the city, this prediction is good. These neighborhoods all have a high concentration of bars and restuarants. The French Quarter is ranked number 23, which is surprisingly low; this might be due to the lack of parks and churches, which are a preference for User 1. 



In [138]:
# need to have a dataframe with neighborhood name, latitude, longitude, and score for visualization

visual_df = pd.merge(df_neighborhoods, scores_df, on='Neighborhood')

In [139]:
visual_df.head()

Unnamed: 0,Neighborhood,Longitude,Latitude,User1 Prefs,User2 Prefs,Total,Scaled Score
0,U.S. NAVAL BASE,-90.026093,29.946085,0.002845,0.0,0.002845,4.761905
1,ALGIERS POINT,-90.051606,29.952462,0.012802,0.00569,0.018492,30.952381
2,WHITNEY,-90.042357,29.9472,0.01138,0.0,0.01138,19.047619
3,AUDUBON,-90.12145,29.932994,0.01138,0.002845,0.014225,23.809524
4,OLD AURORA,-90.0,29.92444,0.007112,0.0,0.007112,11.904762


In [155]:
# Make folio map with neighborhoods color-coded by score

# create map
lat_init = visual_df['Latitude'][0]
lng_init = visual_df['Longitude'][0]
map_clusters = folium.Map(location=[lat_init, lng_init], zoom_start=12)

# add markers to the map
markers_colors = []
for lat, lon, neighb, score in zip(visual_df['Latitude'], 
                                  visual_df['Longitude'], 
                                  visual_df['Neighborhood'], 
                                  visual_df['Scaled Score']):
    try : 
        score_int = int(score)
        score_flt = score / 100.0
        # print(score_int, score_flt)
    except :
        break

    label = folium.Popup(str(neighb) + ' Score ' + str(score_int), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=colors.rgb2hex(colormap(score_flt)),
        fill=True,
        fill_color=colors.rgb2hex(colormap(score_flt)),
        fill_opacity=0.7 ).add_to(map_clusters)
    
    
map_clusters

Static png image for display: 
![Ratings of Neighborhoods](ratings01.png)


## Discussion

The location of a home can be a major selling point if the surrounding neighborhood contains amenities desired by the buyers. This project has taken the preferences of two users and used them to compute a personalized neighborhood score for each neighborhood in New Orleans, LA. Based on my knowledge of the city, the scores are accurate; the neighborhoods with a high concentration of restuarants and bars are scored highly, and those with fewer restuarants and bars have a low score. 

This neighborhood score is similar to a "[walkability score](https://www.walkscore.com/)" because the neighborhood radius is only 1 km (0.6 mi; a 12 minute walk for most adults). However, the neighborhood score presented here is personalized to the preference of the user(s), whereas the standard walkability score is not personalized. One potential addition to this project is a "drivability score" in which venues at a further distance would be taken into account. 

In addition to the neighborhood amenities, other factors such as neighborhood safety and school quality are often important to buyers. This information is not available from Foursquare and so will require integration of new data sources and API's into the current program. There is an opportunity to integrate this type of information with current filtering options on real estate websites so that buyers can better narrow down their options.  


##### Room for Methodological Improvement

- A better category scheme is needed. The categories were taken from Foursquare's website. Sometimes, they seem too detailed, for example, 'Sausage Shop' may not need its own category and could be combined with other 'Grocery' venues. On the other hand,  sometimes the categories are not detailed enough. For example, the 'Church' church category does not contain any information on denomination; if there are 100 Baptist churches but the buyer is Catholic, the neighborhood score will have a false high value. 

- In order to generalized the program, a consistent source for neighborhood location/boundary data is needed. Sometimes this data can be found in geojson files on a city, but sometimes not. Sometimes this data is on Wikipedia, but sometimes not. Also, the url's for the data sources are currently entered manually. An automated way to find neighborhood data is needed. 

- Improvements to the scoring function:
    - An option for ranked-choice of venue types could be added, then the scored function weighted appropriately.
    - The scoring function can give artificially high scores because it is defined to give the best neighborhood a score of 100. Neighborhoods that are overall bad (with a low number of preferred venues) can still look good if the entire city is lacking in the preferred venues. 
 
##### Room for Programming Improvement
- The number of users should be a variable, not set at two.
- The users' preference lists can be combined into a single list immediately, rather than computing separate user scores. 


## Conclusion

This project was able to successfully determine neighborhoods with the desired venue types for two users. This project could be combined with current filtering options on real estate websites to help potential buyers more rapidly narrow down their housing option, resulting in saved time for the realtor and increased customer satisfaction for the buyer. 