# **The Battle of the Neighborhoods**

## **IBM Applied Data Science Capstone Project - Week 1**

### **1. Introduction: Marketing Consultancy Problem**

The marketing teams involved in Travel, Tourism & Housing are faced with the need to provide ever more quality information that is tailored to the individual customers needs. In this era of information overload, it’s easy to get overwhelmed with the information on all the seemingly available choices, without being able to get the best use out of all the information  

Customers today are highly aware of the market and as a result the industry has become highly competitive. This makes it essential to understand the client’s need in the shortest possible time and present the options most suitable for their tastes.

For my project, I present the example of **a travel advisor who was approached by a client looking for a neighborhood to stay for a short term, in and around Toronto. The client has provided some information about his tastes and preference towards some neighborhoods he likes/ dislikes the most.  The challenge for this advisor lies in understanding & exploiting client information that lacks linear structure and using it to recommend location clusters. Additionally, the sheer volume of predictions of possible neighborhoods also poses a challenge.** 

The Toronto area is spread over 630.2 square kilometers and like any big city has a mix of neighborhoods with varying density of shops, establishments, open areas and recreation facilities. One of the big factors in such a search for suitable neighborhood is obviously the available budget, but there is also the question of the entire makeup of the neighborhood. For the purposes of this project, the budget is not being taken into consideration and the focus will be on the facilities and infrastructure. To bring all this information on a visual scale which is easily understandable and also scalable with the ability to focus or zoom in further will help the client get a cleaner understanding of the options matching their outlook and lifestyle.   


### **2. Materials & Methods: Data Collection to Clustering, Recommendation & Classification**

#### **2.1 Project Objective**
To Explore, cluster, recommend the neighborhoods of Toronto using k-means Clustering, Recommender Systems and then classify the clients’ profile based on their neighborhood preferences, applying various Machine Learning Algorithms.

#### **2.2 Data collection**
This project focuses first on applying well-known machine learning algorithms to the dataset available from Wikipedia, Statistic Canada Census Official website & Foursquare. The first task is to define the data requirements for the Segmentation, Recommendation & Classification approach for the Toronto area. The data required for this project is collected from the following websites:

1. Wikipedia List of Postal Codes of Canada
   https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,  
2. Geospatial Data
   http://cocl.us/Geospatial_data
3. Canada Census population data from 2016 
   https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/CompFile.cfm?Lang=Eng&T=1201&OFT=FULLCSV
4. Foursquare Data 

Location data describing places and venues, such as their geographical location, their category, working hours, full address, and so on needs to be gathered. Once all the data ingredients are collected, I will have a better understanding of what I will be working with in the data collection stage. The data should be such that for a given location in the form of its geographical coordinates (or latitude and longitude values), one is able to determine what types of venues exist within a defined radius from that location. So for a given location I will be able to tell whether there are restaurants nearby, or other facilities, institutions such as schools, banks, parks, or gyms, or community centers and also the density of these facilities in each neighborhood. 

This data is then modified as well as combined with various algorithms in the subsequent steps so that it produces the correct classifications matching those on the Customer Profile dataset. 

#### **2.3 Data Definition**
The content, format, and representations of the data needed for clustering, recommendation, and classification are defined & the explicit combination of information extraction and machine learning are executed. In this phase the data requirements are revisited and decisions are made as to whether or not the collection requires more or less data. 

Exploratory Data Analysis (EDA) techniques such as descriptive statistics and visualization can be applied to the data set, to assess the content, quality, and initial insights about the data. Gaps in data will be identified and plans to either fill or make substitutions will have to be made 

#### **2.4 Neighborhood Clustering**
Clustering Algorithm produces taxonomy of location properties, namely neighborhood, with use of k-means method. The developed approach separates groups (clusters) of neighborhood with similar characteristics, which do not depend on spatial location. The segmented neighborhood data segregates the neighborhoods based on venue categories to recommend a potential buyer or renter or traveler based on their preference and with the description of a property of interest available. 

#### **2.5 Neighborhood Recommendation**
Recommending products to consumers is a popular application of machine learning, especially when there exists substantial data about the customer’s preferences. Even though peoples’ tastes may vary, they generally follow patterns. There are similarities in the things that people tend to like ie., they tend to like things in the same category or things that share the same characteristics. 

One of the main advantages of using recommendation systems to my case study is that client gets a broader exposure to many different neighborhoods he might be interested in. The idea was that people who prefer a particular destination are more likely to select a neighborhood from the same neighborhood clusters. Not only does this provide a better experience for the client but it benefits the consultant, as well, with increased potential revenue and better results for its customers.
Content-based Recommender systems try to figure out what a client's favorite aspects of a neighborhood are, and then make recommendations on neighborhoods that share those aspects. The recommendation in a content-based system is based on client's liking for a particular neighborhood and the nature of the venue categories contained in that neighborhood. 

After clustering for each category, the cluster centers were chosen to represent the category and they become the new training sets for Classification Algorithm.

#### **2.6 Final Prediction - Client Profile match**
The match of the client profile with the most appropriate neighborhood cluster is the final stage where the information is now distilled to provide a wide range of information in a highly focused manner. This is achieved by executing classification algorithms with KNN, Decision tree, Support Vector Machine. Finally the evaluation of each classification model is done with a view to ensure its accuracy.

#### **2.7 Final Visual Representation**
Clients often feel the need to switch to different sources in search of complete information, but visual tree has been found to be more effective in customer engagement as it gives a personalized comprehensive information at a glance. The Decision Tree visualization will ease the client's understanding of the information, giving them more flexibility in discovering, using, modifying, and updating the available information according to their taste & preferences and thereby enhancing their Decision-making process. 

#### **2.8 Summary:**
The main stages consist of data collection, visualization, pre-processing, cluster with k-means algorithm, weighting term, application of Recommender system & finally classification with KNN, Decision tree, Support Vector Machine Algorithms.
In short, combination of Clustering, Recommender System Algorithm & Classification techniques to segregate, recommend, classify & predict the available Toronto Neighborhoods to potential clients based on their taste are employed sequentially to get the best possible outcome of their neighborhood search. 


### **3. Download and Explore Dataset**

##### **Import Libraries & Modules**

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Libraries imported.')

Libraries imported.


##### **3.1 Wikipedia List of Postal Codes of Canada**
In order to explore and cluster the neighborhoods in Toronto, the Toronto neighborhood data of postal codes of each neighborhood along with the borough name and neighborhood name, is obtained from the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,  

Using  BeautifulSoup package in Python the table on the Wikipedia page is scrapped which is then wrangled, cleaned, and then read into a structured format like pandas dataFrame. Once the data is in a structured format, the analysis is done to explore and cluster the neighborhoods in the city of Toronto.

In [2]:
#Target page
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#Use BeautifulSoup to parse HTML
soup = BeautifulSoup(website_url,'lxml')

#Extract table from wiki
Extract = soup.find('table',class_='wikitable sortable')

#Create table and populate with lines from extract
table_list = []
for rows in Extract.find_all('td'):
    row = rows.text
    row = row.replace('\n', '')
    table_list.append(row)    

    #Create empty pandas dataframe with field titles
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
toronto_data=[]
toronto_data=pd.DataFrame(columns=column_names)

#Populate df with table rows
toronto_data.iloc[:,0]=table_list[::3]
toronto_data.iloc[:,1]=table_list[1::3]
toronto_data.iloc[:,2]=table_list[2::3]
toronto_data.replace("Not assigned", np.nan, inplace = True)
toronto_data.dropna(subset=["Borough"], axis=0, inplace = True)
toronto_data.reset_index(drop=True, inplace=True)

#Where Neighborhood is not assigned use Borough
for i in range(0, toronto_data.shape[0]):
    if pd.isnull(toronto_data.loc[i,'Neighborhood']):
        toronto_data.replace(toronto_data.loc[i,'Neighborhood'], toronto_data.loc[i,'Borough'],inplace=True)

#Cleanse Neighborhood column based on unique postcodes/boroughs and append neighborhoods in each unique combination
toronto_data['Neighborhood'] = toronto_data[['PostalCode','Borough','Neighborhood']].groupby(['PostalCode','Borough'])['Neighborhood'].transform(lambda x: ', '.join(x)) 
toronto_data.drop_duplicates(inplace=True)
toronto_data.reset_index(drop=True, inplace=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


##### **3.2 Geospatial Data**

In [3]:
Geospatial_data = pd.read_csv("http://cocl.us/Geospatial_data")
neighborhoods = toronto_data.merge(Geospatial_data, how = "left", left_on = "PostalCode", right_on = "Postal Code")
neighborhoods = neighborhoods.drop(["Postal Code"], axis = 1)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


##### **3.3 Statistics Canada Census data**
The demographic status of Toronto can be explored extensively after reading the Statistics Canada Census data for Population, 2016 into the Pandas dataFrame.

In [4]:
Toronto_pop = pd.read_csv("https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/CompFile.cfm?Lang=Eng&T=1201&OFT=FULLCSV", encoding='latin-1')
Toronto_pop =Toronto_pop[(Toronto_pop['Province or territory']== "Ontario")]
Toronto_pop = Toronto_pop[Toronto_pop["Geographic code"].isin(neighborhoods['PostalCode'].tolist())]
Toronto_pop = Toronto_pop[["Geographic code", "Population, 2016", "Total private dwellings, 2016", "Private dwellings occupied by usual residents, 2016" ]]
Toronto_pop.columns = ["Geographic code", "Population", "Total pvt dwlgs", "Pvt by res."]
Toronto_pop.head()

Unnamed: 0,Geographic code,Population,Total pvt dwlgs,Pvt by res.
895,M1B,66108.0,20957.0,20230.0
896,M1C,35626.0,11588.0,11274.0
897,M1E,46943.0,17637.0,17161.0
898,M1G,29690.0,10116.0,9767.0
899,M1H,24383.0,9274.0,8985.0


##### Geocoding
To utilize the Foursquare location data, I need to get the latitude and the longitude coordinates of each neighborhood. To convert the Toronto  address into latitude and longitude values, a search engine for OpenStreetMap data geocoding tool named Nominatim is employed.

In [5]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Toronto_explorer")
address = 'Toronto'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


##### **3.4 Foursquare location data**
Using the Foursquare API, I searched for the type of venues or stores around the Nominatim returned location coordinates. By  making the call to the database entering the developer account credentials, which are my Client ID and Client Secret as well as what is called the version of the API. 

Again because I’m searching for different type of venues, I pass the latitude and longitude coordinates along with the search query radius & limits. This completes the URI to make the call to the database and in return a .JSON file format of the venues that match the query with its name, unique ID, location, and category information is downloaded. 

In [6]:
# The code was removed by Watson Studio for sharing.

Your credentails: Kept Hidden


I get a .json file of the venues with its name, unique ID, location, and category. Apply the get_category_type function from the Foursquare lab, followed by cleaning the .JSON file and structuring it into a pandas dataframe. 

In [7]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [8]:
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Textile Museum of Canada,Art Museum,43.654396,-79.3865
2,Japango,Sushi Restaurant,43.655268,-79.385165
3,Sansotei Ramen 三草亭,Ramen Restaurant,43.655157,-79.386501
4,Tsujiri,Tea Room,43.655374,-79.385354


To repeat the same process to all the neighborhoods in Toronto getNearbyVenues function is created

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']    
    return(nearby_venues)

In [10]:
Toronto_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Parkwoods
Victoria Village
Harbourfront, Regent Park
Lawrence Heights, Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The D

In [11]:
print(Toronto_venues.shape)
Toronto_venues.head()

(2271, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


 Now I am ready to explore the venues in the city of Toronto. 