# Capstone- Predicting Neighborhood for Living in Toronto

## Applied Data Science with Capstone by IBM

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data Description](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a resident. Specifically, this report will be targeted to people who are interested in migrating to a city of Toronto.

As everyone wants to have residence where they can find nearby Restaurant, Park, Grocery Store and the necessities routine life. So this can make people life easier what actually people are expecting.  

We will use our data science powers to generate a few most promising neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by the people who are thinking to migrate.

## Data Description <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* number of the neighborhood in Toronto
* in each neighborhood will find list and number of Pharmacy, Theater, grocery store, restaurants and Gym.
* distance of neighborhood from nearby school, transportation and grocery store.

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of school, transportation location and grocery store in every neighborhood will be obtained using **Foursquare API**
* coordinate of Berlin center will be obtained using **Google Maps API geocoding** of well known Berlin location (Alexanderplatz)

### Neighborhood in Toronto

From the link https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, will retrieve the data of neighborhood with postal code in Toronto. The data will then be converted in dataframe.

In [1]:
import requests
import pandas as pd
import numpy as np
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find('table', class_='wikitable sortable')
    
codes=[]
city=[]
places=[]
 
for row in tb.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        codes.append(cells[0].find(text=True))
        city.append(cells[1].find(text=True))
        places.append(cells[2].find(text=True))


df=pd.DataFrame(codes,columns=['Postcode'])
df['Borough']=city
df['Neighbourhood']=places

#Delete these 'Not assigned' rows in Borough column from dataFrame   
indexNames = df[ df['Borough'] == 'Not assigned'].index
df.drop(indexNames , inplace=True)  

#Merge the rows that have same postal code
df = df.groupby('Postcode').agg({'Borough': 'first', 
                             'Neighbourhood':', '.join }).reset_index()

#Replace the 'Not assigned' data in Neighbourhood column with data of Borough in same row
df['Neighbourhood'] = df['Neighbourhood'].replace('Not assigned',df['Borough'], regex = True)

#Droping the rows that doesn't contain Toronto word in Borough column
df = df[df["Borough"].str.contains('Toronto') == True].reset_index(drop=True)

df.head()



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M4E,East Toronto,The Beaches
1,M4K,East Toronto,"The Danforth West\n, Riverdale"
2,M4L,East Toronto,"The Beaches West\n, India Bazaar"
3,M4M,East Toronto,Studio District\n
4,M4N,Central Toronto,Lawrence Park


Now we will retrieve latitude and longitude of the Neighborhood from reading the CSV file that had Postal code, Latitude and Longitude. The dataframe having the co-ordinates of the Neighborhood is merged with the above dataframe that had Post code , Borough and Neighborhood.

In [2]:
df_geo = pd.read_csv(
'https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv',
    index_col=0)

df = pd.merge(df, 
                  df_geo[['Latitude', 'Longitude']],
                  left_on='Postcode',
                  right_on='Postal Code',
                  how='left')

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West\n, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West\n, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District\n,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on different categories in each neighborhood.

We're interested in venues in various categories, but only those that are very close to Neighborhood. So initially we will include try to get all categories and then we will fetch only venues that are familiar in each Neighborhood. 

Initially we will import all libraries that are required as below.

In [46]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

In [47]:
CLIENT_ID = 'SP1BAW0ZQNIAYSQYVGBMV1LANUOYHB2EA4EH1Y1CSYD5ATO2' # your Foursquare ID
CLIENT_SECRET = 'V4NELWEOH3IL1O4CQEMLLNAHLNFPADWYI450IKCSUGZRSN1E' # your Foursquare Secret
VERSION = '20180604'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: SP1BAW0ZQNIAYSQYVGBMV1LANUOYHB2EA4EH1Y1CSYD5ATO2
CLIENT_SECRET:V4NELWEOH3IL1O4CQEMLLNAHLNFPADWYI450IKCSUGZRSN1E


In [48]:
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [49]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )



The Beaches
The Danforth West
, Riverdale
The Beaches West
, India Bazaar
Studio District

Lawrence Park
Davisville North

North Toronto West

Davisville

Moore Park, Summerhill East

Deer Park, Forest Hill SE
, Rathnelly, South Hill, Summerhill West

Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson
, Garden District

St. James Town
Berczy Park
Central Bay Street

Adelaide
, King
, Richmond

Harbourfront East
, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel

Roselawn

Forest Hill North, Forest Hill West

The Annex, North Midtown
, Yorkville
Harbord
, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay
, Island airport
, Harbourfront West
, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade

First Canadian Place, Underground city
Christie

Dovercourt Village, Dufferin

Little Portugal, Trinity
Brockton
, Exhibition Place, Park

In [71]:
print(toronto_venues.shape)
toronto_venues =toronto_venues[toronto_venues["Venue Category"].str.contains('Grocery Store|Pharmacy|Gym|Movie Theater|Asian Restaurant')==True].reset_index(drop=True)
toronto_venues.head()


(155, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Fox Theatre,43.672801,-79.287272,Indie Movie Theater
1,The Beaches,43.676357,-79.293031,Shoppers Drug Mart,43.670087,-79.300497,Pharmacy
2,The Beaches,43.676357,-79.293031,The Goof,43.672633,-79.287467,Asian Restaurant
3,The Beaches,43.676357,-79.293031,Dyson's valu-mart,43.67321,-79.285868,Grocery Store
4,"The Danforth West\n, Riverdale",43.679557,-79.352188,Bulk Barn,43.67679,-79.355865,Grocery Store


Looking good. So now we have all the various categories in area within few kilometers from each Neighborhood.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for residence!

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Toronto Neighborhood and we will limit our analysis to area ~500 meters around Neighborhood. We will try to recognize maximum of the neighborhood that all required venue close by. Below we will see the chart that will show the highest ratio of neighborhood in Toronto Borough.

In first step we have collected the required **data: location and type (category) of each Neighborhood in Toronto** . We have also **identified number of venue actually available in Neighborhood** (according to One Hot Encoding technique).

Second step in our analysis will be calculation and exploration of every category in each Neighborhood by finding the weighed average of each category in each Neighborhood.

In final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion: we will take into consideration locations with **most common venue in each Neighborhood**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of venue in each Neighborhood using One Hot Encoding Techniques**:

In [51]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Asian Restaurant,Climbing Gym,College Gym,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Indie Movie Theater,Movie Theater,Pharmacy
0,The Beaches,0,0,0,0,0,0,0,1,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,1
2,The Beaches,1,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,1,0,0,0,0,0,0
4,"The Danforth West\n, Riverdale",0,0,0,1,0,0,0,0,0,0


In [23]:
toronto_onehot.shape

(155, 11)

In [52]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Asian Restaurant,Climbing Gym,College Gym,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Indie Movie Theater,Movie Theater,Pharmacy
0,"Adelaide\n, King\n, Richmond\n",0.285714,0.0,0.0,0.0,0.285714,0.142857,0.0,0.0,0.285714,0.0
1,Berczy Park,0.0,0.0,0.0,0.333333,0.666667,0.0,0.0,0.0,0.0,0.0
2,"Brockton\n, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 East...,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street\n,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.333333,0.0


In [53]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide
, King
, Richmond
----
                  venue  freq
0      Asian Restaurant  0.29
1                   Gym  0.29
2         Movie Theater  0.29
3  Gym / Fitness Center  0.14
4          Climbing Gym  0.00


----Berczy Park----
              venue  freq
0               Gym  0.67
1     Grocery Store  0.33
2  Asian Restaurant  0.00
3      Climbing Gym  0.00
4       College Gym  0.00


----Brockton
, Exhibition Place, Parkdale Village----
                  venue  freq
0                   Gym  0.67
1  Gym / Fitness Center  0.33
2      Asian Restaurant  0.00
3          Climbing Gym  0.00
4           College Gym  0.00


----Business Reply Mail Processing Centre 969 Eastern
----
              venue  freq
0     Grocery Store  0.67
1               Gym  0.33
2  Asian Restaurant  0.00
3      Climbing Gym  0.00
4       College Gym  0.00


----Central Bay Street
----
              venue  freq
0     Grocery Store  0.33
1               Gym  0.33
2     Movie Theater  0.33
3  Asian Restaurant

In [54]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [55]:
num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"Adelaide\n, King\n, Richmond\n",Movie Theater,Gym,Asian Restaurant,Gym / Fitness Center,Pharmacy,Indie Movie Theater,Gym Pool
1,Berczy Park,Gym,Grocery Store,Pharmacy,Movie Theater,Indie Movie Theater,Gym Pool,Gym / Fitness Center
2,"Brockton\n, Exhibition Place, Parkdale Village",Gym,Gym / Fitness Center,Pharmacy,Movie Theater,Indie Movie Theater,Gym Pool,Grocery Store
3,Business Reply Mail Processing Centre 969 East...,Grocery Store,Gym,Pharmacy,Movie Theater,Indie Movie Theater,Gym Pool,Gym / Fitness Center
4,Central Bay Street\n,Movie Theater,Gym,Grocery Store,Pharmacy,Indie Movie Theater,Gym Pool,Gym / Fitness Center


In [56]:
from sklearn.cluster import KMeans

kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 2, 1, 2, 1, 2, 0, 2, 2], dtype=int32)

In [57]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,4.0,Pharmacy,Indie Movie Theater,Grocery Store,Asian Restaurant,Movie Theater,Gym Pool,Gym / Fitness Center
1,M4K,East Toronto,"The Danforth West\n, Riverdale",43.679557,-79.352188,4.0,Grocery Store,Pharmacy,Asian Restaurant,Movie Theater,Indie Movie Theater,Gym Pool,Gym / Fitness Center
2,M4L,East Toronto,"The Beaches West\n, India Bazaar",43.668999,-79.315572,2.0,Gym,Movie Theater,Grocery Store,Asian Restaurant,Pharmacy,Indie Movie Theater,Gym Pool
3,M4M,East Toronto,Studio District\n,43.659526,-79.340923,4.0,Gym / Fitness Center,Gym,Grocery Store,Climbing Gym,Pharmacy,Movie Theater,Indie Movie Theater
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0.0,Gym / Fitness Center,College Gym,Pharmacy,Movie Theater,Indie Movie Theater,Gym Pool,Gym


In [58]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

In [59]:
from geopy.geocoders import Nominatim
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

toronto_merged = toronto_merged.dropna(axis=0)


The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [60]:

import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

map_df = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_df)  
    
map_df

In [61]:

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

This concludes our analysis. We have created 38 addresses representing centers of zones containing locations with low number of venue or all venues available in Neighborhood. Although zones are shown on map with a radius of ~500 meters, their shape is actually very irregular and their centers/addresses should be considered only as a starting point for exploring area neighborhoods in search for potential locations. All of the zones are located in boroughs of Toronto, which we have identified as interesting due to being popular with tourists, fairly close to city center and well connected by public transport.

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of Borough and Neighborhood near Toronto but we focused Neighborhood and Borough in Toronto which offer a combination of popularity among tourists, closeness to city center and strong socio-economic dynamics. The reason behind choosing the location is having the happening in life as location around is happening.

Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.

Result of all this is 38 zones containing largest number of potential new locations based on number of and distance to existing venues. This, of course, does not imply that those zones are actually optimal locations for a new resident! Purpose of this analysis was to only provide info on areas in Toronto center where we can get enough information on the availability of daily necessities things. Those criteria would make life more easier and happier. Having those location for resident would be the good achievement. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition but also other factors taken into account and all other relevant conditions met.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify the Neighborhood in Toronto where people can find all necesities things around the residence. By finding the different categories venues in each neighborhood from Foursquare data we have first identified all categories venues that justify further analysis, and then generated extensive collection of locations which satisfy some basic requirements. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration.

Final decision on optimal resident location will be made by people/client based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.