# Capstone Project - The Battle of Neighborhoods

- **Cover page**
### Canada’s skilled worker immigration program

![Image of Yaktocat](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRgxGE0Wqx9tFC_vfgYjPNPo39oQIMxipRhL1QN5l_apiJk3Xbm0A)

- **Introductory section**

Each year the Canadian government enables skilled workers worldwide to become Canadian citizens. Of course in order to become resident the candidates should meet certain criteria that attest their qualification and eligibility to stay and work in Canada.

One can probably imagine how hard is for any newcomer to find its place in a new country, new city and new environment at all. Skilled experts usually receive higher income than average so they normally prefer to live in good neighborhoods with clean streets, fancy restaurants, reputable schools etc. Finding such neighborhood is usually their primary goal when arriving at the new place and they are ready to pay more but to receive proper living conditions. 

**Problem statement and target group identification**

The present study will be examining the opportunities for such skilled immigrants to find their dream home when moving to Canada. The Canadian city chosen for this project is Toronto because it is large enough, offers many opportunities, it is multiethnic and serves as an industry and economic capital of the country.

**Data section**

Considering the fact that the analysis will be on Toronto's job market data it is clear that will be used dataset which was applied in the previous coursework.

The exact link directing to that data is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In order to get the geographic coordinates of the neighborhoods the following link was applied: http://cocl.us/Geospatial_data

Last but not least Foursquare developer dataset was retrieved from their site. It is a really handy source of data related with various venues


In [54]:
#import the necessory libraries
import numpy as np
import pandas as pd
import json
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes
import folium

#importing the datasets


Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



- **Methodology section**

The applied methodology will be in accordance with that used during the Capstone module.

The data for the present research is primary and the main sources are public: wikipedia, foursquare, etc.


The proposed action plan and applied code may be seen below:

- Downloading and manipulating raw data for Toronto's neighborhoods;

In [55]:
#download and manipulate the data
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
table = pd.read_html(url,header=[0])
df = table[0]
df.head()

#clean and rename
df = df.rename(index=str, columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'})
df.head()

df = df[df.Borough !='Not assigned']
df.reset_index(drop=True,inplace=True)
df.head()

#group data
df = df.groupby('PostalCode',as_index=False).agg(lambda x: ','.join(set(x.dropna())))

df.loc[df['Neighborhood'] == 'Not assigned','Neighborhood'] = df['Borough']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,Guildwood,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


- Downloading and manipulating Toronto's coordinates data;

In [56]:
#Add the file with coordinates
df1 = pd.read_csv('http://cocl.us/Geospatial_data')

# #Rename columns
df1.columns = ['PostalCode','Latitude','Longitude']
df1.head() 

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


- Mapping the data with its geographic coordinates: latitude and longitude;

In [57]:
#Use geopy library to get the latitude and longitude values of Toronto.
address = 'Toronto'
geolocator = Nominatim(user_agent="Presiyan Tsankov")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

neighborhoods = df.join(df1.set_index('PostalCode'), on='PostalCode')
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside,Guildwood,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


- Third step is gonna be retrieving data from Foursquare containing information about venues with their characteristics and plotting them on the map of Toronto;

In [58]:
# @hidden_cell
Client_ID="QJFVQGDKNATGS52DNADAKKBCPXOZXYG3MTJYRUTUMOFPPYCM"
Client_Secret="ESMJIGS3YLJWBZUIQJASWB2VFLYOWEOW55HKPYP2TANOQGPY"
VERSION = '20180605'

#Create a map of Toronto with neighborhoods superimposed on top
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

- Afterwards the existing venue information should be further elaborated in order to find specific data patterns and allocate the venues into respective neighborhoods;

In [59]:
def getNearbyVenues(names, latitudes, longitudes, radius=800 , LIMIT = 200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            Client_ID, 
            Client_Secret, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_data = neighborhoods.copy()
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

print(toronto_venues.shape)
toronto_venues.head()

toronto_venues.groupby('Neighborhood').count().head()

print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))


(3967, 7)
There are 332 uniques categories.


*Analyze Each Neighborhood*

In [60]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

#Create new dataframe and display the top 10 venues for each neighborhood.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,Richmond,King",Café,Coffee Shop,Hotel,American Restaurant,Thai Restaurant,Theater,Bar,Concert Hall,Steakhouse,Gastropub
1,Agincourt,Chinese Restaurant,Shopping Mall,Shanghai Restaurant,Motorcycle Shop,Supermarket,Sandwich Place,Discount Store,American Restaurant,Lounge,Malay Restaurant
2,"Agincourt North,Steeles East,L'Amoreaux East,M...",Fast Food Restaurant,Pizza Place,Park,Chinese Restaurant,BBQ Joint,Fried Chicken Joint,Coffee Shop,Pharmacy,Caribbean Restaurant,Noodle House
3,"Alderwood,Long Branch",Pizza Place,Coffee Shop,Gas Station,Pub,Pharmacy,Park,Convenience Store,Sandwich Place,Dance Studio,Skating Rink
4,Bayview Village,Japanese Restaurant,Bank,Skate Park,Café,Skating Rink,Grocery Store,Convenience Store,Chinese Restaurant,Restaurant,Shopping Mall


- Grouping venues with K-mean cluster algorythm 

*Clustering using k = 5*

In [61]:
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(toronto_grouped_clustering)
labels = kmeans.labels_
labels = labels.tolist()
labels.append(1)
labels.append(2)
labels.append(3)
labels.append(4)
labels.append(0)
labels.append(1)

toronto_merged = toronto_data
toronto_merged['Cluster Labels'] = pd.Series(labels)

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353,3,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Spa,Martial Arts Dojo,Paper / Office Supplies Store,Hobby Shop,African Restaurant,Community Center,Dog Run
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,3,Breakfast Spot,Italian Restaurant,Burger Joint,Bar,Women's Store,Electronics Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
2,M1E,Scarborough,"Morningside,Guildwood,West Hill",43.763573,-79.188711,4,Pizza Place,Coffee Shop,Fast Food Restaurant,Greek Restaurant,Pharmacy,Sports Bar,Supermarket,Fried Chicken Joint,Beer Store,Grocery Store
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Coffee Shop,Park,Business Service,Dumpling Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,3,Coffee Shop,Bakery,Indian Restaurant,Yoga Studio,Rental Car Location,Burger Joint,Bus Line,Flower Shop,Music Store,Lounge


*Showing the identified clusters*

In [62]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

*Examine Cluster 0*

In [63]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
40,East York,0,Café,Pizza Place,Coffee Shop,Greek Restaurant,Thai Restaurant,Park,Breakfast Spot,Gastropub,Beer Bar,Ethiopian Restaurant


*Examine Cluster 1*

In [64]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Scarborough,1,College Stadium,Convenience Store,Skating Rink,Bank,Diner,Café,General Entertainment,Park,Thai Restaurant,Construction & Landscaping
27,North York,1,Coffee Shop,Gym,Asian Restaurant,Beer Store,Japanese Restaurant,Deli / Bodega,Restaurant,Dance Studio,Office,Italian Restaurant
39,East York,1,Indian Restaurant,Coffee Shop,Afghan Restaurant,Turkish Restaurant,Yoga Studio,Pharmacy,Park,Restaurant,Sandwich Place,Burger Joint
43,East Toronto,1,Café,Bakery,Coffee Shop,Bar,Diner,American Restaurant,Sushi Restaurant,Italian Restaurant,Park,Brewery
64,Central Toronto,1,Coffee Shop,Italian Restaurant,Dance Studio,Café,Bakery,Bank,Gastropub,Japanese Restaurant,Park,Chinese Restaurant
69,Downtown Toronto,1,Coffee Shop,Café,Restaurant,Italian Restaurant,Japanese Restaurant,Hotel,Bakery,Pub,Seafood Restaurant,Gastropub
77,West Toronto,1,Café,Bar,Restaurant,Bakery,Cocktail Bar,Men's Store,Coffee Shop,Italian Restaurant,Asian Restaurant,Vegetarian / Vegan Restaurant
79,North York,1,Bakery,Park,Construction & Landscaping,Tennis Court,Women's Store,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant
88,Etobicoke,1,Café,Mexican Restaurant,Fast Food Restaurant,Sandwich Place,Bakery,Fried Chicken Joint,Dessert Shop,Bus Stop,Asian Restaurant,Italian Restaurant
96,North York,1,Bakery,Empanada Restaurant,Pizza Place,Electronics Store,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


*Examine Cluster 2*

In [65]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
65,Central Toronto,2,Café,Coffee Shop,Pizza Place,Italian Restaurant,Sandwich Place,Pub,Vegetarian / Vegan Restaurant,American Restaurant,Burger Joint,Restaurant
102,Etobicoke,2,,,,,,,,,,


*Examine Cluster 3*

In [66]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,3,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Spa,Martial Arts Dojo,Paper / Office Supplies Store,Hobby Shop,African Restaurant,Community Center,Dog Run
1,Scarborough,3,Breakfast Spot,Italian Restaurant,Burger Joint,Bar,Women's Store,Electronics Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
4,Scarborough,3,Coffee Shop,Bakery,Indian Restaurant,Yoga Studio,Rental Car Location,Burger Joint,Bus Line,Flower Shop,Music Store,Lounge
5,Scarborough,3,Fast Food Restaurant,Sandwich Place,Pizza Place,Convenience Store,Restaurant,Coffee Shop,Hakka Restaurant,Dog Run,Department Store,Design Studio
6,Scarborough,3,Coffee Shop,Discount Store,Chinese Restaurant,Grocery Store,Convenience Store,Department Store,Intersection,Light Rail Station,Bank,Rental Car Location
7,Scarborough,3,Diner,Bus Line,Coffee Shop,Bakery,Convenience Store,Bus Station,Intersection,Park,Fast Food Restaurant,Soccer Field
8,Scarborough,3,Fast Food Restaurant,Pizza Place,Wings Joint,Furniture / Home Store,Park,Burger Joint,Women's Store,Donut Shop,Discount Store,Dive Bar
10,Scarborough,3,Furniture / Home Store,Electronics Store,Indian Restaurant,Chinese Restaurant,Fast Food Restaurant,Wings Joint,Coffee Shop,Vietnamese Restaurant,Latin American Restaurant,Gym / Fitness Center
11,Scarborough,3,Bar,Restaurant,Seafood Restaurant,Rental Car Location,Furniture / Home Store,Fish Market,Korean Restaurant,Burger Joint,Convenience Store,Smoke Shop
12,Scarborough,3,Chinese Restaurant,Shopping Mall,Shanghai Restaurant,Motorcycle Shop,Supermarket,Sandwich Place,Discount Store,American Restaurant,Lounge,Malay Restaurant


*Examine Cluster 4*

In [67]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,4,Pizza Place,Coffee Shop,Fast Food Restaurant,Greek Restaurant,Pharmacy,Sports Bar,Supermarket,Fried Chicken Joint,Beer Store,Grocery Store
3,Scarborough,4,Coffee Shop,Park,Business Service,Dumpling Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant
16,Scarborough,4,,,,,,,,,,
25,North York,4,Bus Stop,Furniture / Home Store,Park,Road,Food & Drink Shop,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dive Bar
26,North York,4,Japanese Restaurant,Pool,Gym / Fitness Center,Caribbean Restaurant,Paper / Office Supplies Store,Café,Ethiopian Restaurant,Empanada Restaurant,Dessert Shop,Dim Sum Restaurant
46,Central Toronto,4,Sporting Goods Shop,Coffee Shop,Italian Restaurant,Café,Diner,Yoga Studio,Pizza Place,Spa,Skating Rink,Salon / Barbershop
51,Downtown Toronto,4,Restaurant,Coffee Shop,Café,Pizza Place,Pharmacy,Bakery,Pub,Beer Store,Diner,Thai Restaurant
54,Downtown Toronto,4,Coffee Shop,Clothing Store,Restaurant,Plaza,Gastropub,Tea Room,Café,Thai Restaurant,Hotel,Diner
55,Downtown Toronto,4,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Bakery,Cosmetics Shop,Breakfast Spot,Seafood Restaurant,Gastropub
57,Downtown Toronto,4,Coffee Shop,Café,Bubble Tea Shop,Burger Joint,Italian Restaurant,Diner,Chinese Restaurant,Sushi Restaurant,Falafel Restaurant,Spa


*Consolidating the results into pivot table, showing clusters and their most common venue unique values*

In [68]:
toronto_0=pd.DataFrame(toronto_merged, columns=['Cluster Labels', '1st Most Common Venue'])
toronto_0.dropna(inplace=True)
pd.pivot_table(toronto_0, index=['Cluster Labels','1st Most Common Venue'], aggfunc = 'count' )

Cluster Labels,1st Most Common Venue
0,Café
1,Bakery
1,Café
1,Coffee Shop
1,College Stadium
1,Indian Restaurant
1,Pizza Place
2,Café
3,BBQ Joint
3,Bank


- **Results section**

The results derived from clustering with K-mean algorithm (with K equal to 5) shows that most common venues in first and third cluster are cafes. Fourth cluster venues should be related with Toronto's downtown because objects like banks, stores and some hotels are often associated with central business district. Second cluster venues probably belongs to an area closer to the downtown as it has many cafes and restaurants. Finally, the fifth cluster indicates more distant, suburb areas. Such suggestion is based on the fact that many of its venues are bus stops, sporting facilities, even harbor. 

- **Discussion section**

Foursquare data combined with K-mean algorithm provide data analyst with a powerful tool to understand how the modern city works. By plotting the venues the analyst could get a general idea how the city is structured, thus enables him to 'visit' places he never been before.
Concerning Toronto's data as already mentioned in the results part some general suggestions could be made. First, the most visited city places are located in the central area or closer to it. Second, the data implies that except the downtown there should be also areas with features similar to the typical suburb areas. Obviously, Toronto's downtown is more suitable for business purposes (offices, stores, hotels, etc.) compared with the other city areas. However, common practice in such agglomerations is wealthy people to settle in more green and distant neighborhoods. 


- **Conclusion section**

Finally, newcomers in Toronto via different government programs are kindly advised to consider properties in fifth cluster. Usually such districts offer more green areas, pretty houses, good transportation services hence better living quality. Should be also mentioned that there is harbor in the last cluster. As a rule of thumb properties near harbors deem to be luxury and well managed.