### Week3: SegmentingAndClusteringNeighborhoodsInToronto

### Part.1:

First lets import all required libaries.

In [1]:
import pandas as pd
import numpy as np

In [4]:
#!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup
import requests

Then lets scrape the data from the wiki page.
In the same step the data is cleaned.

In [5]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url,'html5lib')

table_contents = []
table = soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

In the next step the table is transfered to a dataframe.
In the same step the data for the Bourgough columns is cleaned.


In [6]:
df = pd.DataFrame(table_contents)
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

Lets print the hole dateframe to ensure there are no missing values.

In [7]:
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
5          M9A               Etobicoke   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B              North York   
11         M9B               Etobicoke   
12         M1C             Scarborough   
13         M3C              North York   
14         M4C               East York   
15         M5C        Downtown Toronto   
16         M6C                    York   
17         M9C               Etobicoke   
18         M1E             Scarborough   
19         M4E            East Toronto   
20         M5E        Downtown Toronto   
21         M6E                    York   
22         M1G             Scarbor

Lastly lets print the number of rows of the dataframe as requested in the assignment description.

In [8]:
print('number of rows of dataframed: ',df.shape[0])

number of rows of dataframed:  103


### Part.2:

First we import geocoder.

In [None]:
#!conda install -c conda-forge geocoder --yes
import geocoder

Then we get the coordinates for each row in the dataframe

In [None]:
coordinates = []

for postal_code in df['PostalCode']:
    print(postal_code)
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        i = i+1
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    coordinates.append((latitude,longitude))

Unfortunately the geoencoder was not working for me. Therefore I had to use the csv-file. Hence we load the csv file and read it into a dataframe.

In [9]:
csv_path = 'https://cocl.us/Geospatial_data'
df_coordinates = pd.read_csv(csv_path)

df = pd.merge(df, df_coordinates, left_on='PostalCode', right_on='Postal Code', how='left').drop('Postal Code', axis=1)

Lets print the hole dateframe to ensure there are no missing values.

In [10]:
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
5          M9A               Etobicoke   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B              North York   
11         M9B               Etobicoke   
12         M1C             Scarborough   
13         M3C              North York   
14         M4C               East York   
15         M5C        Downtown Toronto   
16         M6C                    York   
17         M9C               Etobicoke   
18         M1E             Scarborough   
19         M4E            East Toronto   
20         M5E        Downtown Toronto   
21         M6E                    York   
22         M1G             Scarbor

Lastly lets print the number of columns of the dataframe.

In [11]:
print('number of columns of dataframed: ',df.shape[1])

number of columns of dataframed:  5


### Part.3:

First we import folium, matplotlib and k-means from clustering stage

In [12]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

In [13]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [14]:
from sklearn.cluster import KMeans

First we drop all rows in the dataframe which do not contain "Toronto" in the "Borough"-column and reset the index.

In [26]:
index_names = df[~df['Borough'].str.contains('Toronto')].index
df = df.drop(index_names)
df = df.reset_index(drop=True)
print(df)

   PostalCode                 Borough  \
0         M5A        Downtown Toronto   
1         M5B        Downtown Toronto   
2         M5C        Downtown Toronto   
3         M4E            East Toronto   
4         M5E        Downtown Toronto   
5         M5G        Downtown Toronto   
6         M6G        Downtown Toronto   
7         M5H        Downtown Toronto   
8         M6H            West Toronto   
9         M4J  East York/East Toronto   
10        M5J        Downtown Toronto   
11        M6J            West Toronto   
12        M4K            East Toronto   
13        M5K        Downtown Toronto   
14        M6K            West Toronto   
15        M4L            East Toronto   
16        M5L        Downtown Toronto   
17        M4M            East Toronto   
18        M4N         Central Toronto   
19        M5N         Central Toronto   
20        M4P         Central Toronto   
21        M5P         Central Toronto   
22        M6P            West Toronto   
23        M4R   

Create a map of Toronto with neighborhoods superimposed on top.

In [30]:
# create map of Manhattan using latitude and longitude values
latitude = df['Latitude'][0]
longitude = df['Longitude'][0]
map_toronto = folium.Map(location=[latitude,longitude],zoom_start=11)

# add markers to map
for lat,lng,label in zip(df['Latitude'],df['Longitude'],df['Neighborhood']):
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we get the venue list for Toronto. Unfortunately Foursquare is not working for me, therefore I had to use the workaround provided by the staff members in the discussion forum.

In [34]:
nearby_venues1 = pd.read_json("https://raw.githubusercontent.com/ibm-developer-skills-network/yczvh-DataFilesForIBMProjects/master/segmenting_neighborhoods.json")    
nearby_venues1.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                 'Venue', 
                 'Venue Latitude', 
                 'Venue Longitude', 
                 'Venue Category']
toronto_venues = nearby_venues1
print(toronto_venues)

                                           Neighborhood  \
0                                        Malvern, Rouge   
1                Rouge Hill, Port Union, Highland Creek   
2                     Guildwood, Morningside, West Hill   
3                     Guildwood, Morningside, West Hill   
4                     Guildwood, Morningside, West Hill   
...                                                 ...   
1332  South Steeles, Silverstone, Humbergate, Jamest...   
1333  Clairville, Humberwood, Woodbine Downs, West H...   
1334  Clairville, Humberwood, Woodbine Downs, West H...   
1335  Clairville, Humberwood, Woodbine Downs, West H...   
1336  Clairville, Humberwood, Woodbine Downs, West H...   

      Neighborhood Latitude  Neighborhood Longitude                   Venue  \
0                 43.806686              -79.194353                 Wendy’s   
1                 43.784535              -79.160497   Royal Canadian Legion   
2                 43.763573              -79.188711   

Now we analyse the neighborhoods in toronto. In order to do so we first use one-hot encoding of the "toronto_venues" dataframe.

In [35]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next we group the rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [38]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Next we print each neighborhood along with the top 5 most common venues.

In [37]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4        Monument / Landmark  0.00


----Alderwood, Long Branch----
          venue  freq
0   Pizza Place  0.25
1      Pharmacy  0.12
2   Coffee Shop  0.12
3  Skating Rink  0.12
4           Pub  0.12


----Bathurst Manor, Wilson Heights, Downsview North----
                       venue  freq
0                       Bank  0.09
1                Coffee Shop  0.09
2                Pizza Place  0.04
3                 Restaurant  0.04
4  Middle Eastern Restaurant  0.04


----Bayview Village----
                 venue  freq
0                 Bank  0.25
1                 Café  0.25
2   Chinese Restaurant  0.25
3  Japanese Restaurant  0.25
4          Yoga Studio  0.00


----Bedford Park, Lawrence Manor East----
                     venue  freq
0       Italian Restaurant  0.09
1              Coffee Shop  0

Now we put that into a pandas dataframe.

In [39]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [51]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Women's Store,Deli / Bodega,Drugstore,Donut Shop,Dog Run,Distribution Center
1,"Alderwood, Long Branch",Pizza Place,Skating Rink,Pharmacy,Pub,Sandwich Place,Coffee Shop,Gym,Gas Station,Coworking Space,Diner
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Diner,Bridal Shop,Supermarket,Restaurant,Sushi Restaurant,Ice Cream Shop,Middle Eastern Restaurant,Mobile Phone Shop
3,Bayview Village,Bank,Chinese Restaurant,Japanese Restaurant,Café,Women's Store,Deli / Bodega,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Greek Restaurant,Thai Restaurant,Liquor Store,Juice Bar,Indian Restaurant,Restaurant,Sushi Restaurant


Now we run k-means to cluster the neighborhood into 4 clusters.

In [52]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Then we create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [53]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge dataframes to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'),on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Breakfast Spot,Café,Greek Restaurant,Gym / Fitness Center,Pub,Performing Arts Venue,Mexican Restaurant
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Café,Theater,Clothing Store,Sporting Goods Shop,Hotel,Fast Food Restaurant,Steakhouse,Bakery,Ramen Restaurant,Music Venue
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Gastropub,Café,Farmers Market,Coffee Shop,Thai Restaurant,Diner,Jazz Club,Japanese Restaurant,Italian Restaurant,Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Pub,Trail,Health Food Store,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Women's Store,Cupcake Shop
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Cocktail Bar,Coffee Shop,Beer Bar,Farmers Market,Seafood Restaurant,Café,Breakfast Spot,Liquor Store,Bistro,Comfort Food Restaurant


Finally, wes visualize the resulting clusters.

In [54]:
# create map
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat,lon,poi,cluster in zip(toronto_merged['Latitude'],toronto_merged['Longitude'],toronto_merged['Neighborhood'],toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi)+' Cluster '+str(cluster),parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters

Lastly we examine each cluster and determine the discriminating venue categories that distinguish each cluster.

In [47]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,toronto_merged.columns[[1]+list(range(5,toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,toronto_merged.columns[[1]+list(range(5,toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1,Coffee Shop,Park,Bakery,Breakfast Spot,Café,Greek Restaurant,Gym / Fitness Center,Pub,Performing Arts Venue,Mexican Restaurant
1,Downtown Toronto,1,Café,Theater,Clothing Store,Sporting Goods Shop,Hotel,Fast Food Restaurant,Steakhouse,Bakery,Ramen Restaurant,Music Venue
2,Downtown Toronto,1,Gastropub,Café,Farmers Market,Coffee Shop,Thai Restaurant,Diner,Jazz Club,Japanese Restaurant,Italian Restaurant,Restaurant
3,East Toronto,1,Pub,Trail,Health Food Store,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Women's Store,Cupcake Shop
4,Downtown Toronto,1,Cocktail Bar,Coffee Shop,Beer Bar,Farmers Market,Seafood Restaurant,Café,Breakfast Spot,Liquor Store,Bistro,Comfort Food Restaurant
5,Downtown Toronto,1,Coffee Shop,Italian Restaurant,Café,Yoga Studio,Thai Restaurant,Department Store,Sandwich Place,Spa,Japanese Restaurant,Bubble Tea Shop
6,Downtown Toronto,1,Grocery Store,Café,Park,Baby Store,Candy Store,Coffee Shop,Italian Restaurant,Nightclub,Restaurant,Deli / Bodega
7,Downtown Toronto,1,Coffee Shop,Café,Seafood Restaurant,Thai Restaurant,Smoke Shop,Lounge,Bakery,Steakhouse,Hotel,Fast Food Restaurant
8,West Toronto,1,Pharmacy,Bakery,Park,Pool,Brewery,Bar,Bank,Supermarket,Café,Middle Eastern Restaurant
9,East York/East Toronto,1,Pizza Place,Park,Convenience Store,Intersection,Dance Studio,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run,Distribution Center


In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,toronto_merged.columns[[1]+list(range(5,toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,toronto_merged.columns[[1]+list(range(5,toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,3,Bus Line,Park,Swim School,Women's Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store
21,Central Toronto,3,Park,Sushi Restaurant,Jewelry Store,Trail,Women's Store,Diner,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
29,Central Toronto,3,Lawyer,Park,Trail,Summer Camp,Drugstore,Donut Shop,Dog Run,Distribution Center,Curling Ice,Eastern European Restaurant
33,Downtown Toronto,3,Park,Playground,Trail,Women's Store,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
