# Segmenting and Clustering Neighborhoods in Toronto

# Introduction

This is a peer-graded assignment, which explores and clusters the neighborhoods in Toronto, by scraping data from a website.  Also, we will use the Foursquare API to explore neighborhoods in Toronto, and will use the explore function to get the most common venue categories in each neighborhood, and will then use this feature to group the neighborhoods into clusters. We will use K-means clustering and the Folium library to visualize the neighborhoods in Toronto with the clusters.


#### Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!pip install bs4
from bs4 import BeautifulSoup
!pip install lxml
!pip install html5lib
!pip install requests
import csv
import requests

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: | 

## Start of part 1

#### We will scrape the following Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to get information about the Toronto neighborhoods, and will use BeautifulSoup to parse through the information.

In [None]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
#print(soup.prettify())

#### Extract the Postal Code, Borough, and Neighborhood information.

In [None]:
match = soup.find('table',class_='wikitable sortable')
pclist = []
for pcall in match.tbody.find_all('td'):
    postcode = pcall.text
    pclist.append(postcode)

#### Start by putting it into a dataframe.

In [None]:
newpclist = []
for i,s in enumerate(pclist):
    #print(s.strip())
    newpclist.append(s.strip())
pcpd = pd.DataFrame(newpclist)
pcpd.head(7)

#### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood, so we shape the dataframe into 3 columns instead of 1.
#### Only process the cells that have an assigned borough. Ignore cells with a borough that is 'Not assigned'. To do this, we remove all rows where Borough is equal to 'Not assigned'.

In [None]:
pcdf = pd.DataFrame({'PostalCode':newpclist[0::3],'Borough':newpclist[1::3],'Neighborhood':newpclist[2::3]}, columns=['PostalCode','Borough','Neighborhood'])
# Alternative way to drop Borough != 'Not assigned'
#  pcdf = pcdf.set_index('Borough')
#  pcdf.drop(['Not assigned'], axis = 0,inplace=True)

pcdf_to_keep = list(np.array(pcdf['Borough'].values)[np.array(pcdf['Borough']!='Not assigned')])
pcdf = pcdf.loc[pcdf['Borough'].isin(pcdf_to_keep)]
print(pcdf.shape)
pcdf.head(10)

#### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [None]:
pcdf2 = pcdf.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
print(pcdf2.shape)
pcdf2.head()

#### If a cell has a Borough but a 'Not assigned' Neighborhood, then the Neighborhood will be the same as the Borough. E.g. for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [None]:
pcdf2.replace({'Neighborhood':'Not assigned'}, {'Neighborhood':pcdf2['Borough']}, inplace=True)
pcdf2.head()

In [None]:
print(pcdf2.shape)

## End of Part 1
## Start of Part 2
#### Now that we have built a dataframe of each Postal Code along with the Borough name and Neighborhood names, we need to get the latitude and the longitude coordinates of each Postal Code in order to utilize the Foursquare location data.
#### Given that the Geocoder Python package https://geocoder.readthedocs.io/index.html was very unreliable, I was not able to get the geographical coordinates of the Postal Codes using the Geocoder package, so used this link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data.

In [None]:
latlong_df = pd.read_csv('http://cocl.us/Geospatial_data', header=0)
print(latlong_df.shape)
latlong_df.head()

#### Rename the column to match them in both dataframes.

In [None]:
latlong_df.rename(columns = {"Postal Code": "PostalCode"}, inplace=True)
latlong_df.head()

#### Merge the two dataframes to create one combined dataset.

In [None]:
toronto_data = pd.merge(pcdf2, latlong_df, on='PostalCode')
toronto_data.head(20)

## End of Part 2
## Start of Part 3

#### Use geopy library to get the latitude and longitude values of Toronto.

#### In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent to_explorer.


In [None]:
#define Toronto's lat and long
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('latitude: ', latitude, 'longitude: ', longitude)

#### Validate that the dataset has all Boroughs and Neighborhoods.

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_data['Borough'].unique()),
        toronto_data.shape[0]
    )
)

#### Create a map of Toronto with Boroughs and Neighborhoods superimposed on top.

In [None]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = 'Borough: {} \n Neighborhood: {}'.format(borough, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define Foursquare Credentials and Version.

In [None]:
# The code was removed by Watson Studio for sharing.

#### Let's explore the first Postal Code in our dataframe.

In [None]:
toronto_data.loc[0, 'PostalCode']

#### Get the Postal Code's latitude and longitude values.

In [None]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'PostalCode'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

#### Now, let's get the top 100 venues that are near Postal Code M1B within a radius of 500 meters.

In [None]:
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, LIMIT)
# url


#### Send the GET request and examine the results.

In [None]:
results = requests.get(url).json()
results

#### All the information is in the items key. Before we proceed, let's create the get_category_type function.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Now we are ready to clean the json and structure it into a pandas dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
print(venues)
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

#### Let's examine how many venues were returned.

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

#### Let's create a function to repeat the same process for all the neighborhoods in Toronto.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now we call the above function on each neighborhood and create a new dataframe called toronto_venues.

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['PostalCode'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

#### Let's check the size of the resulting dataframe.

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

#### Let's check how many venues were returned for each Postal Code.

In [None]:
toronto_venues.groupby('PostalCode').count()
toronto_venues.head(10)

#### Let's find out how many unique categories can be curated from all the returned venues.

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

#### Let's analyze each Postal Code area.

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

#### And let's examine the new dataframe size.

In [None]:
toronto_onehot.shape

#### Now, let's group rows by Postal Code area and by taking the mean of the frequency of occurrence of each category.

In [None]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped.head()

#### Let's confirm the new size.

In [None]:
toronto_grouped.shape

#### Let's print each Postal Code along with the top 5 most common venues.

In [None]:
num_top_venues = 5

for pc in toronto_grouped['PostalCode']:
    print("----"+pc+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == pc].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Let's put that into a pandas dataframe.

#### First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Now let's create the new dataframe and display the top 10 venues for each Postal Code area.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

#### Run k-means to cluster the Postal Code areas into 5 clusters.


In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

#### Let's create a new dataframe that includes the cluster as well as the top 10 venues for each Postal Code area.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='PostalCode', how='inner')

#toronto_merged["Cluster Labels"] = toronto_merged["Cluster Labels"].fillna(0.0).astype(int)
toronto_merged.head(20) 

#### Finally, let's visualize the resulting clusters.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, pc, br, nh, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup('Postal Code: ' + str(pc) + ' Borough: ' + br + ' Neighborhood\(s\): ' + nh + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters
#### Start by examining # of unique venues in the top 10s.

In [None]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

for ind in np.arange(num_top_venues):
    try:
        column = '{}{} Most Common Venue'.format(ind+1, indicators[ind])
        print(column + ': ' + str(len(toronto_merged[column].unique().tolist())))
    except:
        column = '{}th Most Common Venue'.format(ind+1)
        print(column + ': ' + str(len(toronto_merged[column].unique().tolist())))

#### Examining Cluster 0, we see that Coffee Shops, Donut Shops, Cafes, and Bakeries are highlights.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + [1] + list(range(5, toronto_merged.shape[1]))]]

#### Examining Cluster 1, we see that it is the only one that selected Cafeteria as the most common venue.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Examining Cluster 2, we see that it is popular for the Garden listed as the most common venue.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Examining Cluster 3, we see that it features more ethnic restaurants.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Examining Cluster 4, we see that it features Park as the most common 1st and 2nd choices of venue.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

## End of part 3