# Scrape data from Wikipedia
# Finding the Postal Code, Borough and Neighborhood from the table
    
## In this program , I will explore and cluster the neighborhoods in Toronto by

### Part 1 
        Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown in the assignment

            The dataframe specification 
                a- consist of three columns: PostalCode, Borough, and Neighborhood
                b- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
                c- More than one neighborhood can exist in one postal code area.
                d- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
                e- Clean this Notebook and add Markdown cells to explain your work and any assumptions you are making.
                f- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

### Part 2 

                Utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
                
                Assumption: I am can use the CSV file provided to import the cordination to my dataframe 
                here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data
                
                
### Part 3
                I will explore and cluster the neighborhoods in Toronto. 
                Note: I will work with only boroughs that contain the word Toronto.


## part 1 

            Finding the Postal Code, Borough and Neighborhood from the table


In [1]:
# Import all library required to complete part 1 
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [2]:
#Reading Wikipedia Url as text and store it in a variable call source
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
#Parsing url with lxml parser
soup = BeautifulSoup(source, 'lxml')

After Reviewing the soup in lxml format all data exist in a table with the raws contain the data 

In [4]:
#Finding table and all the tr's in the table
#reading the table from Soup lxml to table variable 
table = soup.find('table')
#read all teh raws from table to table_raws variable
table_rows = table.find_all('tr')

In [5]:
#create a list to read all raws in it 
toronto_df = []

In [6]:
#Appending the wikipedia table data into "toronto_list"
for tr in table_rows:
    td = tr.find_all('td')
    row = [data.text for data in td]
    row = [content.replace('\n', '') for content in row]
    toronto_df.append(row)

In [7]:
#Sanity check 
toronto_df

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Downtown Toronto', "Queen's Park"],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 'Etobicoke', 'Martin Grove'],
 ['M9B'

In [8]:
#craeting the dataframe to readd data 
#toronto_df = pd.DataFrame(['col1','col2','col3'])
columns = ['Postcode', 'Borough', 'Neighborhood']
toronto_df = pd.DataFrame(toronto_df, columns=columns)
# remove all none 
toronto_df.dropna(inplace=True)

In [9]:
#Sanity check 
toronto_df

Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


In [10]:
#transfer all raws except the one with Not assigned
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']

In [11]:
#Sanity check 
toronto_df

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
...,...,...,...
282,M8Z,Etobicoke,Kingsway Park South West
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West


In [13]:
#Replacing Neighborhood with "Not assigned" value with Borough value
toronto_df['Neighborhood'][toronto_df['Neighborhood'] == 'Not assigned'] = toronto_df['Borough'][toronto_df['Neighborhood'] == 'Not assigned']

In [14]:
#Sanity Check 
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor


In [15]:
#Merging Neighborhoods with same Borough
toronto_df = toronto_df.groupby(['Postcode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(set(x))).reset_index()

In [16]:
#Sanity Check 
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Morningside, West Hill, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [17]:
#apply shape to identify the number of raws and columns 
toronto_df.shape

(103, 3)

# End of the part 1 of assignment 

### ================================

### Part 2 

# ADDING LATITUDE AND LONGITUDE

In [18]:
#reading CVS fole contain all LL 
LL_df = pd.read_csv('http://cocl.us/Geospatial_data')

In [19]:
#Sanity Check 
LL_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Compairing the 2 DF columns name different so to be able merge 2 DF i'll change the toronto_df to the LL_df as following the only column name different is Post code 

In [20]:
# toronto DF columns name 
toronto_df.columns

Index(['Postcode', 'Borough', 'Neighborhood'], dtype='object')

In [21]:
# LL DF columns name 
LL_df.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [22]:
#change post code column name 
toronto_df.rename(columns={'Postcode':'Postal Code'}, inplace=True)

In [23]:
#sanity check 
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Morningside, West Hill, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Merging the Toronto Df and LL DF to have all information in one data frame 

In [24]:
#Merging 
toronto_ll_df = pd.merge(toronto_df, LL_df, on='Postal Code', how='outer')

In [25]:
#sanity check 
toronto_ll_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, West Hill, Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [26]:
#identify the number of raws and columns in the final data frame 
toronto_ll_df.shape

(103, 5)

# End of the part 2 of assignment 

### ==============================

### Part 3 
### Segmenting and Clustering Neighborhoods in Toronto

In [27]:
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from pandas.io.json import json_normalize

print('Libraries imported.')

Libraries imported.


In [28]:
# Discover Toronto latitude and longitude
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="to_explorer", timeout=3)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, ON, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON, Canada are 43.653963, -79.387207.


## Plotting map of Toronto with the Neighborhoods

In [29]:
#create the Toronto map 
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
#print Toronto map 
map_toronto

In [30]:
# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_ll_df['Latitude'], toronto_ll_df['Longitude'], toronto_ll_df['Borough'], toronto_ll_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
#Print map after adding markers 
map_toronto

## Use of Foursquare API to explore the neighborhoods, segmentation

In [31]:
#Foursquare ID
CLIENT_ID = '1DQJY4DOUFEPCGCI0403CLF24SKKCZ2R5RTMDEMZZ1OB2CCJ'
# Foursquare SECRET
CLIENT_SECRET = 'RXLBLRSD4YA0OGQTHMZKTOD2JYV4QU3N445MIJB1SP0N1MOC'
# Foursquare API version
VERSION = '20180605' 
#sanity check 
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1DQJY4DOUFEPCGCI0403CLF24SKKCZ2R5RTMDEMZZ1OB2CCJ
CLIENT_SECRET:RXLBLRSD4YA0OGQTHMZKTOD2JYV4QU3N445MIJB1SP0N1MOC


## explore a random neighborhood in our dataframe.

In [32]:
location = 40

In [33]:
# identify the name in location 40 
toronto_ll_df.loc[location, 'Neighborhood']

'East Toronto'

## discover the neighborhood's latitude and longitude values for location 40 'East Toronto' 

In [34]:
neighborhood_latitude = toronto_ll_df.loc[location, 'Latitude'] # neighborhood latitude value

neighborhood_longitude = toronto_ll_df.loc[location, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_ll_df.loc[location, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of East Toronto are 43.685347, -79.3381065.


## the top 100 venues that are in East Toronto within a radius of 500 meters and limit the result to 100 

In [35]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

In [36]:
#create the GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=1DQJY4DOUFEPCGCI0403CLF24SKKCZ2R5RTMDEMZZ1OB2CCJ&client_secret=RXLBLRSD4YA0OGQTHMZKTOD2JYV4QU3N445MIJB1SP0N1MOC&v=20180605&ll=43.685347,-79.3381065&radius=500&limit=100'

In [37]:
# Send the GET request for Json file and sanity check
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5dfb1a1b47b43d0023e74cb6'},
 'response': {'headerLocation': 'Greektown',
  'headerFullLocation': 'Greektown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.6898470045, 'lng': -79.33189528390383},
   'sw': {'lat': 43.6808469955, 'lng': -79.34431771609616}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c059c6691d776b020b3f7f9',
       'name': 'Aldwych Park',
       'location': {'address': '134 Aldwych Ave.',
        'crossStreet': 'btwn Dewhurst Blvd & Donlands Ave.',
        'lat': 43.68490095564762,
        'lng': -79.34109075059628,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.68490095564762,
          'lng': -79.34109075059628}],
       

## Craete a function to extracts the category of the venue

In [38]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [39]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Aldwych Park,Park,43.684901,-79.341091
1,The Path,Park,43.683923,-79.335007
2,Sammon Convenience,Convenience Store,43.686951,-79.335007
3,Donlands Subway Station,Metro Station,43.68096,-79.337759


In [40]:
# print venues returned by Foursquare
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


## define function to repeat the (extracts the category of the venue) for all the neighborhoods

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## calling getNearbyValues() function and storing the results is toronto_venues

In [42]:
toronto_venues = getNearbyVenues(names=toronto_ll_df['Neighborhood'],
                                   latitudes=toronto_ll_df['Latitude'],
                                   longitudes=toronto_ll_df['Longitude']
                                  )

Rouge, Malvern
Rouge Hill, Port Union, Highland Creek
Morningside, West Hill, Guildwood
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Oakridge, Golden Mile, Clairlea
Cliffcrest, Cliffside, Scarborough Village West
Cliffside West, Birch Cliff
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Sullivan, Clarks Corners, Tam O'Shanter
L'Amoreaux East, Agincourt North, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Downsview North, Bathurst Manor, Wilson Heights
Northwood Park, York University
Downsview East, CFB Toronto
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [43]:
#Sanity check and identify the size of the result 
print(toronto_venues.shape)
toronto_venues.head()

(2237, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,"Morningside, West Hill, Guildwood",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Morningside, West Hill, Guildwood",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


## number of venues for each neighborhood

In [44]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",9,9,9,9,9,9
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Berczy Park,56,56,56,56,56,56
...,...,...,...,...,...,...
Willowdale South,36,36,36,36,36,36
Willowdale West,6,6,6,6,6,6
Woburn,4,4,4,4,4,4
Woodbine Heights,9,9,9,9,9,9


In [45]:
# eliminate all duplicate in teh report or Unique categories from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 274 uniques categories.


## Analyse Each Neighborhood

In [46]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
#dimention of the new data frame 
toronto_onehot.shape

(2237, 274)

## Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [48]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
#sanity check 
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
2,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
3,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.017857,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,Willowdale South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.027778,0.0,0.0,0.0,0.0
97,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
98,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
99,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.000000,0.0,0.111111,0.000000,0.0,0.0,0.0,0.0


In [49]:
# dimention of data frame 
toronto_grouped.shape

(101, 274)

## Each neighborhood along with the top 5 most common venues

In [50]:
top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4              Metro Station  0.00


----Alderwood, Long Branch----
          venue  freq
0   Pizza Place  0.22
1   Coffee Shop  0.11
2  Skating Rink  0.11
3      Pharmacy  0.11
4           Gym  0.11


----Bayview Village----
                 venue  freq
0                 Café  0.25
1                 Bank  0.25
2   Chinese Restaurant  0.25
3  Japanese Restaurant  0.25
4  Monument / Landmark  0.00


----Bedford Park, Lawrence Manor East----
                  venue  freq
0    Italian Restaurant  0.12
1           Coffee Shop  0.08
2     Indian Restaurant  0.04
3  Fast Food Restaurant  0.04
4               Butcher  0.04


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1            Beer Bar  0.04
2         Cheese Shop  0.04
3  Seafood Restaurant  0.04
4        Cocktail Bar

# populate the data Frame 

##  define a function to sort the venues in descending order

In [53]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## Creating a new dataframe and displaying the top 10 venues for each neighborhood

In [54]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Skating Rink,Lounge,Breakfast Spot,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Skating Rink,Pharmacy,Pool,Pub,Sandwich Place,Gym,Airport Terminal,Dessert Shop
2,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Women's Store
3,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Juice Bar,Restaurant,Spa,Fast Food Restaurant,Breakfast Spot,Sandwich Place,Liquor Store,Butcher
4,Berczy Park,Coffee Shop,Seafood Restaurant,Farmers Market,Beer Bar,Bakery,Steakhouse,Cocktail Bar,Café,Cheese Shop,Irish Pub


# Cluster Neighborhoods

## Run k-means to cluster the neighborhood into 5 clusters.

In [55]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       0, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4,
       0, 4, 4, 0, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 0, 4, 4, 0, 4, 4, 4,
       0, 3, 4, 2, 4, 4, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 0], dtype=int32)

## Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [56]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_ll_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [57]:
#check the last columns!

toronto_merged.tail()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
98,M9N,York,Weston,43.706876,-79.518188,0.0,Park,Convenience Store,Women's Store,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
99,M9P,Etobicoke,Westmount,43.696319,-79.532242,4.0,Pizza Place,Intersection,Coffee Shop,Sandwich Place,Chinese Restaurant,Women's Store,Donut Shop,Discount Store,Dog Run,Doner Restaurant
100,M9R,Etobicoke,"Richview Gardens, Martin Grove Gardens, Kingsv...",43.688905,-79.554724,4.0,Park,Pizza Place,Bus Line,Mobile Phone Shop,Empanada Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dessert Shop
101,M9V,Etobicoke,"Silverstone, Mount Olive, Albion Gardens, Beau...",43.739416,-79.588437,4.0,Fast Food Restaurant,Grocery Store,Pharmacy,Pizza Place,Beer Store,Fried Chicken Joint,Sandwich Place,Doner Restaurant,Dim Sum Restaurant,Diner
102,M9W,Etobicoke,Northwest,43.706748,-79.594054,4.0,Rental Car Location,Bar,Drugstore,Women's Store,Dim Sum Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


## Dropping if the data not avalible in neighbourhood and change the floats to int

In [58]:
toronto_merged=toronto_merged.dropna()
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

In [59]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Identify the cluster members 

## Cluster 1 memebers 

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,0,Playground,Convenience Store,Women's Store,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
14,Scarborough,0,Park,Coffee Shop,Playground,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
23,North York,0,Park,Bank,Convenience Store,Bar,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
25,North York,0,Food & Drink Shop,Park,Women's Store,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
30,North York,0,Park,Airport,Women's Store,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
40,East York,0,Park,Metro Station,Convenience Store,Women's Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
44,Central Toronto,0,Park,Swim School,Bus Line,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant
50,Downtown Toronto,0,Park,Playground,Trail,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Event Space,Dumpling Restaurant,Department Store
74,York,0,Park,Women's Store,Market,Fast Food Restaurant,Gourmet Shop,Golf Course,Ethiopian Restaurant,Empanada Restaurant,Greek Restaurant,Electronics Store
79,North York,0,Park,Bakery,Construction & Landscaping,Women's Store,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant


## Cluster 2 members 

In [61]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
97,North York,1,Baseball Field,Women's Store,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Festival


## Cluster 3 members 

In [62]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,2,Fast Food Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Gym Pool


## Cluster 4 members 

In [63]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Central Toronto,3,Garden,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store,Dessert Shop


## Cluster 5 members 

In [64]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,4,Bar,History Museum,Women's Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
2,Scarborough,4,Intersection,Rental Car Location,Electronics Store,Spa,Breakfast Spot,Pizza Place,Mexican Restaurant,Medical Center,Women's Store,Discount Store
3,Scarborough,4,Coffee Shop,Korean Restaurant,Insurance Office,Women's Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,Scarborough,4,Hakka Restaurant,Bakery,Caribbean Restaurant,Athletics & Sports,Bank,Gas Station,Thai Restaurant,Fried Chicken Joint,Doner Restaurant,Dog Run
6,Scarborough,4,Department Store,Bus Station,Convenience Store,Coffee Shop,Hobby Shop,Chinese Restaurant,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop
...,...,...,...,...,...,...,...,...,...,...,...,...
96,North York,4,Pizza Place,Empanada Restaurant,Shopping Mall,Women's Store,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
99,Etobicoke,4,Pizza Place,Intersection,Coffee Shop,Sandwich Place,Chinese Restaurant,Women's Store,Donut Shop,Discount Store,Dog Run,Doner Restaurant
100,Etobicoke,4,Park,Pizza Place,Bus Line,Mobile Phone Shop,Empanada Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dessert Shop
101,Etobicoke,4,Fast Food Restaurant,Grocery Store,Pharmacy,Pizza Place,Beer Store,Fried Chicken Joint,Sandwich Place,Doner Restaurant,Dim Sum Restaurant,Diner


# End of part 3 and end of assignment 