# Segmenting and Clustering Neighborhood in Toronto, Canada

## Introduction

In this notebook I will first scrape Toronto neighborhood data from WIkipedia. From there I will convert addresses into their equivalent latitude and longitude values.I will also us the Foursquare API to explore all of the neighborhoods of Toronto. Using the __explore__ function I will find the most common venue categories in each neighborhood, and then use this feature to group neighborhoods into clusters. The _k_-means clustering algorithm will be used to complete this task. Finally I will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters. 

### Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Webscrape and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Toronto</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

In [1]:
# import all of the necessary libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Webscraping libraries
from bs4 import BeautifulSoup as soup
import praw
import csv
import os
import sys
import pickle
import wikipedia

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Web Scrape and Explore the Dataset

### Set up a request (using requests) to the URL below. 
### Use BeautifulSoup to parse the page and extract all results.


In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
wiki = requests.get(url).text

In [4]:
soup = soup(wiki, 'lxml')

In [5]:
table = soup.find('table', class_='wikitable sortable')

### Write a funtion to extract table from the Wikipedia page to get all of the neighborhood info.

In [6]:
lst = []
    
for row in table.find_all('tr'):
    col = row.find_all('td')
    if len(col) == 3:
        lst.append((col[0].text.strip(), 
                    col[1].text.strip(), 
                    col[2].text.strip()))

### Create a pandas Dataframe out of the table that was scrape from WIkipedia

In [20]:
df = pd.DataFrame(np.asarray(lst))
df.columns = ['Zip', 'Borough', 'Neighborhood']
df['Neighborhood'] = df['Neighborhood'].str.replace(']', '')

In [21]:
df.head()

Unnamed: 0,Zip,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [22]:
# Replace 'Not assigned' as NaN so they can be droped.
df.replace('Not assigned', np.nan, inplace=True)
df.head(10)

Unnamed: 0,Zip,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,
9,M8A,,


#### There are some Boroughs that have names but have NaN values for Neighborhood, for example Queen's Park. I'm going to replace the NaN with the same name as the Borough.

In [23]:
for i in range(0, len(df.index)):
    if df.iloc[i,1] is not np.nan and df.iloc[i,2] is np.nan:
        df.iloc[i,2] = df.iloc[i,1]

In [24]:
# Check to see that it worked.
df.head(10)

Unnamed: 0,Zip,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,,


In [25]:
# Drop all the NaN Boroughs and Neighborhoods.
df.dropna(inplace=True)

In [26]:
df.head()

Unnamed: 0,Zip,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### Group the Neighborhoods by Zip code and Bourough

In [27]:
df = df.groupby(['Zip', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [29]:
df.head(15)

Unnamed: 0,Zip,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [30]:
df.shape

(103, 3)