# Segmenting and Clustering Neighborhoods in Toronto

The key techniques of this notebook are
- Converting addresses into their equivalent latitude and longitude values. 
- Implementing Foursquare API to explore neighborhoods.
- Web scraping using beautifulsoap
- Exploring the most common venue categories in each neighborhood
- Grouping the neighborhoods into clusters.
- Using Folium library to visualize the neighborhoods and their emerging clusters.

-- TODO
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [3]:
#Downloading the dependencies 

from bs4 import BeautifulSoup #library for web scraping
import requests  # library to handle requests
import json  # library to handle JSON files
import xml
import pandas as pd #Python library data manipulation and analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print("Imported Set 1 libraries")

Imported Set 1 libraries


In [4]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.colors import rgb2hex
import matplotlib.pyplot as plt
%matplotlib notebook
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# import k-means from clustering stage
from sklearn.cluster import KMeans
print("Imported Set 2 libraries")

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  51.63 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  34.57 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  38.66 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  48.70 MB/s
Imported Set 2 libraries


In [5]:
!pip install pracpred
import pracpred.scrape as pps
print("Imported Set 3 libraries")

Collecting pracpred
  Downloading https://files.pythonhosted.org/packages/65/97/567c57b82dae7afe07db5f2401aadb6ee339ab5178d54c4410890f97c788/pracpred-0.2.1-py3-none-any.whl
Requirement not upgraded as not directly required: numpy in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from pracpred)
Installing collected packages: pracpred
Successfully installed pracpred-0.2.1
Imported Set 3 libraries


In [6]:
#from mpl_toolkits.basemap import Basemap
from pathlib import Path
import warnings
print("Imported Set 4 libraries")

Imported Set 4 libraries


Using pracpred to parse HTML Table data

In [7]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
CAN_SUBURB_INFO = 'wikitable sortable'
USER_AGENT = (
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +
    'AppleWebKit/537.36 (KHTML, like Gecko) ' +
    'Chrome/61.0.3163.100 Safari/537.36'
)

REQUEST_HEADERS = {
    'user-agent': USER_AGENT,
}

tables = pps.HTMLTables(URL, table_class=CAN_SUBURB_INFO, headers=REQUEST_HEADERS)
raw = tables[0].to_df(repeat_span=True)

In [8]:
URLs = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(URLs,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})

raw1 = pd.read_html(str(My_table))[0]


#print(df[0].to_json(orient='records'))
#raw1.head(5)

In [9]:
def cleanupData(raw):
    prevPostCode = ""
    prevNeighborhood = ""
    del_list = list()
    
    
    dataTemp = raw.copy()
    #print('Stage 0 Shape ', dataTemp.shape)
    dataTemp.columns = dataTemp.loc[0, :]
    

    #The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    dataTemp = dataTemp.drop(dataTemp.index[0])
    #print('Stage 1 Shape ', dataTemp.shape)
    
    #Ignore cells with a borough that is Not assigned.
    dataTemp = dataTemp[dataTemp.Borough != 'Not assigned']
    
    #Sort Data on Post Code and reindex
    dataTemp.sort_values("Postcode", inplace=True)
    dataTemp = dataTemp.reset_index(drop=True)
    #print('Stage 2 Shape ', dataTemp.shape)


    for index, row in dataTemp.iterrows():
        
        if (row[2] == "Not assigned"):     #Not assigned neighborhood, then the neighborhood will be the same as the borough
            dataTemp.at[index, 'Neighbourhood'] = row[1]
            
        elif (prevPostCode == row[0]):
            dataTemp.at[index, 'Neighbourhood'] = prevNeighborhood + ", " + row[2]
            del_list.append(index-1)
        prevPostCode = row[0]
        prevBorough = row[1]
        prevNeighborhood = row[2]
    
    #print(del_list)
    # Delete Duplicate Post code rows and reindex
    dataTemp.drop(dataTemp.index[del_list], axis=0, inplace=True)
    dataTemp.sort_values("Postcode", inplace=True)
    dataTemp = dataTemp.reset_index(drop=True)
    
    #print('Stage 3 Shape ', dataTemp.shape)
    
    return dataTemp

In [10]:
Neighbourhoods = cleanupData(raw)
Neighbourhoods.to_csv('toronto_base.csv') #Save data for future use
Neighbourhoods


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Cliffcrest, Scarborough Village West, Cliffside"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


In [11]:
print('Shape of Data set: ', Neighbourhoods.shape)

Shape of Data set:  (103, 3)
