# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

The project includes scraping the Wikipedia page for the postal codes of Canada and then process and clean the data for the clustering. The clustering is carried out by K Means and the clusters are plotted using the Folium Library. The Boroughs containing the name 'Toronto' in it are first plotted and then clustered and plotted again

## All the 3 tasks of *web scraping, cleaning* and *clustering* are implemented in the same notebook for the ease of evaluation.

In [None]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

### Scraping the Wikipedia page for the table of postal codes of Canada

BeautifulSoup Library of Python is used for web scraping of table from the Wikipedia. The title of the webpage is printed to check if the page has been scraped successfully or not. Then the table of postal codes of Canada is printed.

In [None]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

### The html table is converted to Pandas DataFrame for cleaning and preprocessing

In [None]:
dfs = pd.read_html(tab)
df=dfs[0]
df.head()

### Data preprocessing and cleaning

In [None]:
# Dropping the rows where Borough is 'Not assigned'
df1 = df[df.Borough != 'Not assigned']

# Combining the neighbourhoods with same Postalcode
df2 = df1.groupby(['Postcode','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighbourhood'] = np.where(df2['Neighbourhood'] == 'Not assigned',df2['Borough'], df2['Neighbourhood'])

df2

In [None]:
# Shape of data frame
df2.shape

### Importing the csv file conatining the latitudes and longitudes for various neighbourhoods in Canada

In [None]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

### Merging the two tables for getting the Latitudes and Longitudes for various neighbourhoods in Canada

In [None]:
lat_lon.rename(columns={'Postal Code':'Postcode'},inplace=True)
df3 = pd.merge(df2,lat_lon,on='Postcode')
df3.head()

### The notebook from here includes the Clustering and the plotting of the neighbourhoods of Canada which contain Toronto in their Borough

**Getting all the rows from the data frame which contains Toronto in their Borough.**

In [None]:
df4 = df3[df3['Borough'].str.contains('Toronto',regex=False)]
df4