# Toronto Neighbourhood Analysis

## First Part

Scrape the data and create a dataframe based on it. Furthermore, process the dataframe as described in the instructions section.

Import all needed libraries.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

Get the webpage from Wikipedia.

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(page.content, "html.parser")

Find the table containing the data and convert it to a pandas dataframe with the correct header.

In [3]:
table = soup.find("table", class_="wikitable sortable")
df_list = pd.read_html(str(table))
df = df_list[0]
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Remove all rows, where the column *Borough* has the value *Not assigned*.

In [4]:
df = df[df["Borough"] != "Not assigned"]

Group the post codes and chain the corresponding *Neighbourhood* values separated by comma.
If the *Neighbourhood* value is set to *Not assigned*, give it the value of the column *Borough*.

In [5]:
df = df.groupby('Postcode').agg({'Borough' : 'first', 'Neighbourhood' : ', '.join}).reset_index().reindex(columns=df.columns)
df["Neighbourhood"].loc[df['Neighbourhood'] == "Not assigned"] = df["Borough"]

Rename column *Postcode* to *PostalCode*.

In [6]:
df.rename(columns={"Postcode": "PostalCode"}, inplace=True)

Print the shape of the final DataFrame.

In [7]:
df.shape

(103, 3)

## Second Part

Get geo data from csv and rename column *Postal Code* to *PostalCode* to be able to merge both DataFrames.

In [8]:
geo_data = pd.read_csv("https://cocl.us/Geospatial_data")
geo_data.rename(columns={"Postal Code": "PostalCode"}, inplace=True)

Merge both DataFrames to get a single one we can work with.

In [9]:
df = pd.merge(df, geo_data, on="PostalCode")
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Third Part

In [14]:
from sklearn.cluster import KMeans

In [17]:
# set number of clusters
kclusters = 5

Prepare DataFrame for clustering.

In [25]:
df_clustering = df["Neighbourhood"].to_frame()

Run k-means clustering

In [26]:
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

ValueError: could not convert string to float: 'Northwest'