# **Segmenting and Clustering Neighborhoods in Toronto**

## Let's first fetch the data that we need for this project from the Wikipedia page useing BeautifulSoup package

### Firstly, we need to import the library we will be using to connect to the Wikipedia page and fetch the contents of that page:

In [1]:
# import the library we use to open URLs
import urllib.request

# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

### Let's use Beautiful Soup’s prettify function and check it out right there in our notebook

In [2]:
#print(soup.prettify())

### Let's find the table that we are looking for

In [3]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")
#all_tables

### Looking through the output of ”all_tables” we can again see that the class id of our chosen table is ”wikitable sortable”. We can use this to get BS to only bring back the table data for this particular table and keep that in a variable called ”right_table“:

In [4]:
right_table=soup.find('table', class_='wikitable sortable')
#right_table

### Now we have to start looping through the rows to get the data for every row in the table.

In [5]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

### We’ll import pandas and create a dataframe with it, assigning each of the lists A-C into a column with the name of our source table columns

In [6]:
import pandas as pd

df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neibourhood']=C

### Let's take a look to our dataframe

In [7]:
df.head()

Unnamed: 0,Postal Code,Borough,Neibourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


### Cleaning the data inside our dataframe

In [8]:
df['Postal Code'] = df['Postal Code'].str.replace('\\n','')
df['Borough'] = df['Borough'].str.replace('\\n','')
df['Neibourhood'] = df['Neibourhood'].str.replace('\\n','')

In [9]:
df.head()

Unnamed: 0,Postal Code,Borough,Neibourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Ignore cells with a borough that is Not assigned.

In [10]:
df = df[df.Borough != 'Not assigned']

In [11]:
df.reset_index(inplace=True, drop=True)
df.head(5)

Unnamed: 0,Postal Code,Borough,Neibourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Checking if we have any Neibourhood that Not assigned

In [12]:
df[df['Neibourhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neibourhood


In [13]:
df.shape

(103, 3)

### After we trying many ways to get the coordinates without using the CSV file and failed .. we will use the CSV file to get the coordinates so let's download the data from the CSV file and read it

In [14]:
#Download the data
!wget -q -O 'Geospatial_Coordinates.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [15]:
#Read the data
df_coordinates = pd.read_csv('Geospatial_Coordinates.csv')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Let's merge our dataframe df with the new dataframe df_coordinates to get our dataframe with the coordinates for each code

In [16]:
df_data_with_coordinates = pd.merge(df,df_coordinates)
df_data_with_coordinates.head()

Unnamed: 0,Postal Code,Borough,Neibourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [17]:
df_data_with_coordinates.shape

(103, 5)