# <center>Toronto Neighborhood Clustering - Part 1<center>

### In this project, we gather data on the neighborhoods of Toronto and use that data to cluster the neighborhoods in a way that will be useful to solve certain problems.

### In Part 1, we scrape the Toronto Neighborhood data that we need for this project using BeautifulSoup and pass it into a DataFrame.

In [1]:
#Import the libraries that will be needed for making the DataFrame
import pandas as pd
import requests
from bs4 import BeautifulSoup

First, we need to download the webpage for scraping the data using the requests library.  The webpage that contains the Toronto neighborhood data that we need is https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.  We save this text as 'html_data'.

In [2]:
html_data = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
html_data

<Response [200]>

Next, we parse the webpage using BeautifulSoup.

In [3]:
#Create the BeautifulSoup object for parsing the webpage
soup = BeautifulSoup(html_data.content, 'html.parser')

Now, we scrape the data from the webpage and pass it into a dataframe.  The assumptions we make when scraping the table are as follows with a few exceptions.

1.  The Postal Code is listed at the top of the cell in bold
2.  The Borough is listed below the Postal Code
3.  The Neighborhoods are all listed in parentheses'()'

In [4]:
#Create the dataframe for holding the data.  This frame will have 3 columns: PostalCode, Borough, and Neighborhood.
toronto_neighborhood_data = pd.DataFrame(columns=['PostalCode','Borough','Neighborhood']) 

#Iterate through each cell in each row of the table to extract postalcode, borough, and neighborhood data.
for row in soup.find('tbody').find_all('tr'):    
    for cell in row.find_all('td'):
    
        #If no borough is assigned to the postalcode, then we just skip over it, since we are only interested in the
        #postal codes with assigned boroughs.
        if cell.find('span').getText() == 'Not assigned':
            pass
        else:
            #Get the postal code data
            postal_code = cell.find('b').getText()
            
            #Create a dummy variable x to hold the borough and neighborhood data, since this must be cleaned up after
            #extraction.
            x = cell.find('span').getText()
            
            #Clean up this part by first replacing all the ')' characters with blank spaces ' '.
            x = x.replace(')', ' ')
            
            #If the last character of the string is a blank space due to it originally being a ')' character, we simply cut
            #off this last character using this if statement.
            if x[-1] == ' ':
                x = x[0:len(x)-1]
            
            #Next, create another dummy variable y that converts the string x into a list that separates the borough from
            #the neighborhoods using the split method for strings.
            y = x.split('(',1)
            
            #The borough is now cleaned and can now be extracted.  It is the first element in the y list.
            borough = y[0]
            
            #The neighborhoods data still needs more cleaning, we do this with the replace method.
            y[1] = y[1].replace('(',', ').replace(' /',',')
            
            #Neighborhood data is now ready for extraction.
            neighborhood = y[1]
            
            #Finally, append the data to the dataframe
            toronto_neighborhood_data = toronto_neighborhood_data.append({"PostalCode": postal_code,
                                                                        "Borough": borough,
                                                                        "Neighborhood": neighborhood},
                                                                        ignore_index = True)
toronto_neighborhood_data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


As we can see above, we are not quite done with cleaning the frame, as row 100 for postal code M7Y needs to be fixed.  This is due to the exceptions that occured in the table from our assumptions.  It means that not all the rows copied into the frame correctly.  We must examine each row to find and fix these incorrect elements.
We start by converting each column into a list using the dummy variables a, b, and c. We also create a list of the index numbers so we know which index each element is at.

In [5]:
a = list(toronto_neighborhood_data['PostalCode'])
b = list(toronto_neighborhood_data['Borough'])
c = list(toronto_neighborhood_data['Neighborhood'])
indeces = range(toronto_neighborhood_data.shape[0])

First, we look at the postal codes.

In [6]:
for i in indeces:
    print(i,'   ' + a[i])

0    M3A
1    M4A
2    M5A
3    M6A
4    M7A
5    M9A
6    M1B
7    M3B
8    M4B
9    M5B
10    M6B
11    M9B
12    M1C
13    M3C
14    M4C
15    M5C
16    M6C
17    M9C
18    M1E
19    M4E
20    M5E
21    M6E
22    M1G
23    M4G
24    M5G
25    M6G
26    M1H
27    M2H
28    M3H
29    M4H
30    M5H
31    M6H
32    M1J
33    M2J
34    M3J
35    M4J
36    M5J
37    M6J
38    M1K
39    M2K
40    M3K
41    M4K
42    M5K
43    M6K
44    M1L
45    M2L
46    M3L
47    M4L
48    M5L
49    M6L
50    M9L
51    M1M
52    M2M
53    M3M
54    M4M
55    M5M
56    M6M
57    M9M
58    M1N
59    M2N
60    M3N
61    M4N
62    M5N
63    M6N
64    M9N
65    M1P
66    M2P
67    M4P
68    M5P
69    M6P
70    M9P
71    M1R
72    M2R
73    M4R
74    M5R
75    M6R
76    M7R
77    M9R
78    M1S
79    M4S
80    M5S
81    M6S
82    M1T
83    M4T
84    M5T
85    M1V
86    M4V
87    M5V
88    M8V
89    M9V
90    M1W
91    M4W
92    M5W
93    M8W
94    M9W
95    M1X
96    M4X
97    M5X
98    M8X
99    M4Y
100    M7Y

No need to change anything for the postal codes.  So we move on to the boroughs.

In [7]:
for i in indeces:
    print(i, '  ' + b[i])

0   North York
1   North York
2   Downtown Toronto
3   North York
4   Queen's Park
5   Etobicoke
6   Scarborough
7   North York
8   East York
9   Downtown Toronto
10   North York
11   Etobicoke
12   Scarborough
13   North York
14   East York
15   Downtown Toronto
16   York
17   Etobicoke
18   Scarborough
19   East Toronto
20   Downtown Toronto
21   York
22   Scarborough
23   East York
24   Downtown Toronto
25   Downtown Toronto
26   Scarborough
27   North York
28   North York
29   East York
30   Downtown Toronto
31   West Toronto
32   Scarborough
33   North York
34   North York
35   East YorkEast Toronto
36   Downtown Toronto
37   West Toronto
38   Scarborough
39   North York
40   North York
41   East Toronto
42   Downtown Toronto
43   West Toronto
44   Scarborough
45   North York
46   North York
47   East Toronto
48   Downtown Toronto
49   North York
50   North York
51   Scarborough
52   North York
53   North York
54   East Toronto
55   North York
56   York
57   North York
58   Scarbo

Cells 35, 76, 92, 94, and 100 need to be corrected for the Boroughs column.  So let's make those corrections.

In [8]:
toronto_neighborhood_data.iloc[35,1] = 'East York, East Toronto'
toronto_neighborhood_data.iloc[76,1] = 'Mississauga'
toronto_neighborhood_data.iloc[92,1] = 'Downtown Toronto'
toronto_neighborhood_data.iloc[94,1] = 'Etobicoke'
toronto_neighborhood_data.iloc[100,1] = 'East Toronto'

Display the boroughs list again to ensure it is good.

In [9]:
b = list(toronto_neighborhood_data['Borough'])
for i in indeces:
    print(i, '  ' + b[i])

0   North York
1   North York
2   Downtown Toronto
3   North York
4   Queen's Park
5   Etobicoke
6   Scarborough
7   North York
8   East York
9   Downtown Toronto
10   North York
11   Etobicoke
12   Scarborough
13   North York
14   East York
15   Downtown Toronto
16   York
17   Etobicoke
18   Scarborough
19   East Toronto
20   Downtown Toronto
21   York
22   Scarborough
23   East York
24   Downtown Toronto
25   Downtown Toronto
26   Scarborough
27   North York
28   North York
29   East York
30   Downtown Toronto
31   West Toronto
32   Scarborough
33   North York
34   North York
35   East York, East Toronto
36   Downtown Toronto
37   West Toronto
38   Scarborough
39   North York
40   North York
41   East Toronto
42   Downtown Toronto
43   West Toronto
44   Scarborough
45   North York
46   North York
47   East Toronto
48   Downtown Toronto
49   North York
50   North York
51   Scarborough
52   North York
53   North York
54   East Toronto
55   North York
56   York
57   North York
58   Scar

Boroughs column looks good now, so let's look at the neighborhood column.

In [10]:
for i in indeces:
    print(i, '  ' + c[i])

0   Parkwoods
1   Victoria Village
2   Regent Park, Harbourfront
3   Lawrence Manor, Lawrence Heights
4   Ontario Provincial Government
5   Islington Avenue
6   Malvern, Rouge
7   Don Mills North
8   Parkview Hill, Woodbine Gardens
9   Garden District, Ryerson
10   Glencairn
11   West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
12   Rouge Hill, Port Union, Highland Creek
13   Don Mills South, Flemingdon Park
14   Woodbine Heights
15   St. James Town
16   Humewood-Cedarvale
17   Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
18   Guildwood, Morningside, West Hill
19   The Beaches
20   Berczy Park
21   Caledonia-Fairbanks
22   Woburn
23   Leaside
24   Central Bay Street
25   Christie
26   Cedarbrae
27   Hillcrest Village
28   Bathurst Manor, Wilson Heights, Downsview North
29   Thorncliffe Park
30   Richmond, Adelaide, King
31   Dufferin, Dovercourt Village
32   Scarborough Village
33   Fairview, Henry Farm, Oriole
34   Northwood Park, York University

Cells 35, 40, 76, 92, and 100 need fixing in the neighborhood column.  So let's make those corrections.

In [11]:
toronto_neighborhood_data.iloc[35,2] = 'The Danforth East'
toronto_neighborhood_data.iloc[40,2] = 'Downsview East, CFB Toronto'
toronto_neighborhood_data.iloc[76,2] = 'Canada Post Gateway Processing Centre'
toronto_neighborhood_data.iloc[92,2] = 'Stn A PO Boxes 25, The Esplanade'
toronto_neighborhood_data.iloc[100,2] = 'Business reply mail Processing Centre 969 Eastern'

Let's check the list once more to make sure everything is good.

In [12]:
c = list(toronto_neighborhood_data['Neighborhood'])
for i in indeces:
    print(i, '  ' + c[i])

0   Parkwoods
1   Victoria Village
2   Regent Park, Harbourfront
3   Lawrence Manor, Lawrence Heights
4   Ontario Provincial Government
5   Islington Avenue
6   Malvern, Rouge
7   Don Mills North
8   Parkview Hill, Woodbine Gardens
9   Garden District, Ryerson
10   Glencairn
11   West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
12   Rouge Hill, Port Union, Highland Creek
13   Don Mills South, Flemingdon Park
14   Woodbine Heights
15   St. James Town
16   Humewood-Cedarvale
17   Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
18   Guildwood, Morningside, West Hill
19   The Beaches
20   Berczy Park
21   Caledonia-Fairbanks
22   Woburn
23   Leaside
24   Central Bay Street
25   Christie
26   Cedarbrae
27   Hillcrest Village
28   Bathurst Manor, Wilson Heights, Downsview North
29   Thorncliffe Park
30   Richmond, Adelaide, King
31   Dufferin, Dovercourt Village
32   Scarborough Village
33   Fairview, Henry Farm, Oriole
34   Northwood Park, York University

Column looks good.  DataFrame is ready.  So we print out the first 5 columns and then get the dimensions.

In [13]:
toronto_neighborhood_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [14]:
toronto_neighborhood_data.shape

(103, 3)

Finally, we save our DataFrame to a csv file so we can use it for Part 2 of this project.

In [15]:
with open('toronto_neighborhood_data.csv', 'w') as csv_file:
    toronto_neighborhood_data.to_csv(path_or_buf = csv_file, index=False)