<a href="https://colab.research.google.com/github/Tardiser/Coursera_Capstone/blob/master/TorontoNeighs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # **<div align="center"> Segmenting and Clustering Neighborhoods in Toronto </div>**

---





*   Let's first download and import our packages.



In [0]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re

In [3]:
!pip install geocoder
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 19.4MB/s eta 0:00:01[K     |██████▋                         | 20kB 2.2MB/s eta 0:00:01[K     |██████████                      | 30kB 2.8MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 2.1MB/s eta 0:00:01[K     |████████████████▋               | 51kB 2.3MB/s eta 0:00:01[K     |████████████████████            | 61kB 2.7MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 3.1MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 3.5MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.1MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad4



*   Now that we've imported our packages, let's get the coordinats from the Wikipedia, using Beautiful Soup package for scraping.



In [0]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find("table")



*   We'll write our findings to a csv file. So, let's create one and initialize a csv writer.



In [0]:
csv_file = open('canadaPostal.csv', 'w')
csv_writer = csv.writer(csv_file)



*   Firstly, we'll scrape column names from the wiki table.



In [0]:
columns = []
for th in table.find_all("th"):
    ftext = th.text.strip()
    columns.append(ftext)

csv_writer.writerow(columns)



*   And then, the rest of the table.



In [0]:
table = table.tbody
for tr in table.find_all("tr"):
    columns = []
    for td in tr.find_all("td"):
        ftext = td.text.strip()
        ftext = ftext.replace("/", ",") # If neighbourhood has more than one value, change the seperation to "," instead of "/".
        ftext = re.sub(r'\s+(,)', r'\1', ftext) # Remove the space before ",". 
        if(ftext == "Not assigned"):
            columns = []
            break
        
        columns.append(ftext)
    csv_writer.writerow(columns)

csv_file.close()



*   Let's take a look at our data.



In [9]:
canada_postals_df = pd.read_csv('canadaPostal.csv') 
canada_postals_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"




*   And the shape of our data:



In [14]:
canada_postals_df.shape

(103, 3)



*   I've tried using Geopy for obtaining the coordinates, but it kept returning none. Therefore, I commented out the whole section below.



In [0]:
"""
# Initialize variables.
pcodes = canada_postals_df["Postal code"].tolist()
latitudes = []
longitudes = []
# Get the geo dats.
for pcode in pcodes:
    coords = None
    while(coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(pcode))
      coords = g.latlng
      print(coords)
      
    latitudes.append(coords[0])
    longitudes.append(coords[1])

print(latitudes.length)  
"""

*   So instead, I've used the csv data presented.


In [16]:
geodata = pd.read_csv("geo_coords.csv")
canada_postals_df = canada_postals_df.join(geodata.set_index('Postal Code'), on = 'Postal code')
print(canada_postals_df.head())

  Postal code           Borough  ...   Latitude  Longitude
0         M3A        North York  ...  43.753259 -79.329656
1         M4A        North York  ...  43.725882 -79.315572
2         M5A  Downtown Toronto  ...  43.654260 -79.360636
3         M6A        North York  ...  43.718518 -79.464763
4         M7A  Downtown Toronto  ...  43.662301 -79.389494

[5 rows x 5 columns]




*  Lastly, let's export our data to a csv file, so that we can use it for the next part: Clustering.  




In [0]:
canada_postals_df.to_csv('canada_coords.csv', index = False)



*   Clustering the Toronto Neighborhoods: https://github.com/Tardiser/Coursera_Capstone/blob/master/TorontoNeighsClustered.ipynb 

