| <h1> Data Science Coursera Capstone Project <h1/> |
:---:

This project is for the IBM Data Science course offered on Coursera. This Capstone project will be analyzing neighborhoods in Toronto Canada to determine where to move. Location data and Machine Learning will be used to come to the best conclusion for the situation. 

##### First step is to download and import the necessary packages

In [1]:
import pandas as pd
import numpy as np
!pip install beautifulsoup4



In [2]:
from bs4 import BeautifulSoup
import requests

In [3]:
!pip install lxml



##### Here I am getting the URL where the information is stored

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_content = requests.get(url).text

##### Here I am loading the information from the site and converting it into parsed data using the BeautifulSoup package.

In [5]:
ParseData = BeautifulSoup(html_content, "lxml")

##### Now that I retrieved the html file and parsed the data. I need to find the table located within the file and retrieve each row in that table. Since the table is the only thing I care about we only extract the table information and put it in a variable. If you look at the output of this variable and compare it to the table on the site you will see that each individual row is separated by "tr" and each column within each row is separated by "th". So basically I created 3 empty lists and ran a loop through the parsed data searching for each row. When the row was found, I located each value in each column in put them in their respective list. Each row was appended to the other and unnecessary information was removed ("\n").

In [6]:
TorontoTable = ParseData.find('table', attrs={'class': 'wikitable sortable'}) # Find the Table

In [7]:
Postalcode = []
Borough = []
Neighborhood = []

for row in TorontoTable.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==3:
        Postalcode.append(cells[0].find(text=True).replace('\n', ' ').strip())
        Borough.append(cells[1].find(text=True).replace('\n', ' ').strip())
        Neighborhood.append(cells[2].find(text=True).replace('\n', ' ').strip())

##### Now that I have the data and it is almost fully cleaned up, I need to convert it into a dataframe and change the column names.

In [8]:
TorontoNeighborhoods = pd.DataFrame(Postalcode, columns=['PostalCode'])
TorontoNeighborhoods['Borough'] = Borough
TorontoNeighborhoods['Neighborhood'] = Neighborhood
TorontoNeighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


##### The last thing I need to do to clean it up is to filter out all data points where there is no Borough assigned to the postal code.

In [9]:
NewToronto = TorontoNeighborhoods[TorontoNeighborhoods['Borough'] != "Not assigned"]
NewToronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


##### Here I am simply getting the dimensions of the dataframe.

In [10]:
NewToronto.shape

(103, 3)

##### Because geocoder was not working for me, could not find my postal codes, I decided to read in the csv file provided that contains all the coordinates of the Boroughs.

In [11]:
lonlng = pd.read_csv('http://cocl.us/Geospatial_data')
lonlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### I then merged the two dataframes, specifying the columns to find the matches and dropping the duplicate column. 

In [12]:
lnlngToronto = pd.merge(NewToronto, lonlng, left_on = "PostalCode", right_on = "Postal Code").drop('Postal Code', axis=1)

In [13]:
lnlngToronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
