<h1 align=center>Data Collecting</h1>

For this assignment, we'll be collecting informations about the neighborhoods in Toronto.

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## Acquire data
### Web scrapping
The informations required about Toronto city, like **Postal Code, Borough and Neighborhoods** are available in the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

In [2]:
# assign the url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# scrap the url
results = requests.get(url).text

Next we'll extract the required informations from the desired table :

In [3]:
# parse the html results
soup = BeautifulSoup(results, 'html.parser')

toronto_table = soup.find('table') # get the Toronto table
table_rows = toronto_table.find_all('tr') # get all the rows in the table

headers = [] # a list to store the table headers
rows = [] # a list to store each row components

# iterate through each row elements
for row in table_rows : 
    # extract the table headers
    contents = row.find_all('th')
    if len(contents) != 0 :
        for content in contents :
            header = content.getText().strip()
            headers.append(header)
        continue
    # extract each row elements        
    contents = row.find_all('td')
    postal_code = contents[0].getText().strip()
    borough = contents[1].getText().strip()
    neighborhood = contents[2].getText().strip()
    rows.append([postal_code, borough, neighborhood])
        
print('The table headers are :', headers)
print('The table contains', len(rows), 'rows')

The table headers are : ['Postal Code', 'Borough', 'Neighbourhood']
The table contains 180 rows


### Load the data
Transform the data into a pandas dataframe :

In [4]:
toronto_df = pd.DataFrame(rows, columns=headers)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Inspect the data

In [5]:
toronto_df.shape

(180, 3)

Let's check for nulls :

In [6]:
# replace the null values with NaN
toronto_df = toronto_df.replace('Not assigned', np.nan)
# check the nulls
toronto_df.isnull().mean()

Postal Code      0.000000
Borough          0.427778
Neighbourhood    0.427778
dtype: float64

* Both **Borough** and **Neighborhood** columns contains null values (which is 77 for both).
* Since the **Postal Code** column doesn't contain any null values let's see if it contains any duplicates.(i.e if there exists two or more neighborhoods with the same postal code).

In [7]:
toronto_df['Postal Code'].duplicated().mean()

0.0

**Postal Code** column do not contain any duplicates, which means each neighborhood in our dataframe has its unique postal code.

## Data cleaning
To deal with null values we'll follow these steps :
* Rows without an assigned Borough will be dropped.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [8]:
# processing neighborhood cells
nulls = toronto_df.index[toronto_df['Neighbourhood'].isnull()] # list of indices for null neighborhood cells

for index, row in toronto_df.iterrows() :
    if index in nulls :
        toronto_df.loc[index, 'Neighbourhood'] = toronto_df.loc[index, 'Borough']
        
# drop rows with null values 
toronto_df = toronto_df.dropna()

# sort the dataframe by Bourough
toronto_df = toronto_df.sort_values(by='Borough').reset_index(drop=True)

toronto_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest..."
1,M4S,Central Toronto,Davisville
2,M4T,Central Toronto,"Moore Park, Summerhill East"
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park"
4,M5R,Central Toronto,"The Annex, North Midtown, Yorkville"
5,M4P,Central Toronto,Davisville North
6,M5N,Central Toronto,Roselawn
7,M4R,Central Toronto,"North Toronto West, Lawrence Park"
8,M4N,Central Toronto,Lawrence Park
9,M5E,Downtown Toronto,Berczy Park


Confirm changes :

In [9]:
toronto_df.shape

(103, 3)

## Save the data

In [10]:
toronto_df.to_csv('Toronto_neighborhoods.csv', index=False)