# IBM Data Science Professional Certificate
## Applied Data Science Capstone
### Week 3, Part 1:

Build a pandas dataframe with the postal codes of each Toronto Neighborhood / Borough.

#### Let's start by installing and importing all the libraries we need.

In [1]:
!pip install pandas -U
import pandas as pd
from pandas.io.json import json_normalize
print("\n*** Pandas Installed, Updated, & Imported\n")
print("\n*** JSON_normalize Imported\n")

!pip install numpy -U
import numpy as np
print("\n*** NumPy Installed, Updated, & Imported\n")

import requests
import urllib.request
print("\n*** Requests Imported\n")

import random
print("\n*** Random Imported\n")

!pip install geopy -U
from geopy.geocoders import Nominatim
print("\n*** Geopy Installed, Updated, & Imported\n")
print("\n*** Nominatim Imported\n")

!pip install ipython -U
from IPython.display import Image
from IPython.core.display import HTML
print("\n*** IPython Installed, Updated, & Imported\n")
print("\n*** Image & HTML Imported\n")

!pip install folium -U
import folium
print("\n*** Folium Installed, Updated, & Imported\n")

!pip install BeautifulSoup4 -U
from bs4 import BeautifulSoup

Requirement already up-to-date: pandas in c:\users\alexi\anaconda3\lib\site-packages (1.0.3)

*** Pandas Installed, Updated, & Imported


*** JSON_normalize Imported

Requirement already up-to-date: numpy in c:\users\alexi\anaconda3\lib\site-packages (1.18.3)

*** NumPy Installed, Updated, & Imported


*** Requests Imported


*** Random Imported

Requirement already up-to-date: geopy in c:\users\alexi\anaconda3\lib\site-packages (1.21.0)

*** Geopy Installed, Updated, & Imported


*** Nominatim Imported

Requirement already up-to-date: ipython in c:\users\alexi\anaconda3\lib\site-packages (7.13.0)

*** IPython Installed, Updated, & Imported


*** Image & HTML Imported

Requirement already up-to-date: folium in c:\users\alexi\anaconda3\lib\site-packages (0.10.1)

*** Folium Installed, Updated, & Imported

Requirement already up-to-date: BeautifulSoup4 in c:\users\alexi\anaconda3\lib\site-packages (4.9.0)


#### Scrape data from Wikipedia w/ BeautifulSoup.

With the **lxml** package, we're finding the tables on the target Wikipedia page. Having a look at the code, we notice there's a couple of tables on the page. The one we're interested in is the **sortable wikitable**.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

In [3]:
table = soup.find('table', class_='wikitable sortable')

#### Parse table, and add the contents to a pandas dataframe.

In [4]:
PostalCode = []
Borough = []
Neighborhood = []

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        PostalCode.append(cells[0].find(text = True))
        Borough.append(cells[1].find(text = True))
        Neighborhood.append(cells[2].find(text = True))

In [5]:
df = pd.DataFrame(PostalCode, columns=["PostalCode"])
df["Borough"] = Borough
df["Neighborhood"] = Neighborhood

#### Let's see what we've scraped up

We notice there are a lot of **\n (newline)** characters, **empty cells** and **Not assigned** cells. These all need to be cleaned up. Otherwise it looks like it picked up all of the postal codes (from M1A to M9Z) and the corresponding Boroughs, and Neighborhoods.

In [6]:
print("Head:\n:", df.head())
print("\nShape:\n:", df.shape)
print("\nDescribe:\n:", df.describe())
print("\nTail:\n:", df.tail())

Head:
:   PostalCode             Borough                  Neighborhood
0      M1A\n      Not assigned\n                            \n
1      M2A\n      Not assigned\n                            \n
2      M3A\n        North York\n                   Parkwoods\n
3      M4A\n        North York\n            Victoria Village\n
4      M5A\n  Downtown Toronto\n  Regent Park / Harbourfront\n

Shape:
: (180, 3)

Describe:
:        PostalCode         Borough Neighborhood
count         180             180          180
unique        180              11           99
top         M2A\n  Not assigned\n           \n
freq            1              77           77

Tail:
:     PostalCode         Borough  \
175      M5Z\n  Not assigned\n   
176      M6Z\n  Not assigned\n   
177      M7Z\n  Not assigned\n   
178      M8Z\n     Etobicoke\n   
179      M9Z\n  Not assigned\n   

                                          Neighborhood  
175                                                 \n  
176                

#### Cleaning up

Let's clean up the data, removing **\n(newline)** characters, dropping empty rows, etc.

In [7]:
df = df.replace(r'\n',  ' ', regex=True)
df = df.replace(r'  ',  np.nan, regex=True)
df = df.replace(r'Not assigned', np.nan, regex=True)

In [8]:
df.dropna(axis = 0, how = "any", inplace = True)

In [9]:
print("Head:\n:", df.head())
print("\nShape:\n:", df.shape)
print("\nDescribe:\n:", df.describe())
print("\nTail:\n:", df.tail())

Head:
:   PostalCode            Borough                                   Neighborhood
2       M3A         North York                                      Parkwoods 
3       M4A         North York                               Victoria Village 
4       M5A   Downtown Toronto                     Regent Park / Harbourfront 
5       M6A         North York              Lawrence Manor / Lawrence Heights 
6       M7A   Downtown Toronto   Queen's Park / Ontario Provincial Government 

Shape:
: (102, 3)

Describe:
:        PostalCode      Borough Neighborhood
count         102          102          102
unique        102           10           97
top          M3C   North York    Downsview 
freq            1           24            4

Tail:
:     PostalCode            Borough  \
157       M5X   Downtown Toronto    
165       M4Y   Downtown Toronto    
168       M7Y       East Toronto    
169       M8Y          Etobicoke    
178       M8Z          Etobicoke    

                                  

#### Grouping the dataframe

In [10]:
df_grouped = df.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df_grouped = df_grouped.replace(r' / ',  ', ', regex=True)

In [11]:
print("Head:\n:", df_grouped.head())
print("\nShape:\n:", df_grouped.shape)
print("\nDescribe:\n:", df_grouped.describe())
print("\nTail:\n:", df_grouped.tail())

Head:
:   PostalCode       Borough                             Neighborhood
0       M1B   Scarborough                           Malvern, Rouge 
1       M1C   Scarborough   Rouge Hill, Port Union, Highland Creek 
2       M1E   Scarborough        Guildwood, Morningside, West Hill 
3       M1G   Scarborough                                   Woburn 
4       M1H   Scarborough                                Cedarbrae 

Shape:
: (102, 3)

Describe:
:        PostalCode      Borough Neighborhood
count         102          102          102
unique        102           10           97
top          M2K   North York    Downsview 
freq            1           24            4

Tail:
:     PostalCode     Borough                                       Neighborhood
97        M9N        York                                             Weston 
98        M9P   Etobicoke                                          Westmount 
99        M9R   Etobicoke   Kingsview Village, St. Phillips, Martin Grove ...
100       M