<img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="300" align="left"><font size="6.5"><h1 align="center">Applied Data Science Capstone</font>

<h3 align="center"><font size="5">This notebook is intended for the final course in IBM Data Science Professional Certificate.</font>
<hr style="border: dashed rgb(0,0,0) 1.0px;background-color: rgb(0,0,255);height: 3.0px;"/>

## Week 3 - Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

<hr style="border: dashed rgb(0,0,0) 1.0px;background-color: rgb(0,0,0);height: 1.0px;"/>

### Part (1) - Build code to scrape Wikipedia page, obtain the data in the table of postal codes and to transform the data into a pandas dataframe

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.
Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [6]:
# Clear the output
from IPython.display import clear_output

# import the library we use to open URLs
import urllib.request

# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

# import pandas
import pandas as pd

In [7]:
# specify which URL/web page we are going to be scraping
url_wiki = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# open the url using urllib.request and put the HTML into the page variable
page_wiki = urllib.request.urlopen(url_wiki)
# parse the HTML from our URL into the BeautifulSoup parse tree format
bs_wiki = BeautifulSoup(page_wiki, "lxml")

# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
tables_wiki=bs_wiki.find_all("table")

In [8]:
# Get the Required table
table_Pcodes=bs_wiki.find('table', class_='wikitable sortable')

# Lists to hold data
PostalCode=[]
Borough=[]
Neighborhood=[]

# Loop & ignore Borough = 'Not assigned'
for row in table_Pcodes.findAll('tr'):
    cells=row.findAll('td')
    if len(cells) > 0 :
        # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
        if cells[1].text.strip() != 'Not assigned':
            PostalCode.append(cells[0].text.strip())
            Borough.append(cells[1].text.strip())
            # If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
            if cells[2].text.strip() == 'Not assigned':
                Neighborhood.append(cells[1].text.strip())
            else:
                Neighborhood.append(cells[2].text.strip())

# Create dataframe to hold data        
df_wiki=pd.DataFrame(PostalCode,columns=['PostalCode'])
df_wiki['Borough']=Borough
df_wiki['Neighborhood']=Neighborhood

# show data shape
print (df_wiki.shape)

(211, 3)


In [9]:
# Group Values based on the Postal Code
df_Pcodes = df_wiki.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_Pcodes.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [10]:
# show data shape
print (df_Pcodes.shape)

(103, 3)


<hr style="border: dashed rgb(0,0,0) 1.0px;background-color: rgb(0,0,0);height: 1.0px;"/>

### Part (2) - Build code to get the latitude and the longitude coordinates of each neighborhood

#### Option (1) : Using ArcGIS Geocoder rather than Google

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In [None]:
# Check geocoder
try:
    import pgeocode
    print ('Postal Geocoder available')
except ImportError:
    # installing geocoder
    print ('Installing Postal Geocoder')
    !pip install geocoder
    clear_output()
    print ('Postal Geocoder Installed')

In [None]:
import geocoder # convert an address into latitude and longitude values

In [None]:
# Lists to hold data
Latitude=[]
Longitude=[]

# Loop & Get coord for postal codes using ArcGIS Geocoder
for i in range(0,len(df_Pcodes)):
    address = df_Pcodes['PostalCode'].iloc[i] + ', canada'
    g= geocoder.arcgis(address)
    Latitude.append(g.lat)
    Longitude.append(g.lng)
    
# Create new dataframe to hold coord
df_Coords=df_Pcodes.copy()
df_Coords['Latitude']=Latitude
df_Coords['Longitude']=Longitude

df_Coords.head(10)

In [None]:
# show data shape
print (df_Coords.shape)

#### Alternative option (Note: Reading from file as geocoder failed)

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [None]:
# Lets download the dataset (-q )
!wget -q -O Geospatial_Coordinates.csv https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
print('Data downloaded!')

# Load Data From CSV File  
df_csv = pd.read_csv('Geospatial_Coordinates.csv')
df_csv.head()

In [None]:
# Get the Long & Lat from the csv file and append it to main dataframe
df_final = pd.merge(df_Pcodes, df_csv, how='inner', left_on = 'PostalCode', right_on = 'Postal Code')

df_final.drop(['Postal Code'], axis = 1,inplace=True)
df_final.head(10)

<hr style="border: dashed rgb(0,0,0) 1.0px;background-color: rgb(0,0,0);height: 1.0px;"/>

### Part (3) - Explore and cluster the neighborhoods in Toronto

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 
Just make sure:

1- to add enough Markdown cells to explain what you decided to do and to report any observations you make. 

2- to generate maps to visualize your neighborhoods and how they cluster together. 

In [None]:
# Checking for Required Packages availability

# Check folium
try:
    import folium
    print ('folium available')
except ImportError:
    # installing folium
    print ('Installing folium')
    !conda install -c conda-forge folium=0.5.0 --yes
    !pip install folium
    clear_output()
    print ('folium Installed')

In [None]:
# import required libraries

# import folium
import folium

Create a function to search venues to all the neighborhoods

In [None]:
# create a function to search venues to all the neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# Get the Coord of Toronto
g= geocoder.arcgis('Toronto, Canada')

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[g.lat, g.lng], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
map_toronto

#### Start Clustering Data

Let's simplify the above map and segment and cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [None]:
# Select subset of the data frmae
df_simplified=df_final.copy()
df_toronto = df_simplified[df_simplified['Borough'].str.contains('Toronto')]
df_toronto.shape

In [None]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'your-client-ID' # your Foursquare ID
CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

df_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )
# check the size of the resulting dataframe
print(df_venues.shape)
df_venues.head()
# check how many venues were returned for each neighborhood
df_venues.groupby('Neighborhood').count()
# find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(df_venues['Venue Category'].unique())))

# Analyze Each Neighborhood