# IBM Data Science Professional Certificate
## Part 9: Applied Data Science Capstone - Week 3

<a id="0"></a>
## Notebook Table of Contents:
* [Part 1: Scraping Wikipedia and Create a Cleaned Dataframe](#1)
    * [Requirement 1: Create New Workbook.](#1.1)
    * [Requirement 2: Scrape the Data from Wikipedia.](#1.2)
    * [Requirement 3: Generate Clean Dataframe.](#1.3)
    * [Requirement 4: Submit to Github and Link.](#1.4)
* [Part 2: Geocoding the Dataframe.](#2)
* [Part 3: Visually Explore and Cluster Neighbourhoods.](#3)

### This is Part 1: Scraping Wikipedia and Create a Cleaned Dataframe <a name="1"></a>
For this part of the assignment, I am exploring and clustering the neighborhoods in Toronto.
Each portion of the problem is broken out on it's own.

#### Requirement 1: Create a new Notebook - Done. <a name="1.1"></a>

[Back to Table of Contents](#0)

#### Requirement 2: Scrape the data from Wikipedia <a name="1.2"></a>

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

import requests # library to handle requests
from bs4 import BeautifulSoup

print('Done importing.')

Done importing.


In [2]:
#assign the get request to a variable
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [3]:
#check the status of the request - should read 200 for success.
website_url.status_code

200

In [4]:
#Pull the html from the website as text
fullPage = BeautifulSoup(website_url.text,"lxml")
#print(fullPage.prettify())      
#uncomment print to see all the webpage html - takes up too much space, so commented out.

[Back to Table of Contents](#0)

#### Requirement 3 - Turn the scraped Wikipedia data into a pandas dataframe. <a name="1.3"></a>

In [5]:
#Isolate just the table.
wikiTable = fullPage.find("table",{"class":"wikitable sortable"})

In [6]:
#Isolate just the rows in the table.
tableRows = wikiTable.findAll('tr')

Sub-criteria 1: The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [7]:
#Loop through the rows of the table and add them to a dataframe
postCode = []
for tr in tableRows:
    td = tr.findAll('td')
    row = [tr.text for tr in td] # .text is important here so we don't get html tags.
    if row:
            postCode.append(row)
            
df = pd.DataFrame(postCode,columns=["Postal Code", "Borough", "Neighbourhood"])

#Clean up the messy carriage returns.
df = df.replace(r"\n","", regex=True)

#Display the dataframe header to check.
df.head()      # Still has "Not Assigned"

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Sub-criteria 2: Ignore cells with a Borough that is "Not Assigned"

In [8]:
#ignore cells with Burrow = "Not assigned"
df = df[df.Borough != "Not assigned"]

#Display the dataframe
df.head()     # Still needs to be grouped

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Sub-criteria 3: Group neighbourhoods based on their PostalCode; comma separated.

In [9]:
#Grouping Neighborhoods by PostalCode and Borough - comma separated.
df = df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].agg(lambda x: ', '.join(set(x))).reset_index()

#Display the dataframe
df.head()      # Still need to assign "Not Assigned" Neighborhoods to Borough name.

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
#Specific Check: M5A from example to make sure it matches.
df.loc[df["Postal Code"] == "M5A"]

Unnamed: 0,Postal Code,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Sub-criteria 4: If Borough has name and Neighbourhood does not, use Borough name as Neighborhood name. (See M7A.)

In [11]:
#Check to see which rows need fixed.
df[df["Neighbourhood"].str.contains("Not assign")]

Unnamed: 0,Postal Code,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


In [12]:
#replace unassigned Neighborhoods with Borough name
df["Neighbourhood"] = np.where(df["Neighbourhood"].str.contains("Not assign"), df["Borough"], df["Neighbourhood"])

#Check M7A from example to make sure it matches.
df.loc[df["Postal Code"] == "M7A"]

Unnamed: 0,Postal Code,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [13]:
df.head()     #Final check on head

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
df.shape

(103, 3)

[Back to Table of Contents](#0)

#### Requirement 4: Submit a link to Notebook on Github. <a name="1.4"></a>
 ~~ Should have worked if you're reading this ~~

[Back to Table of Contents](#0)

### This is Part 2: Geocoding the Dataframe <a name="2"></a>
For this part of the assignment, I take the previously created dataframe and assign the appropriate Latitude / Longtitude values to each PostalCode.

In [15]:
# Retrieve csv file with lat/lon data.
!wget -q -O 'Geospatial_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded')

Data downloaded


In [16]:
# Load the data to a dataframe.
geo_data = pd.read_csv("Geospatial_data.csv")

geo_data.head() # Verify working as intended.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
# Merge the dataframe from part 1 with the geospatial data.
mergedData = pd.merge(df, geo_data, on="Postal Code")

mergedData.head() # Verify working as intended.

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"West Hill, Morningside, Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### Submit a link to Notebook on Github.
 ~~ Should have worked if you're reading this ~~

[Back to Table of Contents](#0)

### This is Part 3: Visually Explore and Cluster Neighbourhoods <a name="3"></a>
For this part of the assignment, I visualize, explore and cluster the neighborhoods in Toronto. Modeled to follow the NY exercise.

In [18]:
#importing additional libraries for visualizations

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Additional Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Additional Libraries imported.


In [19]:
# The code was removed by Watson Studio for sharing.

Your credentials:
CLIENT_ID: 32VDYKIEP2AENDZ0JIRNFCVSRUYPLOTPIT3MP5KT1GT3T52E
CLIENT_SECRET:FOAFXOEI1IM5PHP5HEOZJXJXUJ3DCIZFBX10FJJQCEDZD5MK


In [22]:
# Define borough's latitude and longtitude values.

borough_name = mergedData.loc[0, "Borough"] # Borough Name
borough_code = mergedData.loc[0, "Postal Code"] # Borough Postal Code

borough_lat = mergedData.loc[0, "Latitude"] # Borough Latitude Value
borough_lon = mergedData.loc[0, "Longitude"] # Borough Longtitude Value

In [35]:
# create map of all Boroughs in our data using latitude and longitude values
map_mergedData = folium.Map(location=[borough_lat, borough_lon], zoom_start=11)

# add markers to map
for lat, lng, postcode, borough, neighbourhood in zip(mergedData["Latitude"], mergedData["Longitude"], mergedData["Postal Code"], mergedData["Borough"], mergedData["Neighbourhood"]):
    label="Borough: {} ({});   Neighbourhood(s): {}".format(borough, postcode, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='black', #black border for all markers
        fill=True,
        fill_color='seagreen', #seagreen fill
        fill_opacity=0.7,
        parse_html=True).add_to(map_mergedData)  
    
map_mergedData

In [None]:
toronto_data = df[mergedData.Borough]

#ignore cells with Burrow = "Not assigned"
#df = df[df.Borough != "Not assigned"]

#Check to see which rows need fixed.
#df[df["Neighbourhood"].str.contains("Not assign")]

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
results = requests.get(url).json()
results

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
print("{} venue(s) were returned by Foursquare.".format(nearby_venues.shape[0]))

In [None]:
def getTorontoVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[] # create empty dataframe
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)