# Coursera Capstone Project
Anthony Suárez

This notebook is to work on my Capstone for the IBM Data Science Specialization.

## Week 1

In [1]:
import pandas as pd
import numpy as np
import requests
import bs4
from bs4 import BeautifulSoup

In [2]:
print("Hello Coursera Capstone Project!")

Hello Coursera Capstone Project!


## Week 3

### Collect data about Toronto neighborhoods

I will use Beautiful Soup to do web scraping and get data from Wikipedia.

In [3]:
page_url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto#Lists_of_city-designated_neighbourhoods"
page_html = requests.get(page_url, timeout=10)

page_html

<Response [200]>

In [4]:
toronto_soup = BeautifulSoup(page_html.content)
# print(toronto_soup.prettify())

We are interested in the Multiple listing service districts and neighbourhoods table, which has the classes "wikitable sortable jquery-tablesorter"

In [5]:
districts_table = toronto_soup.find("table", {"class": "wikitable sortable"})
# districts_table.__dict__

In [6]:
districts_df = pd.read_html(str(districts_table))
districts_df = districts_df[0]
districts_df.head()

Unnamed: 0,District Number,Neighbourhoods Included
0,C01,"Downtown, Harbourfront, Little Italy, Little P..."
1,C02,"The Annex, Yorkville, South Hill, Summerhill, ..."
2,C03,"Forest Hill South, Oakwood–Vaughan, Humewood–C..."
3,C04,"Bedford Park, Lawrence Manor, North Toronto, F..."
4,C06,"North York, Clanton Park, Bathurst Manor"


Now that we have parsed the table from Wikipedia, we have to get the data from each neighborhood.

In [7]:
neighborhoods = []

for row in districts_df["Neighbourhoods Included"]:
    n_in_district = row.split(', ')
    neighborhoods = neighborhoods + n_in_district
    
print(str(len(neighborhoods)) + ' neighborhoods found.')

225 neighborhoods found.


Now we have a list of 225 individual neighborhoods in Toronto. As almost each one of them has a Wikipedia page with their name, we can use those pages to extract the coordinates for each neighborhoods.

In [8]:
neighborhoods_df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
neighborhoods_df.head()

Unnamed: 0,Neighborhood
0,Downtown
1,Harbourfront
2,Little Italy
3,Little Portugal
4,Dufferin Grove


In [9]:
def find_wiki_coords(page):
    possible_titles = [
        page.replace(' ', '_') + "_Toronto",
        page.replace(' ', '_') + ",_Toronto",
        "Toronto_" + page.replace(' ', '_') ,
        "Toronto,_" + page.replace(' ', '_') ,
        page.replace(' ', '_')
    ]
    
    possible_urls = []
    for title in possible_titles:
        possible_urls.append("https://en.wikipedia.org/wiki/" + title)
    
    for url in possible_urls:
        wiki_page = requests.get(url, timeout=10)
    
        if (wiki_page.status_code == 200):
            soup = BeautifulSoup(wiki_page.content)
            latitude = soup.find("span", {"class": "latitude"})
            longitude = soup.find("span", {"class": "longitude"})

            if latitude and longitude:
                return [latitude.text, longitude.text]
            
    return None

The following code will find the coordinates of each neighborhood in Toronto. It takes a bit of time to run, so the resulting dataframe was saved in a .csv file.

```python
latitudes = []
longitudes = []

for neighborhood in neighborhoods_df["Neighborhood"]:
    coords = find_wiki_coords(neighborhood)
    
    if coords:
        latitudes.append(coords[0])
        longitudes.append(coords[1])
    else:
        latitudes.append(None)
        longitudes.append(None)

neighborhoods_df["Latitude"] = latitudes
neighborhoods_df["Longitude"] = longitudes
neighborhoods_df.to_csv("data/toronto_neighborhoods_coords.csv", index=False)

neighborhoods_df.head()
```

In [10]:
neighborhoods_df = pd.read_csv("data/toronto_neighborhoods_coords.csv")
neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43°39′9.01″N,79°23′0.81″W
1,Harbourfront,43°38′17″N,79°23′06″W
2,Little Italy,43°39′18″N,79°24′47″W
3,Little Portugal,43°39′00″N,79°26′08″W
4,Dufferin Grove,43°39′25″N,79°25′41″W


In [11]:
# For folium we need coordinates as decimals.

def dms_to_dd(coords_str):
    
    if isinstance(coords_str, str):
        new_str = coords_str[:-2]
        delimiters = ["°", "′"]

        for delimiter in delimiters:
            new_str = new_str.replace(delimiter, ',')

        dms = new_str.split(',')
        dms = dms + [0, 0, 0]
        degrees = float(dms[0])
        minutes = float(dms[1])
        seconds = float(dms[2])

        decimal = degrees + (minutes / 60) + (seconds / 3600)
        return decimal
    return None

In [12]:
decimal_latitudes = []
decimal_longitudes = []

for latitude in neighborhoods_df['Latitude']:
    decimal_latitudes.append(dms_to_dd(latitude))
    
for longitude in neighborhoods_df['Longitude']:
    decimal_longitudes.append(dms_to_dd(longitude))
    
neighborhoods_df['Latitude'] = decimal_latitudes
neighborhoods_df['Longitude'] = decimal_longitudes
neighborhoods_df['Longitude'] = neighborhoods_df['Longitude'] * -1 # Had to multiply by -1 because Toronto is west

neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43.652503,-79.383558
1,Harbourfront,43.638056,-79.385
2,Little Italy,43.655,-79.413056
3,Little Portugal,43.65,-79.435556
4,Dufferin Grove,43.656944,-79.428056


In [13]:
print(neighborhoods_df.shape)

(225, 3)


In [14]:
# Drop nans
neighborhoods_df = neighborhoods_df.dropna(axis=0)
print(neighborhoods_df.shape)

(163, 3)


### Visualize Toronto Neighborhoods

In [15]:
# !pip install folium
import folium

In [16]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map

If we zoom out on the map we can see some neighborhood coordinates are wrong. This may be due to the way I got the coords from Wikipedia. I will remove those neighborhoods manually.

In [17]:
neighborhoods_df[neighborhoods_df['Neighborhood'] == 'Hunt Club'].index

Int64Index([126], dtype='int64')

In [18]:
wrong_neighborhoods = ['Westmount', 'Adelaide']

for neighborhood in wrong_neighborhoods:
    neighborhoods_df = neighborhoods_df[neighborhoods_df['Neighborhood'] != neighborhood]
    
neighborhoods_df.shape

(161, 3)

In [19]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map

### Explore venues with Foursquare API

In [20]:
from dotenv import load_dotenv # Use pip to install python-dotenv package
import os

In [21]:
load_dotenv()

# If you are going to run this notebook, please add your own Foursquare credentials in a .env file.
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')

In [22]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500

As a test, I'll explore the first neighborhood in the dataframe.

In [28]:
test_neighborhood = neighborhoods_df.loc[0]['Neighborhood']
test_lat = neighborhoods_df.loc[0]['Latitude']
test_long = neighborhoods_df.loc[0]['Longitude']

In [29]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    test_lat, 
    test_long, 
    radius, 
    LIMIT)

results = requests.get(url)
results = results.json()
# results

In [30]:
# function that extracts the category of the venue. By Coursera.
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [31]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

KeyError: 'groups'

It works. Now, it's time to repeat the process for all of Toronto neighborhoods. The following function is also featured in the Coursera IBM Data Science Course:

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

As the next piece of code takes some time to run, I exported the results to a .csv file.

```python
neighborhoods_venues = getNearbyVenues(names=neighborhoods_df['Neighborhood'],
                                      latitudes=neighborhoods_df['Latitude'],
                                      longitudes=neighborhoods_df['Longitude'])

neighborhoods_venues.to_csv('data/toronto_neighborhoods_venues.csv', index=False)
```

In [None]:
neighborhoods_venues = pd.read_csv('data/toronto_neighborhoods_venues.csv')
neighborhoods_venues.head()

In [34]:
neighborhoods_venues['Venue Category'].unique().shape

(307,)

Perfect! Now we have used Foursquare data to obtain information about venues close to each neighborhood in Toronto. The next step is processing the obtained data to feed it into our clustering algorithm.

### Data wrangling

It's time to prepare our data so it can be fed into the machine learning algorithm.

In [35]:
# From IMB/Coursera Course:
# one hot encoding
neighborhoods_onehot = pd.get_dummies(neighborhoods_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neighborhoods_onehot['Neighborhood'] = neighborhoods_venues['Neighborhood'] 

# move neighborhood column to the first column
first_col = neighborhoods_onehot.pop("Neighborhood")
neighborhoods_onehot.insert(0, "Neighborhood", first_col)

In [36]:
neighborhoods_grouped = neighborhoods_onehot.groupby("Neighborhood").mean().reset_index()
neighborhoods_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport Service,American Restaurant,Amphitheater,Antique Shop,Aquarium,Arcade,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Baby Point,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Banbury,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0
4,Bathurst Manor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Print each neighborhoods most common venues

In [37]:
num_top_venues = 5

for hood in neighborhoods_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = neighborhoods_grouped[neighborhoods_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

   Park  0.25
1  Residential Building (Apartment / Condo)  0.25
2                              Tennis Court  0.25
3                  Mediterranean Restaurant  0.25
4                         Accessories Store  0.00


----Humber Heights – Westmount----
         venue  freq
0  Gas Station   0.5
1  Pizza Place   0.5
2  Music Venue   0.0
3  Opera House   0.0
4       Office   0.0


----Humber Summit----
                     venue  freq
0                   Bakery   0.5
1              Pizza Place   0.5
2        Accessories Store   0.0
3  New American Restaurant   0.0
4             Optical Shop   0.0


----Humber Valley Village----
               venue  freq
0       Cupcake Shop  0.17
1          BBQ Joint  0.17
2             Garden  0.17
3  Electronics Store  0.17
4               Park  0.17


----Humewood-Cedarvale----
               venue  freq
0              Trail  0.25
1  Convenience Store  0.25
2       Hockey Arena  0.25
3              Field  0.25
4  Accessories Store  0.00


----Humewood–C

Put it into a dataframe

In [38]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = neighborhoods_grouped['Neighborhood']

for ind in np.arange(neighborhoods_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neighborhoods_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Park,Zoo,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Field
1,Alderwood,Market,Playground,Athletics & Sports,Park,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm
2,Baby Point,Pool Hall,Athletics & Sports,Playground,Café,Taco Place,Latin American Restaurant,Burger Joint,Bar,Bakery,Coffee Shop
3,Banbury,Restaurant,Pizza Place,Women's Store,Bank,Coffee Shop,Spa,Clothing Store,Movie Theater,Sushi Restaurant,Supermarket
4,Bathurst Manor,Playground,Convenience Store,Arcade,Baseball Field,Park,Zoo,Fast Food Restaurant,Escape Room,Ethiopian Restaurant,Event Space


## Cluster Neighborhoods

In [40]:
from sklearn.cluster import KMeans

In [41]:
# set number of clusters
kclusters = 5

neighborhoods_grouped_clustering = neighborhoods_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neighborhoods_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 0, 0, 0, 0, 0, 0, 1, 0])

In [42]:
# add clustering labels

try:
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
except:
    pass

neighborhoods_merged = neighborhoods_df

neighborhoods_merged = neighborhoods_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')  

neighborhoods_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown,43.652503,-79.383558,0.0,Coffee Shop,Hotel,Clothing Store,Restaurant,Café,Breakfast Spot,Juice Bar,Seafood Restaurant,Sushi Restaurant,Salad Place
1,Harbourfront,43.638056,-79.385,0.0,Boat or Ferry,Aquarium,Coffee Shop,Restaurant,Sports Bar,Baseball Stadium,Brewery,Café,Pizza Place,Music Venue
2,Little Italy,43.655,-79.413056,0.0,Café,Bar,Italian Restaurant,Sushi Restaurant,Cocktail Bar,Asian Restaurant,Sandwich Place,Korean Restaurant,Burger Joint,Pizza Place
3,Little Portugal,43.65,-79.435556,0.0,Coffee Shop,Café,Bar,Grocery Store,Pizza Place,Restaurant,Bakery,Boutique,Vietnamese Restaurant,Breakfast Spot
4,Dufferin Grove,43.656944,-79.428056,0.0,Restaurant,Bar,Bakery,Coffee Shop,Market,Farmers Market,Sushi Restaurant,Sports Bar,Gastropub,Beer Store


In [43]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [44]:
# create map
map_clusters = folium.Map(location=[toronto_coords[0], toronto_coords[1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neighborhoods_merged['Latitude'], neighborhoods_merged['Longitude'], neighborhoods_merged['Neighborhood'], neighborhoods_merged['Cluster Labels']):
    if np.isnan(cluster):
        cluster = 5
    else:
        cluster = int(cluster)

    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters