# Applied Data Science Capstone

_Course provided by IBM in [Coursera.org](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome)._

Please follow the [CourseraCapstone repository](https://github.com/eric-dvlp/CourseraCapstone) to acess all the files used within this project.

## 3rd Week Assignment: Segmenting and Clustering Neighborhoods in Toronto

In this project, we'll explore and cluster the neighborhoods in Toronto, Canada.

* [First part:](#1st) Collecting data
* [Second part:](#2nd) Augmenting data with coordinates
* [Third part:](#3rd) Analysing clustered data

***
<div id='1st'/>

### First part

***

In [None]:
import pandas as pd

First, let us get postal codes of Canada from wikipedia using pandas method `pandas.read_html`:

In [1]:
page = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

This returned a list of contents from the webpage. We only need the first element in `df`:

In [2]:
df = page[0]
df.rename(columns={'Postal Code':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Data wrangling

In `df` we have some _Borough_ informed as Not assigned, we should drop those:

In [3]:
print('Rows before:', df.shape[0])
df.drop( df[df['Borough'] == 'Not assigned'].index, inplace=True)
print('Rows after:', df.shape[0])

Rows before: 180
Rows after: 103


At first, this dataset had duplicated _Boroughs_. This can show it has no more:

In [4]:
df_grouped = df.groupby(['PostalCode']).count()
df_grouped[ df_grouped['Borough'] > 1]

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1


Just another checking that, in the values previously duplicated, it has more then one Neighbourhood comma separated:

In [5]:
df[(df['PostalCode'] == 'M5A') | (df['PostalCode'] == 'M5V')]

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
139,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


Now, we don't have invalid values:

In [6]:
print('Not assigned Neighborhood      ->', (df['Neighborhood'] == 'Not assigned').sum(),
      '\nNeighborhood equals to Borough ->', (df['Neighborhood'] == df['Borough']).sum(),
      '\nFinal dataframe shape:', df.shape
     )

Not assigned Neighborhood      -> 0 
Neighborhood equals to Borough -> 0 
Final dataframe shape: (103, 3)


***
<div id='2nd'/>

### Second part

***

In [7]:
import geocoder

In this session, we'll complement `df` with coordinates to use in Foursquare API.

First, let's crate a new df `df2` to receive previous data plus coordinates:

In [108]:
column_names = ['PostalCode','Borough','Neighborhood', 'Latitude', 'Longitude']
df2 = pd.DataFrame(columns=column_names)
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


The [course task](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit) asked to use geocoder library to retrieve coordinates to every one of the 103 samples. 

In [14]:
#lat_lng_coords = None

#while(lat_lng_coords is None):
#    g = geocoder.google('M5A, Toronto, Ontario')
#    lat_lng_coords = g.latling
    
#print(g.latlng)

Since it took over 10 minutes to complete the code above for only 1 sample _(in 18/08/2020, 10h30)_, it will be used the provided [coordinates dataset](https://cocl.us/Geospatial_data) from __Cognitive Class__ (Coursera).

In [38]:
latlong = pd.read_csv('https://cocl.us/Geospatial_data')
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Let's merge `df` and `latlong` datasets according their _PostalCode_ and create the new `df2`:

In [109]:
for i, sample in enumerate(df['PostalCode']):
    df2 = df2.append({'PostalCode': sample,
                      'Borough': df['Borough'].iloc[i],
                      'Neighborhood': df['Neighborhood'].iloc[i],
                      'Latitude': float(latlong['Latitude'].loc[latlong['Postal Code'] == sample].values),
                      'Longitude': float(latlong['Longitude'].loc[latlong['Postal Code'] == sample].values)}, ignore_index=True)

In [111]:
df2.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


***
<div id='3rd'/>

### Third part

***

In [200]:
import numpy as np
import folium
import json
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In this final session, we'll analyse data from North York, Toronto, clusterizing it according to Foursquare categorization.

First, let's create a separated dataset `df_ny`:

In [199]:
df_ny = df2[df2['Borough'] == 'North York']
df_ny.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M3B,North York,Don Mills,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073


Let's see what we have about Noth York:

In [201]:
latitude = 43.762948
longitude = -79.431893

map_ny = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, borough, neighborhood in zip(df_ny['Latitude'], df_ny['Longitude'], df_ny['Borough'], df_ny['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ny)  
    
map_ny

Now, we'll get data from Foursquare API to cluster this area.

In [222]:
# @hidden_cell
CLIENT_ID = 'HP2CHSWBWPZAVYIQMEYV2IE1JIAEMRRWX0DVX2LRUTUCS3OC' # your Foursquare ID
CLIENT_SECRET = 'ECOO4IJDBU1JRFS21Z41JXVU3N1CCQ5ZGCNB0PULGJUV23QL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500

In [203]:
# function that extracts the category of the venue, from Foursquare lab.
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

    
# functions from previous lab

# it repeats the get request for all Neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

# it gets the top_venues from a grouped dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

Here, the `getNearbyVenues` function is used:

In [204]:
ny_venues = getNearbyVenues(names=df_ny['Neighborhood'],
                                   latitudes=df_ny['Latitude'],
                                   longitudes=df_ny['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


Let's use one hot encoding and group data for KNN uses:

In [223]:
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 

fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()
ny_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Supermarket,Supplement Shop,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095238,...,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0
3,Don Mills,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,...,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downsview,0.0,0.066667,0.0,0.0,0.0,0.0,0.066667,0.0,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And then, let's fix data to it can be processed:

In [208]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Mobile Phone Shop,Shopping Mall,Middle Eastern Restaurant,Park,Pet Store,Pharmacy,Pizza Place,Bridal Shop
1,Bayview Village,Chinese Restaurant,Bank,Café,Japanese Restaurant,French Restaurant,Food Truck,Comfort Food Restaurant,Furniture / Home Store,Construction & Landscaping,Convenience Store
2,"Bedford Park, Lawrence Manor East",Restaurant,Sandwich Place,Coffee Shop,Italian Restaurant,Juice Bar,Café,Butcher,Comfort Food Restaurant,Pharmacy,Pizza Place
3,Don Mills,Gym,Restaurant,Coffee Shop,Japanese Restaurant,Beer Store,Sandwich Place,Clothing Store,Chinese Restaurant,Caribbean Restaurant,Italian Restaurant
4,Downsview,Grocery Store,Park,Gym / Fitness Center,Discount Store,Korean Restaurant,Liquor Store,Shopping Mall,Baseball Field,Bank,Athletics & Sports


Once we have the information, we can build our model and see the ten first results:

In [225]:
kclusters = 5
ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 3, 0, 4, 2])

Now, let's merge all data in a single dataframe:

In [210]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
ny_merged = df_ny
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
ny_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Bus Stop,Food & Drink Shop,Fast Food Restaurant,Park,Women's Store,Dim Sum Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,French Restaurant,Coffee Shop,Hockey Arena,Portuguese Restaurant,Women's Store,Diner,Clothing Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Coffee Shop,Vietnamese Restaurant,Athletics & Sports,Bakery,Convenience Store
7,M3B,North York,Don Mills,43.745906,-79.352188,0,Gym,Restaurant,Coffee Shop,Japanese Restaurant,Beer Store,Sandwich Place,Clothing Store,Chinese Restaurant,Caribbean Restaurant,Italian Restaurant
10,M6B,North York,Glencairn,43.709577,-79.445073,3,Park,Japanese Restaurant,Bakery,Pub,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop


The last step is to visualize our clustering. Let's use Folium one last time:

In [212]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Ok, now job is done.

### Thanks for your attention!

<p></p>

_Notebook developed by Eric Galindo in August/2020 in São Paulo, Brazil._