# Applied Data Science Capstone

## Week 3 - Evaluation - Question 1
Below I'll show you how I've built the pandas dataframe based on the wikipedia article. I used an old link to ensure that I was using the same dataset that the instructors used when they created this evaluation. I've used comments in my code to make it as easier to understand.

Let's start by importing the needed libraries.

In [1]:
# Import the needed libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib.request

In this block I fetch the data through BeautifulSoup and convert it into a dataframe

In [2]:
# Step 1: Retrieve html file from page
data_url = urllib.request.urlopen("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641")
html_file = data_url.read()

# Step 2: Pass html file into BeautifulSoup
soup = BeautifulSoup(html_file, "html.parser")

# Step 3: Extract Table
table = soup.table

# Step 4: Create a df
df = pd.read_html(str(table))[0]

Now that the dataframe has been made, it's time to start cleaning it up. This means restructuring and renaming the dataframe where necessary.

In [4]:
# Step 1: Remove unassigned Boroughs
df = df[df["Borough"] != "Not assigned"].reset_index(drop=True)

# Step 2: Fill in unassigned neighbourhoods
df.loc[df["Neighbourhood"] == "Not assigned", "Neighbourhood"] = df["Borough"]

# Step 3: Group by postcode & Borough, then merge the neighbourhood values with a comma in between each value.
df_grouped = df.groupby(["Postcode","Borough"])["Neighbourhood"].apply(','.join).reset_index()

# Step 4: Rename Postcode
df_final = df_grouped.rename(columns={"Postcode": "Postalcode"})

Time for the final part, showing the dataframe's shape. I've thrown in the head as well for good measure so you can see my data is properly structured. It does show in a different order than the example due to the groupby method I used.

In [16]:
# Show the final dataframe's shape as requested in the exercise
shape = df_final.shape
text = "This is the dataframe's shape: {}".format(shape)
print(text)
df_final.head()

This is the dataframe's shape: (103, 3)


Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Week 3 - Evaluation - Question 2
Below you'll find my method for adding the coordinates to each postal code. As geocoder was unreliable for me, I chose to use the provided csv-file instead.

In [12]:
# read csv into a dataframe
loc_data = pd.read_csv("https://cocl.us/Geospatial_data")

# View the dataframe to see what's in it
loc_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now that we know that the new dataframe 'loc_data' only contains a Postal Code, Latitude and Longitude, our next step is clear. Time to merge the dataframe from the previous question with this new dataframe on the 'Postal Code' key.

In [14]:
# Join the 2 dataframes where the postal codes match
df_joined = df_final.join(loc_data, lsuffix='Postalcode', rsuffix='Postal Code')

# With this method, we end up with the postal code column from both data frames. This code drops the 2nd Postal Code column since we won't need it.
df_complete = df_joined.drop(labels="Postal Code", axis = 1)

And the result is...

In [17]:
# Show the completed dataframe's shape as requested in the exercise
shape2 = df_complete.shape
text2 = "This is the dataframe's shape: {}".format(shape2)
print(text2)
df_complete.head()

This is the dataframe's shape: (103, 5)


Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Week 3 - Evaluation - Question 3
Scarborough has been at the top of my dataframes all this time, which has made me quite curious about this borough. Let's zoom in on it and see what's popular in the neighbourhoods of this rough-sounding borough. Due to there being multiple neighbourhoods per postal code, I'll be using the postal code to keep things easier to read.

First up: importing the needed libraries

In [18]:
import requests 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!pip install folium
import folium
print("Done importing")

Done importing


##### I'll enter my API credentials below. Make sure you hide them. Github is public after all! If you didn't hide yours, go back and change it!

In [19]:
# hidden credentials
CLIENT_ID = 
CLIENT_SECRET = 

In [20]:
# Other required fields to make API calls to Foursquare
VERSION = '20180605' 
LIMIT = 100 

### Time to define the getNearbyVenues functions that was shown off in the exercices.

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
# Let's zoom in on Scarborough as I said I would in the introduction.
scar_data = df_complete[df_complete["Borough"] == "Scarborough"].reset_index(drop=True)

# Define the needed variables
scar_post = scar_data["Postalcode"]
scar_lat = scar_data["Latitude"]
scar_lng = scar_data["Longitude"]

# Execute the function and make the needed calls
scar_venues = getNearbyVenues(names=scar_post, latitudes=scar_lat, longitudes=scar_lng)

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X


In [24]:
# Time to see just how many venues were fetched per postal code.
scar_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,2,2,2,2,2,2
M1C,1,1,1,1,1,1
M1E,8,8,8,8,8,8
M1G,3,3,3,3,3,3
M1H,8,8,8,8,8,8
M1J,1,1,1,1,1,1
M1K,7,7,7,7,7,7
M1L,7,7,7,7,7,7
M1M,2,2,2,2,2,2
M1N,4,4,4,4,4,4


### All right, that's that. Time to use one-hot encoding to see which types of venues were popular in each postal code.

In [25]:
# Step 1: one hot encoding
scar_onehot = pd.get_dummies(scar_venues[['Venue Category']], prefix="", prefix_sep="")

# Step 2: add neighborhood column back to dataframe
scar_onehot['Neighborhood'] = scar_venues['Neighborhood'] 

# Step 3: move neighborhood column to the first column
fixed_columns = [scar_onehot.columns[-1]] + list(scar_onehot.columns[:-1])
scar_onehot = scar_onehot[fixed_columns]

# Step 4: Let's have a quick look at the preliminary results.
scar_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Café,...,Playground,Print Shop,Rental Car Location,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Thai Restaurant,Train Station,Vietnamese Restaurant
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Time to whittle that down to the average values per postal code.
scar_grouped = scar_onehot.groupby('Neighborhood').mean().reset_index()
scar_grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Café,...,Playground,Print Shop,Rental Car Location,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Thai Restaurant,Train Station,Vietnamese Restaurant
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.125,0.0,0.125,0.0,0.0,...,0.0,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.125,0.0,0.125,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0
5,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M1K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0
7,M1L,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M1M,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M1N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0


### That's quite the data set! Let's see what the most popular venue types are.

In [27]:
# Step 1: Define a function that will return the most popular venues per postal code
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
# Step 2: Define needed variables
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# Step 3: create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Step 4: create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scar_grouped['Neighborhood']

for ind in np.arange(scar_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scar_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,College Stadium,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store
1,M1C,Bar,Vietnamese Restaurant,Convenience Store,Hakka Restaurant,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
2,M1E,Intersection,Bank,Restaurant,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store
3,M1G,Coffee Shop,Korean BBQ Restaurant,Vietnamese Restaurant,Convenience Store,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
4,M1H,Hakka Restaurant,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Department Store,Gym


### We've finally got the data we need, time to get clustering.

In [29]:
# Step 1: set number of clusters
kclusters = 3

scar_grouped_clustering = scar_grouped.drop('Neighborhood', 1)

# Step 2: run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scar_grouped_clustering)

In [30]:
# Step 3: Add the labels to our dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


### And now we start prepping to visualise

In [31]:
# Step 1: Merge the 2 data frames
scar_merged = scar_data.join(neighborhoods_venues_sorted, lsuffix="Postalcode", rsuffix="Neighborhood")
scar_merged.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,0.0,M1B,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,College Stadium,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,2.0,M1C,Bar,Vietnamese Restaurant,Convenience Store,Hakka Restaurant,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,0.0,M1E,Intersection,Bank,Restaurant,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,M1G,Coffee Shop,Korean BBQ Restaurant,Vietnamese Restaurant,Convenience Store,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,M1H,Hakka Restaurant,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Department Store,Gym


In [32]:
# With this method of joining, my cluster labels became floats but they have to be integers. Time to fix that and a few other problems!

# Step 2: Let's get rid of that 2nd neighborhood column real quick.
scar_merged_2 = scar_merged.drop(labels="Neighborhood", axis = 1)

# Step 3: Due to the limited amount of venues I retrieved, 1 postal code got left out. Time to remove that one.
scar_merged_final = scar_merged_2[scar_merged_2['Cluster Labels'].notna()]

# Step 4: Turn the floats into integers. This code will throw a warning. However, when you do use .loc it won't actually convert to an integer for some reason.
scar_merged_final["Cluster Labels"] = scar_merged_final["Cluster Labels"].astype(int)

# Step 5: One final look at the dataframe before we map the data.
scar_merged_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,0,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,College Stadium,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,2,Bar,Vietnamese Restaurant,Convenience Store,Hakka Restaurant,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,0,Intersection,Bank,Restaurant,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,Convenience Store
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Korean BBQ Restaurant,Vietnamese Restaurant,Convenience Store,Gym,Grocery Store,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,Hakka Restaurant,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Department Store,Gym


### Time to create the map we need

In [33]:
# Step 1: Provide the location data to have the map zoom in on the correct neighbourhood
scar_latitude = 43.777702
scar_longitude = -79.233238

# Step 2: create map
map_clusters = folium.Map(location=[scar_latitude, scar_longitude], zoom_start=11)

# Step 3: set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Step 4: add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scar_merged_final['Latitude'], scar_merged_final['Longitude'], scar_merged_final['Postalcode'], scar_merged_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters