## Day 48 Lecture 1 Assignment

In this assignment, we will apply hierarchical clustering to a dataset containing the locations of all Starbucks in the U.S.

Note: this assignment uses geographical data and maps, which will require the use of two specific packages: haversine and plotly. Both of these can be pip installed.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from haversine import haversine
import plotly.express as px

Below are some convenience functions for calculating geographical distance matrices using lat-long data and plotting a dendrogram by combining a scikit-learn model with scipy's dendrogram plotting functionality.

In [2]:
def geo_sim_matrix(df, col_name = 'Coordinates'):
    """
    A function that computes a geographical distance matrix (in miles).
    Each row in the dataframe should correspond to one location.
    In addition, the dataframe must have a column containing the lat-long of each location as a tuple (i.e. (lat, long)).
    
    Parameters:
        df (pandas dataframe): an nxm dataframe containing the locations to compute similarities between.
        col (string): the name of the column containing the lat-long tuples.
        
    Returns:
        distance (pandas dataframe): an nxn distance matrix between the geographical coordinates of each location.
    """
    
    df = df.copy()
    df.reset_index(inplace=True)
    haver_vec = np.vectorize(haversine, otypes=[np.float32])
    distance = df.groupby('index').apply(lambda x: pd.Series(haver_vec(df[col_name], x[col_name])))
    distance = distance / 1.609344
    distance.columns = distance.index
    
    return distance


def plot_dendrogram(model, **kwargs):
    """
    A basic function for plotting a dendrogram. Sourced from the following link:
    https://github.com/scikit-learn/scikit-learn/blob/70cf4a676caa2d2dad2e3f6e4478d64bcb0506f7/examples/cluster/plot_hierarchical_clustering_dendrogram.py
    
    Parameters:
        model (object of class sklearn.cluster.hierarchical.AgglomerativeClustering): a fitted scikit-learn hierarchical clustering model.
    
    Output: a dendrogram based on the model based in the parameters.
    
    Returns: N/A    
    """
    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [3]:
# answer goes here
df = pd.read_csv('data/starbucks_locations.csv')




Begin by narrowing down the dataset to a specific geographic area of interest. Since we will need to manually compute a distance matrix, which will be on the order of $n^{2}$ in terms of size, we would recommend choosing an area with 3000 or less locations. In this example, we will use California, which has about 2800 locations. Feel free to choose a different region that is of more interest to you, if desired.

Subset the dataframe to only include records for Starbucks locations in California.

In [4]:
# answer goes here
oh_df = df[df['State/Province'] == 'OH']




In [5]:
oh_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378 entries, 21503 to 21880
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Brand           378 non-null    object 
 1   Store Number    378 non-null    object 
 2   Store Name      378 non-null    object 
 3   Ownership Type  378 non-null    object 
 4   Street Address  378 non-null    object 
 5   City            378 non-null    object 
 6   State/Province  378 non-null    object 
 7   Country         378 non-null    object 
 8   Postcode        377 non-null    object 
 9   Phone Number    357 non-null    object 
 10  Timezone        378 non-null    object 
 11  Longitude       378 non-null    float64
 12  Latitude        378 non-null    float64
dtypes: float64(2), object(11)
memory usage: 41.3+ KB


The haversine package takes tuples with 2 numeric elements and interprets them as lat-long to calculate distance, so add a new column called "Coordinates" that converts the lat and long in each row into a tuple. In other words, the last two columns of the dataframe should initially look like this:

**Latitude, Longitude**  
-121.64, 39.14  
-116.40, 34.13  
...

After adding the new column, the last three columns should look like this:

**Latitude, Longitude, Coordinates**  
-121.64, 39.14, (-121.64, 39.14)  
-116.40, 34.13, (-116.40, 34.13)  
...

In [6]:
# answer goes here
oh_df['Coordinates'] = list(zip(oh_df['Latitude'], oh_df["Longitude"]))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Calculate the distance matrix using the starter code/function geo_sim_matrix() provided earlier in the notebook. It assumes the column containing the coordinates for each location is called "Coordinates"; examine the docstring for more details.

In [7]:
# answer goes here
oh_dists = geo_sim_matrix(oh_df)
oh_dists




index,21503,21504,21505,21506,21507,21508,21509,21510,21511,21512,...,21871,21872,21873,21874,21875,21876,21877,21878,21879,21880
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21503,0.000000,0.000000,6.305164,4.193080,6.688168,6.987537,7.936386,6.688168,0.865266,0.520821,...,105.880028,38.227207,38.227207,26.301682,27.919926,103.765007,158.916397,44.674191,80.583328,80.108788
21504,0.000000,0.000000,6.305164,4.193080,6.688168,6.987537,7.936386,6.688168,0.865266,0.520821,...,105.880028,38.227207,38.227207,26.301682,27.919926,103.765007,158.916397,44.674191,80.583328,80.108788
21505,6.305164,6.305164,0.000000,9.525536,11.808400,0.690934,12.979708,11.808400,5.744386,6.411960,...,102.261017,44.329292,44.329292,32.478065,25.890823,100.739342,156.542999,44.981430,75.136665,74.692642
21506,4.193080,4.193080,9.525536,0.000000,2.498870,10.132814,3.748107,2.498870,4.171383,3.749296,...,104.916473,36.760777,36.760777,24.733757,26.366465,102.364052,157.013779,48.050720,81.421280,80.910194
21507,6.688168,6.688168,11.808400,2.498870,0.000000,12.381018,1.249237,0.000000,6.655448,6.248166,...,104.589706,35.923550,35.923550,23.931051,25.893564,101.769958,156.081787,50.021702,82.184959,81.653519
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21876,103.765007,103.765007,100.739342,102.364052,101.769958,100.328682,101.495262,101.769958,102.922958,103.365067,...,11.569868,133.901108,133.901108,123.248039,76.001450,0.000000,57.364594,143.775848,53.650658,53.013039
21877,158.916397,158.916397,156.542999,157.013779,156.081787,156.192856,155.629318,156.081787,158.105133,158.478348,...,60.916637,185.321884,185.321884,175.694748,131.014267,57.364594,0.000000,200.448715,107.193069,106.736809
21878,44.674191,44.674191,44.981430,48.050720,50.021702,45.180485,51.024044,50.021702,45.293259,45.189640,...,143.852310,51.016815,51.016815,47.035366,70.797531,143.775848,200.448715,0.000000,108.874985,108.696991
21879,80.583328,80.583328,75.136665,81.421280,82.184959,74.496178,82.592972,82.184959,79.758240,80.410332,...,46.599174,118.105301,118.105301,106.095444,60.223953,53.650658,107.193069,108.874985,0.000000,0.870435


Build the hierarchical clustering model using n_clusters = 5 and average linkage. Bear in mind that we are passing a precomputed distance matrix, which will require an additional parameter to be manually specified. 

Additionally, save the predicted cluster assignments as a new column in your dataframe.

In [None]:
# answer goes here





Plot the dendrogram using the provided starter code/function "plot_dendrogram". The dendrogram will be difficult to read because there are so many leaf nodes; try experimenting with smaller geographical areas for easier to read dendrograms.

In [None]:
# answer goes here





Finally, plot the resulting clusters on a map using the "scatter_geo" function from plotly.express. The map defaults to the entire world; the "scope" parameter is useful for narrowing down the region plotted in the map. The documentation can be found here:

https://www.plotly.express/plotly_express/#plotly_express.scatter_geo

Tip: If the markers on the map are too large, their size can be changed with the following line of code:

*fig.update_traces(marker=dict(size=...)))*

Do the clusters correspond to geographic areas you would expect? Experiment with other values for n_cluster and linkage and see how they affect the results.

In [None]:
# answer goes here



