## Walmart Home Catchments

In [1]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests

from google.cloud import bigquery

from cartoframes.viz import Map, Layer

import datasets

## Retrieve the data

The data is a set of locations from which people travel to visit Panama City Beach, Florida during the month of July 2019. This data comes from the [Safegraph Patterns data](https://blog.safegraph.com/introducing-places-patterns-17ac5b96fb33). This example is a basic reproduction of some of the findings in the [CARTO <> Safegraph partnership blog post](https://carto.com/blog/visit-pattern-footfall-data-safegraph/).

Since we know the locations that people are coming from, it might be natural to ask if there are general regions that we can identify as drivers of the visits. For example, are there areas with a higher density of source visits that could be used to understand visit demographics?

Let's get started by downloading the data nad taking a look at it.

In [2]:
sg_pcb = datasets.get_safegraph_visits()

In [3]:
sg_pcb.head()

Unnamed: 0_level_0,geometry,longitude,latitude,num_visits
cartodb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,POINT (-84.69018 33.99092),-84.690182,33.990924,5.0
2,POINT (-85.87721 30.21668),-85.877212,30.216679,13.0
3,POINT (-85.17326 31.90427),-85.173263,31.904274,6.0
4,POINT (-86.00685 34.63264),-86.006852,34.632641,5.0
5,POINT (-85.03878 32.52374),-85.038783,32.523741,11.0


This is a point dataset associated with the number of visits.

## Visualize points on map

In [4]:
Layer(sg_pcb)

## Calculate Clusters

To calculate clusters, we will use DBSCAN because it works well for finding clusters based on density and works well with spatial measurements.

In [5]:
from sklearn.cluster import dbscan

# use lat/lng in radians as coordinates
coords = np.radians(sg_pcb[["latitude", "longitude"]].values)

# choose appropriate epsilon value
# here we use ~35 kilometers
kms_per_radian = 6371
epsilon = 35 / kms_per_radian

# calculate clusters
# use haversine metric for calculating approximate distances on earth's surface (crow fly)
_, cluster_labels = dbscan(
    coords, eps=epsilon, min_samples=4, algorithm="ball_tree", metric="haversine",
)

print("Number of clusters: {}".format(len(set(cluster_labels))))

Number of clusters: 9


### Add cluster labels to data

Now that we have uncovered some natural clusters, let's give them some appropriate labels.

In [6]:
from cartoframes.viz.helpers import color_category_layer

# convert labels to text for creating a category map
sg_pcb["dbscan_labels"] = [str(s) for s in cluster_labels]

# show distribution of labels
color_category_layer(sg_pcb, 'dbscan_labels')

NameError: name 'color_category_layer' is not defined

### Apply readable labels to clusters

In [None]:
sg_pcb["dbscan_labels"] = cluster_labels

# Give cluster labels titles
cluster_title_mapping = {
    -1: "Outlier",
    0: "Northern Alabama and Georgia",
    1: "Panama City Beach (Locals)",
}
cluster_title_mapping.update(
    {k: "Other smaller region" for k in range(2, max(cluster_labels) + 1)}
)


# identify points as within a cluster or not
def in_cluster(cluster_num):
    if cluster_num == -1:
        return "Out of cluster"
    return "In cluster"


sg_pcb["in_cluster"] = sg_pcb["dbscan_labels"].apply(in_cluster)

## Calculate Convex Hulls to show approximate cluster region

In [None]:
cluster_hulls = (
    sg_pcb[sg_pcb["dbscan_labels"] != -1]
    .groupby("dbscan_labels")
    .geometry.apply(lambda x: x.unary_union.convex_hull.buffer(0.05))
)

cluster_hulls = gpd.GeoDataFrame(cluster_hulls).reset_index()
cluster_hulls["dbscan_labels_readable"] = cluster_hulls["dbscan_labels"].apply(
    lambda x: cluster_title_mapping.get(x)
)

In [None]:
cluster_hulls

## Visualize outputs

In [None]:
Map(
    [
        color_category_layer(
            cluster_hulls,
            "dbscan_labels_readable",
            opacity=0.7,
            widget=True,
            palette=["#66C5CC", "#DCB0F2", "#F89C74"],
            stroke_color="transparent",
            title="Visit Regions",
        ),
        color_category_layer(
            sg_pcb,
            "in_cluster",
            palette=["#666", "deeppink"],
            opacity=0.5,
            title="In Cluster?",
        ),
    ]
)