# Clustering the UCSD Campus

The University of California At San Diego has a very unique campus environment for its undergraduate students. The entire campus and undergraduate student population are currently divided into six residential colleges. When a student is admitted into UCSD, they are assigned to one of the 6 colleges, and that determines where they live, what general education classes they have to take, and what resources are available to them. These colleges are vital to UCSD's unique campus culture, which is what makes the UCSD campus, the perfect clustering problem. For this problem, I have decided to implement the unsupervised k-means algorithm, however, I may revisit this problem with other types of clustering algorithms.

In [1]:
import pandas as pd
import numpy as np
import googlemaps
import time
from utils import distance, get_key
from visualize import cluster_map

## Data Wrangling

In order to cluster the campus, we must first gather a dataset of names of places on campus and their locations. To do this I have decided to use the Google Maps Places API and the Python client library for it. I first create a list of location categories that would be relevant to the UCSD campus. Then I search for each of those categories of locations one by one in the proximity of the campus and save the results to a list.

In [2]:
gmaps = googlemaps.Client(key=get_key())
gmaps

<googlemaps.client.Client at 0x251bf449438>

In [3]:
types = ["university", "art_gallery", "atm", "bakery", "bank", "bar", "beauty_salon", "bicycle_store", "book_store",
         "bus_station", "cafe", "campground", "clothing_store", "convenience_store", "doctor", "establishment",
         "food", "gym", "grocery_or_supermarket", "health", "hospital", "laundry", "library", "liquor_store",
         "local_government_office", "lodging", "meal_takeaway", "museum", "park", "parking", "pharmacy", "physiotherapist",
         "police", "post_office", "restaurant", "school", "stadium", "storage", "store", "supermarket", "transit_station"]

def page_search(loc, rad, category=''):
    """
    Returns metadata for places within radius of location meeting criteria for given category
    :param loc: tuple of longitude and latitude
    :param rad: integer for radius(in meters) to search
    :param category: string specifying category
    :return: dictionary of places' metadata
    """
    results = []
    search = gmaps.places_nearby(location=loc, rank_by="distance", type=category)
    results += search["results"]
    while "next_page_token" in search:
        time.sleep(2)
        search = gmaps.places_nearby(location=loc, page_token=search["next_page_token"], radius=rad, type=category)
        results += search["results"]       
    return results


places = []
for t in types:
    places += page_search((32.881439,-117.237729), 50, category=t)
places

{'geometry': {'location': {'lat': 32.8781369, 'lng': -117.2403332},
  'viewport': {'northeast': {'lat': 32.8794694802915,
    'lng': -117.2387766197085},
   'southwest': {'lat': 32.8767715197085, 'lng': -117.2414745802915}}},
 'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/generic_business-71.png',
 'id': 'eb175531d46075e176a0c51290da00a24108de63',
 'name': 'Mandeville Annex Gallery',
 'place_id': 'ChIJweM8xMYG3IAR6GgXGxeFbVQ',
 'plus_code': {'compound_code': 'VQH5+7V San Diego, California, United States',
  'global_code': '8544VQH5+7V'},
 'reference': 'ChIJweM8xMYG3IAR6GgXGxeFbVQ',
 'scope': 'GOOGLE',
 'types': ['art_gallery', 'university', 'point_of_interest', 'establishment'],
 'vicinity': 'San Diego'}

After gathering the data, I format the search results to be saved into a dataframe storing the names of locations, latitude and longitude.

In [4]:
dct = {"name": [], "latitude": [], "longitude": []}
for place in places:
    dct["name"].append(place["name"])
    dct["latitude"].append(place["geometry"]["location"]["lat"])
    dct["longitude"].append(place["geometry"]["location"]["lng"])
    
{key:dct[key][:3] for key in dct.keys()}

{'name': ['Mandeville Annex Gallery',
  'University Art Gallery (UAG)',
  'Crafts Center Grove Gallery'],
 'latitude': [32.8781369, 32.87780120000001, 32.8756918],
 'longitude': [-117.2403332, -117.2407337, -117.2356496]}

In [5]:
df = pd.DataFrame(dct)
df.head()

Unnamed: 0,name,latitude,longitude
0,Mandeville Annex Gallery,32.878137,-117.240333
1,University Art Gallery (UAG),32.877801,-117.240734
2,Crafts Center Grove Gallery,32.875692,-117.23565
3,San Diego Center-Jewish Comm,32.875717,-117.21535
4,Gotthelf Art Gallery,32.87561,-117.215004


Now that I have a dataframe of places I can now filter the results of the api requests using truth values on the Pandas series to narrow our dataset to only locations on campus.

In [1]:
top_bound = ((df['latitude'] <= 32.8912) & (df['longitude'] <= -117.237248)) | \
            ((df['latitude'] <= 32.885216) & (df['longitude'] <= -117.222171)) | \
            ((df['latitude'] <= 32.882486) & (df['longitude'] <= -117.21923))
bottom_bound = (df['latitude'] >= 32.871570)
left_bound = df['longitude'] >= -117.243233
right_bound = df['longitude'] <= -117.218857

df = df[left_bound][right_bound][top_bound][bottom_bound]
df.head()

NameError: name 'df' is not defined

In [6]:
df.to_csv("places.csv")

In [7]:
places = {row[1]: (row[2], row[3]) for row in df.itertuples()}

{key:places[key] for key in list(places.keys())[:5]}

{'Mandeville Annex Gallery': (32.8781369, -117.2403332),
 'University Art Gallery (UAG)': (32.87780120000001, -117.2407337),
 'Crafts Center Grove Gallery': (32.8756918, -117.2356496),
 'San Diego Center-Jewish Comm': (32.8757165, -117.2153504),
 'Gotthelf Art Gallery': (32.87561000000001, -117.215004)}

## K-Means

Now that our dataset has been put together, we can begin to implement the k-means algorithm. The goal of the algorithm is to take a sequence of points and seperate them into k clusters. The algorithm starts by randomly selecting a k number of points known as centroids. These centroids will act as the center of our final clusters. The algorithm then is comprised of 2 notable steps:

1. Update the clusters. We assign each point to a cluster based on which centroid the point is closest to. This step is performed by the **group_by_centroid** function which takes a sequence of places and sequence of centroids to return k clusters.
2. Update the centroids. We calculate the mean latitude and mean longitude of each cluster and those statistics are used as our new centroid. This step is performed by the **find_centroid** function which is mapped to compute on each cluster.

These 2 steps are repeated until the centroids are unable to change or we reach a maxiumum update threshold we define ourselves. 

In [8]:
def location(place_name):
    """
    Returns location of a place
    :param place_name: name of place on map
    :return: the location given as a tuple of longitude and latitude
    """
    return tuple((places[place_name][0], places[place_name][1]))

In [9]:
def group_by_centroid(places, centroids):
    """
    Assigns places to their respective closest centroids and returns a cluster of places for each centroid
    :param places: a sequence of places
    :param centroids: a sequence of centroids
    :return: a nested sequence containing sequences of places all closest to the same centroid
    """
    clusters = [[] for i in range(len(centroids))]
    for place_name, location in places.items():
        dists = [distance(centroid, location) for centroid in centroids]
        clusters[dists.index(min(dists))].append(location)
    return clusters

In [10]:
def find_centroid(cluster):
    """
    Returns updated centroid of given cluster
    :param cluster: a sequence of places
    :return: tuple of latitude and longitude for updated centroid
    """
    return tuple((np.mean([i[0] for i in cluster]), np.mean([i[1] for i in cluster])))

In [1]:
def k_means(places, k, max_updates=100):
    """
    Uses the k-means algorithm to group places into k clusters
    :param places: a sequence of places
    :param k: amount of clusters to group places into
    :param max_updates: maximum number of centroid updates allowed
    :return: k number of centroids represented as a tuple of longitude and latitude
    """
    assert len(places) >= k, 'Not enough restaurants to cluster'
    
    old_centroids, n = [], 0
    indexes = list(np.random.choice(range(len(places)), size=k, replace=False))
    centroids = [list(places.values())[i] for i in indexes]

    while old_centroids != centroids and n < max_updates:
        old_centroids = centroids
        clusters = group_by_centroid(places, centroids)
        centroids = list(map(find_centroid, clusters))
        n += 1
    return centroids

In [2]:
centroids = k_means(places, 9)
centroids

NameError: name 'places' is not defined

# Visualization
To visualize locations on campus 

In [None]:
cluster_map(places, centroids)

Serving HTTP on 0.0.0.0 port 8000 ...
Type Ctrl-C to exit.
