# IBM Data Science Capstone

## Introduction
The different neighborhoods of a city can have a distinctly different feel. I will be examining the neighborhoods of Seattle, and clustering them into groups of similar neighborhoods based on number and types of venues they contain. This analysis could be valuable to several groups of people. For example, business owners with a successful business in one neighborhood may be interested in similar areas where they could expand. Or perhaps residents planning to move between neighborhoods might want to know what neighborhoods are similar (or dissimilar) to where they currently live. 

In [3]:
# Imports, needed for code throughout the analysis
import pandas as pd
import numpy as np
import os
import requests
import folium
import geopy.distance
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from math import cos, radians
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

## Data

The data used in this analysis comes from 2 sources. 

First, we need geographic data for Seattle's neighborhoods. This is downloaded from an open data set (found [here](https://gis-kingcounty.opendata.arcgis.com/datasets/neighborhood-centers-in-king-county-neighborhood-centers-point), under the "Download" menu). This data defines 46 different neighborhoods of Seattle and provides latitude and longitude coordinates for each. These coordinates will be needed to get the next data.

I decided to drop neighborhoods that were far from the city center, and arbitrarily decided that 13km was the cutoff. Calculating distances in terms of latitude and longitude coordinates is actually quite complicated (fun fact: the curvature of the Earth must be accounted for!), but fortunately the `geopy` library does the hard work for you. 33 neighborhoods remained within the cutoff distance.

In [4]:
df = pd.read_csv("Neighborhood_Centers_in_King_County___neighborhood_centers_point.csv")
df = df[['NAME', 'LATITUDE', 'LONGITUDE' ]]

seattle_coordinates = (47.6062, -122.3321)

distances = []
for idx, neighborhood in df.iterrows():
    neighborhood_coordinates = (neighborhood.LATITUDE, neighborhood.LONGITUDE)
    km_to_city_center = geopy.distance.distance(seattle_coordinates, neighborhood_coordinates).km
    distances.append(km_to_city_center)
df['KM_TO_CITY_CENTER'] = distances
drop_cities_farther_than = 13
df = df[df['KM_TO_CITY_CENTER'] <= drop_cities_farther_than]
df.head()

Unnamed: 0,NAME,LATITUDE,LONGITUDE,KM_TO_CITY_CENTER
1,Northgate Neighborhood,47.708593,-122.323276,11.403751
2,Lake City Neighborhood,47.719278,-122.295228,12.873825
4,Wedgewood Neighborhood,47.675783,-122.290273,8.350559
5,University District,47.661268,-122.313133,6.286332
6,Green Lake Neighborhood,47.67949,-122.325846,8.162216


Let's see those neighborhoods on a map:

In [7]:
seattle_map = folium.Map(location=seattle_coordinates, zoom_start=11)

for lat, lng, name in zip(df['LATITUDE'], df['LONGITUDE'], df['NAME']):
    label = folium.Popup(name, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        color='blue',
        fill=True,
        fill_opacity=1.0,
        popup=label
    ).add_to(seattle_map)
seattle_map

Next, we need to get venue data for each neighborhood. We will fetch this from the Foursquare API. This data contains numerous details for each venue, but we are mainly concerned with the `category`. There are hundreds of categories for venues, but they all converge into 10 top level categories, and we will be using these top level categories for this analysis. The top level categories are:
- Arts & Entertainment
- College & University
- Event
- Food
- Nightlife Spot
- Outdoors & Recreation
- Professional & Other Places
- Residence
- Shop & Service
- Travel & Transport

To get the venue data, we need to iterate through every category for every neighborhood and make a request to the Foursquare API for each. The design of the API creates a limitation, as it is only capable of returning venues within a certain radius from a point on the map. Unfortunately, the neighborhoods of Seattle are not shaped like perfect circles, so we will not get all the venues in a neighborhood, but it will have to do. I arbitrarily chose a radius of 250 meters.

Fetching this data requires doing `10 categories * 33 neighborhoods = 330 requests` to the API. Fortunately this is within the usage limits of the free tier. 

While testing these results, I noticed that many of the requests returned the API limit of 50 venues (for example, Fremont, Capitol Hill, and the International District all had at least 50 `Food` venues). This was a problem because if a neighborhood had more than 50 venues of a certain category, some of them would be left out and this would represent that neighborhood inaccurately. 

To get around this, if a request returned 50 venues, I would then perform a subsearch of that same area. I submitted 4 new requests to the API, each one searching a radius of 125 meters (half the original 250 meters) and centered 125 meters North, South, East, and West of the original search. These 4 subsearches are all contained within the original search, but do not cover all of it so it is possible that some venues were missed. These subsearches also overlap, so duplicate venues needed to be removed.

The code to do this was fairly verbose, so we will skip the details. The results were saved to a CSV file. 

In [8]:
vdf = pd.read_csv('seattle_venues.csv')
vdf.head()

Unnamed: 0,VENUE,CATEGORY,TOP_LEVEL_CATEGORY,NEIGHBORHOOD,NEIGHBORHOOD_LAT,NEIGHBORHOOD_LNG
0,Westlake Dance Center,Dance Studio,Arts & Entertainment,Northgate Neighborhood,47.708593,-122.323276
1,Kidgits,General Entertainment,Arts & Entertainment,Northgate Neighborhood,47.708593,-122.323276
2,Hot Blonde Bum @ 5th And Northgate,Outdoor Sculpture,Arts & Entertainment,Northgate Neighborhood,47.708593,-122.323276
3,The Wolfe Den,Theme Park Ride / Attraction,Arts & Entertainment,Northgate Neighborhood,47.708593,-122.323276
4,Circulation,Art Gallery,Arts & Entertainment,Northgate Neighborhood,47.708593,-122.323276
