# City venues classification of Bergen and Hudson County, NJ

    New Jersey's Hudson county has unique geography.  It is the sixth most densly populated county in the USA.  It is located west of the hudson river.  Hudson county has a robust transit system.  It has the PATH trains that can take passengers to the World Trade Center, Penn Station New York, and Newark Penn Station.  Hudson county is home to the Hudson Bergen Light Rail that allows passengers ferquent rides within hudson county.  There is also bus lines that serve many neighborhoods within Hudson County.  The dense population, availability of transportation, and proximity to major cities makes this county very popular for proespective venues.  This project was meant to cluster neighborhoods of Hudson County, NJ by the frequency of the venues within said neighborhoods.  The program will cluster similar neighborhoods together using the KMeans algorithm.  This report is targeted to people who might want to open a business and wants to see the most frequent venues in an area of hudson county.  

# Data

       The FourSquare API was used to locate nearby venues by using the 'explore' aspect of the API.  Foursquare API requires credentials, which will be omitted in the final code.  FourSquare API requires the latitude and longitude for a given location and a search radius.  Some hudson conty cities are large and are subdivided into different zip codes.  There is a list of NJ zip codes, city, and latitude and longitude coordinates from opendatasoft.com.  The table does have data for all cities in NJ.  The hudson county wikipedia article has a table that lists each city in hudson county.  The city name list from wikipedia is used to limit the cities from the OpenDataSoft list.  The foursquare api is used to collect nearby venues information.  

OpenDataSoft List of NJ City information:
https://public.opendatasoft.com/api/records/1.0/search/?dataset=us-zip-code-latitude-and-longitude&q=NJ&rows=762&facet=state&facet=timezone&facet=dst

#### An example of a response from one of the requests from the opendatasoft site

In [None]:
{'datasetid': 'us-zip-code-latitude-and-longitude',
  'recordid': '0b26467a7ec2b1a30432f960a680a2948091124c',
  'fields': {'city': 'Nutley',
   'zip': '07110',
   'dst': 1,
   'geopoint': [40.8196, -74.15877],
   'longitude': -74.15877,
   'state': 'NJ',
   'latitude': 40.8196,
   'timezone': -5},
  'geometry': {'type': 'Point', 'coordinates': [-74.15877, 40.8196]},
  'record_timestamp': '2018-02-09T16:33:38.603000+00:00'},

Wikipedia Article That contains a table of NJ Hudson County Cities: https://en.wikipedia.org/wiki/Hudson_County,_New_Jersey
The City column is the only data that will be used from this table!

import pandas as pd
hudsonURL = "https://en.wikipedia.org/wiki/Hudson_County,_New_Jersey"
hudsonTables = pd.read_html(hudsonURL)
hudsonCities = hudsonTables[1]
hudsonCities.rename({"Municipality":"City"},axis=1,inplace=True)
hudsonCities.head()

## This is the FourSquare API url that was used in the program.

url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT
        )

### Typical Response from the url

In [3]:
import requests, json
CLIENT_ID = "redacted"
CLIENT_SECRET = 'redacted'
VERSION = "20180605"
LIMIT = 3
url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            40.81,
            -74.15,
            500,
            LIMIT
        )
test = requests.get(url).json()
test

{'meta': {'code': 200, 'requestId': '604e8306c0470e3e31a3e43d'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Nutley',
  'headerFullLocation': 'Nutley',
  'headerLocationGranularity': 'city',
  'totalResults': 21,
  'suggestedBounds': {'ne': {'lat': 40.814500004500005,
    'lng': -74.14406564245576},
   'sw': {'lat': 40.8054999955, 'lng': -74.15593435754425}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4fb18d6ee4b05cf2df954c93',
       'name': 'TNT Fitness',
       'location': {'address': '135 Main St',
        'crossStreet': 'Belleville Avenue',
        'lat': 40.80967276742127,
        'lng': -74.15170481755672,
        'labeledLatL

## Methodology
The Hudson county city coordinates were parsed from an API from OpenDataSoft.  The cities were limited to hudson county from a table on wikipedia using the pandas library.  The venue category was parsed from the response request from the FourSquare API.  Dummy categories were created using the pandas library.  The frequency of each restaurant was grouped for each zipcode.  The top 10 venues were added to a new dataframe in descending order.  The clustering alogorithm KMeans was used to determine the groups based on the top 10 most frequent venues with a k = 5.  

# Results

    The algorithm grouped West New York, North Bergen, and both zip codes of Union City into cluster 0.  07096 Secaucus, 07099 Kearny, and these following Jersey City zip codes were clustered as cluster 1: 07309,07303,07399,07097,07398, and 07311.  Hoboken, Bayonne, Harrison, 07094 of Secaucus, and the zip codes 07307, 07310, 07302,07305,07304, and 07301 of Jersey City were grouped as cluster 2.  07032 of Kearny was the only member of cluster 3.  07306 was the only member of cluster 4.

# Discussion

    Cluster 1 is interesting.  The dominating venue in cluster 1 is the Italian restaurant while the Mexican restaurant is second in the majority of cluster 1.  There seems to be a subgroup in cluster 1 as two zipcodes share the exact same common venues.
    
    Cluster 0 zipcodes share the Latin American Restaurant as the most common venue (with the outlier of Union City).  These cities do have a large hispanic population so this is not a huge surprise.  Cuban Restaurant is the second most common venue for some zipcodes in cluster 0.  
    
    The sole-proprietorships clusters 3 and 4 are unique.  Cluster 4 has 07306 of Jersey City with its distinction from the most common venue: Indian Restaurant.  The Indian Restaurant frequency was about one quarter in this zipcode.  This was probably the most determinging factor about cluster 4.
    
    Cluster 3's solo member is less of an outlier as the most common venue is Deli/Bodega.  The algorithm might have assigned it uniquely to cluster 3, because it has unique 2nd common venue: a Dance Studio.
    
    Cluster 2 is the most puzzling.  There is no clear distinction on why these zipcodes were placed in cluster 2; however, there is some similarity between all of these zipcodes.  They typically include general venues like Banks, Bar, and a few Restaurants.

# Conclusion

I identified Hudson County, NJ as a prime target for new potential venues because of its location, access to public transit, and dense population.  I used the OpenDataSoft API, a wikipedia table of Hudson county city names, and FourSquare API to get the venue data.  The most frequent venue category was found for each zip code.  These labels were inserted into a KMeans algorithm were the zipcodes were clusterd according to common venues.  One zipcode was identified with a large frequency of indian restaurants.  One zipcode had an unsually high number of dance studios.  Many zipcodes had high number of Italian restaurants.  There are alot of Latin American Restaurants in the northern parts of Hudson county.  There are zipcodes that are loosely related to eachother.