# Coursera - Applied Data Science Capstone Project

## Best GYM in Manhattan NY (Week 2)

### Scenario:

Being a fitness freak, my first priority after shifting to a new place, is to find a good GYM.
So i will use this opportunity provided by coursera, to compare best GYM around Manhattan NY.
In order to make a comparison and evaluation of all fitness centers, i would set some restrictions, they are:
- Should be very popular with foursquare users.
- Should be in one of the neighborhood of Manhattan

### Business Problem:   
The challenge is to find a good GYM in Manhattan NY that aligns with the demands on rating, popularity and disagree percentage. The data required to resolve this challenge is described in the following section 2, below.


### Interested Audience
I believe this is a relevant challenge with valid questions for anyone moving to other large city in US, EU or Asia and wanting to find good fitness centers. The same methodology can be applied in accordance to demands as applicable. This case is also applicable for anyone interested in exploring starting or locating a new business in any city. Lastly, it can also serve as a good practical exercise to develop Data Science skills.

# 2. Data Section:¶
### Description of the data and its sources that will be used to solve the problem

### Description of the Data:¶

The following data is required to  answer the issues of the problem:

- Latitude and Longitutde of Manhattan NY.
- List of Fitness Centers around Manhattan NY.
- List of Fitness Centers with ratings, popularity count and agree/disagree count.
- NY Neighborhhod data
- Venues for each Fitness Gym (Manhattan Neighborhood which can be clustered)


### How the data will be used to solve the problem

The data will be used as follows:
- Use Foursquare to find top 5 venues for Manhattan Neighborhood,
- Use Foursquare and geopy data to map the location of Fitness Centers.
- Use Folium to create NY neighborhoods and clustered in groups ( as per Course LAB)
- create a map that depicts, best gym / fitness center around Manhattan NY

The procesing of these DATA will allow to answer the key questions to make a decision:
- which is the most popular Fitness Center in the neighborhood?
- What is the distance of Fitness Center to possible home in NY?
- How many Foursquare users agree with the review given for the Fitness Center?

## Note: Since this notebook is uploaded to Github, images are not visible. Kindly refer presentation for images.

## Data Preparation


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Since feature is the node from which we require the data, we'll assign it to a variable.

In [4]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [7]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [10]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

In [11]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [12]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.

In [13]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it.

In [14]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'LRI3X3XUNGGKG11TX2DH5ZAVOQ5CLE3CM2NXOBHA0YNQT2K5' # your Foursquare ID
CLIENT_SECRET = '4DRAHU1TRPC0XAUYGVVA2LKDYGPKXVSYY1U0ENJV3MNTC4Y3' # your Foursquare Secret
VERSION = '20180605'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LRI3X3XUNGGKG11TX2DH5ZAVOQ5CLE3CM2NXOBHA0YNQT2K5
CLIENT_SECRET:4DRAHU1TRPC0XAUYGVVA2LKDYGPKXVSYY1U0ENJV3MNTC4Y3


In [16]:
search_query = 'GYM'
radius = 3000
print(search_query + ' .... OK!')

GYM .... OK!


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [17]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

Get the neighborhood's latitude and longitude values.

In [18]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [19]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=LRI3X3XUNGGKG11TX2DH5ZAVOQ5CLE3CM2NXOBHA0YNQT2K5&client_secret=4DRAHU1TRPC0XAUYGVVA2LKDYGPKXVSYY1U0ENJV3MNTC4Y3&ll=40.87655077879964,-73.91065965862981&v=20180605&query=GYM&radius=3000&limit=30'

Double-click __here__ for the solution.
<!-- The correct answer is:
LIMIT = 100 # limit of number of venues returned by Foursquare API
-->

<!--
radius = 500 # define radius
-->

<!--
\\ # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
--> 

Send the GET request and examine the resutls

In [20]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d6aa06eacc5f5002ceac72e'},
 'response': {'venues': [{'id': '4cc9ac26d54fa1cda7f33829',
    'name': 'Old Gym Building - Lehman College',
    'location': {'address': '250 Bedford Park Blvd W',
     'lat': 40.87219552416185,
     'lng': -73.89512523441805,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.87219552416185,
       'lng': -73.89512523441805}],
     'distance': 1394,
     'postalCode': '10468',
     'cc': 'US',
     'city': 'Bronx',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['250 Bedford Park Blvd W',
      'Bronx, NY 10468',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d198941735',
      'name': 'College Academic Building',
      'pluralName': 'College Academic Buildings',
      'shortName': 'Academic Building',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/education/academicbuilding_',
       'suffix': '.png'},
      'primary': True}],
    'referralId

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [22]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
nearby_venues = json_normalize(venues)
nearby_venues.head()

Unnamed: 0,categories,hasPerk,id,location.address,location.cc,location.city,location.country,location.crossStreet,location.distance,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.neighborhood,location.postalCode,location.state,name,referralId,venuePage.id
0,"[{'id': '4bf58dd8d48988d198941735', 'name': 'C...",False,4cc9ac26d54fa1cda7f33829,250 Bedford Park Blvd W,US,Bronx,United States,,1394,"[250 Bedford Park Blvd W, Bronx, NY 10468, Uni...","[{'label': 'display', 'lat': 40.87219552416185...",40.872196,-73.895125,,10468.0,NY,Old Gym Building - Lehman College,v-1567268974,
1,"[{'id': '4bf58dd8d48988d176941735', 'name': 'G...",False,53fbbab3498e41f50c2fac75,3210 Riverdale Ave,US,Bronx,United States,232nd Street,735,"[3210 Riverdale Ave (232nd Street), Bronx, NY ...","[{'label': 'display', 'lat': 40.88274556055887...",40.882746,-73.907625,,10463.0,NY,3210 Riverdale Avenue - Wellness Center & Gym,v-1567268974,
2,[],False,4ce43fbe4039b60c7d0ec006,,US,Bronx,United States,,215,"[Bronx, NY 10463, United States]","[{'label': 'display', 'lat': 40.87471362199872...",40.874714,-73.911467,,10463.0,NY,Winston Churchill Gym,v-1567268974,
3,"[{'id': '4bf58dd8d48988d175941735', 'name': 'G...",False,4aa00bcdf964a520103e20e3,82 W 225th St,US,Bronx,United States,,302,"[82 W 225th St, Bronx, NY 10463, United States]","[{'label': 'display', 'lat': 40.8740876399889,...",40.874088,-73.909137,,10463.0,NY,Planet Fitness,v-1567268974,
4,"[{'id': '4bf58dd8d48988d176941735', 'name': 'G...",False,4ce716cd0f196dcb7fe43bae,Douglas Avenue,US,Bronx,United States,,1456,"[Douglas Avenue, Bronx, NY, United States]","[{'label': 'display', 'lat': 40.888722, 'lng':...",40.888722,-73.917011,,,NY,Hayden On Hudson Gym,v-1567268974,


And how many venues were returned by Foursquare?

In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

30 venues were returned by Foursquare.


<a id='item2'></a>

## 2. Explore GYMs in Manhattan

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [24]:
def getNearbyVenues(names, latitudes, longitudes):
    radius = 100
    LIMIT = 5
    search_query = "GYM"
    venues_list = pd.DataFrame(columns=["name","categories","address","cc","city","country","crossStreet","distance","formattedAddress","labeledLatLngs","lat","lng","postalCode","state","id","Neighborhood","Neighborhood_lat","Neighborhood_long"])
    for name, lat, long in zip(names, latitudes, longitudes):
        print(name)
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, long, VERSION, search_query, radius, LIMIT)
        results = requests.get(url).json()
        
        # assign relevant part of JSON to venues
        venues = results['response']['venues']

        # tranform venues into a dataframe
        dataframe = json_normalize(venues)
        #print(dataframe.head())
        if dataframe.empty:
            continue
        else:
            # keep only columns that include venue name, and anything that is associated with location
            filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
            dataframe_filtered = dataframe.loc[:, filtered_columns]

            # function that extracts the category of the venue
            def get_category_type(row):
                try:
                    categories_list = row['categories']
                except:
                    categories_list = row['venue.categories']

                if len(categories_list) == 0:
                    return None
                else:
                    return categories_list[0]['name']

            # filter the category for each row
            dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

            # clean column names by keeping only last term
            dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

            dataframe_filtered['Neighborhood'] = name
            dataframe_filtered['Neighborhood_lat'] = lat
            dataframe_filtered['Neighborhood_long'] = long
            venues_list = venues_list.append(dataframe_filtered, ignore_index = True)
    
    return(venues_list)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [25]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


#### Let's check the size of the resulting dataframe

In [26]:
print(manhattan_venues.shape)
manhattan_venues.head()

(47, 19)


Unnamed: 0,Neighborhood,Neighborhood_lat,Neighborhood_long,address,categories,cc,city,country,crossStreet,distance,formattedAddress,id,labeledLatLngs,lat,lng,name,neighborhood,postalCode,state
0,Chinatown,40.715618,-73.994279,,Office,US,,United States,,59,"[New York, United States]",4d135140d1848cfa20d3c271,"[{'label': 'display', 'lat': 40.71614, 'lng': ...",40.71614,-73.994452,Gym Office,,,New York
1,Upper East Side,40.775639,-73.960508,911 Park Ave,Gym,US,New York,United States,80th st,83,"[911 Park Ave (80th st), New York, NY 10075, U...",4c06a60d92a4ef3b7117b0f1,"[{'label': 'display', 'lat': 40.77631040937412...",40.77631,-73.960056,911 Gym,,10075.0,NY
2,Yorkville,40.77593,-73.947118,,Gym,US,New York,United States,,95,"[New York, NY, United States]",56bb3803498eb5c4304be776,"[{'label': 'display', 'lat': 40.77514150587055...",40.775142,-73.947551,Cambridge Gym,,,NY
3,Yorkville,40.77593,-73.947118,500 E 85th St,Residential Building (Apartment / Condo),US,New York,United States,York Ave,89,"[500 E 85th St (York Ave), New York, NY 10028,...",4efc79e99a521091899c58a7,"[{'label': 'display', 'lat': 40.775135809722, ...",40.775136,-73.947311,Gym @ The Cambridge,,10028.0,NY
4,Yorkville,40.77593,-73.947118,435 E 85th St,Residential Building (Apartment / Condo),US,New York,United States,York Ave.,91,"[435 E 85th St (York Ave.), New York, NY 10028...",4e369e151838f85189b2b58a,"[{'label': 'display', 'lat': 40.77560867202094...",40.775609,-73.948123,Ellen's Gym,,10028.0,NY


In [27]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood_lat,Neighborhood_long,address,categories,cc,city,country,crossStreet,distance,formattedAddress,id,labeledLatLngs,lat,lng,name,neighborhood,postalCode,state
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Battery Park City,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1
Chelsea,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,0,0,1
Chinatown,1,1,0,1,1,0,1,0,1,1,1,1,1,1,1,0,0,1
Civic Center,5,5,5,5,5,5,5,3,5,5,5,5,5,5,5,1,5,5
Clinton,4,4,4,4,4,4,4,2,4,4,4,4,4,4,4,0,4,4
East Village,2,2,1,2,2,1,2,0,2,2,2,2,2,2,2,0,1,2
Financial District,4,4,4,4,4,4,4,1,4,4,4,4,4,4,4,0,4,4
Flatiron,3,3,2,3,3,3,3,2,3,3,3,3,3,3,3,0,3,3
Gramercy,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1
Greenwich Village,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1


#### Let's find out how many unique categories can be curated from all the returned venues

In [28]:
print('There are {} uniques categories.'.format(len(manhattan_venues['categories'].unique())))

There are 15 uniques categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood in Manhattan with category of GYM results

In [29]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['categories']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Building,College Auditorium,College Gym,Daycare,General Entertainment,Gym,Gym / Fitness Center,Hotel,Martial Arts Dojo,Office,Physical Therapist,Playground,Residential Building (Apartment / Condo),Sports Club,University
0,Chinatown,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,Upper East Side,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,Yorkville,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,Yorkville,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Yorkville,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


And let's examine the new dataframe size.

In [30]:
manhattan_onehot.shape

(47, 16)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [31]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Building,College Auditorium,College Gym,Daycare,General Entertainment,Gym,Gym / Fitness Center,Hotel,Martial Arts Dojo,Office,Physical Therapist,Playground,Residential Building (Apartment / Condo),Sports Club,University
0,Battery Park City,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,Civic Center,0.0,0.0,0.0,0.0,0.0,0.6,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
4,Clinton,0.25,0.0,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,East Village,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
6,Financial District,0.0,0.0,0.0,0.0,0.0,0.5,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0
7,Flatiron,0.0,0.0,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Gramercy,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Greenwich Village,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [32]:
manhattan_grouped.shape

(22, 16)

#### Let's print each neighborhood along with the top 5 most common venues

In [33]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
                venue  freq
0                 Gym   1.0
1            Building   0.0
2  College Auditorium   0.0
3         College Gym   0.0
4             Daycare   0.0


----Chelsea----
                venue  freq
0          University   1.0
1            Building   0.0
2  College Auditorium   0.0
3         College Gym   0.0
4             Daycare   0.0


----Chinatown----
                venue  freq
0              Office   1.0
1            Building   0.0
2  College Auditorium   0.0
3         College Gym   0.0
4             Daycare   0.0


----Civic Center----
                  venue  freq
0                   Gym   0.6
1  Gym / Fitness Center   0.2
2     Martial Arts Dojo   0.2
3              Building   0.0
4    College Auditorium   0.0


----Clinton----
                  venue  freq
0  Gym / Fitness Center  0.75
1              Building  0.25
2    College Auditorium  0.00
3           College Gym  0.00
4               Daycare  0.00


----East Village----
        

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [35]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Battery Park City,Gym,University,Sports Club,Residential Building (Apartment / Condo),Playground
1,Chelsea,University,Sports Club,Residential Building (Apartment / Condo),Playground,Physical Therapist
2,Chinatown,Office,University,Sports Club,Residential Building (Apartment / Condo),Playground
3,Civic Center,Gym,Martial Arts Dojo,Gym / Fitness Center,University,Sports Club
4,Clinton,Gym / Fitness Center,Building,University,Sports Club,Residential Building (Apartment / Condo)


<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [36]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 2, 1, 2, 2, 2, 1, 4], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
manhattan_merged.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
manhattan_merged.head() # check the last columns!
manhattan_merged["Cluster Labels"] = manhattan_merged["Cluster Labels"].astype(int)

Finally, let's visualize the resulting clusters

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

#### Cluster 1

In [60]:
c1 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
for neighbor in c1['Neighborhood']:
    print("Neighborhood--> "+neighbor+" GYMs-->"+manhattan_venues.loc[manhattan_venues['Neighborhood']==neighbor]['name'].values)

['Neighborhood--> Chinatown GYMs-->Gym Office']
['Neighborhood--> Upper West Side GYMs-->Gymboree Play & Music'
 'Neighborhood--> Upper West Side GYMs-->Gymboree Play & Music']
['Neighborhood--> Chelsea GYMs-->18th Street Gym - Volleyball']


#### Cluster 2

In [59]:
c2 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
for neighbor in c2['Neighborhood']:
    print("Neighborhood--> "+neighbor+" GYMs-->"+manhattan_venues.loc[manhattan_venues['Neighborhood']==neighbor]['name'].values)

['Neighborhood--> Clinton GYMs-->Yotel Gym'
 'Neighborhood--> Clinton GYMs-->The Victory Gym'
 'Neighborhood--> Clinton GYMs-->The Gym at The OUT NYC'
 'Neighborhood--> Clinton GYMs-->The Victory']
['Neighborhood--> Gramercy GYMs-->Gym']
['Neighborhood--> Turtle Bay GYMs-->Ambassador East Gym']
['Neighborhood--> Tudor City GYMs-->The Corinthian Sky Gym']
['Neighborhood--> Stuyvesant Town GYMs-->309 Makeshift Swole Gym']


#### Cluster 3

In [61]:
c3 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
for neighbor in c3['Neighborhood']:
    print("Neighborhood--> "+neighbor+" GYMs-->"+manhattan_venues.loc[manhattan_venues['Neighborhood']==neighbor]['name'].values)

['Neighborhood--> Upper East Side GYMs-->911 Gym']
['Neighborhood--> Yorkville GYMs-->Cambridge Gym'
 'Neighborhood--> Yorkville GYMs-->Gym @ The Cambridge'
 "Neighborhood--> Yorkville GYMs-->Ellen's Gym"]
['Neighborhood--> Midtown GYMs-->Gym'
 'Neighborhood--> Midtown GYMs-->Harvard Club Gym'
 'Neighborhood--> Midtown GYMs-->Gym at the Mansfield'
 'Neighborhood--> Midtown GYMs-->Sofitel New York'
 'Neighborhood--> Midtown GYMs-->a gym']
['Neighborhood--> Murray Hill GYMs-->Affinia Shelburne Gym'
 'Neighborhood--> Murray Hill GYMs-->The Gersten Gym']
["Neighborhood--> East Village GYMs-->Gordon's Gym"
 'Neighborhood--> East Village GYMs-->Tompkins Square Outdoors Gym']
['Neighborhood--> West Village GYMs-->350 Bleecker Street Gym']
['Neighborhood--> Battery Park City GYMs-->The Club at Gateway']
['Neighborhood--> Financial District GYMs-->Stark Gym'
 'Neighborhood--> Financial District GYMs-->37 Wall Street Gym'
 'Neighborhood--> Financial District GYMs-->Professional Physical Therapy'

#### Cluster 4

In [62]:
c4 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
for neighbor in c4['Neighborhood']:
    print("Neighborhood--> "+neighbor+" GYMs-->"+manhattan_venues.loc[manhattan_venues['Neighborhood']==neighbor]['name'].values)

['Neighborhood--> Morningside Heights GYMs-->Lefrak Gym Barnard College']


#### Cluster 5

In [63]:
c5 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
for neighbor in c5['Neighborhood']:
    print("Neighborhood--> "+neighbor+" GYMs-->"+manhattan_venues.loc[manhattan_venues['Neighborhood']==neighbor]['name'].values)

["Neighborhood--> Greenwich Village GYMs-->St. Anthony's Mem. Gym"]


# Cluster 2 and Cluster 3 have Neighborhood having many options for category Gym.