<a href="https://colab.research.google.com/github/BrandaoEid/IBM/blob/master/IBM_Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

Nowadays its commun to find a lot of restaurants and food trucks, but the same problem happens in both establishments, the same waste of time in the long queue and them problem to find a place to sit.

This project aims anyone that wants to open a health food truck focused on low cost snacks to eat on the go. This project will provide the location were it is viable to open this kind of establishment.

In order to achieve this proposal, Citi Bike System data will be used. So, we will focus on people that already adhered a health lifestyle and care about how time are spent. The arrival end points in the morning period will be used to find the total arrivals per point so we can determine were people with this lifestyle is concentraded.

# Data Information

The data that will be used in this problem it will be from Citi Bike System data and from Fourquare.

Data is available online (https://s3.amazonaws.com/tripdata/index.html) and contains:
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013), and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure). (Citi Bike System Data info)

Besides all this information, the only that will be relevant for the solution it is:
- Stop Time and Date
- End Station Name
- Station Lat/Long

In this proposed solution the selected year was 2014.

The data that will be retrive from Fourquare it will be 30 venues for each endpoint registred in Citi Bike System that it has at least 10.000 in 2014

# Methodology discussion

The mainly idea in this project was to grab the most commum rides destination during morning period (6:00 to 12:00 AM). To achieve this the date and time field was split and them applied a mask to find all the rides in this period of time.
The destination was grouped and so it was possible to find the most commun venues in the locations. After this step it was applied KMeans to find similar destinations based on venues and total arrivals.

# Results

The results show us end points clustered in four different classes. 

Class:


0.   Restaurants
1.   Park
2.   Hotel
3.   Bar

# Conclusion

It's possible to notice that the Cluster 1 show us the best opportunities, because it has a moderate arrivals and it's not commum this kind of business in those location. 

As a suggestion the food truck could be opened at Barclay St & Church St.

#Download Data

In [0]:
import pandas as pd
import os
import sys
from zipfile import ZipFile
from urllib.request import urlopen
import requests
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import folium

In [2]:
months = {'01':'Janeiro', '02':'Fevereiro', '03':'Março',
          '04':'Abril', '05':'Maio', '06':'Junho',
          '07':'Julho', '08':'Agosto','09':'Setembro',
          '10':'Outubro', '11':'Novembro','12':'Dezembro'
          } 
year = "2014"

data = pd.DataFrame()

for m, m_name in months.items(): 
    print("Recuperando os dados do mês de {} .....".format(m_name)) 
    URL = "https://s3.amazonaws.com/tripdata/" + year + m + "-citibike-tripdata.zip"   
    url = urlopen(URL)
    
    output = open('zipFile.zip', 'wb')        
    output.write(url.read())
    output.close()
    
    bike_data = pd.read_csv('zipFile.zip')      
    data = pd.concat([data, bike_data], ignore_index=True)
    
    os.remove('zipFile.zip')
    
print("Dados recuperados com sucesso!")

Recuperando os dados do mês de Janeiro .....
Recuperando os dados do mês de Fevereiro .....
Recuperando os dados do mês de Março .....
Recuperando os dados do mês de Abril .....
Recuperando os dados do mês de Maio .....
Recuperando os dados do mês de Junho .....
Recuperando os dados do mês de Julho .....
Recuperando os dados do mês de Agosto .....
Recuperando os dados do mês de Setembro .....
Recuperando os dados do mês de Outubro .....
Recuperando os dados do mês de Novembro .....
Recuperando os dados do mês de Dezembro .....
Dados recuperados com sucesso!


Headers cleaning

In [0]:
headers = []

for c in data.columns:
    headers.append(c.replace(" ",""))

data.columns = headers

Take a look into data types, features and total entries

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8081216 entries, 0 to 8081215
Data columns (total 15 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   tripduration           int64  
 1   starttime              object 
 2   stoptime               object 
 3   startstationid         int64  
 4   startstationname       object 
 5   startstationlatitude   float64
 6   startstationlongitude  float64
 7   endstationid           int64  
 8   endstationname         object 
 9   endstationlatitude     float64
 10  endstationlongitude    float64
 11  bikeid                 int64  
 12  usertype               object 
 13  birthyear              object 
 14  gender                 int64  
dtypes: float64(4), int64(5), object(6)
memory usage: 924.8+ MB


Change data type from column 'stoptime' to Date Time

In [0]:
data['stoptime'] = pd.to_datetime(data['stoptime'])

Split 'stoptime' into date and time

In [0]:
data['date'] = [d.date() for d in data['stoptime']]
data['time'] = [str(d.time()) for d in data['stoptime']]

Filter all the end points that were registred between 6:00 ~ 12:00 AM

In [0]:
data = data[(data.time > '06:00') & (data.time < '12:00')]

Round lat lon numbers

In [0]:
data = data.round(6)

Create a DataFrame that will contain only necessary features

In [0]:
dfLocation = data[['endstationname','endstationlatitude','endstationlongitude']]

In [0]:
dfLocationGroup = pd.DataFrame(dfLocation.groupby('endstationname').agg({'endstationname': 'count',
                                                                         'endstationlatitude':'min',
                                                                         'endstationlongitude':'min'}))

In [0]:
dfLocationGroup.index.name = ''
dfLocationGroup.reset_index(inplace = True)
dfLocationGroup.columns = ['name', 'total', 'lat', 'lon']

In [0]:
dfLocationTop10 = dfLocationGroup.sort_values(by = 'total', ascending = True)

In [13]:
dfLocationTop10['total'].describe()

count      344.000000
mean      6569.084302
std       5142.949173
min        161.000000
25%       2069.750000
50%       5627.000000
75%       9413.750000
max      27574.000000
Name: total, dtype: float64

In [0]:
dfLocationTop10 = dfLocationTop10[dfLocationTop10['total'] > 10000]

In [15]:
map_ny = folium.Map(location=[40.758896, -73.985130], zoom_start=13)

for lat, lng, label, total in zip(dfLocationTop10['lat'], dfLocationTop10['lon'], dfLocationTop10['name'],dfLocationTop10['total']):
    label = folium.Popup(label+ '\n Total: '+str(total), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label ,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ny)  
    
map_ny

# Foursquare data

In [0]:
CLIENT_ID = '######'
CLIENT_SECRET = '######' 
VERSION = '20180605'

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 30

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']

        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Name', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Retrieve 30 venues for each end point

In [0]:
location_venues = getNearbyVenues(names=dfLocationTop10['name'],
                                   latitudes=dfLocationTop10['lat'],
                                   longitudes=dfLocationTop10['lon']
                                 )

In [19]:
print(location_venues.shape)
location_venues.head()

(2340, 7)


Unnamed: 0,Name,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Washington Square E,40.730494,-73.995721,Washington Square Park,40.730816,-73.997458,Park
1,Washington Square E,40.730494,-73.995721,Brooklyn Bagel & Coffee Company,40.730913,-73.993259,Bagel Shop
2,Washington Square E,40.730494,-73.995721,Some Good Wine,40.731981,-73.995469,Wine Shop
3,Washington Square E,40.730494,-73.995721,Washington Square Dog Run,40.730767,-73.99849,Dog Run
4,Washington Square E,40.730494,-73.995721,Boba Guys,40.730122,-73.994298,Bubble Tea Shop


In [20]:
print('There are {} uniques categories.'.format(len(location_venues['Venue Category'].unique())))

There are 242 uniques categories.


Analyse venue at each endpoint

In [21]:
location_onehot = pd.get_dummies(location_venues[['Venue Category']], prefix = "", prefix_sep="")

location_onehot['Name'] = location_venues['Name']

location_onehot = location_onehot[ ['Name'] + [ col for col in location_onehot.columns if col != 'Name' ] ]

location_onehot.head()

Unnamed: 0,Name,Accessories Store,Adult Boutique,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auditorium,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Court,Basketball Stadium,Beer Bar,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Boxing Gym,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,...,Spiritual Center,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Synagogue,Szechuan Restaurant,TV Station,Taco Place,Tailor Shop,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Tourist Information Center,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Volleyball Court,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Washington Square E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Washington Square E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Washington Square E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,Washington Square E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Washington Square E,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
location_onehot.shape

(2340, 243)

Frequency of each type of venue in DataFrame

In [23]:
location_grouped = location_onehot.groupby('Name').mean().reset_index()
location_grouped.head()

Unnamed: 0,Name,Accessories Store,Adult Boutique,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auditorium,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Court,Basketball Stadium,Beer Bar,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Boxing Gym,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,...,Spiritual Center,Sporting Goods Shop,Sports Bar,Sports Club,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Synagogue,Szechuan Restaurant,TV Station,Taco Place,Tailor Shop,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Tourist Information Center,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Volleyball Court,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,1 Ave & E 30 St,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.066667,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0
1,1 Ave & E 44 St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
2,11 Ave & W 27 St,0.0,0.0,0.0,0.0,0.0,0.033333,0.233333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
3,2 Ave & E 31 St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.033333,0.033333,0.0,0.0
4,5 Ave & E 29 St,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0


Retrieve 5 most important venues for each endpoint

In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)

    return row_categories_sorted.index.values[0:num_top_venues]

In [25]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

columns = ['Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

location_venues_sorted = pd.DataFrame(columns=columns)
location_venues_sorted['Name'] = location_grouped['Name']

for ind in np.arange(location_grouped.shape[0]):
    location_venues_sorted.iloc[ind, 1:] = return_most_common_venues(location_grouped.iloc[ind, :], num_top_venues)

location_venues_sorted.head()

Unnamed: 0,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,1 Ave & E 30 St,Grocery Store,Coffee Shop,American Restaurant,Pizza Place,Mexican Restaurant
1,1 Ave & E 44 St,Sushi Restaurant,Karaoke Bar,Coffee Shop,Park,Deli / Bodega
2,11 Ave & W 27 St,Art Gallery,Park,Lounge,Cocktail Bar,Scenic Lookout
3,2 Ave & E 31 St,Thai Restaurant,Wine Bar,Grocery Store,Pub,Pizza Place
4,5 Ave & E 29 St,Korean Restaurant,Gym / Fitness Center,Japanese Restaurant,Hotel,Bakery


In [26]:
dfComplete = pd.merge(left = location_grouped, right = dfLocationTop10, how= 'inner', left_on = 'Name', right_on= 'name')
dfComplete

Unnamed: 0,Name,Accessories Store,Adult Boutique,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auditorium,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Court,Basketball Stadium,Beer Bar,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Boxing Gym,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,...,Steakhouse,Street Art,Supermarket,Sushi Restaurant,Synagogue,Szechuan Restaurant,TV Station,Taco Place,Tailor Shop,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Tourist Information Center,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Volleyball Court,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,name,total,lat,lon
0,1 Ave & E 30 St,0.0,0.0,0.066667,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.033333,0.0,0.033333,0.000000,0.066667,0.000000,0.0,0.000000,0.000000,0.033333,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.000000,0.066667,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.033333,0.000000,0.033333,0.0,0.000000,1 Ave & E 30 St,13960,40.741444,-73.975361
1,1 Ave & E 44 St,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.033333,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.033333,0.0,0.033333,0.000000,0.0,0.000000,0.0,0.0,0.033333,0.0,...,0.000000,0.0,0.000000,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.033333,0.033333,0.000000,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.033333,1 Ave & E 44 St,10029,40.750020,-73.969053
2,11 Ave & W 27 St,0.0,0.0,0.000000,0.0,0.0,0.033333,0.233333,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.033333,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.033333,0.033333,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.033333,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.033333,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.033333,0.000000,0.0,0.000000,11 Ave & W 27 St,10895,40.751396,-74.005226
3,2 Ave & E 31 St,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.033333,0.0,0.000000,0.033333,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.000000,0.066667,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.066667,0.033333,0.033333,0.0,0.000000,2 Ave & E 31 St,11054,40.742909,-73.977061
4,5 Ave & E 29 St,0.0,0.0,0.033333,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.033333,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.033333,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.033333,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.033333,0.000000,0.000000,0.0,0.000000,5 Ave & E 29 St,13397,40.745168,-73.986831
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,W 56 St & 6 Ave,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.066667,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.033333,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,W 56 St & 6 Ave,12618,40.763406,-73.977225
74,W Broadway & Spring St,0.0,0.0,0.000000,0.0,0.0,0.000000,0.033333,0.033333,0.033333,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.033333,0.000000,0.000000,0.0,0.033333,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.033333,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,W Broadway & Spring St,12277,40.724910,-74.001547
75,W Houston St & Hudson St,0.0,0.0,0.100000,0.0,0.0,0.000000,0.033333,0.033333,0.000000,0.000000,0.033333,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.033333,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.033333,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.033333,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.033333,0.0,0.0,0.000000,0.0,0.0,0.066667,0.000000,0.000000,0.0,0.000000,W Houston St & Hudson St,16022,40.728739,-74.007488
76,Washington Square E,0.0,0.0,0.066667,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.033333,0.000000,0.000000,0.000000,0.0,0.000000,0.033333,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.033333,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.033333,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.066667,0.000000,0.0,0.000000,Washington Square E,10021,40.730494,-73.995721


Cluster Endpoints

In [0]:
kclusters = 4

location_grouped_clustering = dfComplete.drop(['Name','name'], 1)
location_grouped_clustering_scaled = StandardScaler().fit_transform(location_grouped_clustering)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(location_grouped_clustering_scaled)

In [0]:
location_venues_sorted.drop(columns= 'Cluster Labels', inplace = True)

In [48]:

location_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

location_merged = dfLocationTop10

location_merged = location_merged.join(location_venues_sorted.set_index('Name'), on='name')

location_merged.head() 

Unnamed: 0,name,total,lat,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
332,Washington Square E,10021,40.730494,-73.995721,0,Ice Cream Shop,Wine Shop,American Restaurant,Cuban Restaurant,Spa
127,E 33 St & 5 Ave,10028,40.747659,-73.984907,0,Korean Restaurant,Japanese Restaurant,Hotel,Gym / Fitness Center,Cosmetics Shop
3,1 Ave & E 44 St,10029,40.75002,-73.969053,0,Sushi Restaurant,Karaoke Bar,Coffee Shop,Park,Deli / Bodega
227,Mott St & Prince St,10101,40.72318,-73.9948,0,Italian Restaurant,Cosmetics Shop,Yoga Studio,Paper / Office Supplies Store,Street Art
120,E 27 St & 1 Ave,10133,40.739445,-73.976806,3,Bar,American Restaurant,Multiplex,Taco Place,Spa


Map visualization of Clusters

In [49]:
map_clusters = folium.Map(location=[40.758896, -73.985130], zoom_start=13)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster,total in zip(location_merged['lat'], location_merged['lon'], location_merged['name'], location_merged['Cluster Labels'],location_merged['total']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster) + 'Total: ' + str(total), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine Clusters

Cluster 0

In [50]:
print(location_merged.loc[location_merged['Cluster Labels'] == 0].shape)
location_merged.loc[location_merged['Cluster Labels'] == 0, location_merged.columns[[1] + list(range(5, location_merged.shape[1]))]].head()

(68, 10)


Unnamed: 0,total,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
332,10021,Ice Cream Shop,Wine Shop,American Restaurant,Cuban Restaurant,Spa
127,10028,Korean Restaurant,Japanese Restaurant,Hotel,Gym / Fitness Center,Cosmetics Shop
3,10029,Sushi Restaurant,Karaoke Bar,Coffee Shop,Park,Deli / Bodega
227,10101,Italian Restaurant,Cosmetics Shop,Yoga Studio,Paper / Office Supplies Store,Street Art
152,10142,Hotel,Jewelry Store,Coffee Shop,Steakhouse,Electronics Store


Cluster 1

In [51]:
print(location_merged.loc[location_merged['Cluster Labels'] == 1].shape)
location_merged.loc[location_merged['Cluster Labels'] == 1, location_merged.columns[[1] + list(range(5, location_merged.shape[1]))]].head()

(3, 10)


Unnamed: 0,total,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
259,11048,Park,Memorial Site,Gourmet Shop,Food Court,Plaza
37,13741,Memorial Site,Hotel,Building,Yoga Studio,Taco Place
274,18192,Park,Food Court,Steakhouse,Performing Arts Venue,Burger Joint


Cluster 2

In [52]:
print(location_merged.loc[location_merged['Cluster Labels'] == 2].shape)
location_merged.loc[location_merged['Cluster Labels'] == 2, location_merged.columns[[1] + list(range(5, location_merged.shape[1]))]].head()

(4, 10)


Unnamed: 0,total,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,11924,Hotel,Lounge,Coffee Shop,Italian Restaurant,Music Venue
296,13295,Burger Joint,Korean Restaurant,Indie Theater,Hotel,Gym / Fitness Center
294,14954,Boxing Gym,Yoga Studio,Indie Theater,Gym / Fitness Center,Coffee Shop
17,15988,Hotel,Music Venue,Deli / Bodega,Cupcake Shop,Donut Shop


Cluster 3

In [53]:
print(location_merged.loc[location_merged['Cluster Labels'] == 3].shape)
location_merged.loc[location_merged['Cluster Labels'] == 3, location_merged.columns[[1] + list(range(5, location_merged.shape[1]))]].head()

(3, 10)


Unnamed: 0,total,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
120,10133,Bar,American Restaurant,Multiplex,Taco Place,Spa
9,11054,Thai Restaurant,Wine Bar,Grocery Store,Pub,Pizza Place
2,13960,Grocery Store,Coffee Shop,American Restaurant,Pizza Place,Mexican Restaurant
