# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Daniel Claudiano Cabral Pinto 
#### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>
###   What kind of Business is likely to be successful in Queens NY?

* In order to answer this question, I need to know what kind of venues is the most popular among similar neighborhoods, then capture which neighborhoods this venues are not present. 
* Since they are similar neighborhoods, the most popular kind of venue in this subset tends to be a popular one on those neighborhoods where it has been missing.


## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* Number of restaurants and their type and location in each neighborhood from Queens. 

Following data sources will be needed to extract/generate the required information:
* New York location data from IBM Database
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**


### Importing New York Data


In [2]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


### Importing packages

In [3]:
import pandas as pd
import numpy as np 
import json 
from geopy.geocoders import Nominatim 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!pip install folium
import folium 
print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.1 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Libraries imported.


### Getting Queens Data

In [6]:
with open('newyork_data.json') as json_data: 
    newyork_data = json.load(json_data) # getting new york data

neighborhoods_data = newyork_data['features']

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data: # getting new york's neighborhood data 
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True) 
    
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True) # getting Queens's data

queens_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138
...,...,...,...,...
76,Queens,Middle Village,40.716415,-73.881143
77,Queens,Malba,40.790602,-73.826678
78,Queens,Hammels,40.587338,-73.805530
79,Queens,Bayswater,40.611322,-73.765968


### Getting Venues Data from Foursquare

In [7]:
CLIENT_ID = 'WFGWLKS31WH1EJUUNKUP3OMSQKJANUC34XNUIV3N25NJXCH5'
CLIENT_SECRET = 'ZDZSUV3CC1JG0AAXCN5JWRXRLTNWL2VNYTJBQMVUFT2O53RQ' 
VERSION = '20180605' 
LIMIT = 100 

def getNearbyVenues(names, latitudes, longitudes, radius=500): #function to get neighborhood's venues data
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                                  )
queens_venues

Astoria
Woodside
Jackson Heights
Elmhurst
Howard Beach
Corona
Forest Hills
Kew Gardens
Richmond Hill
Flushing
Long Island City
Sunnyside
East Elmhurst
Maspeth
Ridgewood
Glendale
Rego Park
Woodhaven
Ozone Park
South Ozone Park
College Point
Whitestone
Bayside
Auburndale
Little Neck
Douglaston
Glen Oaks
Bellerose
Kew Gardens Hills
Fresh Meadows
Briarwood
Jamaica Center
Oakland Gardens
Queens Village
Hollis
South Jamaica
St. Albans
Rochdale
Springfield Gardens
Cambria Heights
Rosedale
Far Rockaway
Broad Channel
Breezy Point
Steinway
Beechhurst
Bay Terrace
Edgemere
Arverne
Rockaway Beach
Neponsit
Murray Hill
Floral Park
Holliswood
Jamaica Estates
Queensboro Hill
Hillcrest
Ravenswood
Lindenwood
Laurelton
Lefrak City
Belle Harbor
Rockaway Park
Somerville
Brookville
Bellaire
North Corona
Forest Hills Gardens
Jamaica Hills
Utopia
Pomonok
Astoria Heights
Hunters Point
Sunnyside Gardens
Blissville
Roxbury
Middle Village
Malba
Hammels
Bayswater
Queensbridge


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,40.768509,-73.915654,Orange Blossom,40.769856,-73.917012,Gourmet Shop
2,Astoria,40.768509,-73.915654,Off The Hook,40.767200,-73.918104,Seafood Restaurant
3,Astoria,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
4,Astoria,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
...,...,...,...,...,...,...,...
2131,Queensbridge,40.756091,-73.945631,Queensbridge Park Softball Fields,40.756055,-73.948407,Baseball Field
2132,Queensbridge,40.756091,-73.945631,Queensbridge Basketball Courts,40.755060,-73.949103,Basketball Court
2133,Queensbridge,40.756091,-73.945631,The Ravel Hotel Gym,40.753787,-73.948815,Athletics & Sports
2134,Queensbridge,40.756091,-73.945631,Estate Garden And Grill,40.753700,-73.948841,Beer Garden


## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Queens that have low number of the most popular kind of venue of similar neihborhoods. Using the k-means method we are going to discover which places are similiar.

In [8]:
# one hot encoding
queens_onehot = pd.get_dummies(queens_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
queens_onehot['ZNeighborhood'] = queens_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [queens_onehot.columns[-1]] + list(queens_onehot.columns[:-1])
queens_onehot = queens_onehot[fixed_columns]
queens_onehot.drop('Neighborhood', axis=1, inplace=True)
queens_onehot.rename(index=str, columns={'ZNeighborhood': 'Neighborhood'},inplace = True)
queens_grouped = queens_onehot.groupby('Neighborhood').mean().reset_index()
queens_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport Terminal,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Arverne,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.045455,0.0,0.000000,0.0
1,Astoria,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.010000,0.0,0.000000,0.0
2,Astoria Heights,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000,0.0
3,Auburndale,0.000000,0.0,0.050000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000,0.0
4,Bay Terrace,0.000000,0.0,0.085714,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.028571,0.000000,0.0,0.0,0.028571,0.00,0.000000,0.0,0.057143,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,Sunnyside Gardens,0.000000,0.0,0.030000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.01,0.000000,0.0,0.000000,0.0
77,Utopia,0.066667,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000,0.0
78,Whitestone,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000,0.0
79,Woodhaven,0.000000,0.0,0.000000,0.037037,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.037037,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000,0.0


In [9]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = queens_grouped['Neighborhood']

for ind in np.arange(queens_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(queens_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Surf Spot,Metro Station,Sandwich Place,BBQ Joint,Playground,Donut Shop,Coffee Shop,Caribbean Restaurant,Café,Bus Stop
1,Astoria,Bar,Middle Eastern Restaurant,Greek Restaurant,Seafood Restaurant,Bakery,Café,Mediterranean Restaurant,Deli / Bodega,Indian Restaurant,Hookah Bar
2,Astoria Heights,Italian Restaurant,Plaza,Bus Station,Playground,Bowling Alley,Supermarket,Museum,Chinese Restaurant,Bakery,Burger Joint
3,Auburndale,Italian Restaurant,Mattress Store,Pharmacy,Pet Store,Discount Store,Comic Shop,Noodle House,Fast Food Restaurant,Mobile Phone Shop,Miscellaneous Shop
4,Bay Terrace,Clothing Store,American Restaurant,Shoe Store,Mobile Phone Shop,Cosmetics Shop,Women's Store,Kids Store,Donut Shop,Shopping Mall,Gym
...,...,...,...,...,...,...,...,...,...,...,...
76,Sunnyside Gardens,Bar,Grocery Store,Pizza Place,Pharmacy,Turkish Restaurant,Coffee Shop,Bank,Mexican Restaurant,American Restaurant,Deli / Bodega
77,Utopia,Deli / Bodega,Afghan Restaurant,Automotive Shop,Pizza Place,Donut Shop,South American Restaurant,Spa,Grocery Store,Basketball Court,Korean Restaurant
78,Whitestone,Dance Studio,Deli / Bodega,Bubble Tea Shop,Candy Store,Yoga Studio,Fish & Chips Shop,Farm,Farmers Market,Fast Food Restaurant,Filipino Restaurant
79,Woodhaven,Deli / Bodega,Pharmacy,Bank,Park,Spanish Restaurant,Fried Chicken Joint,Nail Salon,Sandwich Place,Supermarket,Latin American Restaurant


In [10]:
# set number of clusters
kclusters = 20

queens_grouped_clustering = queens_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(queens_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50] 

array([14,  6, 19,  6, 17, 19,  1, 19,  6,  6,  3,  3, 18,  6,  0, 12, 17,
        3, 19,  3,  3, 19,  6,  3, 19, 17,  6, 17,  9,  6,  0,  5,  6, 17,
       11,  6,  6, 19,  6,  7,  6,  6,  3, 13, 17,  3,  3, 19, 10, 19],
      dtype=int32)

In [11]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

queens_merged = queens_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
queens_merged = queens_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
#queens_merged.drop('Cluster Labels', axis=1, inplace=True)
queens_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Queens,Astoria,40.768509,-73.915654,6,Bar,Middle Eastern Restaurant,Greek Restaurant,Seafood Restaurant,Bakery,Café,Mediterranean Restaurant,Deli / Bodega,Indian Restaurant,Hookah Bar
1,Queens,Woodside,40.746349,-73.901842,17,Grocery Store,Latin American Restaurant,Bakery,Filipino Restaurant,Thai Restaurant,Pub,American Restaurant,Bar,Donut Shop,Pharmacy
2,Queens,Jackson Heights,40.751981,-73.882821,19,Latin American Restaurant,Peruvian Restaurant,South American Restaurant,Bakery,Mobile Phone Shop,Thai Restaurant,Grocery Store,Pizza Place,Supermarket,Empanada Restaurant
3,Queens,Elmhurst,40.744049,-73.881656,6,Mexican Restaurant,Thai Restaurant,Bubble Tea Shop,Chinese Restaurant,Vietnamese Restaurant,Colombian Restaurant,Big Box Store,Bar,Malay Restaurant,Donut Shop
4,Queens,Howard Beach,40.654225,-73.838138,6,Italian Restaurant,Pharmacy,Deli / Bodega,Chinese Restaurant,Sandwich Place,Fast Food Restaurant,Shipping Store,Concert Hall,Hookah Bar,Tapas Restaurant


## Analysis <a name="analysis"></a>

Now that we have the most popular kind of venues in each neighborhood cluster, the next step is to discover which is the most popular type and the neighborhoods in the cluster that this type of venues is not present.  

#### Observing clusters

In [12]:
address = 'Queens, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Queens are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Queens are 40.7498243, -73.7976337.


In [13]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(queens_merged['Latitude'], queens_merged['Longitude'], queens_merged['Neighborhood'], queens_merged['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Discovering the most populated cluster

In [15]:
popular_cluster = queens_merged['Cluster_Labels'].value_counts()
popular_cluster

6     23
19    13
17    13
3     12
9      4
0      2
14     1
13     1
12     1
11     1
10     1
18     1
8      1
7      1
15     1
5      1
4      1
16     1
2      1
1      1
Name: Cluster_Labels, dtype: int64

#### Getting data from the most popular cluster

In [16]:
cluster1_neighborhoods = queens_merged.query('Cluster_Labels == 6')
cluster1_neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Queens,Astoria,40.768509,-73.915654,6,Bar,Middle Eastern Restaurant,Greek Restaurant,Seafood Restaurant,Bakery,Café,Mediterranean Restaurant,Deli / Bodega,Indian Restaurant,Hookah Bar
3,Queens,Elmhurst,40.744049,-73.881656,6,Mexican Restaurant,Thai Restaurant,Bubble Tea Shop,Chinese Restaurant,Vietnamese Restaurant,Colombian Restaurant,Big Box Store,Bar,Malay Restaurant,Donut Shop
4,Queens,Howard Beach,40.654225,-73.838138,6,Italian Restaurant,Pharmacy,Deli / Bodega,Chinese Restaurant,Sandwich Place,Fast Food Restaurant,Shipping Store,Concert Hall,Hookah Bar,Tapas Restaurant
6,Queens,Forest Hills,40.725264,-73.844475,6,Gym / Fitness Center,Gym,Yoga Studio,Thai Restaurant,Park,Pharmacy,Pizza Place,Convenience Store,Peruvian Restaurant,Optical Shop
7,Queens,Kew Gardens,40.705179,-73.829819,6,Cosmetics Shop,Chinese Restaurant,Indian Restaurant,Bank,Pharmacy,Pizza Place,Donut Shop,Park,Spa,Bar
17,Queens,Woodhaven,40.689887,-73.85811,6,Deli / Bodega,Pharmacy,Bank,Park,Spanish Restaurant,Fried Chicken Joint,Nail Salon,Sandwich Place,Supermarket,Latin American Restaurant
18,Queens,Ozone Park,40.680708,-73.843203,6,Pharmacy,Gym,Diner,Bank,Pizza Place,Grocery Store,Furniture / Home Store,Sandwich Place,Gas Station,Bowling Alley
19,Queens,South Ozone Park,40.66855,-73.809865,6,Park,Bar,Deli / Bodega,Donut Shop,Fast Food Restaurant,Sandwich Place,Food Truck,Hotel,Dim Sum Restaurant,Event Space
23,Queens,Auburndale,40.76173,-73.791762,6,Italian Restaurant,Mattress Store,Pharmacy,Pet Store,Discount Store,Comic Shop,Noodle House,Fast Food Restaurant,Mobile Phone Shop,Miscellaneous Shop
26,Queens,Glen Oaks,40.749441,-73.715481,6,Pharmacy,Indian Restaurant,Donut Shop,Mexican Restaurant,Playground,Bus Station,Mattress Store,Moving Target,Fast Food Restaurant,Falafel Restaurant


#### Getting data venues

In [17]:
cluster1_neighborhoods_venues = getNearbyVenues(names=cluster1_neighborhoods['Neighborhood'],
                                   latitudes=cluster1_neighborhoods['Latitude'],
                                   longitudes=cluster1_neighborhoods['Longitude']
                                  )
cluster1_neighborhoods_venues 

Astoria
Elmhurst
Howard Beach
Forest Hills
Kew Gardens
Woodhaven
Ozone Park
South Ozone Park
Auburndale
Glen Oaks
Briarwood
Jamaica Center
St. Albans
Rochdale
Springfield Gardens
Rockaway Beach
Hillcrest
Belle Harbor
Rockaway Park
Bellaire
Jamaica Hills
Hunters Point
Queensbridge


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,40.768509,-73.915654,Orange Blossom,40.769856,-73.917012,Gourmet Shop
2,Astoria,40.768509,-73.915654,Off The Hook,40.767200,-73.918104,Seafood Restaurant
3,Astoria,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
4,Astoria,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
...,...,...,...,...,...,...,...
693,Queensbridge,40.756091,-73.945631,Queensbridge Park Softball Fields,40.756055,-73.948407,Baseball Field
694,Queensbridge,40.756091,-73.945631,Queensbridge Basketball Courts,40.755060,-73.949103,Basketball Court
695,Queensbridge,40.756091,-73.945631,The Ravel Hotel Gym,40.753787,-73.948815,Athletics & Sports
696,Queensbridge,40.756091,-73.945631,Estate Garden And Grill,40.753700,-73.948841,Beer Garden


#### Discovering the most popular type of Venue in the cluster

In [18]:
cluster1_neighborhoods_venues['Venue Category'].value_counts()

Deli / Bodega            25
Pharmacy                 23
Donut Shop               19
Chinese Restaurant       18
Pizza Place              17
                         ..
Hardware Store            1
Check Cashing Service     1
Shop & Service            1
Cuban Restaurant          1
Movie Theater             1
Name: Venue Category, Length: 183, dtype: int64

#### Getting data od which venue in the cluster

In [21]:
df_decision = cluster1_neighborhoods_venues.groupby(['Neighborhood','Venue Category'])['Venue'].count()
df = df_decision.to_frame()
df.reset_index(inplace= True)
df.rename(index=str, columns={"Venue Category": "Venue_Category"},inplace = True)
df

Unnamed: 0,Neighborhood,Venue_Category,Venue
0,Astoria,BBQ Joint,1
1,Astoria,Bagel Shop,2
2,Astoria,Bakery,4
3,Astoria,Bar,6
4,Astoria,Beer Garden,1
...,...,...,...
520,Woodhaven,Sandwich Place,1
521,Woodhaven,Spanish Restaurant,1
522,Woodhaven,Supermarket,1
523,Woodhaven,Thai Restaurant,1


#### Discovering neighborhoods from cluster where there's Deli / Bodega

In [27]:
df_deli = df.query('Venue_Category == "Deli / Bodega"')
df_deli 

Unnamed: 0,Neighborhood,Venue_Category,Venue
15,Astoria,Deli / Bodega,3
60,Auburndale,Deli / Bodega,1
80,Bellaire,Deli / Bodega,1
94,Belle Harbor,Deli / Bodega,2
105,Briarwood,Deli / Bodega,1
139,Forest Hills,Deli / Bodega,1
204,Howard Beach,Deli / Bodega,2
241,Hunters Point,Deli / Bodega,2
283,Jamaica Center,Deli / Bodega,1
336,Kew Gardens,Deli / Bodega,2


In [29]:
# List of neighborhoods where there´s Deli
deli_neighborhoods = df_deli['Neighborhood'].to_list()
deli_neighborhoods

['Astoria',
 'Auburndale',
 'Bellaire',
 'Belle Harbor',
 'Briarwood',
 'Forest Hills',
 'Howard Beach',
 'Hunters Point',
 'Jamaica Center',
 'Kew Gardens',
 'Ozone Park',
 'Rockaway Beach',
 'Rockaway Park',
 'South Ozone Park',
 'St. Albans',
 'Woodhaven']

#### Discovering neighborhoods from cluster where there's no Deli / Bodega

In [33]:
cluster6_neighborhoods = df['Neighborhood'].unique()
not_deli = []
for i in cluster6_neighborhoods:
    if i not in deli_neighborhoods:
        not_deli.append(i)
print(not_deli)
        

['Elmhurst', 'Glen Oaks', 'Hillcrest', 'Jamaica Hills', 'Queensbridge', 'Rochdale', 'Springfield Gardens']


## Results and Discussion <a name="results"></a>

The neighborhoods : Elmhurst, Glen Oaks, Hillcrest, Jamaica Hills, Queensbridge, Rochdale and Springfield Gardens are good places to start a Deli, because it's the most popular type of venue of their group of similar neighborhoods, and there's no Deli in these neighborhoods.


## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify oportunities to start a business in similar neighborhooods in Queens area. Using the k-means method to determine similiar neighborhoods through clusters, I need to know what kind of venues is the most popular among thouse neighborhoods, then capture which neighborhoods this venues are not present. Since they are similar neighborhoods, the most popular kind of venue in this subset tends to be a popular one on neighborhoods where it has been missing.

Based in the results of this analysis, Deli\Bodega is the most popular venue in the most populated cluster and is to be expectated to be a successful type of business in Elmhurst, Glen Oaks, Hillcrest, Jamaica Hills, Queensbridge, Rochdale and Springfield Gardens..