# IBM Data Science- The Battle of Neighborhoods (Week 2
## Chinese Restaurant Exploration

## Problem Statement

A budding entrepreneur sees a trend in New York City for Chinese cuisine. He is interested in setting up shop in New York City. Being in the big city, he is unsure of where the best place would be for him to open up his new chinese restaurant. However, he understands the power of data in order to make a more informed decision. Therefore, utilising data, he would like to determine where the best place to open up a new chinese restaurant would be.

## Data

In order to help this restaurateur solve his dilemma, data from the New York City neighbourhoods database will be used. Then, based on this dataset and the given lattitude and longitudes of the neighbourhoods, the Foursquare API will be queried in order to find out the distribution of Chinese restaurants in each neighbourhood. Then, based on the neighbourhood with the lowest density of Chinese restaurants available, it could be used to determine whether or not the new restaurateur should set up shop there.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in New York City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>   
    
6. <a href="#item6">Results & Discussion</a>

7. <a href="#item7">Conclusion</a>
</font>
</div>

First, all the dependencies that are used in this project has to be downloaded, and the necessary libraries installed.

In [6]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

The dataset containing the details about each neighbourhood located in New York is available from the following link:
https://cocl.us/new_york_dataset

Now the data is downloaded:

In [9]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

In [7]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [9]:
neighborhoods_data = newyork_data['features']

#### Tranform the data into a *pandas* dataframe

In [10]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Fill the data in the dataframe from the loaded JSON file.

In [11]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [12]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


#### Use geopy library to get the latitude and longitude values of New York City.

In [13]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

In [14]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

#### Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = '3XALZSF2OJ5NNVCRNT1FPH4IP2VA30U5L5MSGAWGCTWOPUJF' # your Foursquare ID
CLIENT_SECRET = 'FF5LJCUFG3MEWSBN5UTHLD4WD0I5Q02QU0ZXUCV20LIJIYM4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3XALZSF2OJ5NNVCRNT1FPH4IP2VA30U5L5MSGAWGCTWOPUJF
CLIENT_SECRET:FF5LJCUFG3MEWSBN5UTHLD4WD0I5Q02QU0ZXUCV20LIJIYM4


In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

<a id='item2'></a>

#### Create a function to repeat the same process to all the neighborhoods 

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        LIMIT = 100
        radius = 500
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [25]:
# type your answer here

venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )



Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

#### Check the size of the resulting dataframe

In [26]:
print(venues.shape)
venues.head()

(10259, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


Check how many venues were returned for each neighborhood

In [22]:
venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,28,28,28,28,28,28
Annadale,8,8,8,8,8,8
Arden Heights,5,5,5,5,5,5
Arlington,7,7,7,7,7,7
Arrochar,19,19,19,19,19,19
Arverne,18,18,18,18,18,18
Astoria,100,100,100,100,100,100
Astoria Heights,13,13,13,13,13,13
Auburndale,18,18,18,18,18,18
Bath Beach,48,48,48,48,48,48


Since we are only interested in the chinese restaurants in the area, we will filter the dataset to show them only

In [55]:
#musicSchoolVenues = venues[(venues['Venue Category']=='Music Venue')|(venues['Venue Category']=='Music School')]
#musicSchoolVenues
chineseFood = venues[(venues['Venue Category']=='Chinese Restaurant')]
chineseFood

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
46,Eastchester,40.887556,-73.827806,Xing Lung Chinese Restaurant,40.888785,-73.831226,Chinese Restaurant
208,Norwood,40.877224,-73.879391,Ming Liang Kitchen,40.879876,-73.876629,Chinese Restaurant
212,Norwood,40.877224,-73.879391,Sing Fei Chinese Restaurant,40.879907,-73.875307,Chinese Restaurant
216,Norwood,40.877224,-73.879391,Happy Dragon,40.88041,-73.883442,Chinese Restaurant
244,Pelham Parkway,40.857413,-73.854756,Mr. Q's Chinese Restaurant,40.85579,-73.855455,Chinese Restaurant
260,Pelham Parkway,40.857413,-73.854756,Great Wall Chinese Restaurant,40.855168,-73.855587,Chinese Restaurant
305,Bedford Park,40.870185,-73.885512,Choi Yuan - Chinese Restaurant,40.873078,-73.889086,Chinese Restaurant
308,Bedford Park,40.870185,-73.885512,Hung Hing Chinese Restaurant,40.871181,-73.886759,Chinese Restaurant
311,Bedford Park,40.870185,-73.885512,Rose Flower Chinese,40.867958,-73.883858,Chinese Restaurant
320,Bedford Park,40.870185,-73.885512,Hong Kong Kitchen,40.869617,-73.890512,Chinese Restaurant


## 2. Explore Neighborhoods in Manhattan

<a id='item3'></a>

## 3. Analyze Each Neighborhood

In [56]:
# one hot encoding
chinese_onehot = pd.get_dummies(chineseFood[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
chinese_onehot['Neighborhood'] = chineseFood['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [chinese_onehot.columns[-1]] + list(chinese_onehot.columns[:-1])
chinese_onehot = chinese_onehot[fixed_columns]

chinese_onehot

Unnamed: 0,Neighborhood,Chinese Restaurant
46,Eastchester,1
208,Norwood,1
212,Norwood,1
216,Norwood,1
244,Pelham Parkway,1
260,Pelham Parkway,1
305,Bedford Park,1
308,Bedford Park,1
311,Bedford Park,1
320,Bedford Park,1


And let's examine the new dataframe size.

In [52]:
chinese_onehot.shape

(24, 3)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [64]:
chinese_grouped = chinese_onehot.groupby('Neighborhood').sum().reset_index()
chinese_grouped

Unnamed: 0,Neighborhood,Chinese Restaurant
0,Allerton,1
1,Astoria,1
2,Bath Beach,3
3,Battery Park City,1
4,Bay Ridge,2
5,Bayside,1
6,Bedford Park,4
7,Beechhurst,2
8,Bellaire,2
9,Belle Harbor,1


## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 3 clusters.

In [61]:
# set number of clusters
kclusters = 3

chinese_grouped_clustering = chinese_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(chinese_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 1, 0, 1, 0, 2, 1, 1, 0], dtype=int32)

A new dataframe is created to include the clusters

In [65]:
# add clustering labels
chinese_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
#chinese_grouped_cluster = chinese_grouped.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#manhattan_merged.head() # check the last columns!
chinesefoodCombined = chinese_grouped.join(neighborhoods.set_index('Neighborhood'), on='Neighborhood')

In [66]:
chinesefoodCombined

Unnamed: 0,Cluster Labels,Neighborhood,Chinese Restaurant,Borough,Latitude,Longitude
0,0,Allerton,1,Bronx,40.865788,-73.859319
1,0,Astoria,1,Queens,40.768509,-73.915654
2,1,Bath Beach,3,Brooklyn,40.599519,-73.998752
3,0,Battery Park City,1,Manhattan,40.711932,-74.016869
4,1,Bay Ridge,2,Brooklyn,40.625801,-74.030621
5,0,Bayside,1,Queens,40.766041,-73.774274
6,2,Bedford Park,4,Bronx,40.870185,-73.885512
7,1,Beechhurst,2,Queens,40.792781,-73.804365
8,1,Bellaire,2,Queens,40.733014,-73.738892
9,0,Belle Harbor,1,Queens,40.576156,-73.854018


Finally the resulting clusters are visualized

In [67]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(chinesefoodCombined['Latitude'], chinesefoodCombined['Longitude'], chinesefoodCombined['Neighborhood'], chinesefoodCombined['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

#### Cluster 1

In [70]:
chinesefoodCombined.loc[chinesefoodCombined['Cluster Labels'] == 0, chinesefoodCombined.columns[list(range(1, chinesefoodCombined.shape[1]))]]

Unnamed: 0,Neighborhood,Chinese Restaurant,Borough,Latitude,Longitude
0,Allerton,1,Bronx,40.865788,-73.859319
1,Astoria,1,Queens,40.768509,-73.915654
3,Battery Park City,1,Manhattan,40.711932,-74.016869
5,Bayside,1,Queens,40.766041,-73.774274
9,Belle Harbor,1,Queens,40.576156,-73.854018
10,Bellerose,1,Queens,40.728573,-73.720128
12,Blissville,1,Queens,40.737251,-73.932442
13,Boerum Hill,1,Brooklyn,40.685683,-73.983748
14,Borough Park,1,Brooklyn,40.633131,-73.990498
15,Bronxdale,1,Bronx,40.852723,-73.861726


#### Cluster 2

In [71]:
chinesefoodCombined.loc[chinesefoodCombined['Cluster Labels'] == 1, chinesefoodCombined.columns[list(range(1, chinesefoodCombined.shape[1]))]]

Unnamed: 0,Neighborhood,Chinese Restaurant,Borough,Latitude,Longitude
2,Bath Beach,3,Brooklyn,40.599519,-73.998752
4,Bay Ridge,2,Brooklyn,40.625801,-74.030621
7,Beechhurst,2,Queens,40.792781,-73.804365
8,Bellaire,2,Queens,40.733014,-73.738892
11,Bensonhurst,2,Brooklyn,40.611009,-73.99518
17,Bulls Head,2,Staten Island,40.609592,-74.159409
21,Central Harlem,2,Manhattan,40.815976,-73.943211
28,Clinton Hill,3,Brooklyn,40.693229,-73.967843
30,College Point,2,Queens,40.784903,-73.843045
31,Concourse,2,Bronx,40.834284,-73.915589


#### Cluster 3

In [72]:
chinesefoodCombined.loc[chinesefoodCombined['Cluster Labels'] == 2, chinesefoodCombined.columns[list(range(1, chinesefoodCombined.shape[1]))]]

Unnamed: 0,Neighborhood,Chinese Restaurant,Borough,Latitude,Longitude
6,Bedford Park,4,Bronx,40.870185,-73.885512
23,Chinatown,9,Manhattan,40.715618,-73.994279
50,Flushing,5,Queens,40.764454,-73.831773
71,Little Neck,5,Queens,40.770826,-73.738898
73,Lower East Side,4,Manhattan,40.717807,-73.98089


## 6. Results & Discussion

From the analysis conducted, it can be observed that cluster 1 contains the areas with the lowest count of Chinese Restaurants. Overall, it can be observed that New York City still has a lot of opportunity for new Chinese Restaurants. Other than Manhattan, Queens, and the Bronx area, the number of Chinese Restaurants is still relatively low. One of the boroughs with the lowest amount of Chinese Restaurants would be the Staten Island area.

## 7. Conclusion

In conclusion, analysis was done on the New York City Neighborhoods dataset, and a suitable new location to open up a Chinese restaurant was determined. 