#### Author: Sbonelo Ngobese

#### Capstone Project: Real-estate business using Foursquare API and historical housing data. 

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

import json # library to handle JSON files

#### Let us become familiar with the data. This data was obtained from Kaggle datasets 

NY City data : [https://www.kaggle.com/new-york-city/nyc-property-sales#nyc-rolling-sales.csv](https://www.kaggle.com/new-york-city/nyc-property-sales#nyc-rolling-sales.csv) for house prices and the data we obtain from 

2014 New York City Neighborhood Names: https://geo.nyu.edu/catalog/nyu_2451_34572

In [2]:
# Load the NYC Property data to a pandas dataframe object 
nyc_prop_data2 = pd.read_csv("nyc-rolling-sales.csv", index_col=0) 

In [3]:
# Wrangle data from a JSON file 
with open('nyu_2451_34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
nyc_prop_data = pd.DataFrame(columns=column_names)


for data in neighborhoods_data:
    borough = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    nyc_prop_data = nyc_prop_data.append({'PostalCode': borough,
                                    'Borough':neighborhood_name,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
nyc_prop_data.set_index("PostalCode",inplace=True)
nyc_prop_data.head()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bronx,Wakefield,Wakefield,40.894705,-73.847201
Bronx,Co-op City,Co-op City,40.874294,-73.829939
Bronx,Eastchester,Eastchester,40.887556,-73.827806
Bronx,Fieldston,Fieldston,40.895437,-73.905643
Bronx,Riverdale,Riverdale,40.890834,-73.912585


#### In this data we can see that it has the longitude, latitude and neighbourhood column these should come in handy for Foursquare. 

In [4]:
nyc_prop_data.dtypes

Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [5]:
nyc_prop_data.Neighborhood = [string.upper() for string in nyc_prop_data.Neighborhood.values]

#### Some data wrangling. We merge data from NYC Properties and from the json file 

In [6]:
nyc_data = nyc_prop_data2[['NEIGHBORHOOD','SALE PRICE']]
nyc_data.rename(columns={'SALE PRICE':'PRICE', 'NEIGHBORHOOD': 'Neighborhood'}, inplace=True)

In [7]:
print(nyc_data.shape)

(84548, 2)


#### We remove the dashes i.e. - from the Property dataset and we merge this with NYC neighbourhoods names data we do this in order to enable foursquare to get a more polished dataset with relevant CITY names.

In [8]:
nyc_data.head(10)

Unnamed: 0,Neighborhood,PRICE
4,ALPHABET CITY,6625000
5,ALPHABET CITY,-
6,ALPHABET CITY,-
7,ALPHABET CITY,3936272
8,ALPHABET CITY,8000000
9,ALPHABET CITY,-
10,ALPHABET CITY,3192840
11,ALPHABET CITY,-
12,ALPHABET CITY,-
13,ALPHABET CITY,16232000


In [9]:
merge_df =pd.merge(nyc_data,
                   nyc_prop_data,
                   how="inner",
                   on='Neighborhood')

In [10]:
merge_df.shape

(57599, 5)

In [11]:
merge_df.head(10)

Unnamed: 0,Neighborhood,PRICE,Borough,Latitude,Longitude
0,CHELSEA,-,Chelsea,40.744035,-74.003116
1,CHELSEA,-,Chelsea,40.594726,-74.18956
2,CHELSEA,-,Chelsea,40.744035,-74.003116
3,CHELSEA,-,Chelsea,40.594726,-74.18956
4,CHELSEA,7425000,Chelsea,40.744035,-74.003116
5,CHELSEA,7425000,Chelsea,40.594726,-74.18956
6,CHELSEA,10,Chelsea,40.744035,-74.003116
7,CHELSEA,10,Chelsea,40.594726,-74.18956
8,CHELSEA,10,Chelsea,40.744035,-74.003116
9,CHELSEA,10,Chelsea,40.594726,-74.18956


In [12]:
merge_df.drop(['Borough'], axis=1, inplace=True)

#### More data cleaning

In [13]:
merge_df.replace(' -  ', np.nan, inplace=True)
merge_df.dropna(inplace=True)
merge_df.PRICE = merge_df.PRICE.astype(np.float64)

#### Aggregations: Group by average price

In [14]:
df_grp_price = merge_df.groupby(['Neighborhood', 'Latitude', 'Longitude'])['PRICE'].mean().reset_index()

In [15]:
df_grp_price.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,PRICE
0,ANNADALE,40.538114,-74.178549,628046.6
1,ARDEN HEIGHTS,40.549286,-74.185887,394956.5
2,ARROCHAR,40.596313,-74.067124,570045.9
3,ARVERNE,40.589144,-73.791992,379451.9
4,ASTORIA,40.768509,-73.915654,1399695.0


#### This is where the data section ends!

In [16]:
df_grp_price.Neighborhood.value_counts()

MURRAY HILL            2
SUNNYSIDE              2
CHELSEA                2
FIELDSTON              1
BAYCHESTER             1
LITTLE ITALY           1
OAKLAND GARDENS        1
LITTLE NECK            1
MANOR HEIGHTS          1
CIVIC CENTER           1
JAMAICA HILLS          1
FRESH MEADOWS          1
COUNTRY CLUB           1
WILLOWBROOK            1
ASTORIA                1
NEW SPRINGVILLE        1
NEW DORP               1
HUGUENOT               1
RIDGEWOOD              1
BAY RIDGE              1
WOODROW                1
HOLLISWOOD             1
SOUTH OZONE PARK       1
PARK SLOPE             1
OCEAN HILL             1
STAPLETON              1
RIVERDALE              1
KENSINGTON             1
MADISON                1
HAMMELS                1
                      ..
NEPONSIT               1
MANHATTAN VALLEY       1
GERRITSEN BEACH        1
BRONXDALE              1
CANARSIE               1
RICHMOND HILL          1
SOUTH BEACH            1
EMERSON HILL           1
PLEASANT PLAINS        1


In [17]:
df_grp_price.columns

Index(['Neighborhood', 'Latitude', 'Longitude', 'PRICE'], dtype='object')

In [18]:
df_grp_price.shape

(164, 4)

In [19]:
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium 

print('Libraries imported.')

Libraries imported.


#### Methodology section!!!

In [20]:
import datetime as DT
import hmac
from geopy.distance import vincenty
# import k-means from clustering stage
from sklearn.cluster import KMeans

#### QUERY NEW YORK from GEOPY; this will be compatible with the Neighborhood names

In [22]:
address = 'NEW YORK'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7308619, -73.9871558.


#### Create the Map of NEW YORK City.


In [23]:
# create map of NYC using latitude and longitude values
map_nyc = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, price, street in zip(df_grp_price['Latitude'], df_grp_price['Longitude'], 
                                   df_grp_price['PRICE'], df_grp_price['Neighborhood']):
    label = '{}, {}'.format(street, price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nyc)  
    
map_nyc

In [24]:
CLIENT_ID = 'WORJ2SDTSF4STM5GZY5SNICQ33GKOGGDCCBRQNMYQFFM32WC' # your Foursquare ID
CLIENT_SECRET = '035UB0G15P11XX442YNVBJOOPMMSZGHHICEPZU2NZ3STYLUV' # your Foursquare Secret
VERSION = '20181206' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WORJ2SDTSF4STM5GZY5SNICQ33GKOGGDCCBRQNMYQFFM32WC
CLIENT_SECRET:035UB0G15P11XX442YNVBJOOPMMSZGHHICEPZU2NZ3STYLUV


#### 4. Modeling
After exploring the dataset and gaining insights into it, we are ready to use the clustering methodology to analyze real estates. We will use the k-means clustering technique as it is fast and efficient in terms of computational cost, is highly flexible to account for mutations in real estate market in New Yor City and is accurate. 

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)   
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Street', 
                  'Street Latitude', 
                  'Street Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
# Run the above function on each location and create a new dataframe called location_venues and display it.
location_venues = getNearbyVenues(names=df_grp_price['Neighborhood'],
                                   latitudes=df_grp_price['Latitude'],
                                   longitudes=df_grp_price['Longitude']
                                  )

ANNADALE
ARDEN HEIGHTS
ARROCHAR
ARVERNE
ASTORIA
BATH BEACH
BAY RIDGE
BAYCHESTER
BAYSIDE
BEDFORD STUYVESANT
BEECHHURST
BELLE HARBOR
BELLEROSE
BELMONT
BENSONHURST
BERGEN BEACH
BLOOMFIELD
BOERUM HILL
BOROUGH PARK
BRIARWOOD
BRIGHTON BEACH
BROAD CHANNEL
BRONXDALE
BROOKLYN HEIGHTS
BROWNSVILLE
BULLS HEAD
BUSHWICK
CAMBRIA HEIGHTS
CANARSIE
CARROLL GARDENS
CASTLETON CORNERS
CHELSEA
CHELSEA
CHINATOWN
CITY ISLAND
CIVIC CENTER
CLINTON
CLINTON HILL
CO-OP CITY
COBBLE HILL
COLLEGE POINT
CONCORD
CONEY ISLAND
CORONA
COUNTRY CLUB
CROWN HEIGHTS
CYPRESS HILLS
DONGAN HILLS
DOUGLASTON
DYKER HEIGHTS
EAST ELMHURST
EAST NEW YORK
EAST TREMONT
EAST VILLAGE
ELMHURST
ELTINGVILLE
EMERSON HILL
FAR ROCKAWAY
FIELDSTON
FLATIRON
FLATLANDS
FLORAL PARK
FORDHAM
FOREST HILLS
FORT GREENE
FRESH MEADOWS
GERRITSEN BEACH
GLEN OAKS
GLENDALE
GOWANUS
GRAMERCY
GRANT CITY
GRASMERE
GRAVESEND
GREAT KILLS
GREENPOINT
GRYMES HILL
HAMMELS
HILLCREST
HOLLIS
HOLLISWOOD
HOWARD BEACH
HUGUENOT
HUNTS POINT
INWOOD
JACKSON HEIGHTS
JAMAICA ESTATES
JA

In [27]:
location_venues

Unnamed: 0,Street,Street Latitude,Street Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,ANNADALE,40.538114,-74.178549,Play Sports Bar,40.540418,-74.177196,Sports Bar
1,ANNADALE,40.538114,-74.178549,Annadale Diner,40.542079,-74.177325,Diner
2,ANNADALE,40.538114,-74.178549,Il Sogno,40.541286,-74.178489,Restaurant
3,ANNADALE,40.538114,-74.178549,MTA SIR - Annadale,40.540482,-74.178185,Train Station
4,ANNADALE,40.538114,-74.178549,The Square,40.540013,-74.178330,Pizza Place
5,ANNADALE,40.538114,-74.178549,Diesel Bagels,40.540373,-74.177374,American Restaurant
6,ANNADALE,40.538114,-74.178549,Angelos Pizza,40.540475,-74.176878,American Restaurant
7,ANNADALE,40.538114,-74.178549,Creative Nails,40.540960,-74.177380,Cosmetics Shop
8,ANNADALE,40.538114,-74.178549,Mangia! Healthy Kitchen,40.541178,-74.178250,American Restaurant
9,ANNADALE,40.538114,-74.178549,Crown Palace,40.540604,-74.176116,Food


In [28]:
location_venues.groupby('Street').count()

Unnamed: 0_level_0,Street Latitude,Street Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Street,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ANNADALE,12,12,12,12,12,12
ARDEN HEIGHTS,5,5,5,5,5,5
ARROCHAR,19,19,19,19,19,19
ARVERNE,18,18,18,18,18,18
ASTORIA,100,100,100,100,100,100
BATH BEACH,48,48,48,48,48,48
BAY RIDGE,88,88,88,88,88,88
BAYCHESTER,25,25,25,25,25,25
BAYSIDE,72,72,72,72,72,72
BEDFORD STUYVESANT,27,27,27,27,27,27


In [29]:
# get the List of Unique Categories
print('There are {} uniques categories.'.format(len(location_venues['Venue Category'].unique())))

There are 372 uniques categories.


In [30]:
# one hot encoding
venues_onehot = pd.get_dummies(location_venues[['Venue Category']], prefix="", prefix_sep="")

# add street column back to dataframe
venues_onehot['Street'] = location_venues['Street'] 

# move street column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])

#fixed_columns
venues_onehot = venues_onehot[fixed_columns]

venues_onehot.head()

Unnamed: 0,Street,ATM,Accessories Store,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Volleyball Court,Waste Facility,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,ANNADALE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ANNADALE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ANNADALE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ANNADALE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ANNADALE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
nyc_grouped = venues_onehot.groupby('Street').mean().reset_index()
nyc_grouped

Unnamed: 0,Street,ATM,Accessories Store,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Volleyball Court,Waste Facility,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,ANNADALE,0.0,0.000000,0.0,0.000000,0.250000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.000000,0.000000
1,ARDEN HEIGHTS,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.000000,0.000000
2,ARROCHAR,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.000000,0.000000
3,ARVERNE,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.055556,0.000000,0.000000,0.000000
4,ASTORIA,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.010000,0.000000,0.000000,0.000000
5,BATH BEACH,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.041667,0.000000
6,BAY RIDGE,0.0,0.000000,0.0,0.000000,0.034091,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.000000,0.000000
7,BAYCHESTER,0.0,0.000000,0.0,0.000000,0.040000,0.000000,0.04,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.000000,0.000000,0.000000,0.000000,0.000000
8,BAYSIDE,0.0,0.000000,0.0,0.000000,0.041667,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.013889,0.000000,0.000000,0.000000,0.013889
9,BEDFORD STUYVESANT,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00,0.000000,0.00,...,0.00,0.0,0.00,0.0,0.00,0.037037,0.037037,0.000000,0.000000,0.000000


In [32]:
# What are the top 5 venues/facilities nearby profitable real estate investments?#
num_top_venues = 5

for hood in nyc_grouped['Street']:
    print("----"+hood+"----")
    temp = nyc_grouped[nyc_grouped['Street'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----ANNADALE----
                 venue  freq
0  American Restaurant  0.25
1          Pizza Place  0.17
2         Dance Studio  0.08
3       Cosmetics Shop  0.08
4           Restaurant  0.08


----ARDEN HEIGHTS----
         venue  freq
0     Pharmacy   0.2
1  Coffee Shop   0.2
2         Pool   0.2
3  Pizza Place   0.2
4   Playground   0.2


----ARROCHAR----
                   venue  freq
0     Italian Restaurant  0.11
1               Bus Stop  0.11
2          Deli / Bodega  0.11
3  Outdoors & Recreation  0.05
4             Food Truck  0.05


----ARVERNE----
             venue  freq
0        Surf Spot  0.22
1    Metro Station  0.11
2       Donut Shop  0.06
3  Thai Restaurant  0.06
4      Coffee Shop  0.06


----ASTORIA----
                       venue  freq
0                        Bar  0.07
1  Middle Eastern Restaurant  0.07
2                 Hookah Bar  0.06
3           Greek Restaurant  0.05
4         Seafood Restaurant  0.04


----BATH BEACH----
                  venue  freq
0  Fast

                venue  freq
0       Deli / Bodega  0.08
1              Bakery  0.08
2   Korean Restaurant  0.08
3  Italian Restaurant  0.08
4  Chinese Restaurant  0.08


----DYKER HEIGHTS----
           venue  freq
0  Hot Dog Joint  0.12
1   Dance Studio  0.12
2   Burger Joint  0.12
3           Park  0.12
4     Bagel Shop  0.12


----EAST ELMHURST----
                 venue  freq
0           Donut Shop  0.25
1  Rental Car Location  0.08
2  Indie Movie Theater  0.08
3          Coffee Shop  0.08
4   Chinese Restaurant  0.08


----EAST NEW YORK----
                venue  freq
0       Deli / Bodega  0.25
1  Chinese Restaurant  0.17
2       Event Service  0.08
3       Metro Station  0.08
4   Convenience Store  0.08


----EAST TREMONT----
            venue  freq
0     Pizza Place  0.24
1            Café  0.12
2      Shoe Store  0.06
3          Lounge  0.06
4  Breakfast Spot  0.06


----EAST VILLAGE----
                venue  freq
0                 Bar  0.06
1            Wine Bar  0.05
2     

                venue  freq
0          Bagel Shop   0.2
1         Pizza Place   0.1
2  Italian Restaurant   0.1
3       Deli / Bodega   0.1
4        Dessert Shop   0.1


----MANHATTAN BEACH----
            venue  freq
0            Café  0.22
1  Sandwich Place  0.11
2           Beach  0.11
3        Bus Stop  0.11
4  Ice Cream Shop  0.11


----MANHATTAN VALLEY----
               venue  freq
0  Indian Restaurant  0.05
1        Coffee Shop  0.05
2        Pizza Place  0.05
3        Yoga Studio  0.03
4               Café  0.03


----MANOR HEIGHTS----
                 venue  freq
0           Bagel Shop   0.2
1           Donut Shop   0.1
2        Deli / Bodega   0.1
3             Pharmacy   0.1
4  American Restaurant   0.1


----MARINE PARK----
            venue  freq
0  Baseball Field  0.11
1   Deli / Bodega  0.11
2            Park  0.11
3    Soccer Field  0.11
4  Ice Cream Shop  0.11


----MASPETH----
               venue  freq
0        Pizza Place  0.10
1              Diner  0.10
2         

                 venue  freq
0           Donut Shop  0.13
1  Fried Chicken Joint  0.13
2          Bus Station  0.13
3           Shoe Store  0.07
4   Chinese Restaurant  0.07


----ST. ALBANS----
                  venue  freq
0  Caribbean Restaurant  0.17
1         Deli / Bodega  0.11
2  Fast Food Restaurant  0.11
3            Donut Shop  0.06
4     Mobile Phone Shop  0.06


----STAPLETON----
                venue  freq
0      Discount Store  0.06
1         Pizza Place  0.06
2  Mexican Restaurant  0.06
3          Restaurant  0.06
4                Bank  0.06


----SUNNYSIDE----
                venue  freq
0         Pizza Place  0.08
1  Italian Restaurant  0.08
2  Chinese Restaurant  0.06
3       Deli / Bodega  0.06
4         Coffee Shop  0.04


----SUNSET PARK----
                       venue  freq
0         Mexican Restaurant  0.09
1                       Bank  0.09
2  Latin American Restaurant  0.09
3                     Bakery  0.09
4                Pizza Place  0.09


----THROGS NECK

In [33]:
# Define a function to return the most common venues/facilities nearby real estate investments#

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Street']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [35]:
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Street'] = nyc_grouped['Street']

for ind in np.arange(nyc_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(nyc_grouped.iloc[ind, :], num_top_venues)

In [36]:
venues_sorted.head()

Unnamed: 0,Street,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,ANNADALE,American Restaurant,Pizza Place,Dance Studio,Train Station,Diner,Restaurant,Food,Cosmetics Shop,Sports Bar,Food & Drink Shop
1,ARDEN HEIGHTS,Pool,Pharmacy,Playground,Pizza Place,Coffee Shop,Yoga Studio,Farmers Market,English Restaurant,Ethiopian Restaurant,Event Service
2,ARROCHAR,Bus Stop,Deli / Bodega,Italian Restaurant,Cosmetics Shop,Athletics & Sports,Supermarket,Taco Place,Middle Eastern Restaurant,Outdoors & Recreation,Mediterranean Restaurant
3,ARVERNE,Surf Spot,Metro Station,Thai Restaurant,Coffee Shop,Donut Shop,Board Shop,Event Service,Bus Stop,Sandwich Place,Pizza Place
4,ASTORIA,Middle Eastern Restaurant,Bar,Hookah Bar,Greek Restaurant,Seafood Restaurant,Bakery,Italian Restaurant,Gourmet Shop,Gym / Fitness Center,Gym


In [37]:
venues_sorted.shape

(160, 11)

In [38]:
nyc_grouped = df_grp_price
print(nyc_grouped.shape)

(164, 4)


After our inspection of venues/facilities/amenities nearby the most profitable real estate investments in New York City, we could begin by clustering properties by venues/facilities/amenities nearby.

In [39]:
#Distribute in 5 Clusters

# set number of clusters
kclusters = 5

nyc_grouped_clustering = nyc_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50]

array([0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 4, 0, 0, 0, 0,
       0, 4, 0, 0, 0, 0, 0, 4, 0, 2, 2, 2, 0, 3, 4, 4, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [40]:
#Dataframe to include Clusters

nyc_grouped_clustering=df_grp_price
nyc_grouped_clustering.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,PRICE
0,ANNADALE,40.538114,-74.178549,628046.6
1,ARDEN HEIGHTS,40.549286,-74.185887,394956.5
2,ARROCHAR,40.596313,-74.067124,570045.9
3,ARVERNE,40.589144,-73.791992,379451.9
4,ASTORIA,40.768509,-73.915654,1399695.0


In [41]:
df_grp_price.shape

(164, 4)

In [42]:
nyc_grouped_clustering.shape

(164, 4)

In [43]:
venues_sorted.shape

(160, 11)

In [44]:
kmeans.labels_.shape

(164,)

In [45]:
#add clustering labels
nyc_grouped_clustering['Cluster Labels'] = kmeans.labels_

# merge wdc with New york city to add latitude/longitude for each neighborhood
nyc_grouped_clustering = nyc_grouped_clustering.join(venues_sorted.set_index('Street'), on='Neighborhood')

nyc_grouped_clustering.head(30) # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,PRICE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,ANNADALE,40.538114,-74.178549,628046.6,0,American Restaurant,Pizza Place,Dance Studio,Train Station,Diner,Restaurant,Food,Cosmetics Shop,Sports Bar,Food & Drink Shop
1,ARDEN HEIGHTS,40.549286,-74.185887,394956.5,0,Pool,Pharmacy,Playground,Pizza Place,Coffee Shop,Yoga Studio,Farmers Market,English Restaurant,Ethiopian Restaurant,Event Service
2,ARROCHAR,40.596313,-74.067124,570045.9,0,Bus Stop,Deli / Bodega,Italian Restaurant,Cosmetics Shop,Athletics & Sports,Supermarket,Taco Place,Middle Eastern Restaurant,Outdoors & Recreation,Mediterranean Restaurant
3,ARVERNE,40.589144,-73.791992,379451.9,0,Surf Spot,Metro Station,Thai Restaurant,Coffee Shop,Donut Shop,Board Shop,Event Service,Bus Stop,Sandwich Place,Pizza Place
4,ASTORIA,40.768509,-73.915654,1399695.0,4,Middle Eastern Restaurant,Bar,Hookah Bar,Greek Restaurant,Seafood Restaurant,Bakery,Italian Restaurant,Gourmet Shop,Gym / Fitness Center,Gym
5,BATH BEACH,40.599519,-73.998752,519516.8,0,Pharmacy,Fast Food Restaurant,Pizza Place,Kids Store,Italian Restaurant,Chinese Restaurant,Women's Store,Sushi Restaurant,Bubble Tea Shop,Mobile Phone Shop
6,BAY RIDGE,40.625801,-74.030621,565323.6,0,Italian Restaurant,Spa,Pizza Place,Bagel Shop,Grocery Store,Pharmacy,Greek Restaurant,American Restaurant,Bar,Chinese Restaurant
7,BAYCHESTER,40.866858,-73.835798,336642.7,0,Bus Station,Gym / Fitness Center,Pet Store,Fried Chicken Joint,Breakfast Spot,Supermarket,Fast Food Restaurant,Mexican Restaurant,Men's Store,Mattress Store
8,BAYSIDE,40.766041,-73.774274,682523.4,0,Bar,Pizza Place,American Restaurant,Indian Restaurant,Mexican Restaurant,Sushi Restaurant,Bakery,Italian Restaurant,Pub,Donut Shop
9,BEDFORD STUYVESANT,40.687232,-73.941785,822454.9,0,Coffee Shop,Bar,Deli / Bodega,Pizza Place,Café,Cocktail Bar,Juice Bar,Gourmet Shop,Thrift / Vintage Store,Bagel Shop


In [46]:
# Create Map

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyc_grouped_clustering['Latitude'], nyc_grouped_clustering['Longitude'],
                                  nyc_grouped_clustering['Neighborhood'], nyc_grouped_clustering['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [47]:
nyc_grouped_clustering.loc[nyc_grouped_clustering['Cluster Labels'] == 0, 
                              nyc_grouped_clustering.columns[[1] + list(range(5, nyc_grouped_clustering.shape[1]))]].head()


Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,40.538114,American Restaurant,Pizza Place,Dance Studio,Train Station,Diner,Restaurant,Food,Cosmetics Shop,Sports Bar,Food & Drink Shop
1,40.549286,Pool,Pharmacy,Playground,Pizza Place,Coffee Shop,Yoga Studio,Farmers Market,English Restaurant,Ethiopian Restaurant,Event Service
2,40.596313,Bus Stop,Deli / Bodega,Italian Restaurant,Cosmetics Shop,Athletics & Sports,Supermarket,Taco Place,Middle Eastern Restaurant,Outdoors & Recreation,Mediterranean Restaurant
3,40.589144,Surf Spot,Metro Station,Thai Restaurant,Coffee Shop,Donut Shop,Board Shop,Event Service,Bus Stop,Sandwich Place,Pizza Place
5,40.599519,Pharmacy,Fast Food Restaurant,Pizza Place,Kids Store,Italian Restaurant,Chinese Restaurant,Women's Store,Sushi Restaurant,Bubble Tea Shop,Mobile Phone Shop


In [48]:
nyc_grouped_clustering.loc[nyc_grouped_clustering['Cluster Labels'] == 1, 
                              nyc_grouped_clustering.columns[[1] + list(range(5, nyc_grouped_clustering.shape[1]))]].head()


Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,40.605779,Bus Stop,Discount Store,Recreation Center,Theme Park,Park,English Restaurant,Ethiopian Restaurant,Event Service,Event Space,Eye Doctor


In [49]:
nyc_grouped_clustering.loc[nyc_grouped_clustering['Cluster Labels'] == 2, 
                              nyc_grouped_clustering.columns[[1] + list(range(5, nyc_grouped_clustering.shape[1]))]].head()


Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,40.594726,Coffee Shop,Italian Restaurant,Ice Cream Shop,Nightclub,American Restaurant,Bakery,Seafood Restaurant,Theater,Hotel,French Restaurant
32,40.744035,Coffee Shop,Italian Restaurant,Ice Cream Shop,Nightclub,American Restaurant,Bakery,Seafood Restaurant,Theater,Hotel,French Restaurant
33,40.715618,Chinese Restaurant,Dim Sum Restaurant,Cocktail Bar,American Restaurant,Vietnamese Restaurant,Bakery,Noodle House,Ice Cream Shop,Bubble Tea Shop,Hotpot Restaurant
53,40.727847,Bar,Wine Bar,Ice Cream Shop,Mexican Restaurant,Vegetarian / Vegan Restaurant,Ramen Restaurant,Cocktail Bar,Pizza Place,Chinese Restaurant,Coffee Shop
59,40.739673,Yoga Studio,Gym,American Restaurant,Gym / Fitness Center,New American Restaurant,Sporting Goods Shop,Mediterranean Restaurant,Cycle Studio,Bakery,Japanese Restaurant


In [50]:
nyc_grouped_clustering.loc[nyc_grouped_clustering['Cluster Labels'] == 3, 
                              nyc_grouped_clustering.columns[[1] + list(range(5, nyc_grouped_clustering.shape[1]))]].head()


Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
35,40.715229,Gym / Fitness Center,Bakery,Italian Restaurant,French Restaurant,Yoga Studio,Coffee Shop,Sporting Goods Shop,Spa,Cocktail Bar,Sandwich Place
91,40.719324,Bakery,Café,Sandwich Place,Chinese Restaurant,Mediterranean Restaurant,Seafood Restaurant,Bubble Tea Shop,Salon / Barbershop,Clothing Store,Cocktail Bar


In [51]:
nyc_grouped_clustering.loc[nyc_grouped_clustering['Cluster Labels'] == 4, 
                              nyc_grouped_clustering.columns[[1] + list(range(5, nyc_grouped_clustering.shape[1]))]].head()


Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,40.768509,Middle Eastern Restaurant,Bar,Hookah Bar,Greek Restaurant,Seafood Restaurant,Bakery,Italian Restaurant,Gourmet Shop,Gym / Fitness Center,Gym
13,40.857277,Italian Restaurant,Deli / Bodega,Pizza Place,Bakery,Grocery Store,Dessert Shop,Spanish Restaurant,Bar,Bank,Liquor Store
17,40.685683,Coffee Shop,Dance Studio,Hotel,Spa,Bar,Martial Arts Dojo,Middle Eastern Restaurant,Furniture / Home Store,French Restaurant,Bakery
23,40.695864,Yoga Studio,Pizza Place,Park,Deli / Bodega,Gym,Coffee Shop,Italian Restaurant,Cosmetics Shop,Mexican Restaurant,Asian Restaurant
29,40.68054,Italian Restaurant,Cocktail Bar,Bakery,Pizza Place,Coffee Shop,Wine Shop,Bar,Gym / Fitness Center,Spa,Dessert Shop


#### Results and discussion

We performed a clustering technique on New York neighborhoods in order to recommend venues and the current average price of real estate where home-buyers can make a real estate investment. We recommended profitable venues according to amenities and essential facilities surrounding such venues i.e. elementary schools, high schools, hospitals & grocery stores.

We started by collecting New york city rolling prices of properties data from Kaggle ([ https://www.kaggle.com/new-york-city/nyc-property-sales#nyc-rolling-sales.csv]( https://www.kaggle.com/new-york-city/nyc-property-sales#nyc-rolling-sales.csv)) and the relative price paid data were extracted from . Moreover, to explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we accessed data through FourSquare API interface and arranged them as a data frame for visualization. Integrating data from FourSquare API and the new york property we suggest which neighbourhoods are great for real-estate investment business based on the surrounding facilities and amenities.

The Methodology section can be broken down into four stages: 
    1. Collect Data 
    2. Explore and Understand Data 
    3. Data preparation and preprocessing 
    4. Modeling
    
In the modeling section, we used the k-means clustering technique as it is fast and efficient in terms of computational cost, it is highly flexible to account for mutations in real estate market in New York and it is accurate.

#### Conclusion
The top areas/neighbourhoods that seem to be good for real-estate business are Annadale, Aden Heights, Arrorchar, Arverne and Astoria. Keep in mind these choices are due to the surrounding facilities that are mostly trending and well placed in terms of availability. 

