## Capstone Project - The Battle of the Neighborhoods (Week 2)
## Neighborhood Recommender System by:
### Abubaker Abdelhafiz

## Business Problem Chosen:
### Neighborhood Recommender
Imagine a young couple expecting a child in a few months. They've both early on in their careers and have, after many years of making good financial decisions and working hard, they now can finally afford to buy a house.

First of all, congratulations to our fictional couple!

And now with that out of the way, the real work begins. Not only do our protagonists have to find something they can afford, they also have to decide on which neighborhood to live in. Not an easy task at all..

On top of that, there are many, many factors to consider here. What is "livable" for one family is considered too spartan for another. Some people place a high value on amenities and community, whereas others are focused on quality-of-life factors such as having a good variety of restaurants nearby. Commutes are also a consideration for many people here so as we can see, this is a multi-dimensional and very complicated problem. 

The goal in this project is to develop a system that does the following:

1- Take some things a client (in this case, our young and strapping couple) wants in a neighborhood, sorted by importance from 'Non-negotiable' to 'nice to haves'.

2- Take a "sample" ideal neighborhood they would like to live in.

3- Use data science and machine learning techniques to recommend to them some candidate neighborhoods, sorted by how well they match their desired requirements and how closely they 'resemble' the sample neighborhood they provided.

We could think of this as a virtual real-estate agent of sorts.

## Data:

In this project, the following data inputs are required :


1- **Client's budget**: After all, there's no point in doing all of this work then realizing that they cannot afford a house in the recommdneded neighborhoods.

2- **List of requirements sorted by 'priority'**: The application asks the client to input an item they care about or would like to have. This will be used to 'score' a neighborhood depending on the priority the client has placed on it. To limit complexity at this stage, the client will be provided a list of things to choose from, and will be able to input a 'score' between 0 and 10 to indicate how much value they place on each item.

Once all this data is collected, our application will fetch additional information about the city's neighborhoods from the following sources:

1- **Foursquare API**: This will be used to find venues/amenities for the neighborhoods around the city.

2- **Folium**: Maps for the city and neighborhoods will be used.

3- **Real Estate Databases** (Optional): This will be used to find the average house price for the neighborhoods to exclude ones that fall outside the client's budget. If it is not possible to procure a compatible, readily available database, I will find real estate price estimates for our city and enter them directly. 


With all of this on hand, what our brave application will attempt to do is to:

1- Provide our clients with a list of neighborhoods that are 'similar' to the one they listed

2- Score those neighborhoods based on how well they fit the criteria provided by clients if multiple matches or candidates exist.


### Obtaining the New York Map

First, let's import all the libraries we need

In [2]:
import requests
import pandas as pd 
import numpy as np 
import random 
import json 
from pandas.io.json import json_normalize
from IPython.display import Image 
from IPython.core.display import HTML 
import matplotlib.cm as cm
import matplotlib.colors as colors  
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium


from geopy.geocoders import Nominatim


print('Done importing and installing')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                       

Let's bring in the map of our city and its neighborhoods

In [17]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')
with open('newyork_data.json') as json_data:
    NYC_data = json.load(json_data)

Data downloaded!


And now, we'll convert the json data to a dataframe so that we can make a list of the neighborhoods to search through

In [468]:
thehood=input('Please input your reference neighborhood\t')

address = '{}, NY'.format(thehood)

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(thehood,latitude, longitude))

Please input your reference neighborhood	Chelsea
The geograpical coordinate of Chelsea are 40.7464906, -74.0015283.


So they picked Chelsea as their 'reference' area. We will hold on to this piece of information for the time being and continue to work on collecting the neighborhood data and getting our mapping ready.

In [469]:
NYC_data
neighborhoods_data =  NYC_data['features']
neighborhoods_data[0]


column_names = ['Neighborhood', 'Latitude', 'Longitude'] 


neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [529]:
neighborhoods.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Wakefield,40.894705,-73.847201
1,Co-op City,40.874294,-73.829939
2,Eastchester,40.887556,-73.827806
3,Fieldston,40.895437,-73.905643
4,Riverdale,40.890834,-73.912585


Here we will get the geographical information of all NY Neighborhoods

In [471]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for Latitude, Longitude, Neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [Latitude, Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Next, we will call the Foursquare API in order to collect data on the amenities/features of a given neighborhood

In [472]:
CLIENT_ID = 'WSGTFACJVQSPGAQOEOEW1VRGLNNN3ENPEUR2BPZXMHJ0VUBX' # your Foursquare ID
CLIENT_SECRET = 'LKEMNPN4HIJDYM4F5RMUETS2F2JBT3YTVNQZI42DGE5WJWYK' # your Foursquare Secret
VERSION = '20180604'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WSGTFACJVQSPGAQOEOEW1VRGLNNN3ENPEUR2BPZXMHJ0VUBX
CLIENT_SECRET:LKEMNPN4HIJDYM4F5RMUETS2F2JBT3YTVNQZI42DGE5WJWYK


In this step, we will run a sample query where we search for the availability of the client's desired amenity around the neighborhood specified (just as a test/sanity check). 

In [473]:
search_query = input('Please enter a sample amenity or service you would like to look up\t')

radius = 500
LIMIT = 30

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url
results = requests.get(url).json()
results

Please enter a sample amenity or service you would like to look up	Japanese Food


{'meta': {'code': 200, 'requestId': '5ea3739abe61c9001b9e2573'},
 'response': {'venues': [{'id': '4bbd185af57ba5932cb3adb9',
    'name': 'Health Food & Vitamin City',
    'location': {'lat': 40.746354,
     'lng': -74.001795,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.746354,
       'lng': -74.001795},
      {'label': 'entrance', 'lat': 40.746473, 'lng': -74.00166}],
     'distance': 27,
     'postalCode': '10011',
     'cc': 'US',
     'city': 'New York',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['New York, NY 10011', 'United States']},
    'categories': [{'id': '50aa9e744b90af0d42d5de0e',
      'name': 'Health Food Store',
      'pluralName': 'Health Food Stores',
      'shortName': 'Health Food Store',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/food_grocery_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1587770260',
    'hasPerk': False},
   {'id': '507c544be4b0f570f9a8b5e

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [41]:
neighborhood_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )


Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [474]:
print(neighborhood_venues.shape)
neighborhood_venues.head()
neighborhood_venues.groupby('Neighborhood').count()

(6107, 7)


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,27,27,27,27,27,27
Annadale,9,9,9,9,9,9
Arden Heights,5,5,5,5,5,5
Arlington,6,6,6,6,6,6
Arrochar,21,21,21,21,21,21
Arverne,17,17,17,17,17,17
Astoria,30,30,30,30,30,30
Astoria Heights,12,12,12,12,12,12
Auburndale,16,16,16,16,16,16
Bath Beach,30,30,30,30,30,30


### One-hot encoding:
Here we one-hot encode the venues of the neighborhood to prepare the data for processing and analysis

In [533]:
# one hot encoding
neighborhood_onehot = pd.get_dummies(neighborhood_venues[['Venue Category']], prefix="", prefix_sep="")


neighborhood_onehot['Neighborhood'] =neighborhood_venues['Neighborhood'] 


fixed_columns = [neighborhood_onehot.columns[-1]] + list(neighborhood_onehot.columns[:-1])
neighborhood_onehot = neighborhood_onehot[fixed_columns]

neighborhood_onehot.head()


Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [532]:
neighborhood_grouped = neighborhood_onehot.groupby('Neighborhood').mean().reset_index()

s=neighborhood_grouped.sum(axis=1)

In [534]:
num_top_venues = 5

for hood in neighborhood_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = neighborhood_grouped[neighborhood_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allerton----
              venue  freq
0       Pizza Place  0.15
1  Department Store  0.07
2     Deli / Bodega  0.07
3       Supermarket  0.07
4    Cosmetics Shop  0.07


----Annadale----
          venue  freq
0  Dance Studio  0.11
1   Pizza Place  0.11
2         Diner  0.11
3           Pub  0.11
4    Sports Bar  0.11


----Arden Heights----
           venue  freq
0  Deli / Bodega   0.2
1       Pharmacy   0.2
2    Coffee Shop   0.2
3       Bus Stop   0.2
4    Pizza Place   0.2


----Arlington----
                 venue  freq
0             Bus Stop  0.33
1        Deli / Bodega  0.17
2         Intersection  0.17
3  American Restaurant  0.17
4        Grocery Store  0.17


----Arrochar----
                      venue  freq
0        Italian Restaurant  0.10
1                  Bus Stop  0.10
2             Deli / Bodega  0.10
3        Athletics & Sports  0.05
4  Mediterranean Restaurant  0.05


----Arverne----
            venue  freq
0       Surf Spot  0.24
1  Sandwich Place  0.12
2   Met

# Methodology:

Now that we have gone through the process of setting everything up, importing our libraries, perparing the data, plotting our neighborhoods on the map and everything else, we move on to the "how" of how we are going to carry out our objectives.

In this project, the procedure is as follows:

1- use K-means clustering to 'group' or segment neighborhoods into clusters.

2- after carrying out k-means clustering, locate the cluster our reference neighborhood provided by the clients belongs to.

3- use cosine-similarity within the cluster of reference neighborhood to locate the neighborhood most similar to the reference within the cluster.

4- once our work is complete, we generate a 'survey' of the selected neighborhood to present to the clients what venues, amenities and features are present there.

In [478]:
from sklearn.cluster import KMeans

kclusters = 10

neighborhood_grouped_clustering = neighborhood_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neighborhood_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20] 



array([2, 2, 4, 4, 4, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 6, 2, 1, 2, 1],
      dtype=int32)

In [535]:

neighborhood_grouped.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhood_merged=neighborhoods
neighborhood_merged = neighborhood_merged.join(neighborhood_grouped.set_index('Neighborhood'), on='Neighborhood')
neighborhood_merged.head()
desired_cluster=neighborhood_merged.loc[neighborhood_merged['Neighborhood'] == str(thehood)]['Cluster Labels']
desired_cluster.astype(int)
x=desired_cluster.values
x[0]
#neighborhood_merged.groupby('Cluster Labels').head()
#neighborhood_merged.head(20)

#neighborhood_merged.sort_values(by=['Cluster Labels'])

result_df=neighborhood_merged.loc[neighborhood_merged['Cluster Labels'] ==x[0]]

result_df.shape


(107, 387)

In our cluster, there are 107 elements.

What this means is that if our real estate agent wishes to stop there and use this information to narrow their search down, that still leaves them with a whopping 107 units to search within.

In [481]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for Latitude, Longitude, Neighborhood in zip(result_df['Latitude'], result_df['Longitude'], result_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [Latitude, Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Next, we will use the Cosine Similarity metric to find the neighborhood 'most similar' to the one provided within its own cluster. First however, we need to work with our results_df dataframe a bit to get it ready.

In [482]:
result_df_temp=result_df

#pick our desired reference area as the one to correlate with
reference_hood=neighborhood_merged.loc[neighborhood_merged['Neighborhood'] == str(thehood)]
reference_hood=reference_hood.groupby('Neighborhood').mean()
reference_hood
reference_hood_cleaned=reference_hood.drop(columns=['Cluster Labels','Latitude','Longitude'],axis=1)
result_df_temp_cleaned=result_df_temp.drop(columns=['Cluster Labels','Latitude','Longitude'],axis=1)

#now we will remove our 'reference' from here to get the cosine similarity

result_df_temp_cleaned.head
reference_hood_cleaned

Unnamed: 0_level_0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [510]:
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

r3=result_df_temp_cleaned 

r3.set_index('Neighborhood')

r3.groupby('Neighborhood').mean()



c=cosine_similarity(r3.drop(columns=['Neighborhood']))


df_ref = pd.DataFrame(c, columns=r3.index.values, index=r3.index).reset_index()
desired_row=df_ref.loc[df_ref['index']==116]
desired_row.drop(columns=['index'],inplace=True)
#desired_row.replace(1,0)

o=desired_row.idxmax(axis=1,skipna=True)
desired_row.at[46,o]=0
f=desired_row.idxmax(axis=1,skipna=True)
desired_row.at[46,f]=0
desired_row.idxmax(axis=1,skipna=True)







A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc[index, col] = value


46    125
dtype: int64

It should be noted that instead of finding the best match directly, our real estate agent could choose to just look at the top 10 matches. A simple modification of this code can be made to sort the entries by their similarity index and fetch the neighborhoods corresponding to the 10 highest similarity indices if desired.

### Result of the Search:
now that we have found our matching neighborhood, let's fetch it from the DF!

In [509]:
result_df.loc[125]

Neighborhood                     Morningside Heights
Latitude                                      40.808
Longitude                                   -73.9639
Cluster Labels                                     1
Yoga Studio                                        0
Accessories Store                                  0
Adult Boutique                                     0
Afghan Restaurant                                  0
African Restaurant                                 0
Airport Terminal                                   0
American Restaurant                              0.1
Antique Shop                                       0
Arcade                                             0
Arepa Restaurant                                   0
Argentinian Restaurant                             0
Art Gallery                                        0
Art Museum                                         0
Arts & Crafts Store                                0
Asian Restaurant                              

Once all is said and done, we can then show the clients a neat little "summary" of the neighborhood the system has identified as being the top candidate. We'll do this by pulling up the Foursquare API and having it give them a little virtual tour of the neighborhood 

In [521]:
address = 'Morningside Heights, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address,latitude, longitude))


url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

# send GET request and get trending venues
results = requests.get(url).json()
'There are {} venues around {}.'.format(len(results['response']['groups'][0]['items']),address)



The geograpical coordinate of Morningside Heights, NY are 40.81, -73.9625.


{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '4bb64fc7f562ef3bf2f52f97',
  'name': 'Alma Mater Statue',
  'location': {'address': 'W. 116th St.',
   'crossStreet': 'btwn. Broadway and Amsterdam',
   'lat': 40.80772569502726,
   'lng': -73.96225166494271,
   'labeledLatLngs': [{'label': 'display',
     'lat': 40.80772569502726,
     'lng': -73.96225166494271}],
   'distance': 254,
   'postalCode': '10027',
   'cc': 'US',
   'city': 'New York',
   'state': 'NY',
   'country': 'United States',
   'formattedAddress': ['W. 116th St. (btwn. Broadway and Amsterdam)',
    'New York, NY 10027',
    'United States']},
  'categories': [{'id': '52e81612bcbc57f1066b79ed',
    'name': 'Outdoor Sculpture',
    'pluralName': 'Outdoor Sculptures',
    'shortName': 'Outdoor Sculpture',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/sculpture_',
     'suffix':

In [522]:
items = results['response']['groups'][0]['items']
items[0]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '4bb64fc7f562ef3bf2f52f97',
  'name': 'Alma Mater Statue',
  'location': {'address': 'W. 116th St.',
   'crossStreet': 'btwn. Broadway and Amsterdam',
   'lat': 40.80772569502726,
   'lng': -73.96225166494271,
   'labeledLatLngs': [{'label': 'display',
     'lat': 40.80772569502726,
     'lng': -73.96225166494271}],
   'distance': 254,
   'postalCode': '10027',
   'cc': 'US',
   'city': 'New York',
   'state': 'NY',
   'country': 'United States',
   'formattedAddress': ['W. 116th St. (btwn. Broadway and Amsterdam)',
    'New York, NY 10027',
    'United States']},
  'categories': [{'id': '52e81612bcbc57f1066b79ed',
    'name': 'Outdoor Sculpture',
    'pluralName': 'Outdoor Sculptures',
    'shortName': 'Outdoor Sculpture',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/sculpture_',
     'suffix':

In [526]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [527]:
dataframe = json_normalize(items) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories'] + [col for col in dataframe.columns if col.startswith('venue.location.')] + ['venue.id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# filter the category for each row
dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean columns
dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered.head(10)

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
0,Alma Mater Statue,Outdoor Sculpture,W. 116th St.,US,New York,United States,btwn. Broadway and Amsterdam,254,"[W. 116th St. (btwn. Broadway and Amsterdam), ...","[{'label': 'display', 'lat': 40.80772569502726...",40.807726,-73.962252,10027.0,NY,4bb64fc7f562ef3bf2f52f97
1,Jan's Express,Food Truck,W. 120th Street,US,New York,United States,Broadway,85,"[W. 120th Street (Broadway), New York, NY 1002...","[{'label': 'display', 'lat': 40.81039029230897...",40.81039,-73.961621,10027.0,NY,508037efe4b08a31a6b373e7
2,Nous Espresso Bar - Graduate Student Center,Café,"1150 Amsterdam Ave, 301 Philosophy Hall",US,New York,United States,,306,"[1150 Amsterdam Ave, 301 Philosophy Hall, New...","[{'label': 'display', 'lat': 40.80753267924142...",40.807533,-73.960879,,NY,534560c6498e7e382358a4ec
3,Shake Shack,Burger Joint,2957 Broadway,US,New York,United States,W 116th Street,263,"[2957 Broadway (W 116th Street), New York, NY ...","[{'label': 'display', 'lat': 40.8079332191406,...",40.807933,-73.964013,10027.0,NY,5978a3f5f427de7d9ea68121
4,Riverside Park 119th Street Tennis Courts,Tennis Court,119th Street,US,New York,United States,Riverside Park,312,"[119th Street (Riverside Park), New York, NY 1...","[{'label': 'display', 'lat': 40.81135834077416...",40.811358,-73.965748,10025.0,NY,4e652fbf52b1260c144d1676
5,Sakura Park,Park,500 Riverside Dr,US,New York,United States,at W 122nd St,344,"[500 Riverside Dr (at W 122nd St), New York, N...","[{'label': 'display', 'lat': 40.813078, 'lng':...",40.813078,-73.962124,10027.0,NY,4be84319947820a1a2adb4db
6,Hartley Pharmacy,Pharmacy,1219 Amsterdam Ave,US,New York,United States,at W 120 St,287,"[1219 Amsterdam Ave (at W 120 St), New York, N...","[{'label': 'display', 'lat': 40.80927154381511...",40.809272,-73.959231,10027.0,NY,4abd4c1ff964a5208f8920e3
7,Math Lawn,Park,2990 Broadway,US,New York,United States,,153,"[2990 Broadway, New York, NY 10027, United Sta...","[{'label': 'display', 'lat': 40.80882806650421...",40.808828,-73.961537,10027.0,NY,4e6913cbb0fb8e94c812656c
8,Joe Coffee Company,Coffee Shop,550 W 120th St,US,New York,United States,at Broadway,40,"[550 W 120th St (at Broadway), New York, NY 10...","[{'label': 'display', 'lat': 40.81004243550347...",40.810042,-73.96202,10027.0,NY,4d4fe94b529dcbffe118d5c4
9,Columbia Greenmarket,Farmers Market,2926 Broadway,US,New York,United States,btwn 114th & 115th St.,348,"[2926 Broadway (btwn 114th & 115th St.), New Y...","[{'label': 'display', 'lat': 40.80719498437207...",40.807195,-73.964335,10027.0,NY,4a59feb9f964a5209fb91fe3


In [528]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=15) # generate map centred around Ecco


# add Ecco as a red circle mark
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    popup='Ecco',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.6
    ).add_to(venues_map)


# add popular spots to the map as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(venues_map)

# display map
venues_map

And now we have an interactive map for our clients to look at and use to investigate the neighborhood our system spit out for them!

# Discussion:

Throughout this project, we have gone through the full process of taking a client's preferred neighborhood of choice and converting that into a tangible piece of information that can be recommend to them an area that would be most similar to the one they named.

Throughout this process, we have used many tools covering the gamut of visualization, mapping, data analysis, data manipulation, data science, machine learning and Python programming. 

As a result, we have gone through the process of building a system that can function as a 'virtual' real estate agent which takes a very vague piece of information such as "we want to live in a area that's kinda like X" and through use of all the tools and techinques listed above, converted this very generic input into an actionable piece of information that can be used to assist a real estate agent in better serving their clientele and aiding them in finding their dream home.

In the course of going through this project, we went from an initial list of 307 neighborhoods, down to a smaller list of 107. Through further analysis, we were able to narrow this down to just one neighborhood within which a real estate agent could focus their search. Not bad I'd say!


# Conclusion:

The real takeaway from this project in my opinion is that data science can be used to streamline processes that we often think of as "ill-defined". Selecting a house to live in is one of the most challenging decisions most of us will ever make, and there are so many ambiguities and subjective factors involved in the proces that no amount of experience will trivialize this problem.

I believe that a system with these capabilities can help in simplifying this process, and has the potential to be extended for use in the real estate industry; in particular larger real estate firms that manage hundreds of agents, each having a portfolio of hundreds of clients. 

From the perspective of an 'economy of scale', the potential savings for that industry can be quite significant when one considers how much time a real estate agent spends doing the initial footwork of taking a statement as non-specific as "I like this neighborhood" and producing actionable intelligence that helps them narrow their search process and focus on the properties that they think will appeal to their clients and matter to them, instead of wasting everyone's time going through countless properties that are not of interest.