# Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods (Week 1)

##### Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

##### In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.

##### In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

##### These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.

## WEEK 1
##### For this week, you will required to submit the following:
###### 1. A description of the problem and a discussion of the background. (15 marks)

Background:
The aim of this project is find the most suitable neighborhood in New York for a family with young children to live in. The neighborhood should have some amenities that the parents would benefit from as well as some amenities for the children to avail of.

Problem Statment:
The parents of the family would like to live within walking distance of at least one pub, bar or restaurant.
The children would like to live in a neighborhood that is close to one or more of the following amenities:
- A Sports Club
- A Park
- A Playground
- A Toy / Game Store
- A Movie Theater (Cinema)


###### 2. A description of the data and how it will be used to solve the problem. (15 marks)

The primary dataset that will be used in this analysis is available at the following link:
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json 

This json file contains a list of the neighborhoods in new york along with the location data (latitude, longitude). This json file will be read into a pandas dataframe for the analysis. The project will then use this location data in conjunction with the Foursquare API to explore which amenities/venues are nearby each neighborhood. 

We will use this data from Foursquare to filter the neighborhoods of New York to ensure that only neighborhoods that are suitable for both the parents and the children are considered in our analysis. To do this, we will only consider neighborhoods within 1,500m of either a Pub, Bar or Restaurants. This should satisfy the parents' wants. We will also only consider neighborhoods that are within a similar distance of either a Sports Club, Park, Playground, Toy Store or Cinema to satisfy the childrens' needs.

The initial analysis of this project will not consider house prices in the analysis. However, it may be worth including another dataset containing regional house price data, as that would obviously have a big impact on the decision of which neighborhood to live in.

We will perform k-means clustering on the filtered data to cluster the neighborhoods.

## WEEK 2
## Step 1 - Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

import json

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
import folium # map rendering library

import requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries Imported")

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes

Libraries Imported


## Step 2 - Download/Extract New York Data

*2.1* - Download data from json file

In [2]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

*2.2* - Filter the data to include the 'features' key only as this contains the info that we want to use

In [4]:
neighborhoods_data = newyork_data['features']

*2.3* - Tranform the data into a pandas dataframe. Create an empty dataframe with the 4 fields that we want

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

*2.4* - Loop through the data and fill the dataframe one row at a time using the data we extraced from the 'feautures' key of the downloaded json file.

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


## Step 3 - Create Map Using Folium

*3.1* - Get the coordinates of New York City to centre the map around

In [8]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


*3.2* - Add each neighborhood to the map as a marker

In [9]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

## Step 4 - Use the Foursquare API to explore the neighborhoods in New York and segment them

*4.1* - Define Foursquare Credentials

In [10]:
CLIENT_ID = 'YL1YFSDDV0Z2331ZQE3NFLVKXGSSX0XK1W0MLOVDUNV0ZBVB' # your Foursquare ID
CLIENT_SECRET = 'XPJEHN4KJ0FQ3O2H0TFYORMMN2ICL0IEGGCGSHVB3G5ML5VL' # your Foursquare Secret
ACCESS_TOKEN = 'PGYXZ04MNOKM3KQC5X3QOFGTFCOG3XRJQZKSZRFOYWBLL0Z0' # your FourSquare Access Token
VERSION = '20180604'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YL1YFSDDV0Z2331ZQE3NFLVKXGSSX0XK1W0MLOVDUNV0ZBVB
CLIENT_SECRET:XPJEHN4KJ0FQ3O2H0TFYORMMN2ICL0IEGGCGSHVB3G5ML5VL


*4.2* - Let's explore the first neighborhood in our dataframe.

In [11]:
neighborhoods.loc[0, 'Neighborhood']

'Wakefield'

In [12]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[0, 'Neighborhood'] # neighborhood name for Wakefield

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Wakefield are 40.89470517661, -73.84720052054902.


*4.3* - Let's get the top 30 venues that are in Wakefield within a radius of 1500 meters.

In [13]:
LIMIT = 30

radius = 1500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=YL1YFSDDV0Z2331ZQE3NFLVKXGSSX0XK1W0MLOVDUNV0ZBVB&client_secret=XPJEHN4KJ0FQ3O2H0TFYORMMN2ICL0IEGGCGSHVB3G5ML5VL&v=20180604&ll=40.89470517661,-73.84720052054902&radius=1500&limit=30'

*4.4* - Send the get request and examine the results

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 429,
  'errorType': 'quota_exceeded',
  'errorDetail': 'Quota exceeded',
  'requestId': '6059113bb6f602648295e2fa'},
 'response': {}}

*4.5* - Extract the data from the json file and Put the results into a pandas dataframe

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

KeyError: 'groups'

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

## Step 5 - Repeat above analysis from Wakefield on all neighborhoods in New York

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

*5.1* - Use the above function to get nearby venues for all neighborhoods in the dataframe

In [None]:
NY_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

*5.2* - Get a list of all venue categories returned

In [None]:
NY_venues["Venue Category"].unique()

*5.3* - Create a smaller filtered dataset that only includes the following venues
- Park
- Supermarket
- Playground
- Toy / Game Store
- Movie Theater
- School
- Sports Club
- Irish Pub

In [None]:
NY_venues_Filtered = NY_venues[NY_venues['Venue Category'].isin(["Park",
                                "Supermarket",
                                "Playground",
                                "Toy / Game Store",
                                "Movie Theater",
                                "School",
                                "Sports Club",
                                "Pub",
                                "Bar",
                                "Restaurant"])]
NY_venues_Filtered.head()

*5.4* - Do one-hot encoding on the filtered dataset for analysis

In [None]:
# one hot encoding
NY_onehot = pd.get_dummies(NY_venues_Filtered[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NY_onehot['Neighborhood'] = NY_venues_Filtered['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [NY_onehot.columns[-1]] + list(NY_onehot.columns[:-1])
NY_onehot = NY_onehot[fixed_columns]

NY_onehot.head()

*5.5* - Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
NY_grouped = NY_onehot.groupby('Neighborhood').mean()
NY_grouped

*5.6* - Apply filters based on project requirements

In [None]:
#Filter the data to suit the requirements for the parents. i.e. must be close to a Restaurant, Bar or Pub
NY_Parents = NY_grouped[(NY_grouped['Restaurant'] > 0) | (NY_grouped['Bar'] > 0) | (NY_grouped['Pub'] > 0)]

#Filter the data to suit the requirements for the Children. i.e. must be close to a Park, Playground, Cinema, Sports Club or Toy Store
NY_Children = NY_grouped[(NY_grouped['Park'] > 0) | (NY_grouped['Playground'] > 0) | (NY_grouped['Movie Theater']>0) | (NY_grouped['Sports Club']>0) | (NY_grouped['Toy / Game Store']>0)]

#Inner join the 2 dataframes to find the neighborhoods that are suitable for both Parents & Children
NY_Parents_Children = pd.merge(NY_Parents,NY_Children[[]],how='inner',on=['Neighborhood'])
print("There are: {} neighborhoods that meet the minimum requirements".format(NY_Parents_Children.shape[0]))

In [None]:
from sklearn.cluster import KMeans

In [None]:
#Aplpy Kmeans Clustering
k_means_1 = KMeans(init="k-means++", n_clusters = 4, n_init = 12)
k_means_1.fit(NY_Parents_Children)
labels = k_means_1.labels_

NY_Parents_Children["Labels"] = labels
NY_Parents_Children.head()

In [None]:
#Examine clusters

C1 = NY_Parents_Children[NY_Parents_Children["Labels"] ==0]
C2 = NY_Parents_Children[NY_Parents_Children["Labels"] ==1]
C3 = NY_Parents_Children[NY_Parents_Children["Labels"] ==2]
C4 = NY_Parents_Children[NY_Parents_Children["Labels"] ==3]

print("There are {} neighborhoods in cluster 1".format(C1.shape[0]))
print("There are {} neighborhoods in cluster 2".format(C2.shape[0]))
print("There are {} neighborhoods in cluster 3".format(C3.shape[0]))
print("There are {} neighborhoods in cluster 4".format(C4.shape[0]))

In [None]:
#Add location data back to the final dataset for mapping/visualisation purposes
NY_Parents_Children_loc = pd.merge(NY_Parents_Children,neighborhoods,how='left',on=['Neighborhood'])
NY_Parents_Children_loc.head()

In [None]:
#Visualize clusters on Map
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_Parents_Children_loc['Latitude'], NY_Parents_Children_loc['Longitude'], NY_Parents_Children_loc['Neighborhood'], NY_Parents_Children_loc['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

In [None]:
#Examine Cluster 1 for similarities
C1

In [None]:
#Examine Cluster 2 for similarities
C2

In [None]:
#Examine Cluster 3 for similarities
C3

In [None]:
#Examine Cluster 4 for similarities
C4

In [None]:
C1.to_excel("U:\WORK\TRAINING\Coursera - Python Capstone\C1.xlsx")