# Capstone Project - The Battle of Neighborhoods

### Section 1: Introduction/Business Problem

The intention of this capstone project is to explore the neighborhoods of the cities Toronto and New York. Therefore Foursquare location data will be leveraged to identify characteristics of the neighborhoods of the cities. In a first step the data will be mined and wrangeld before the characteristics will be visually explored. Afterwards the usage of clustering methods will allow to find similarities in the data and answer the question of the stakeholder i.e. solve the business problem.

The concrete question that will be answered is:
##### What are the characteristics of someones neighborhood in NY (e.g. in Riverdale, NYC)? If he/she would like to move to Toronto, which neighborhoods are comparable and due to which characteristics?

### Section 2: Data

##### Import libraries, initializing

In [5]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
                                                                                                                       /failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - cffi -> python[version='2.7.*|3.5.*|3.6.*|3.6.9|3.6.9|3.6.9|3.6.9|>=3.6,<3.7.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|3.4.*',build='0_73_pypy|3_73_pypy|2_73

ibm-wsrt-py37main-main -> parso==0.7.0[build=*]

Package cudnn conflicts for:
pytorch -> cudnn[version='5.1.*|6.0.*|>=7.0.0,<=8.0a0|>=7.0.5,<=8.0a0|>=7.1.0,<=8.0a0|>=7.1.3,<8.0a0|>=7.3.0,<=8.0a0|>=7.6,<8.0a0|>=7.6.5,<8.0a0|>=7.6.4,<8.0a0|>=7.3.1,<8.0a0|>=7.1.2,<=8.0a0']
tensorflow-base -> cudnn[version='>=7.0.0,<=8.0a0|>=7.1.0,<=8.0a0|>=7.2.0,<=8.0a0|>=7.3.1,<8.0a0|>=7.6,<8.0a0|>=7.6.5,<8.0a0|>=7.6.0,<8.0a0|>=7.6.4,<8.0a0|>=7.1.2,<=8.0a0|>=7.0.5,<=8.0a0']

Package matplotlib conflicts for:
scikit-image -> matplotlib[version='>=1.1|>=1.3.1|>=2.0.0']
nltk -> matplotlib
ibm-wsrt-py37main-main -> matplotlib==3.2.2[build=*]
bokeh -> matplotlib
arcgis=1.6.0 -> matplotlib
seaborn -> matplotlib[version='>=1.4.3|>=2.1.2']

Package pyasn1-modules conflicts for:
google-auth -> pyasn1-modules[version='>=0.0.5,<1dev|>=0.2.1']
ibm-wsrt-py37main-main -> pyasn1-modules==0.2.8[build=*]

Package webencodings conflicts for:
tensorboard -> webencodings
bleach -> webencodings
ibm-wsrt-

Libraries imported.


##### Get boroughs, neighboorhoods, longitudes and latitudes of New York City 

In [22]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']


#definition of dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
#instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#fill the dataframe

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

#examine the dataframe
df_nyc = neighborhoods
df_nyc.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


##### Get boroughs, neighboorhoods, longitudes and latitudes of Toronto

In [11]:
#read table from Wiki via Pandas
postalcodes_rawlist = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') 

#the first dataframe is the desired one, assign it to another dataframe
df_poco_raw = postalcodes_rawlist[0] 

#remove i.e. ignore examples with not assigned boroughs
df_poco_new = df_poco_raw[df_poco_raw.Borough != 'Not assigned' ] 

#reset the index
df_poco_new = df_poco_new.reset_index(drop = True)

#replace column names
df_poco_new = df_poco_new.rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood" } ) 

#read CSV-file for geospatial data
coordinates = pd.read_csv('http://cocl.us/Geospatial_data')

#rename column so it can match with main table
coordinates = coordinates.rename(columns = {'Postal Code': 'PostalCode'}) 

#merge the two tables by joining them on 'PostalCode' as ID
df_poco_total = pd.merge(df_poco_new, coordinates, how = 'left', on = 'PostalCode' ) 

#examine the dataframe
df_poco_total.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


##### Merge the two datasets

In [29]:
#select necessary columns in Toronto Dataset
df_toronto = df_poco_total[['Borough','Neighborhood','Latitude','Longitude']]
df_toronto.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [31]:
#add City column
df_toronto['City']= 'Toronto'
df_nyc['City'] = 'NYC'

#check whether size (number of columns) match
print(df_toronto.shape)
print(df_nyc.shape)

(103, 5)
(306, 5)


In [42]:
#merge the two datasets
df_merged = pd.concat([df_toronto,df_nyc])
df_merged = df_merged.reset_index(drop = True)

#check the maths.. sum of examples should be 103 + 309 = 409 
df_merged.shape

(409, 5)

##### Get the data for the characteristics of the neighborhoods by using Foursquare 

In [44]:
#foursquare credentials
CLIENT_ID = 'GYQDYNALIGVN1ATEW1VO4F23QRBOWHPIKPOR3KSGXZE1I03B' 
CLIENT_SECRET = 'PJ0XHMR03XTMUU1BNJEHZI3EEDVRA4AX01O4LLTDA4STSQR4' 
VERSION = '20180604'
LIMIT = 30
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

CLIENT_ID: GYQDYNALIGVN1ATEW1VO4F23QRBOWHPIKPOR3KSGXZE1I03B
CLIENT_SECRET:PJ0XHMR03XTMUU1BNJEHZI3EEDVRA4AX01O4LLTDA4STSQR4


In [45]:
#function to get nearby venues for the neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [47]:
#search venues for each neighborhood
df_venues = getNearbyVenues(df_merged['Neighborhood'], df_merged['Latitude'], df_merged['Longitude'])

df_venues.shape #7739 Venues for the neighborhoods were found!

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

(7739, 7)

In [48]:
df_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [53]:
#gives a venue count for each neighborhood, there is none with 0 count
df_venues.groupby('Neighborhood').count()  

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",7,7,7,7,7,7
Allerton,30,30,30,30,30,30
Annadale,12,12,12,12,12,12
Arden Heights,6,6,6,6,6,6
Arlington,7,7,7,7,7,7
Arrochar,24,24,24,24,24,24
Arverne,21,21,21,21,21,21
Astoria,30,30,30,30,30,30
Astoria Heights,13,13,13,13,13,13


In [57]:
#gives the total amount of unique venue categories
len(df_venues['Venue Category'].unique())

424

###### df_venues is the dataset which will be used further. In total it contains 7739 venues while there is at least one venue available for each neighborhood. The data contains 424 different venue catgeories which is the potential number of features which can be used after wrangling the data further for the clustering algorithm.

### Section 3: to be continued in week 2 ;)