<h1>The Battle of Neighborhoods</h1>

This Project marks the final assignment of the IBM Data Science Professional Certificate Caoptone.

<b>Author</b>: Julian Oellrich

<h2>Table of content</h2>

<ol>
  <li>Introduction and Background</li>
  <li>Setup and Libraries</li>
  <li>Data Import</li>
  <li>Crawling venues using Foursquare API</li>
  <li>Exploratory Data Analysis</li>
</ol>

<h2 id="introduction">1. Introduction and Background</h2>

<p>There are many people who are working in various cities (say New York and Toronto) across the world. Let's say you live on the West side of the City of Toronto in Canada. You love your neighborhood mainly because of all the great amenities and other types of venues that exist in the neighborhood. Such as gourmet fast food joints, pharmacies, parks, grad schools, and so on. Now say you receive a job offer from a great company on the other side of the city with great career prospects. However, given the far distance from your current place, you unfortunately must move if you decide to accept the offer. Wouldn't it be great if you're able to determine neighborhoods on the other side of the city? There are exactly the same as your current neighborhood, and if not, perhaps similar neighborhoods that are at least closer to your new job.
</p>

<p>
The goal of this capstone project ist to compare different neighborhoods in terms of a service. Search for potential explanation of why a neighborhood is popular. The cause of complaints in another neighborhood, or anything else related to neighborhoods. Hence the name of the capstone project will be the battle of the neighborhoods.</p>

<p>
In order to do this, given a city like the City of Toronto, we will segment it into different neighborhoods using the geographical coordinates of the center of each neighborhood. And then, using a combination of location data and machine learning, togroup the neighborhoods into clusters.
</p>

<h2 id="libraries">2. Setup and Libraries</h2>

Import Libraries

In [106]:
# Comuptation Libraries
import pandas as pd
import numpy as np

# API Communication
import json
import requests
import urllib

# Plotting 
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# Maps and geocoding
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium

# ML Algorithms
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# json & API
import json, lxml
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# data scrping modules
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

# Print Statement
print('\n * --- Libraries imported! --- *\n')


 * --- Libraries imported! --- *



<h2 id="data">3. Data Import and Preprocessing</h2>

<p>
In order to segement the neighborhoods, explore vanues in those neighbourhoods and compare them, we will essentially need a dataset that contains all the boroughs and the neighborhoods of both New York and Toronto as well as the the latitude and logitude coordinates of each neighborhood. 
</p>

<p>
Therefore in a first important step, we will import the neighbourhood and location data of New York and Toronto, clean the data if necessary and prepare them as dataframe that has the following columns:
<ul>
  <li>Borough </li>
  <li>Neighbourhood </li>
  <li>Latitude  </li>
  <li>Longitude  </li>
</ul>
</p>

<p>
This dataframe will be created for both cities, New York and Toronto. In a final Data Summary, both dataframes will be printed and a map of each city with the neighbourhoods superimposed on top will be created.
</p>

<h3>3.1 New York Data</h3>

Let's first work on the New York Neighbourhood data. The New York data is provided **for free** on the web and can be dowloaded via this [link](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json). This dataset has a .json format. More information on the New York data set can be found on this [site](https://geo.nyu.edu/catalog/nyu_2451_34572?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).

<h4>3.1.1 Download New York Location Data</h4>

The first step is downloading the New York JSON dataset and save it as _newyork_data.json_ to the current folder

In [107]:
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json'
json_filename = 'newyork_data.json'

urllib.request.urlretrieve(url, json_filename)

# print statement
print('New York data has been successfully downloaded to file',json_filename)

New York data has been successfully downloaded to file newyork_data.json


<h4>3.1.2 Preprocessing and Cleaning</h4>

Now the data from the JSON file can be loaded and stored to a dataframe. To get the final New York dataframe _df_newyork_ we need to proceed the following steps:

<ol>
  <li>Load JSON data to dictionary</li>
  <li>Extract only the relevant data from the _features_ key</li>
  <li>Create an empty dataframe _df_newyork_</li>
  <li>Fill in the data into the dataframe</li>
</ol>

In [108]:
# 1. Load JSON data to dictionary
with open('newyork_data.json') as json_data:
    newyork_dict= json.load(json_data)

# Extract only the relevant data from the _features_ key
newyork_data = newyork_dict['features']

# Create an empty dataframe _df_ny_
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 
df_newyork = pd.DataFrame(columns=column_names)

# Fill in the data into the dataframe
for data in newyork_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighbourhood_name = data['properties']['name']
        
    neighbourhood_latlon = data['geometry']['coordinates']
    neighbourhood_lat = neighbourhood_latlon[1]
    neighbourhood_lon = neighbourhood_latlon[0]
    
    df_newyork = df_newyork.append({'Borough': borough,
                                          'Neighbourhood': neighbourhood_name,
                                          'Latitude': neighbourhood_lat,
                                          'Longitude': neighbourhood_lon}, ignore_index=True)

In [109]:
df_newyork.head(8)

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315


<h3>3.2. Toronto Data</h3>

The Toronto neighbourhood and location data can not be directly download. Therefore to get the final dataframe several steps has to be executed.

Postal Code, borough and neighbourhood information can be scrapped from this [wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). Latitude and longitude information is provided as a CSV file seperately (filename: _Toronto_Geospatial_Coordinates.csv_).

To get the final Toronto dataframe _df_toronto_ we need to proceed the following steps:

<ol>
  <li>Get Toronto borough and neighourhood from wikipedia page and store it to dataframe</li>
  <li>Clean the data</li>
  <li>Import latitude and longitude information from csv to a dataframe</li>
  <li>Merge neighbourhood dataframe with coordinates dataframe</li>
</ol>

<h4>3.2.1 Import Toronto Neighbourhood Data</h4>

Scraping Toronto data from wikipedia page with BeautifulSoup Library and extract the relevant information from the table on the wikipedia page.

In [110]:
# scrape wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source)

# extract data from wikipedia table
table_data = soup.find('div', class_='mw-parser-output')
table = table_data.table.tbody

Finally write the table data into a new dataframe

In [111]:
columns = ['PostalCode', 'Borough', 'Neighbourhood']
data = dict({key:[]*len(columns) for key in columns})

for row in table.find_all('tr'):
    for i,column in zip(row.find_all('td'),columns):
        i = i.text
        i = i.replace('\n', '')
        data[column].append(i)

df_raw = pd.DataFrame.from_dict(data=data)[columns]
print(df.shape)
df_raw.head(10)

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<h4>3.2.2 Preprocessing and Cleaning</h4>

Before continuing with the merging, the dataframe _df_raw_ has to be cleaned. After examining the printed dataframe above the follwing cleaning steps have to be done:

<ol>
    <li>Change Columnname <i>PostalCode</i> to <i>Postal Code</i></li>
  <li>Remove rows with <i>Not assigned</i> boroughs</li>
  <li>Fill in <i>Not assigned</i> neighbourhoods with corresponding borough name</li>
  <li>Check for duplicate neighbourhoods</li>
</ol>

<h5>Change Columnname <i>PostalCode</i> to <i>Postal Code</i></h5>

In [112]:
df_raw.rename(columns={'PostalCode': 'Postal Code'}, inplace = True)

<h5>Remove rows with <i>Not assigned</i> boroughs</h5>

Only the cells that have an assigned borough should be processed. That means cells with a borough that is 'Not assigned' should be removed from the data frame. Those are the steps:

1. The first step is to count how many rows have a 'Not assigned' Borough
2. Drop the rows where Borough is 'Not assigned' and write it into new dataframe df_cleaned
3. Check if there are any rows with Not assigned Borough in the new dataframe

1. The first step is to count how many rows have a 'Not assigned' Borough

In [113]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df_raw[df_raw['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 77 rows where Borough is Not assigned


2. Drop the rows where Borough is 'Not assigned' and write it into new dataframe df_cleaned

In [114]:
df_cleaned = df_raw.drop(df_raw[df_raw['Borough'] == 'Not assigned'].index) 
df_cleaned.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


3. Check if there are any rows with Not assigned Borough in the new dataframe

In [115]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df_cleaned[df['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Borough is Not assigned


<h5>Fill in <i>Not assigned</i> neighbourhood</h5>

If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. Let's check how many rows have a 'Not assigned' Neighbourhood:

In [116]:
print_statement = 'There are {} rows where Neighbourhood is Not assigned'.format(
    df_cleaned[df_cleaned['Neighbourhood'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Neighbourhood is Not assigned


<b>Conclusion:</b> It shows, that there are <b> no rows</b> with 'Not assigned' Neighbourhood but assigned Borough

<h5>Check for duplicate neighbourhoods</h5>

In [117]:
# create new test dataframe with only neighbourhoods and postal codes
df_neigh = df_cleaned[['Postal Code','Neighbourhood']]

# check Postalcode duplicates and create new bool column that marks duplicates with True
df_neigh['duplicate bool'] = df_neigh['Postal Code'].duplicated(keep = 'first')

# Output value count of duplicate PostalCode values
df_neigh['duplicate bool'].value_counts()

False    103
Name: duplicate bool, dtype: int64

<b>Conclusion:</b> There are no duplicate PostCode values 

<h4>3.2.3 Import Toronto Coordinates Data</h4>

Import latitude and longitude data from csv file <i>Toronto_Geospatial_Coordinates.csv</i> provided by Coursera

In [118]:
geospatial_toronto = pd.read_csv('Toronto_Geospatial_Coordinates.csv')
geospatial_toronto.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h4>3.2.4 Merge Data</h4>

Merging dataframe _geospatial_toronto_ to _df_cleaned_ with key _Postal Code_ and save it to new DataFrame **df_toronto**

In [119]:
df_toronto = pd.merge(df_cleaned, geospatial_toronto, how= 'inner', on ='Postal Code')

The Postal Code Information is not needed anymore, therefore it will be removed from the dataframe

In [120]:
df_toronto.drop('Postal Code', axis = 1, inplace = True) 

<h3>3.3. Data Summary and City Maps</h3>

<h4>3.4. Facts about the New York and Toronto DataFrame</h4>

In [121]:
# New York Print Statements
NY_statement0 = '\n* ----- New York DataFrame ----- *'
NY_statement1 = '\nThe New York dataframe has {} rows and {} columns'.format(df_newyork.shape[0], df_newyork.shape[1])
NY_statement2 = 'New York has {} unique Boroughs'.format(len(df_newyork['Borough'].unique()))
NY_statement3 = 'New York has {} unique Neighbourhoods'.format(len(df_newyork['Neighbourhood'].unique()))
NY_statement4 = 'The Dataframe looks as following: \n'

# Toronto Print Statements
TOR_statement0 = '\n* ----- Toronto DataFrame ----- *'
TOR_statement1 = '\nThe toronto dataframe has {} rows and {} columns'.format(df_toronto.shape[0], df_toronto.shape[1])
TOR_statement2 = 'Toronto has {} unique Boroughs'.format(len(df_toronto['Borough'].unique()))
TOR_statement3 = 'Toronto has {} unique Neighbourhoods'.format(len(df_toronto['Neighbourhood'].unique()))
TOR_statement4 = 'The Dataframe looks as following: \n'

# output Toronto strings
print(TOR_statement0)
print(TOR_statement1)
print(TOR_statement2)
print(TOR_statement3)
print(TOR_statement4)

# output head of Toronto dataframe
display(df_newyork.head(8))

# output Toronto strings
print(NY_statement0)
print(NY_statement1)
print(NY_statement2)
print(NY_statement3)
print(NY_statement4)

# output head of Toronto dataframe
display(df_toronto.head(8))


* ----- Toronto DataFrame ----- *

The toronto dataframe has 103 rows and 4 columns
Toronto has 10 unique Boroughs
Toronto has 99 unique Neighbourhoods
The Dataframe looks as following: 



Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315



* ----- New York DataFrame ----- *

The New York dataframe has 306 rows and 4 columns
New York has 5 unique Boroughs
New York has 302 unique Neighbourhoods
The Dataframe looks as following: 



Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,North York,Don Mills,43.745906,-79.352188


<h4>3.5. Visualize Neighbourhood Maps</h4>

<h5>Get coordinates of New York and Toronto</h5>

In [122]:
# Toronto Coordinates
tor_address = 'Toronto, CA'

tor_geolocator = Nominatim(user_agent="tor_explorer")
tor_location = tor_geolocator.geocode(tor_address)
tor_lat = tor_location.latitude
tor_lng = tor_location.longitude

# New York Coordinates
ny_address = 'New York City, NY'

ny_geolocator = Nominatim(user_agent="ny_explorer")
ny_location = ny_geolocator.geocode(ny_address)
ny_lat = ny_location.latitude
ny_lng = ny_location.longitude

# Print Coordinates
print('The geograpical coordinates of Toronto are {}, {}.'.format(tor_lat, tor_lng))
print('The geograpical coordinates of New York are {}, {}.'.format(ny_lat, ny_lng))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.
The geograpical coordinates of New York are 40.7127281, -74.0060152.


<h5>Create Toronto Map</h5>

In [124]:
# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[tor_lat, tor_lng], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{} | {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h5>Create New York Map</h5>

In [126]:
# create map of Manhattan using latitude and longitude values
map_newyork = folium.Map(location=[ny_lat, ny_lng], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_newyork['Latitude'], df_newyork['Longitude'], df_newyork['Borough'], df_newyork['Neighbourhood']):
    label = '{} | {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

<h2 id="venues">4. Crawling Venues with Foursquare API</h2>

<h3>4.1 Function definition</h3>

<h3>4.2 Getting New York neighbourhood venues</h3>

<h3>4.3 Getting Toronto neighbourhood venues</h3>