# Capstone Project - The Battle of the Neighborhoods (Week 1)

## Table of contents
* [Introduction: Problem Description](#introduction)
* [Data Description](#data)


### Introduction: Problem Description

Two of the most important cities in the world are New York City and City of Toronto. Every year millions of tourists visit these cities for business and pleasure. 

The main reasons for tourist to visit these cities are as follows:

* Discover different neighborhoods
* Stand in the awe of skyscrapers
* Enjoying different international and experimental foods
* Visiting world renowned arts and galleries

People from different part of the world come to visit, they like to explore places, taste similar and different cuisines, enjoy popular sites and much more. When someone wants to visit and has to decide which city to choose, the visitor/visitors would like to compare two cities based on their likes and dislikes.

A comparison of the venues between two cities will help people decide where to visit. A data analysis between New York city and Toronto which gives a picture of the sought after venues will serve the purpose.

### Data Description

In order to make a comparative analysis of venues of interest between two cities( New York and Toronto), we need effective datasets for both the two cities. 

Following data sources will be needed to extract/generate the required information:

* New YOrk City data will be obtained from a json file obtained from IBM Developer Skills Network
* City of Toronto postal codes are obtained from a Wikipedia page
* City of Toronto data will be obtained from a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_dataFollowing 
* Number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API

Communicating with the Foursquare database is done by their RESTful API. A uniform resource identifier or URI is created and  extra parameters are appended depending on the data that we are seeking from the database. Any call request you make is composed of, we can call this base URI, which is api.foursquare.com/v2, and you can request data about venues, users, or tips.

For New York City, We will be using the coordinates of Manhattan to conduct the search.
For City of Toronto, the coordinates of Toronto city will be used.



In [20]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes  
#uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\moham\anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.10.0               |   py37h03978a9_0         3.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    openssl-1.1.1k             |       h8ffe710_0         5.7 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.0 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0

The followin

#### City of Toronto Data Preparation

A dataframe will be created from a Wikipedia page. The dataframe will consist of three columns:PostalCode, Borough, and Neighborhood.

In [21]:
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'

In [22]:
df = pd.read_html(url)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [23]:
filt = (df['Borough'] !='Not assigned')
df=df[filt]
df.rename(columns={'Postal Code':'PostalCode'}, inplace =True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [24]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [25]:
df_postal = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df_postal.set_index('PostalCode', inplace = True)
df_postal.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


Create a dataframe with latitude and longitude from the csv file that has the geographical coordinates of each postal code

In [26]:
df_latlong = pd.read_csv('Geospatial_Coordinates.csv')
df_latlong.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_latlong.set_index('PostalCode', inplace = True)
df_latlong.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Join two dataframes to create one dataframe with PostalCode, Borough, Neighborhood, Latitude and Longitude

In [27]:
df_toronto = df_postal.join(df_latlong)
df_toronto.reset_index(inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### New York City data preparation
Data is extracted from the following url named as newyorkurl.

In [35]:
import urllib.request

In [36]:
newyorkurl = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json'

In [41]:
response = urllib.request.urlopen(newyorkurl)
content = response.read()
newyork_data = json.loads(content.decode("utf8"))
print(newyork_data)


<class 'dict'>


Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [43]:
neighborhoods_data = newyork_data['features']

Let's look at the first item in this list.

In [44]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

##### Tranform the data into a pandas dataframe

In [45]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [46]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [47]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [48]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [49]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [50]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
