# **CAPSTONE PROJECT**

## **By Arthur Ye**

<img src="https://scontent.fsan1-1.fna.fbcdn.net/v/t1.0-9/p720x720/82957017_2950790718277850_5324816671653756928_o.jpg?_nc_cat=106&_nc_ohc=Pvp6ilX5GS4AX-Rwe2m&_nc_ht=scontent.fsan1-1.fna&_nc_tp=6&oh=deb89b0f9edee6c1947741bed11be7a0&oe=5ED6F94B">

##   

## __Introduction__

### New York City(NYC), 40.7128° N, 74.0060° W, is the most populous city in the United States, with a population of 8.6 million estimated in 2019 distributed over a land area of 303 square miles. There are 5 boroughs within NYC confined by the Hudson River and the Atlantic Ocean. At its center is Manhattan, a densely populated city that is one of the world's largest commercial, cultural and financial district. Being the financial center of United States, NYC records an average Gross Domestic Product (GDP) of US$842.3 billion. 

### In this particular capstone project, one is looking to open a restaurant in NYC, based on the valuable and available data online,to provide detailed analysis in order to recommend the location. Similarly, if a contractor is trying to start a business in NYC, where would be an ideal neighborhood to launch the business. 

<img src="https://miro.medium.com/max/1200/1*o77vgUxSopaFSuqa5UFBCw.gif">

##  

## __Data__

### data from Wikipedia will be utilized for the purpose of this project. The available source is called "List of towns in New York (state)". The shapefile was created to project populations at a small area level, from 2000 to 2030 for PlaNYC, the long-term sustainability plan for New York City. Since population size affects the error associated with population projections, these geographic units needed to have a minimum population, which we determined to be 15,000. This criterion resulted in combinations of neighborhoods that probably would not occur if one were solely designating boundaries of historical neighborhoods. 

### **"https://en.wikipedia.org/wiki/List_of_towns_in_New_York_(state)"**


### On the other hand, "2014 New York City Neighborhood Names" data is also obtained from NYU to identify the 306 neighborhoods within 5 boroughs in NYC. The file was created at a guide to the NYC's neighborhoods that appear on the web source, which includes information such as industry structure, population, area, trade value and etc. 

### **https://geo.nyu.edu/catalog/nyu-2451-34572**


### Lastly, Foursquare API, a local search-and-discovery mobile application developed by Foursquare Labs Inc. The application provides personalized recommendations of places using user's current location, based on users' previous browsing history and check-in history. 

### **https://developer.foursquare.com/docs/api/endpoints**

In [2]:
import requests #Library to handle requests
!pip install beautifulSoup4 # library to parse HTML and XML documents
from bs4 import BeautifulSoup #library of beautifulsoup
!pip install lxml #library of lxml
!pip install html5lib #library of html
from urllib.request import urlopen #library of URL

import numpy as np
import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#Command to install OpenCage Geocoder for fetching Lat and Long of Neighborhood
!pip install opencage

#Importing OpenCage Geocoder
from opencage.geocoder import OpenCageGeocode

#Use the inline backend to generate the plots within the browser
%matplotlib inline 

#Importing Matplot lib and associated packages to perform Data Visualisation and Exploratory Data Analysis
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

#Check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Importing folium to visualise Maps and plot based on Lat and Lng
import folium

#To normalise data returned by FourSquare API
from pandas.io.json import json_normalize

#Importing KMeans from SciKit library to Classify neighborhoods into clusters
from sklearn.cluster import KMeans

print('Libraries imported')

Collecting beautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
[K     |████████████████████████████████| 112kB 29.3MB/s eta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulSoup4
Successfully installed beautifulSoup4-4.8.2 soupsieve-1.9.5
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 29.0MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.0
Collecting opencage
  Downloading https://files.pythonhosted.

##  

In [3]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page=urlopen(url).read().decode('utf-8')
soup=BeautifulSoup(page,'html.parser')
req_table=soup.body.table.tbody

In [4]:
#Data parsing
def req_cell(element):
    cells = element.find_all('td')
    row = []
    
    for cell in cells:
        if cell.a:            
            if (cell.a.text):
                row.append(cell.a.text)
                continue
        row.append(cell.string.strip())
        
    return row

In [5]:
def req_row():    
    data = []  
    
    for tr in req_table.find_all('tr'):
        row = req_cell(tr)
        if len(row) != 3:
            continue
        data.append(row)        
    
    return data

In [6]:
data = req_row()
columns = ['Postcode', 'Borough', 'Neighbourhood']
df = pd.DataFrame(data, columns=columns)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
df_req = df[df.Borough != 'Not assigned']
df_req = df_req.sort_values(by=['Postcode','Borough'])
df_req.reset_index(inplace=True)
df_req.drop('index',axis=1,inplace=True)
df_req.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Highland Creek
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Port Union


In [87]:
df = pd.DataFrame(arrRowData)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,Town,County,Pop.(2010),Land(sq mi),Water(sq mi),Coordinates,GEO ID(FIPS code),ANSI code(GNIS ID),,,,,,,
1,Adams,Jefferson,5143,42.270,0.153,"43.844658, -76.054289",3604500210,00978655,,,,,,,
2,Addison,Steuben,2595,25.545,0.142,"42.132164, -77.234422",3610100287,00978656,,,,,,,
3,Afton,Chenango,2851,45.836,0.679,"42.226305, -75.526838",3601700353,00978657,,,,,,,
4,Alabama,Genesee,1869,42.370,0.406,"43.090053, -78.366480",3603700474,00978658,,,,,,,


## __Methodology__

### fortunately, the data obtained from NYU are in Json format, which means it can be easily transferred and interpreted by Jupyter Note Book. However, the data will require some wangling and clean-up work with Pandas dataframe. Applying Foursquare API, the project will identify the key venues of each cities in NYC, by obtaining their longitude and latitude values, specifically those related to economic activities, to identify the more common types of venues in each cities. In combination with the "2016 ew York City Neighborhood Tabulation Area" data, the study will perform clustering analysis on the citiies to separate the cities in total into various clusters, and identify each of their strategic strength area through analysis of their characterstics. Furthermore, by applying folium package, a geographical map can be demonstrated to visualize the clusters of venues. 

In [9]:
df_postcodes = df_req['Postcode']
df_postcodes.drop_duplicates(inplace=True)
df_req2 = pd.DataFrame(df_postcodes)
df_req2['Borough'] = '';
df_req2['Neighbourhood'] = '';


df_req2.reset_index(inplace=True)
df_req2.drop('index', axis=1, inplace=True)
df_req.reset_index(inplace=True)
df_req.drop('index', axis=1, inplace=True)

for i in df_req2.index:
    for j in df_req.index:
        if df_req2.iloc[i, 0] == df_req.iloc[j, 0]:
            df_req2.iloc[i, 1] = df_req.iloc[j, 1]
            df_req2.iloc[i, 2] = df_req2.iloc[i, 2] + ',' + df_req.iloc[j, 2]
            
for i in df_req2.index:
    s = df_req2.iloc[i, 2]
    if s[0] == ',':
        s =s [1:]
    df_req2.iloc[i,2 ] = s

df_req2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
import sys
!{sys.executable} -m pip install geocoder
!{sys.executable} -m pip install folium

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 18.6MB/s ta 0:00:01
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 24.2MB/s eta 0:00:01
Collecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
[K     |████████████████████████████████| 81kB 8.9MB/s  eta 0:00:01
Building wheels 

In [12]:
import geocoder
def get_latlng(postal_code):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords
get_latlng('M4G')

[43.70949500000006, -79.36398897099997]

In [13]:
postal_codes = df_req2['Postcode']    
coords = [ get_latlng(postal_code) for postal_code in postal_codes.tolist() ]
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df_req2['Latitude'] = df_coords['Latitude']
df_req2['Longitude'] = df_coords['Longitude']
df_req2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


In [15]:
TorontoData  = df_req2

In [16]:
toronto_map = folium.Map(location=[43.65, -79.4], zoom_start=12)

X = TorontoData['Latitude']
Y = TorontoData['Longitude']
Z = np.stack((X, Y), axis=1)

kmeans = KMeans(n_clusters=4, random_state=0).fit(Z)

clusters = kmeans.labels_
colors = ['red', 'green', 'blue', 'yellow']
TorontoData['Cluster'] = clusters

for latitude, longitude, borough, cluster in zip(TorontoData['Latitude'], TorontoData['Longitude'], TorontoData['Borough'], TorontoData['Cluster']):
    label = folium.Popup(borough, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=4,
        popup=label,
        color='red',
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=0.8).add_to(toronto_map)  

toronto_map