# Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods (Week 2)

In this week, you will continue working on your capstone project. Please remember by the end of this week, you will need to submit the following:

A full report consisting of all of the following components (15 marks):

1. 
<br>
Introduction where you discuss the business problem and who would be interested in this project.<br>
Data where you describe the data that will be used to solve the problem and the source of the data.<br>
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.<br>
Results section where you discuss the results.<br>
 Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.<br>
 Conclusion section where you conclude the report.<br>

2.
<br>
A link to your Notebook on your Github repository pushed showing your code. (15 marks)
<br>
3.
<br>
Your choice of a presentation or blogpost. (10 marks)

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find the optimal location for different businesses in New York City. 
<br>
Specifically, we will look at the current resturants, stores, and other businesses and their locations and business information to see if some businesses are succeeding based on their location or if some are not. 
<br>
We will use data science tools to generate ideal locations for several example businesses. 

## Data <a name="data"></a>

Based on our problems, factors that will influence a businesses sucess in a location are:
* location in NYC (cordinates)
* Categories of businesses
Following data sources will be needed to extract/generate the required information:
* Foursquare API
* Google Maps API geocoding 
* NYC Data (Wiki)


### Import data

In [5]:
import numpy as np 

import pandas as pd 
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json 

from geopy.geocoders import Nominatim
import geocoder 

import requests 
from bs4 import BeautifulSoup

from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium 

print("Libraries imported.")

Libraries imported.


## Scrap Data into dataframe

In [8]:
data = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City").text

In [9]:
soup = BeautifulSoup(data, 'html.parser')

In [10]:
neighborhoodList = []

In [13]:
for row in soup.find_all("div", class_="mw-category")[1].findAll(""):
    neighborhoodList.append(row.text)

IndexError: list index out of range

In [14]:
nyc_df = pd.DataFrame({"Neighborhood": neighborhoodList})
nyc_df.head()

Unnamed: 0,Neighborhood


In [15]:
nyc_df.shape

(0, 1)

## Get cordinates 

In [17]:
def get_nyc(neighborhood):
    # initialize your variable to None
    nyc_coords = None
    # loop until you get the coordinates
    while(nyc_coords is None):
        g = geocoder.arcgis('{}, New York, New York'.format(neighborhood))
        nyc_coords = g.nyc
    return nyc_coords

In [None]:
coords = [ get_nyc(neighborhood) for neighborhood in nyc_df["Neighborhood"].tolist() ]

In [None]:
coords

In [None]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [None]:
nyc_df['Latitude'] = df_coords['Latitude']
nyc_df['Longitude'] = df_coords['Longitude']

In [18]:
print(nyc_df.shape)
nyc_df

(0, 1)


Unnamed: 0,Neighborhood


## Create Map

In [None]:
address = 'New York, USA'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York, New York {}, {}.'.format(latitude, longitude))

In [None]:
map_nyc = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(nyc_df['Latitude'], nyc_df['Longitude'], nyc_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_nyc)  
    
map_nyc

In [None]:
map_nyc.save('map_nyc.html')


## Use the Foursquare API to explore the neighborhoods


In [None]:
CLIENT_ID = 'your Foursquare ID' 
CLIENT_SECRET = 'your Foursquare Secret'  
VERSION = '20180605' 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
# get venues of a certaina radius
radius = 1000
LIMIT = 500

venues = []

for lat, long, neighborhood in zip(nyc_df['Latitude'], nyc_df['Longitude'], nyc_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [None]:
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

In [None]:
venues_df.groupby(["Neighborhood"]).count()

In [None]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

In [None]:
venues_df['VenueCategory'].unique()[:50]

In [None]:
"Neighborhood" in venues_df['VenueCategory'].unique()

## Analyse Neighbourhoods

In [None]:
# one hot encoding
nyc_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nyc_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [nyc_onehot.columns[-1]] + list(nyc_onehot.columns[:-1])
nyc_onehot = nyc_onehot[fixed_columns]

print(nyc_onehot.shape)
nyc_onehot.head()

In [None]:
print(nyc_grouped.shape)
nyc_grouped

## Cluster Neighbourhoods

In [None]:
kclusters = 4

nyc_clustering = nyc_df.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyc_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
nyc_merged = nyc_df.copy()

# add clustering labels
nyc_merged["Cluster Labels"] = kmeans.labels_

In [None]:
nyc_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
nyc_merged.head()

In [None]:
#merge 
nyc_merged = nyc_merged.join(nyc_df.set_index("Neighborhood"), on="Neighborhood")

print(nyc_merged.shape)
nyc_merged.head()

In [None]:
#sort by cluster labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

## Visualization

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyc_merged['Latitude'], nyc_merged['Longitude'], nyc_merged['Neighborhood'], nyc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
map_clusters.save('map_clusters.html')

## Analysis <a name="analysis"></a>

In [None]:
nyc_merged.loc[nyc_merged['Cluster Labels'] == 0]

In [None]:
nyc_merged.loc[nyc_merged['Cluster Labels'] == 1]

In [None]:
nyc_merged.loc[nyc_merged['Cluster Labels'] == 2]

In [None]:
nyc_merged.loc[nyc_merged['Cluster Labels'] == 3]

## Methodology <a name="methodology"></a>

In this project we will use the popular city of New York, New York to analyse exisiting businesses. We will limit out analysis to 500 since theyre are so many businesses. 

In first step we collect the data, locations and businesses from wiki data and foursquare.

Second step in our analysis will be calculation and exploration of the categories of businesses to identify a few areas close to center with lower numbers of certain businesses.

In third and final step we will focus on most promising areas and within those create 4 clusters of locations that meet basic requirements. 
We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses to find the ideal location to open businessess with minimal competition. 

## Conclusion <a name="conclusion"></a>

In conclusion, the locations on the map with low density of similar categories of businesses indicate low competition but could also mean that there is a lower need for said category in that neighbourhood. 