# Data Science Capstone Project


## Purpose
The purpose of this project is to explore the venues of Toronto and Waterloo and understand the differences between a big city and a smaller city.

## Background
Predicting where to open a business is a very hard problem. The number of variables that go into solving this kind of problem is a lot. It is almost impossible to predict whether a certain kind of business might succeed in one neighbourhood vs the other, one city vs the other, Evironment, culture, people, religion, time, location, asthetic appeal they all intermingle with each other to predict whether a certain place might be a good place for a business. Also what kind of business also makes a big difference. I am no expert and neither do I claim to be I just experiment and analyze and predict. So I just go into this problem blindly and see what comes out of the end. In this project I look into one variable which is the venue data on othere people's busineses and I cluster them and see which cluster is good for which kind of business.


## Methodology
I. Data acquisiton and cleaning

II. Data Analysis

III. Clustering

IV. Conclusions



# Setup

## Library import
We import all the required Python libraries

In [None]:
# Imported all the neccesary libraries 
from bs4 import BeautifulSoup # For Web Scraping

import csv
import pandas as pd #Data Analysis
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)
import numpy as np

import geocoder #get location data
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests
import json #To handle json files
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans # To run k-means algorithm

import matplotlib.cm as cm
import matplotlib.colors as colors # matplotlib stuff
import matplotlib.pyplot as plt
#%matplotlib inline # backend for rendering plots wit

import folium # To visualize on map
from folium.map import *
from folium import plugins
from folium.plugins import MeasureControl
from folium.plugins import FloatImage

import json

import seaborn as sns
from IPython.display import display

print('Libraries Imported')

# Data Acquisition and Cleaning

We retrieve all the required data for the analysis. I use the arcgis API to get the required Waterloo-Kitchener Neighbourhood Data. I am aware that these Neighbourhoods might not be accurate in their location but these are just rough placeholder names to segment the area.

Getting Waterloo Data

In [None]:
url = 'https://services.arcgis.com/ZpeBVw5o1kjit7LT/arcgis/rest/services/NeighbourhoodAssociations/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json'
data = requests.get(url).json()  


with open('data.geojson', 'w') as json_file:
    json.dump(data, json_file)

address = 'Waterloo, Ontario, Canada'

geolocator = Nominatim(user_agent="waterloo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Waterloo City are {}, {}.'.format(latitude, longitude))

element = []

for item in data['features']:
    i = round(len(item['geometry']['rings'][0])/2)+round(len(item['geometry']['rings'][0])/3)
    element.append([item['attributes']['NAME'],item['attributes']['TYPE'],item['geometry']['rings'][0][i][1],item['geometry']['rings'][0][i][0]])


df = pd.DataFrame(element,columns = ['Neighbourhood', 'Neighbourhood Type','lat','long'])

df.head()

Getting Kitchener Data

In [None]:
url = 'https://services1.arcgis.com/qAo1OsXi67t7XgmS/arcgis/rest/services/Neighbourhood_Association/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json'

data2 = requests.get(url).json() 

data2['features'][0]

element = []

for item in data2['features']:
    i = round(len(item['geometry']['rings'][0])/5)
    element.append([item['attributes']['MAPLABEL1'],item['attributes']['NEIGHBOURHOOD_ASSOCIATION'],item['geometry']['rings'][0][i][1],item['geometry']['rings'][0][i][0]])
    
element[:5]

df2 = pd.DataFrame(element,columns = ['Neighbourhood', 'Neighbourhood Type','lat','long'])

df2.head()

Combining them to get the bigger dataset

In [None]:
df_big = pd.concat([df,df2],axis = 0,sort = False).reset_index()

df_big.head(10)

In [None]:
df_big.shape

Getting venue data from this DataFrame

In [None]:
names = df_big['Neighbourhood']
latitudes = df_big['lat']
longtitudes = df_big['long']
radius = 500

CLIENT_ID = 'JR0WCEE3K2VYBHOHIRHCAPOL3BXLGM2WHQNPCXXJ0GB20HPP' #Foursquare ID
CLIENT_SECRET = '251PHKQSNFRZEAMTRWB3YKHYRHT50WRVAAW35AMS3CHXDFE2' # Foursquare Secret
VERSION = '20200528' # Foursquare API version
LIMIT = 500

list = []


for name,lat,long in zip(names,latitudes,longtitudes):
    url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{long}&radius={radius}&limit={LIMIT}'
    json = requests.get(url).json()
#     venues = json['response']['groups'][0]['items']
    try:
        venues = json['response']['groups'][0]['items']
        list.append([(
                    name,
                    lat,
                    long,
                    v['venue']['name'], 
                    v['venue']['location']['lat'],
                    v['venue']['categories'][0]['name'],
                    v['venue']['location']['lng']) for v in venues]
                   )
    except:
        continue
missing_values = []
for name,lat,long in zip(names,latitudes,longtitudes):
    url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{long}&radius={radius}&limit={LIMIT}'
    try:
        json = requests.get(url).json()
        venues = json['response']['groups'][0]['items']
        if len(venues) == 0:
            missing_values.append(name)
    except:
        continue

In [None]:
missing_values # The neighbourhoods which have

In [None]:
for item in missing_values:  
    df_big = df_big[df_big['Neighbourhood'] != item]

In [None]:
df_big.shape

In [None]:
venue_stuff = pd.DataFrame([item for venue_list in list for item in venue_list],columns = ['Neighbourhood','lat','long','venue','venue lat','venue type','venue long'])

In [None]:
venue_stuff.to_csv(r'C:\Users\Abhik\Downloads\venue_waterloo_data.csv',index = False)

In [None]:
venue_stuff.head(10)

Getting Toronto data from previous analysis

In [None]:
toronto_venue_data = pd.read_csv('venue_toronto_data.csv')
toronto_venue_data.head(10)

In [None]:
toronto_clusters = pd.read_csv('toronto_clusters.csv')
toronto_clusters.head()

We have our desired DataFrame

# Data Analysis

In [None]:
KW_map = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat,lng,neighbourhood in zip(df_big['lat'],df_big['long'],df_big['Neighbourhood Type']):
    label = f'{neighbourhood}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup=label,
        color = 'green',
        fill = True,
        parse_html=False, 
    ).add_to(KW_map)
    

KW_map

# Clustering

In [None]:
# One Hot Encoding
venue_stuff = pd.read_csv('venue_waterloo_data.csv')
waterloo_onehot = pd.get_dummies(venue_stuff[['venue type']],prefix ='',prefix_sep = '')
waterloo_onehot['Neighbourhood'] = venue_stuff['Neighbourhood']

fixed_columns = [waterloo_onehot.columns[-1]] + waterloo_onehot.columns[:-1].to_list()

waterloo_onehot = waterloo_onehot[fixed_columns]

In [None]:
waterloo_grouped = waterloo_onehot.groupby('Neighbourhood').mean().reset_index()
waterloo_grouped.shape

In [None]:
waterloo_clustered = waterloo_grouped.drop('Neighbourhood',1)

In [None]:
kclusters = 15
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(waterloo_clustered)

# check cluster labels generated for each row in the dataframe
kmeans.labels_



In [None]:
df_big.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
df_big.head()

In [None]:
inertia=[]
N = 55
K = []
for k in range(2,N) :
    K.append(k)
    kmeans = KMeans(n_clusters=k, random_state=0).fit(waterloo_clustered)
    inertia.append(kmeans.inertia_)


In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_big['lat'], df_big['long'], df_big['Neighbourhood'], df_big['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       

map_clusters

In [None]:
for i in range(kclusters):
    print(f'cluster {i}')
    venue_cluster = df_big[df_big['Cluster Labels'] == i]
    for item in venue_cluster['Neighbourhood']:        
        display(venue_stuff[venue_stuff['Neighbourhood'] == item].drop('Neighbourhood',1))

In [None]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="yyz_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

kclusters = 30
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_clusters['lat'], toronto_clusters['long'], toronto_clusters['Borough'], toronto_clusters['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       

map_clusters

Now I am going to attach each cluster label to its respective venue and plot it.

In [None]:
dataframe = []
for j in range(len(toronto_clusters['Cluster Labels'])):
    cluster_i_neighbourhood = toronto_clusters[toronto_clusters['Cluster Labels'] == j].reset_index()['Borough'].to_list()
    for i in range(len(cluster_i_neighbourhood)):
        cluster_list = np.full(shape=len(toronto_venue_data[toronto_venue_data['Borough']==cluster_i_neighbourhood[i]]),
                           fill_value=j).tolist()
        temp = toronto_venue_data[toronto_venue_data['Borough']==cluster_i_neighbourhood[i]]
        temp.insert(0,'Cluster Labels',cluster_list)
        dataframe.append(temp)

venue_cluster = pd.concat(dataframe[i] for i in range(len(toronto_clusters['Cluster Labels']))).reset_index(drop=True)
venue_cluster.drop_duplicates().head()

In [None]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="yyz_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude],tiles='Stamen Toner', zoom_start=11)

kclusters = 30
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(len(toronto_clusters['Cluster Labels']))]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venue_cluster['venue lat'], venue_cluster['venue long'], venue_cluster['Borough'], venue_cluster['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=1,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       

map_clusters

In [None]:
dataframe = []
for j in range(len(df_big['Cluster Labels'])):
    cluster_i_neighbourhood = df_big[df_big['Cluster Labels'] == j].reset_index()['Neighbourhood'].to_list()
    for i in range(len(cluster_i_neighbourhood)):
        cluster_list = np.full(shape=len(venue_stuff[venue_stuff['Neighbourhood']==cluster_i_neighbourhood[i]]),
                           fill_value=i).tolist()
        temp = venue_stuff[venue_stuff['Neighbourhood']==cluster_i_neighbourhood[i]]
        temp.insert(0,'Cluster Labels',cluster_list)
        dataframe.append(temp)
        
waterloo_new_venue = pd.concat(dataframe).reset_index(drop = True)

address = 'Waterloo, Ontario, Canada'

geolocator = Nominatim(user_agent="waterloo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


map_clusters = folium.Map(location=[latitude, longitude],tiles = 'Stamen Toner', zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(waterloo_new_venue['venue lat'], waterloo_new_venue['venue long'], waterloo_new_venue['Neighbourhood'], waterloo_new_venue['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=2,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       

map_clusters

I have clustered the neighbourhoods based on venue type

In [None]:
cluster_6 = venue_cluster[venue_cluster['Cluster Labels']==6]
cluster_6.head()

In [None]:
cluster_6['venue type'].value_counts()

In [None]:
random = []
random2 = []
for i in range(kclusters):
    #print(f'cluster {i}')
    cluster_i = venue_cluster[venue_cluster['Cluster Labels']==i]
    random.append([cluster_i['venue type'].value_counts().index[0:9].to_list(),cluster_i['Borough'].unique()])

for j in range(len(waterloo_new_venue['Cluster Labels'].unique())):
    cluster_j = waterloo_new_venue[waterloo_new_venue['Cluster Labels'] == j]
    random2.append([cluster_j['venue type'].value_counts().index[0:9].to_list(),cluster_j['Neighbourhood'].unique()])


#     display(cluster_i['venue type'].value_counts().index[0:9])
#     display(cluster_i['Borough'].unique())
#     for item in venue_cluster['Neighbourhood']:        
#         display(venue_stuff[venue_stuff['Neighbourhood'] == item].drop('Neighbourhood',1))

In [None]:
d = []
f = []
for i in pd.DataFrame(random,columns=['venue type','Borough']).iterrows():
    d.append(i[1][0])
    f.append(i[1][1])

columns = []
for ind in range(9):
    columns.append(f'{ind+1}th most common venue type')
    
toronto_final = pd.DataFrame(d, columns = columns).reset_index(drop = True)
toronto_cluster_neighbourhoods  = pd.DataFrame(f)
    

In [None]:
d = []
f = []
for i in pd.DataFrame(random2,columns=['venue type','Neighbourhood']).iterrows():
    d.append(i[1][0])
    f.append(i[1][1])

columns = []
for ind in range(9):
    columns.append(f'{ind+1}th most common venue type')
    
waterloo_final = pd.DataFrame(d, columns = columns).reset_index(drop = True)
waterloo_cluster_neighbourhoods  = pd.DataFrame(f)

In [None]:
pd.DataFrame(waterloo_final.iloc[0,:])

In [None]:
pd.DataFrame(toronto_final.iloc[6,:])

In [None]:
pd.DataFrame(toronto_cluster_neighbourhoods.iloc[6,:])

In [None]:
pd.DataFrame(waterloo_cluster_neighbourhoods.iloc[0,:]).dropna()