# Data Science Capstone Project


## Purpose
The purpose of this project is to explore the venues of Toronto and Waterloo and understand the differences between a big city and a smaller city.

## Background
Predicting where to open a business is a very hard problem. The number of variables that go into solving this kind of problem is a lot. It is almost impossible to predict whether a certain kind of business might succeed in one neighbourhood vs the other, one city vs the other, Evironment, culture, people, religion, time, location, asthetic appeal they all intermingle with each other to predict whether a certain place might be a good place for a business. Also what kind of business also makes a big difference. I am no expert and neither do I claim to be I just experiment and analyze and predict. So I just go into this problem blindly and see what comes out of the end. In this project I look into one variable which is the venue data on othere people's busineses and I cluster them and see which cluster is good for which kind of business.


## Methodology
I. Data acquisiton and cleaning

II. Data Mining and Cleaning

III. Exploratory Data Analysis

IV. Predictive Modeling

V. Conclusions



# Setup

## Library import
We import all the required Python libraries

In [1]:
# Imported all the neccesary libraries 
from bs4 import BeautifulSoup # For Web Scraping

import csv
import pandas as pd #Data Analysis
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)
import numpy as np

import geocoder #get location data
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests
import json #To handle json files
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans # To run k-means algorithm

import matplotlib.cm as cm
import matplotlib.colors as colors # matplotlib stuff
import matplotlib.pyplot as plt
#%matplotlib inline # backend for rendering plots wit

import folium # To visualize on map
from folium.map import *
from folium import plugins
from folium.plugins import MeasureControl
from folium.plugins import FloatImage

import json

print('Libraries Imported')

Libraries Imported


# Data Mining and Cleaning

We retrieve all the required data for the analysis. I use the arcgis API to get the required Waterloo-Kitchener Neighbourhood Data. I am aware that these Neighbourhoods might not be accurate in their location but these are just rough placeholder names to segment the area.

Getting Waterloo Data

In [2]:
url = 'https://services.arcgis.com/ZpeBVw5o1kjit7LT/arcgis/rest/services/NeighbourhoodAssociations/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json'
data = requests.get(url).json()  


with open('data.geojson', 'w') as json_file:
    json.dump(data, json_file)

address = 'Waterloo, Ontario, Canada'

geolocator = Nominatim(user_agent="waterloo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Waterloo City are {}, {}.'.format(latitude, longitude))

element = []

for item in data['features']:
    i = round(len(item['geometry']['rings'][0])/2)+round(len(item['geometry']['rings'][0])/3)
    element.append([item['attributes']['NAME'],item['attributes']['TYPE'],item['geometry']['rings'][0][i][1],item['geometry']['rings'][0][i][0]])


df = pd.DataFrame(element,columns = ['Neighbourhood', 'Neighbourhood Type','lat','long'])

df.head()

The geograpical coordinate of Waterloo City are 43.466874, -80.524635.


Unnamed: 0,Neighbourhood,Neighbourhood Type,lat,long
0,Uptown West,Neighbourhood Association,43.457114,-80.532676
1,Mary Allen,Neighbourhood Association,43.459286,-80.511531
2,Conservation Meadows,Neighbourhood,43.498458,-80.573241
3,Upper Beechwood II,Homes Association,43.458892,-80.571388
4,Beechwood North,Homes Association,43.465507,-80.563759


Getting Kitchener Data

In [3]:
url = 'https://services1.arcgis.com/qAo1OsXi67t7XgmS/arcgis/rest/services/Neighbourhood_Association/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json'

data2 = requests.get(url).json() 

data2['features'][0]

element = []

for item in data2['features']:
    i = round(len(item['geometry']['rings'][0])/5)
    element.append([item['attributes']['MAPLABEL1'],item['attributes']['NEIGHBOURHOOD_ASSOCIATION'],item['geometry']['rings'][0][i][1],item['geometry']['rings'][0][i][0]])
    
element[:5]

df2 = pd.DataFrame(element,columns = ['Neighbourhood', 'Neighbourhood Type','lat','long'])

df2.head()

Unnamed: 0,Neighbourhood,Neighbourhood Type,lat,long
0,Victoria Hills N. A. (27),VICTORIA HILLS NEIGHBOURHOOD ASSOCIATION,43.439768,-80.507431
1,King East N. A. (19),KING EAST NEIGHBOURHOOD ASSOCIATION,43.446339,-80.472788
2,Auditorium N. A. (2),AUDITORIUM NEIGHBOURHOOD ASSOCIATION,43.445602,-80.470959
3,Westmount N. A. (29),WESTMOUNT NEIGHBOURHOOD ASSOCIATION,43.446995,-80.522008
4,Bridgeport Community Assoc. (4),BRIDGEPORT COMMUNITY ASSOCIATION,43.491443,-80.480919


Combining them to get the bigger dataset

In [4]:
df_big = pd.concat([df,df2],axis = 0,sort = False).reset_index()

df_big.head()

Unnamed: 0,index,Neighbourhood,Neighbourhood Type,lat,long
0,0,Uptown West,Neighbourhood Association,43.457114,-80.532676
1,1,Mary Allen,Neighbourhood Association,43.459286,-80.511531
2,2,Conservation Meadows,Neighbourhood,43.498458,-80.573241
3,3,Upper Beechwood II,Homes Association,43.458892,-80.571388
4,4,Beechwood North,Homes Association,43.465507,-80.563759


Visualizing this DataSet on a map

In [5]:
KW_map = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat,lng,neighbourhood in zip(df_big['lat'],df_big['long'],df_big['Neighbourhood Type']):
    label = f'{neighbourhood}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup=label,
        color = 'green',
        fill = True,
        parse_html=False, 
    ).add_to(KW_map)
    

KW_map

Getting venue data from this DataFrame

In [6]:
names = df_big['Neighbourhood']
latitudes = df_big['lat']
longtitudes = df_big['long']
radius = 500

CLIENT_ID = 'JR0WCEE3K2VYBHOHIRHCAPOL3BXLGM2WHQNPCXXJ0GB20HPP' #Foursquare ID
CLIENT_SECRET = '251PHKQSNFRZEAMTRWB3YKHYRHT50WRVAAW35AMS3CHXDFE2' # Foursquare Secret
VERSION = '20200528' # Foursquare API version
LIMIT = 500

list = []


for name,lat,long in zip(names,latitudes,longtitudes):
    url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{long}&radius={radius}&limit={LIMIT}'
    json = requests.get(url).json()
    venues = json['response']['groups'][0]['items']
    try:
        list.append([(
                    name,
                    lat,
                    long,
                    v['venue']['name'], 
                    v['venue']['location']['lat'],
                    v['venue']['categories'][0]['name'],
                    v['venue']['location']['lng']) for v in venues]
                   )
    except:
        continue
missing_values = []
for name,lat,long in zip(names,latitudes,longtitudes):
    url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{long}&radius={radius}&limit={LIMIT}'
    try:
        json = requests.get(url).json()
        venues = json['response']['groups'][0]['items']
        if len(venues) == 0:
            missing_values.append(name)
    except:
        continue

In [15]:
list[0][0] # Looking at what this list looks like

('Uptown West',
 43.457113845,
 -80.532676196,
 "Fo'cheezy",
 43.458229035211716,
 'Food Truck',
 -80.53239871108174)

In [16]:
missing_values # The neighbourhoods which have

['Conservation Meadows', 'Huron Community Assoc. (18) ']

In [7]:
venue_stuff = pd.DataFrame([item for venue_list in list for item in venue_list],columns = ['Neighbourhood','lat','long','venue','venue lat','venue type','venue long'])

In [8]:
venue_stuff.to_csv(r'C:\Users\Abhik\Downloads\venue_waterloo_data.csv',index = False)

In [9]:
venue_stuff.head()

Unnamed: 0,Neighbourhood,lat,long,venue,venue lat,venue type,venue long
0,Uptown West,43.457114,-80.532676,Our Lady of Lourdes Ball Field,43.457637,Baseball Field,-80.530302
1,Uptown West,43.457114,-80.532676,Fo'cheezy,43.458229,Food Truck,-80.532399
2,Uptown West,43.457114,-80.532676,Empire Public School,43.455051,Playground,-80.535269
3,Uptown West,43.457114,-80.532676,Mirage Las Vegas,43.454707,Hotel,-80.529663
4,Uptown West,43.457114,-80.532676,Optometrist,43.459087,Optical Shop,-80.53713
