# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

**Bakeries** are a popular type of food service establishment. The smell of freshly baked goods and fantastic coffee is what I call heaven. 
My friend Mia loves to bake and just finished her baking school. Opening a bakery presents many unique challenges, soI wants to help her decide on a location where she can open a bakery with a low risk of competition.

This project aims to analyze and select the best locations in **Pune**, India, to open a new bakery. This project mainly focuses on the geospatial analysis of Pune City to understand which would be the best place to open a new bakery. Using data science methodology and machine learning techniques like **clustering**, this project aims to provide solutions to answer the business question: In Pune, if a person is looking to open a new bakery, where would you recommend that they open it?


In [73]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from bs4 import BeautifulSoup

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

/bin/bash: conda: command not found
/bin/bash: conda: command not found
Libraries imported.


## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing bakeries in the neighborhood
* distance of neighborhood from city center

Following data sources will be needed to extract/generate the required information:
* List of neighbourhoods in Pune: It defines this project's scope, which is confined to the city of Pune.
* Latitude and longitude coordinates of those neighbourhoods. This is required to plot the map and also to get the venue data
* Venue data, particularly data related to bakeries. We will use this data to perform clustering on the neighbourhoods.


In [74]:
# Send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Pune").text
# Parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')
# Create a list to store neighbourhood data
neighborhoodList = []
# Append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
  neighborhoodList.append(row.text)
# Create a new DataFrame from the list
kl_df = pd.DataFrame({"Neighborhood": neighborhoodList})
kl_df.head()

Unnamed: 0,Neighborhood
0,Appa Balwant Chowk
1,"Aundh, Pune"
2,Balewadi
3,Baner
4,Bavdhan


### Neighborhood Candidates

This is the data frame created after scraping the data. We need to get the geographical coordinates in the form of latitude and longitude in order to be able to use Foursquare API. To do so, we will use the Geocoder package that will allow us to convert the address into geographical coordinates in the form of latitude and longitude.


In [75]:
#!pip install geocoder
import geocoder
# Defining a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Pune, India'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in kl_df["Neighborhood"].tolist()]

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


In [76]:
coords

[[18.516483671884753, 73.85387026191101],
 [18.563450000000046, 73.81227000000007],
 [18.576020000000028, 73.77983000000006],
 [18.548200000000065, 73.77316000000008],
 [18.50747000000007, 73.78236000000004],
 [18.509030000000052, 73.87317000000007],
 [18.579220000000078, 73.74352000000005],
 [18.516890000000046, 73.85617000000008],
 [18.51244931570263, 73.85657158825195],
 [18.494410000000073, 74.39857000000006],
 [18.515850000000057, 73.84061000000008],
 [18.46628000000004, 73.85325000000006],
 [18.447020000000066, 73.80757000000006],
 [18.509650000000022, 73.83124000000004],
 [18.505840000000035, 73.90232000000003],
 [18.514030000000048, 73.86287000000004],
 [18.503460000000075, 73.86432000000008],
 [18.50577000000004, 73.86142000000007],
 [18.502490000000023, 73.92709000000008],
 [18.591420000000028, 73.73895000000005],
 [18.546460000000025, 73.90067000000005],
 [18.52305000000007, 73.85825000000006],
 [18.544620000000066, 73.93922000000003],
 [18.535330000000044, 73.89382000000006

Looking good. Let's now place all this into a Pandas dataframe.

In [77]:
# Create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
# Merge the coordinates into the original dataframe
kl_df['Latitude'] = df_coords['Latitude']
kl_df['Longitude'] = df_coords['Longitude']
print(kl_df.shape)
kl_df.head()

(57, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Appa Balwant Chowk,18.516484,73.85387
1,"Aundh, Pune",18.56345,73.81227
2,Balewadi,18.57602,73.77983
3,Baner,18.5482,73.77316
4,Bavdhan,18.50747,73.78236


This is the combined data frame which contains all the neighbourhoods along with the geographical coordinates.

Let's first find the latitude & longitude of Pune city, using specific, well known address and Google Maps geocoding API.

In [78]:
# Getting the coordinates of Pune
address = 'Pune, India'
geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Pune, India {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Pune, India 18.521428, 73.8544541.


After gathering the data, we have to populate the data into a pandas DataFrame and then visualize the neighbourhoods in a map using Folium package.

In [79]:
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)
# Adding markers to map
for lat, lng, neighborhood in zip(kl_df['Latitude'],  kl_df['Longitude'], kl_df['Neighborhood']):
 label = '{}'.format(neighborhood)
 label = folium.Popup(label, parse_html=True)
 folium.CircleMarker([lat, lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(map_kl)
map_kl

### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on bakeries in each neighborhood.

We're interested in venue 'bakery' category. So we will include 'Barkery, in our list i.e. where the venues has 'bakery' in category name.

In [80]:
CLIENT_ID = 'VM1GZX1IP2KLKCZT0J0XXNB4YJKGEYGJA3ZRX3IKCA2RS2S5' # your Foursquare ID
CLIENT_SECRET = 'R41ZUQRN0VAXRG1IYX2XAME5ZN2I3KXCCBYWU3GUCECM2FDT' # your Foursquare Secret
VERSION = '20180604'
radius = 2000
LIMIT = 100
venues = []
for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
  # Create the API request URL
  url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,long,radius,LIMIT)
  # Make the GET request
  results = requests.get(url).json()["response"]['groups'][0]['items']
  # Return only relevant information for each nearby venue
  for venue in results:
      venues.append((neighborhood,lat,long,venue['venue']['name'],
      venue['venue']['location']['lat'],venue['venue']['location']    ['lng'],venue['venue']['categories'][0]['name']))

After extracting all the venues, we have to convert the venues list into a new DataFrame.

In [81]:
venues_df = pd.DataFrame(venues)
# Defining the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
print(venues_df.shape)
venues_df.head()

(3369, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Appa Balwant Chowk,18.516484,73.85387,Bhagat Tarachand,18.514332,73.851317,Indian Restaurant
1,Appa Balwant Chowk,18.516484,73.85387,Lal Mahal,18.51872,73.856556,Historic Site
2,Appa Balwant Chowk,18.516484,73.85387,Sujata Mastani,18.511793,73.852145,Ice Cream Shop
3,Appa Balwant Chowk,18.516484,73.85387,Fish Curry Rice,18.516415,73.850934,Seafood Restaurant
4,Appa Balwant Chowk,18.516484,73.85387,Raja Dinkar Kelkar museum,18.510744,73.854389,History Museum


In [82]:
# Lets check how many venues were returned for each neighbourhood
venues_df.groupby(["Neighborhood"]).count()
# Lets check out how many unique categories can be curated from all the returned values
print('There are {} unique categories.'.format(len(venues_df['VenueCategory'].unique())))
# There are 175 unique categories
# Displaying the first 50 Venue Category names
venues_df['VenueCategory'].unique()[:50]

There are 143 unique categories.


array(['Indian Restaurant', 'Historic Site', 'Ice Cream Shop',
       'Seafood Restaurant', 'History Museum',
       'Vegetarian / Vegan Restaurant', 'Jewelry Store', 'Juice Bar',
       'Tea Room', 'Donut Shop', 'Café', 'Snack Place', 'Motorcycle Shop',
       'Fast Food Restaurant', 'BBQ Joint', 'Bistro', 'Bakery',
       'Sandwich Place', "Women's Store", 'Coffee Shop',
       'South Indian Restaurant', 'Diner', 'Road', 'Italian Restaurant',
       'Restaurant', 'Frozen Yogurt Shop', 'Dessert Shop', 'Burger Joint',
       'Pizza Place', 'Maharashtrian Restaurant', 'Stationery Store',
       'Bar', 'Convenience Store', 'Hookah Bar', 'Hotel',
       'Asian Restaurant', 'Food Truck', 'Comfort Food Restaurant',
       'Gym / Fitness Center', 'Hospital', 'Garden', 'Department Store',
       'Gym', 'Bookstore', 'Korean Restaurant', 'Chocolate Shop',
       'Shopping Mall', 'Grocery Store', 'Mexican Restaurant',
       'Chinese Restaurant'], dtype=object)

## Analyzing each neighbourhood

In [83]:
# One hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")
# Adding neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood']
# Moving neighbourhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]
print(kl_onehot.shape)
(6684, 175)

(3369, 144)


(6684, 175)

Next, let’s group rows of neighbourhood by taking the sum of the frequency of occurrence of each category.

In [84]:
kl_grouped=kl_onehot.groupby(["Neighborhoods"]).sum().reset_index()
print(kl_grouped.shape)

(56, 144)


In [85]:
len((kl_grouped[kl_grouped["Bakery"] > 0]))

40

There are 40 bakeries in Pune which is very high. So now we have to select a suitable location where the number of bakeries is low so that our chances of opening up a bakery at that location should be good.

## Methodology <a name="methodology"></a>

The Foursquare API allows application developers to interact with the Foursquare platform. The API itself is a RESTful set of addresses to which you can send requests, so there’s really nothing to download onto your server.

In [86]:
# Creating a dataframe for Bakery data only
kl_bakery = kl_grouped[["Neighborhoods","Bakery"]]

## Clustering the neighbourhoods
Now we need to cluster all the neighbourhoods into different clusters. The results will allow us to identify which neighbourhoods have a higher number of bakeries or lower number of bakeries or even may be no bakery. Based on the occurrence of bakeries in different neighbourhoods, it will help us answer the question as to which neighbourhoods are most suitable to open new bakery.

In [87]:
# Setting the number of clusters
kclusters = 3
kl_clustering = kl_bakery.drop(["Neighborhoods"], 1)
# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(kl_clustering)
# Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 0, 0, 0, 2, 2, 0, 2, 2, 2], dtype=int32)

We set the number of clusters to 3 and run the algorithm. After applying the K-Means clustering algorithm, all the neighbourhoods get segregated and form different clusters.

In [88]:
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_bakery.copy()
# Add the clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head(10)

Unnamed: 0,Neighborhood,Bakery,Cluster Labels
0,Appa Balwant Chowk,1,2
1,"Aundh, Pune",0,0
2,Balewadi,0,0
3,Baner,0,0
4,Bavdhan,2,2
5,"Bhavani Peth, Pune",1,2
6,Blue Ridge Town Pune,0,0
7,"Budhwar Peth, Pune",1,2
8,"Chakan, Pune",1,2
9,Deccan Gymkhana,2,2


Here the Bakery column represents the number of bakeries in that particular area and Cluster Labels represents the cluster number (either 0 or 1 or 2)

In [89]:
 # Adding latitude and longitude values to the existing dataframe
kl_merged['Latitude'] = kl_df['Latitude']
kl_merged['Longitude'] = kl_df['Longitude']
# Sorting the results by Cluster Labels
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

Unnamed: 0,Neighborhood,Bakery,Cluster Labels,Latitude,Longitude
11,Dhayari,0,0,18.46628,73.85325
47,"Sus, Pune",0,0,18.52031,73.86768
13,Fatimanagar,0,0,18.50965,73.83124
29,Marunji,0,0,18.55874,73.98737
54,Warje,0,0,18.604,73.75038
10,Dhankawadi,0,0,18.51585,73.84061
28,Manjri,0,0,18.52458,73.86465
21,Kharadi,0,0,18.52305,73.85825
6,Blue Ridge Town Pune,0,0,18.57922,73.74352
36,Pashan,0,0,18.50741,73.84409


Now here we can clearly see all the places which belong to the 1st cluster(cluster number 0). Similarly, we see all the cluster numbers in the sorted order that is 0,1,2.

## Visualizing the resulting clusters

In [90]:
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
  label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
  folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
map_clusters

In [91]:
kl_merged[kl_merged['Cluster Labels']==0].Bakery.unique()

array([0], dtype=uint8)

In [92]:
kl_merged[kl_merged['Cluster Labels']==1].Bakery.unique()

array([5, 3, 6, 4], dtype=uint8)

In [93]:
kl_merged[kl_merged['Cluster Labels']==2].Bakery.unique()

array([2, 1], dtype=uint8)

## Examining the clusters

In [94]:
print(len(kl_merged.loc[kl_merged['Cluster Labels'] == 0]))
print(len(kl_merged.loc[kl_merged['Cluster Labels'] == 1]))
print(len(kl_merged.loc[kl_merged['Cluster Labels'] == 2]))

16
6
34


# Results

The results from the K-means clustering show that we can categorize the neighbourhoods into 3 clusters based on the frequency of occurrence for "Bakery":
-  Cluster 0: Neighbourhoods with no Bakeries
-  Cluster 1: Neighbourhoods with a high number of Bakery
-  Cluster 2: Neighbourhoods with a moderate number of Bakery


We visualize the results of the clustering in the map with cluster 0 in red colour, cluster 1 in purple colour, and cluster 2 in mint green colour.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Pune areas with low number of bakeries in order to find a location for Mia to open a bakery by narrowing down the search for optimal location for a new bakery.
By calculating bakery density distribution from Foursquare data we have first identified that there are 40 bakeries in the 56 Neighborhood.
Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal bakery location will be made by the soon to be owner(Mia) based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration the number of bakeries already present in the neighborhood