# Coursera Capstone Battle of The Neighborhoods

This notebook is used for the Week 4 and 5 Capstone Project (the battle of the Neighborhoods) of the IBM Data Science Specialization on Coursera.

In this project we would be looking at the best region to open a Café in the city of Mumbai, India.

## Step 1: Importing Associated Libraries

In [1]:
# library to handle data in a vectorized manner
import numpy as np


# library for data analsysis
import pandas as pd


# library to handle JSON files
import json


# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim


# to get coordinates
import geocoder


# library to handle requests
import requests


# library to parse HTML and XML documents
from bs4 import BeautifulSoup


# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


# import k-means from clustering stage
from sklearn.cluster import KMeans


# map rendering library
import folium

print("Libraries imported.")

Libraries imported.


## Step 2: Scraping Data of Neighborhoods for Mumbai from wiki

In [2]:
url = "https://en.wikipedia.org/wiki/Category:Suburbs_of_Mumbai"

# send the GET request
url_html = requests.get(url).text

# parse data from the html above
html_doc = BeautifulSoup(url_html, 'html.parser')

## Step 3: Extracting the neighborhoods into a dataframe

In [3]:
# intiating an empty list to store the neighborhood data
neighborhoods = []

# append the data into the list
for row in html_doc.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoods.append(row.text)
    
# creating a dataframe of the neighbouthoods list
df_mumbai = pd.DataFrame({"Neighborhood": neighborhoods})

df_mumbai.set_index('Neighborhood', inplace = True)
df_mumbai.drop('Uttan', inplace = True)

df_mumbai.reset_index(inplace = True)
df_mumbai.head()

Unnamed: 0,Neighborhood
0,Andheri
1,Anushakti Nagar
2,Baiganwadi
3,Bandra
4,Bhandup


In [4]:
df_mumbai.shape

(41, 1)

## Step 4: Getting the Lat, Long of the neighborhoods

In [5]:
# creating a function to get the lat long

def get_latlng(neighborhood):
    
    # initialize lat, long  variable to None
    lat_lng_coords = None
    
    # loop until we get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai, India'.format(neighborhood))
        lat_lng_coords = g.latlng
        
    return lat_lng_coords

In [6]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [get_latlng(neighborhood) for neighborhood in df_mumbai["Neighborhood"].tolist()]

In [7]:
# converting the list to a dataframe
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

df_coords.head()

Unnamed: 0,Latitude,Longitude
0,19.11847,72.84177
1,19.04283,72.92734
2,19.06294,72.92663
3,19.05437,72.84017
4,19.14556,72.94856


In [8]:
# merge the coordinates into the original dataframe
df_mumbai['Latitude'] = df_coords['Latitude']
df_mumbai['Longitude'] = df_coords['Longitude']

In [9]:
# check the neighborhoods and the coordinates
print(df_mumbai.shape)
df_mumbai

(41, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Andheri,19.11847,72.84177
1,Anushakti Nagar,19.04283,72.92734
2,Baiganwadi,19.06294,72.92663
3,Bandra,19.05437,72.84017
4,Bhandup,19.14556,72.94856
5,Borivali,19.22936,72.85751
6,Charkop,19.20866,72.82612
7,Chembur,19.053995,72.899675
8,Dahisar,19.25003,72.85907
9,Devipada,19.22469,72.86605


## Step 5: Creating a map of Mumbai with neighborhoods as markers

In [10]:
# get the coordinates of Mumbai
address = 'Mumbai, India'

geolocator = Nominatim(user_agent="mumbai_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai, India are : {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai, India are : 19.0759899, 72.8773928.


In [11]:
# create map of Toronto using latitude and longitude values
map_mumbai = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df_mumbai['Latitude'], df_mumbai['Longitude'], df_mumbai['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mumbai)  
    
map_mumbai

## Step 6: Use the Foursquare API to explore the neighborhoods

In [12]:
# define Foursquare Credentials and Version
CLIENT_ID = 'UFQ1VCPKDL1FK2RXWEBR43PE5IZUMXHIMUY2AOHGVNRE0RJJ' # My Foursquare ID
CLIENT_SECRET = 'AHLXOHS0IE143EQSRV3MORUWXCVCFCIO2PH2QJVZJFQ0V5KP' # My Foursquare Secret
VERSION = '20200721' # Foursquare API version

print('My Foursquare credentails are:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My Foursquare credentails are:
CLIENT_ID: UFQ1VCPKDL1FK2RXWEBR43PE5IZUMXHIMUY2AOHGVNRE0RJJ
CLIENT_SECRET:AHLXOHS0IE143EQSRV3MORUWXCVCFCIO2PH2QJVZJFQ0V5KP


In [13]:
radius = 2500 # Venues within a radius of 2.5 Km
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(df_mumbai['Latitude'], df_mumbai['Longitude'], df_mumbai['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [14]:
# convert the list of venues to dataframe
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

venues_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,Andheri,19.11847,72.84177,Merwans Cake shop,19.1193,72.845418,Bakery
1,Andheri,19.11847,72.84177,Radha Krishna Veg Restaurant,19.11513,72.84306,Indian Restaurant
2,Andheri,19.11847,72.84177,Naturals,19.111204,72.837255,Ice Cream Shop
3,Andheri,19.11847,72.84177,Shawarma Factory,19.124591,72.840398,Falafel Restaurant
4,Andheri,19.11847,72.84177,Quench- All Day Pub,19.114538,72.836204,Pub


In [15]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andheri,100,100,100,100,100,100
Anushakti Nagar,20,20,20,20,20,20
Baiganwadi,4,4,4,4,4,4
Bandra,100,100,100,100,100,100
Bhandup,37,37,37,37,37,37
Borivali,100,100,100,100,100,100
Charkop,61,61,61,61,61,61
Chembur,98,98,98,98,98,98
Dahisar,78,78,78,78,78,78
Devipada,100,100,100,100,100,100


In [16]:
# checking the unique count of Venue Categories
print('There are {} uniques categories.'.format(len(venues_df['Venue Category'].unique())))

There are 183 uniques categories.


In [17]:
# check if the results contain "Café"
"Café" in venues_df['Venue Category'].unique()

True

## Step 7: Analyzing each Neighborhood

In [35]:
# one hot encoding of data as part of data wrangling
mumbai = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mumbai['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [mumbai.columns[-1]] + list(mumbai.columns[:-1])
mumbai = mumbai[fixed_columns]

mumbai.head()

Unnamed: 0,Neighborhoods,Afghan Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Theater,Theme Park,Toy / Game Store,Track,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Wine Bar,Women's Store
0,Andheri,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Group rows by neighborhood by taking the mean of the frequency of occurrence of each category
mumbai_grouped = mumbai.groupby(["Neighborhoods"]).mean().reset_index()

mumbai_grouped

Unnamed: 0,Neighborhoods,Afghan Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Theater,Theme Park,Toy / Game Store,Track,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Wine Bar,Women's Store
0,Andheri,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01
1,Anushakti Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Baiganwadi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bandra,0.0,0.0,0.02,0.0,0.03,0.0,0.01,0.01,0.07,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bhandup,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.054054,0.027027,0.0,0.0
5,Borivali,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0
6,Charkop,0.0,0.016393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Chembur,0.0,0.0,0.0,0.010204,0.030612,0.0,0.0,0.0,0.010204,...,0.0,0.0,0.0,0.0,0.0,0.0,0.010204,0.030612,0.0,0.0
8,Dahisar,0.0,0.0,0.012821,0.0,0.0,0.0,0.025641,0.0,0.012821,...,0.012821,0.0,0.0,0.0,0.0,0.0,0.012821,0.012821,0.0,0.0
9,Devipada,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0


In [20]:
# create a new dataframe with only Café data
mumbai_cafe = mumbai_grouped[['Neighborhoods', 'Café']]

mumbai_cafe.head()

Unnamed: 0,Neighborhoods,Café
0,Andheri,0.08
1,Anushakti Nagar,0.0
2,Baiganwadi,0.0
3,Bandra,0.09
4,Bhandup,0.081081


## Step 8: Clustering the neighborhoods

In [21]:
# set number of clusters
kclusters = 5

mumbai_clustered = mumbai_cafe.drop(["Neighborhoods"], axis = 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mumbai_clustered)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 1, 2, 1, 4, 4, 2, 2])

In [22]:
# create a new dataframe that includes the cluster label
mumbai_merged = mumbai_cafe.copy()

# add clustering labels
mumbai_merged["Cluster Labels"] = kmeans.labels_


mumbai_merged.rename(columns = {'Neighborhoods': 'Neighborhood'}, inplace = True)
mumbai_merged.head()

Unnamed: 0,Neighborhood,Café,Cluster Labels
0,Andheri,0.08,2
1,Anushakti Nagar,0.0,0
2,Baiganwadi,0.0,0
3,Bandra,0.09,1
4,Bhandup,0.081081,2


In [23]:
# merge  with toronto_data to add latitude/longitude for each neighborhood
mumbai_merged = mumbai_merged.join(df_mumbai.set_index("Neighborhood"), on="Neighborhood")

print(mumbai_merged.shape)
mumbai_merged.head() # check the last columns!

(40, 5)


Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
0,Andheri,0.08,2,19.11847,72.84177
1,Anushakti Nagar,0.0,0,19.04283,72.92734
2,Baiganwadi,0.0,0,19.06294,72.92663
3,Bandra,0.09,1,19.05437,72.84017
4,Bhandup,0.081081,2,19.14556,72.94856


## Step 9: Visualizing the clusters

In [24]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_merged['Latitude'], mumbai_merged['Longitude'], mumbai_merged['Neighborhood'], mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Step 10: Examining the Clusters

### Cluster 0

In [25]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
1,Anushakti Nagar,0.0,0,19.04283,72.92734
2,Baiganwadi,0.0,0,19.06294,72.92663
17,Kalyan,0.01,0,18.95394,72.82037
23,Mankhurd,0.0,0,19.04853,72.93222
27,Mumbra,0.0,0,19.188413,73.022011
30,Shil Phata,0.0,0,19.14658,73.04005
36,Vikhroli,0.023256,0,19.11109,72.92781


### Cluster 1

In [26]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
3,Bandra,0.09,1,19.05437,72.84017
5,Borivali,0.1,1,19.22936,72.85751
19,Kanjurmarg,0.097561,1,19.13138,72.93568
20,Kausa,0.12,1,19.12758,72.82539
29,Seven Bungalows,0.12,1,19.13146,72.81646


### Cluster 2

In [27]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
0,Andheri,0.08,2,19.11847,72.84177
4,Bhandup,0.081081,2,19.14556,72.94856
8,Dahisar,0.064103,2,19.25003,72.85907
9,Devipada,0.08,2,19.22469,72.86605
11,Eastern Suburbs (Mumbai),0.074074,2,19.004272,72.85579
12,Ghatkopar,0.07,2,19.0863,72.90908
13,Goregaon,0.06,2,19.16455,72.84946
25,Mira Road,0.057692,2,19.265705,72.870693
26,Mulund,0.06,2,19.17183,72.95565
33,Thakur village,0.076923,2,19.2102,72.87541


### Cluster 3

In [28]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 3]

Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
10,Dombivli,0.192308,3,19.21275,73.08324


### Cluster 4

In [29]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 4]

Unnamed: 0,Neighborhood,Café,Cluster Labels,Latitude,Longitude
6,Charkop,0.032787,4,19.20866,72.82612
7,Chembur,0.040816,4,19.053995,72.899675
14,Grant Road,0.04,4,18.95929,72.83108
15,Jogeshwari,0.05,4,19.13792,72.84941
16,Juhu,0.05,4,19.01492,72.84522
18,Kandivali,0.05,4,19.2119,72.8375
21,Kurla,0.04,4,19.06498,72.88069
22,Mahavir Nagar (Kandivali),0.05,4,19.21094,72.84137
24,"Matharpacady, Mumbai",0.04,4,18.95071,72.82727
28,Pestom sagar,0.05,4,19.07064,72.90217


#### Observation:
Most of the Cafes are concentrated in the southern and eastern area of Mumbai, with the highest number in cluster 2 and moderate number in cluster 4. On the other hand, cluster 0 has very low number to totally no Cafes in the neighborhoods. This represents a great opportunity and high potential areas to open new Cafes as there is very little to no competition from existing shops. Meanwhile, cafes in cluster 3 are likely suffering from intense competition due to oversupply and high concentration of cafes. From another perspective, this also shows that the oversupply of cafes mostly happened in the central area of the city, with the suburb area still have very few cafes. Therefore, this project recommends developers to capitalize on these findings to open new cafes in neighborhoods in cluster 0 with little to no competition. Developers with unique selling propositions to stand out from the competition can also open new cafes in neighborhoods in cluster 2 and cluster 4 with moderate competition. Lastly, developers are advised to avoid neighborhoods in cluster 3 which already hs high concentration of cafes and is suffering from intense competition.