# **Exploring Mumbai City for Foreign Tourists**

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#results)
* [Conclusion](#conclusion)




### **Introduction: Business Problem** <a name="introduction"></a>

Mumbai is a commercial capital of India with a rich cosmopolitan culture. Many tourists from all over the world visit here to enjoy the scenic beaches, historic monuments, museums and delicious international variety of cuisines. It is a shopping paradise and offer lot of commodities ranging from readymade garments, household items to antiques. Generally, tourists have no or little idea where to book a hotel which is near to touristic places of interest like beaches, museum or aquarium they are looking for. Where they can find good restaurants, which can offer a wide variety of international cuisine. Or where they can find park and gardens for their small kids. Which areas are good for shopping having malls or flea markets. Where they can get antiques. The fitness freaks would prefer a Gym or Spa nearby. Where they can enjoy a boat ride. The buses and trains usually are overcrowded and not recommended for tourists. Therefore, they should get some knowhow of city neighbourhoods so they can book a hotel in neighbourhoods where they can find places of their interest within walking distance or they can hire a taxi and can reach within minutes and time is not wasted in overcrowded buses or trains.
Mumbai is basically divided into South, West, East and North regions or boroughs. South Mumbai is oldest and rich in heritage.
I decided to choose Mumbai as I am born and brought up in this city and familiar with most of the neighbourhoods and can verify accuracy of data obtained.


The data required to resolve this problem includes neighbourhoods of Mumbai, in which borough they fall, their coordinates and the venues nearby with their categories.
The key aim of this project is to guide a foreign tourist about booking a hotel in a neighbourhood where he/she can find places of his/her interest within walking distance or can reach by hiring a taxi in few minutes from his/her hotel. This problem can be resolved by showing a map with clusters of neighbourhoods in different colours based on their similarities.
 

The foreign tourists will benefit most from the findings of this project and they should be able to book the hotel in an area where they can find things of their interest nearby and will save their time in commuting through overcrowded trains or buses. Apart from that those who want to setup a new hotel or a restaurant or some shop can also be benefitted. Other interested parties can be tour operators to help them setup their office in right place.

### **Data** <a name="data"></a>

In this analysis the key decision-making factor would be frequency of occurrence of a venue category within a particular neighbourhood. The data you need is neighbourhood, its respective borough, latitude and longitude position of each neighbourhood, venues, venue category, latitude and longitude position of each venue.

The main data source would be Four Square API which provides you with the venues along with the categories. But, in order to provide venues of a location you need to provide latitude and longitude of the location. The latitude and longitude of a location are fetched from Geo API. The Geo API has some limitations as to number of location coordinates it can return and times out. This issue was resolved by splitting the data and passing locations in two iterations. But, major challenge was non-availability of all the neighbourhoods with their respective boroughs. Therefore, constructed this matrix using Python objects like array, dictionary and dataframe within code. 

Get required Libraries for this project.

In [64]:
import pandas as pd
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

import numpy as np

from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns
from urllib.request import urlopen

import requests
# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print("Library imported.")


Library imported.


Let's generate the data for neighbourhoods with their respective boroughs and their latitude and longitudes    using objects like array, dictionary and dataframe.

Mumbai is divided into four boroughs mainly. The South, North, East and West.For this analysis 21 neighborhoods from South, 8 from West, 5 each from East and West. Total neighborhoods is 39 to begin with.

In [0]:
# South
areas_south = ["Colaba","Apollo Bandar","Fort","Churchgate","Nariman Point" \
        ,"Marine Lines","Walkeshwar","Malabar Hill","Kalbadevi","Bhuleshwar" \
        ,"Masjid Bandar","Darukhana","Pydhoni","Nagpada","Dongri" \
        ,"Byculla","Grand Road","Mazgaon","Mumbai Central","Mahalaxmi" \
        ,"Worli"]

# West
areas_west = ["Dadar","Bandra West","Juhu","Andheri","Kalina" \
              ,"Santa Cruz","Mahim","Khar"]

# East
areas_east = ["Kurla","Ghatkopar","Chembur","Govandi","Mulund West"] 

# North
areas_north = ["Powai","Jogeshwari","Malad","Borivali","Mira Road"] 

# boroughs

boroughs_south_mumbai = ["South Mumbai","South Mumbai","South Mumbai","South Mumbai","South Mumbai", \
             "South Mumbai","South Mumbai","South Mumbai","South Mumbai","South Mumbai", \
             "South Mumbai","South Mumbai","South Mumbai","South Mumbai","South Mumbai", \
             "South Mumbai","South Mumbai","South Mumbai","South Mumbai","South Mumbai", \
             "South Mumbai"]

boroughs_west_mumbai = ["West Mumbai","West Mumbai","West Mumbai","West Mumbai","West Mumbai" \
             ,"West Mumbai","West Mumbai","West Mumbai"]

boroughs_east_mumbai = ["East Mumbai","East Mumbai","East Mumbai","East Mumbai","East Mumbai"]

boroughs_north_mumbai = ["North Mumbai","North Mumbai","North Mumbai","North Mumbai","North Mumbai"]

#### Note : This split was done due to limitations faced during call to Geopy API for retrieving the coordinates of all 39 neighbourhoods in one go.  Therefore, the data was retrieved in two batches.

In [66]:
boroughs1 =  boroughs_south_mumbai
boroughs2 =  boroughs_west_mumbai + boroughs_east_mumbai + boroughs_north_mumbai

areas1 = areas_south
areas2 = areas_west + areas_east + areas_north

#Building full address to pass to API

full_address1 = []
for area in areas1:
  full_address1.append(area + ", Mumbai")

full_address2 = []
for area in areas2:
  full_address2.append(area + ", Mumbai") 

print("Lenght of areas1={} , boroughs1 = {} , full_address1 = {}" \
.format(len(areas1),len(boroughs1),len(full_address1)))

print("Lenght of areas2={} , boroughs2 = {} , full_address2 = {}" \
.format(len(areas2),len(boroughs2),len(full_address2))) 

#Create Dictionary first
dic_area1 = {"Neighborhood": areas1,"Boroughs": boroughs1,"Address":full_address1}
dic_area2 = {"Neighborhood": areas2,"Boroughs": boroughs2,"Address":full_address2}
print("dic_area1:",dic_area1)
print("dic_area2:",dic_area2)

#Create dataframes from dictionary
df_area1 = pd.DataFrame.from_dict(dic_area1)
df_area2 = pd.DataFrame.from_dict(dic_area2)

print("df_area1:",df_area1.shape)
print("df_area2:",df_area2.shape)



Lenght of areas1=21 , boroughs1 = 21 , full_address1 = 21
Lenght of areas2=18 , boroughs2 = 18 , full_address2 = 18
dic_area1: {'Neighborhood': ['Colaba', 'Apollo Bandar', 'Fort', 'Churchgate', 'Nariman Point', 'Marine Lines', 'Walkeshwar', 'Malabar Hill', 'Kalbadevi', 'Bhuleshwar', 'Masjid Bandar', 'Darukhana', 'Pydhoni', 'Nagpada', 'Dongri', 'Byculla', 'Grand Road', 'Mazgaon', 'Mumbai Central', 'Mahalaxmi', 'Worli'], 'Boroughs': ['South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai', 'South Mumbai'], 'Address': ['Colaba, Mumbai', 'Apollo Bandar, Mumbai', 'Fort, Mumbai', 'Churchgate, Mumbai', 'Nariman Point, Mumbai', 'Marine Lines, Mumbai', 'Walkeshwar, Mumbai', 'Malabar Hill, Mumbai', 'Kalbadevi, Mumbai', 'Bhuleshwar, Mumba

Let's pass first batch of South Mumbai first.

In [67]:
# South Mumbai first

df2 = df_area1
df2.head()

data = []
for  neighborhood,address,borough in zip(df2['Neighborhood'], df2['Address'],df2['Boroughs']):
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    data.append([neighborhood,borough,latitude,longitude])

col_names =  ['Neighborhood', 'Borough', 'Latitude','Longitude']
neighborhoods = pd.DataFrame(data,columns=col_names)

# Assign to neighborhoods1 : South Mumbai
neighborhoods1 = neighborhoods
print("Coordinates for {} South Mumbai neighborhoods obtained.".format(neighborhoods1.shape[0]) )
neighborhoods1.tail()


Coordinates for 21 South Mumbai neighborhoods obtained.


Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
16,Grand Road,South Mumbai,18.938771,72.835335
17,Mazgaon,South Mumbai,18.968052,72.840012
18,Mumbai Central,South Mumbai,18.969586,72.819315
19,Mahalaxmi,South Mumbai,18.982568,72.82416
20,Worli,South Mumbai,19.011696,72.81807


Good.We got coordinates of all 21 neighborhoods of South with their respective borough.

Now let's pass second batch of suburbs of Mumabi.

In [68]:
# second iteration for suburbs of Mumbai.

df2 = df_area2

data = []
for  neighborhood,address,borough in zip(df2['Neighborhood'], df2['Address'],df2['Boroughs']):
  print("neighborhood:", neighborhood)
  print("address:", address)
  print("borough:", borough)
  geolocator = Nominatim(user_agent="ny_explorer")
  location = geolocator.geocode(address)
  latitude = location.latitude
  longitude = location.longitude
  print("latitude:", latitude)
  print("longitude:", longitude)
  data.append([neighborhood,borough,latitude,longitude])

col_names =  ['Neighborhood', 'Borough', 'Latitude','Longitude']
neighborhoods = pd.DataFrame(data,columns=col_names)

#Assign to neighborhoods2 : Rest of Mumbai
neighborhoods2 = neighborhoods
print("Coordinates for {} Rest of Mumbai neighborhoods obtained.".format(neighborhoods2.shape[0]) )
neighborhoods2.tail()

neighborhood: Dadar
address: Dadar, Mumbai
borough: West Mumbai
latitude: 19.019282
longitude: 72.8428757
neighborhood: Bandra West
address: Bandra West, Mumbai
borough: West Mumbai
latitude: 19.0583358
longitude: 72.8302669
neighborhood: Juhu
address: Juhu, Mumbai
borough: West Mumbai
latitude: 19.1070215
longitude: 72.8275275
neighborhood: Andheri
address: Andheri, Mumbai
borough: West Mumbai
latitude: 19.1196976
longitude: 72.8464205
neighborhood: Kalina
address: Kalina, Mumbai
borough: West Mumbai
latitude: 19.079273
longitude: 72.8612672
neighborhood: Santa Cruz
address: Santa Cruz, Mumbai
borough: West Mumbai
latitude: 19.0793694
longitude: 72.8470855
neighborhood: Mahim
address: Mahim, Mumbai
borough: West Mumbai
latitude: 19.0423145
longitude: 72.8398344
neighborhood: Khar
address: Khar, Mumbai
borough: West Mumbai
latitude: 19.0696584
longitude: 72.8398944
neighborhood: Kurla
address: Kurla, Mumbai
borough: East Mumbai
latitude: 19.0652797
longitude: 72.8793805
neighborhood: G

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
13,Powai,North Mumbai,19.11872,72.907348
14,Jogeshwari,North Mumbai,19.134899,72.84882
15,Malad,North Mumbai,19.186719,72.848588
16,Borivali,North Mumbai,19.229068,72.857363
17,Mira Road,North Mumbai,19.187896,72.836596


Good. Now we got coordinates of suburbs of Mumbai. So, let's merge both data.

In [69]:
neighborhoods = neighborhoods1.append(neighborhoods2,ignore_index=True)
neighborhoods = neighborhoods.reset_index()
print("Merged data frame have {} rows.".format(neighborhoods.shape[0]) )
neighborhoods.tail()


Merged data frame have 39 rows.


Unnamed: 0,index,Neighborhood,Borough,Latitude,Longitude
34,34,Powai,North Mumbai,19.11872,72.907348
35,35,Jogeshwari,North Mumbai,19.134899,72.84882
36,36,Malad,North Mumbai,19.186719,72.848588
37,37,Borivali,North Mumbai,19.229068,72.857363
38,38,Mira Road,North Mumbai,19.187896,72.836596


Now, we have all coordinates of Mumbai in one place. So, let's get coordinates of Mumbai city.

In [70]:
address = "Mumbai"
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
mumbai_latitude = latitude
mumbai_longitude = longitude
print ("Latitude={} and Longitude={}".format(mumbai_latitude,mumbai_longitude))

Latitude=18.9387711 and Longitude=72.8353355


We got Mumbai latitude and longitude. Now we can create a map of Mumbai using Folium library and then superimpose neighborhoods of Mumbai on top of map.

In [71]:
map_mumbai = folium.Map(location=[latitude, longitude], zoom_start=14)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mumbai)  
    
map_mumbai

Beautiful !

##### Now, we will use Four Square API to fetch the venues around a location. Four Square needs latitude and longitude of location and returns venues along with venue categories.

Let's define Foursquare Credentials and Version.

In [72]:
CLIENT_ID = 'AED1YJAEJO1BSJVJJ5KT2Y2BDGDVPCB21WUXERHTA4B4XXKC' # Foursquare ID
CLIENT_SECRET = '5WF0PKUAGWG30ASKZSBV12XHC2IDIQLEHRRPU541XI2JVQCH' #  Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


credentails:
CLIENT_ID: AED1YJAEJO1BSJVJJ5KT2Y2BDGDVPCB21WUXERHTA4B4XXKC
CLIENT_SECRET:5WF0PKUAGWG30ASKZSBV12XHC2IDIQLEHRRPU541XI2JVQCH


Let's create a function to get nearby venues of each neighborhoods in Mumbai.

I tried to keep an optimum value of radius. With more radius it fetches more venues which is good but it also increases chances of overlapping of neighborhood as well. So, 650 meter was optimum radius which fetches maximum data with minimum overlaps and duplication. The limit was kept 100.

In [73]:
radius=650
LIMIT = 100
print("The radius is = {} and LIMIT is = {}".format(radius,LIMIT))

def getNearbyVenues(names, latitudes, longitudes, radius=650):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


The radius is = 650 and LIMIT is = 100


Now, let's run the above function on each neighborhood and create a new dataframe.

In [74]:
south_mumbai_data = neighborhoods
south_mumbai_venues = getNearbyVenues(names=south_mumbai_data['Neighborhood'],
                                   latitudes=south_mumbai_data['Latitude'],
                                   longitudes=south_mumbai_data['Longitude']
                                   
                                  )
print("dtypes:",south_mumbai_venues.dtypes)
print("Total venues retrieved:",south_mumbai_venues.shape[0])


Colaba
Apollo Bandar
Fort
Churchgate
Nariman Point
Marine Lines
Walkeshwar
Malabar Hill
Kalbadevi
Bhuleshwar
Masjid Bandar
Darukhana
Pydhoni
Nagpada
Dongri
Byculla
Grand Road
Mazgaon
Mumbai Central
Mahalaxmi
Worli
Dadar
Bandra West
Juhu
Andheri
Kalina
Santa Cruz
Mahim
Khar
Kurla
Ghatkopar
Chembur
Govandi
Mulund West
Powai
Jogeshwari
Malad
Borivali
Mira Road
dtypes: Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Venue                      object
Venue Latitude            float64
Venue Longitude           float64
Venue Category             object
dtype: object
Total venues retrieved: 1104


Let's see how the data looks like.

In [75]:
south_mumbai_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Colaba,18.915091,72.825969,Charagh Din,18.915254,72.824151,Men's Store
1,Colaba,18.915091,72.825969,IMBISS Meating Joint,18.917157,72.827018,German Restaurant
2,Colaba,18.915091,72.825969,Thai Pavilion,18.914246,72.82108,Thai Restaurant
3,Colaba,18.915091,72.825969,Theobroma,18.919298,72.829185,Dessert Shop
4,Colaba,18.915091,72.825969,Vivanta by Taj - President,18.914413,72.821028,Hotel


Backup this data to a file.

In [0]:
# backup
south_mumbai_venues_unclean = south_mumbai_venues
# save as csv file
south_mumbai_venues_unclean = south_mumbai_venues_unclean.to_csv("south_mumbai_venues_unclean")

Now, merge with master table to include respective borough of each neighborhood.

In [77]:
# merge with master table

south_mumbai_venues_merge_1 = pd.merge(south_mumbai_venues,neighborhoods,on="Neighborhood", how="inner")
south_mumbai_venues_merge_1 = south_mumbai_venues_merge_1[["Neighborhood","Borough", "Neighborhood Latitude" \
                               , "Neighborhood Longitude", "Venue", "Venue Latitude" \
                               , "Venue Longitude", "Venue Category" \
                                ]]
south_mumbai_venues_merge_1.head()


Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Colaba,South Mumbai,18.915091,72.825969,Charagh Din,18.915254,72.824151,Men's Store
1,Colaba,South Mumbai,18.915091,72.825969,IMBISS Meating Joint,18.917157,72.827018,German Restaurant
2,Colaba,South Mumbai,18.915091,72.825969,Thai Pavilion,18.914246,72.82108,Thai Restaurant
3,Colaba,South Mumbai,18.915091,72.825969,Theobroma,18.919298,72.829185,Dessert Shop
4,Colaba,South Mumbai,18.915091,72.825969,Vivanta by Taj - President,18.914413,72.821028,Hotel


### **Data Cleansing**

Good.Now we got all the data we need.But, this data still is not good for analysis, it requires cleaning. So, let's identify issues with the data and resolve them one by one.

In summary, what we are going to do here.
First, we will get rid of any duplicate data, Then find out which neighborhoods have very less data.For such cases we will either drop them or merge them with nearby areas.Then, we will try to find out which categories are too many but not that significant. We can get rid of them.

First, let's drop the duplicates due to overlapping of neighborhhods resulted due to increase of radius.

In [78]:
print("rows before dropping duplicates={}".format(south_mumbai_venues.shape[0]))
south_mumbai_venues = south_mumbai_venues.drop_duplicates(subset=['Venue','Venue Latitude', 'Venue Longitude'])
print("rows after dropping duplicates={}".format(south_mumbai_venues.shape[0]))


rows before dropping duplicates=1104
rows after dropping duplicates=988


Okay, so we got 116 duplicates, and they are gone now.

Let's work on neighborhood first.
So, let's group by Neighborhood. The idea here is to know which neighborhoods got very few venues. We can either drop some venues or can merge with nearby venues so not to lose much data.

In [0]:
# group by neighborhoods
south_mumbai_venues_grp_nbhood_1 = south_mumbai_venues_merge_1.groupby("Neighborhood").count()
south_mumbai_venues_grp_nbhood_1.head(1)
south_mumbai_venues_grp_nbhood_1 = south_mumbai_venues_grp_nbhood_1.rename({"Borough" : "Count"}, axis=1)
south_mumbai_venues_grp_nbhood_2 = south_mumbai_venues_grp_nbhood_1.sort_values("Count" , ascending=True)
south_mumbai_venues_grp_nbhood_2 = south_mumbai_venues_grp_nbhood_2[["Count"]]



Let's see which one.

In [80]:
south_mumbai_venues_grp_nbhood_2.head(10)

Unnamed: 0_level_0,Count
Neighborhood,Unnamed: 1_level_1
Darukhana,1
Jogeshwari,6
Govandi,6
Malad,7
Kurla,7
Dongri,8
Mazgaon,10
Byculla,10
Kalina,12
Mahim,12


In [0]:
# save as csv
south_mumbai_venues_grp_nbhood_2.to_csv("south_mumbai_venues_grp_nbhood_2.csv")
south_mumbai_venues_grp_nbhood_2_bkp = south_mumbai_venues_grp_nbhood_2
 

Drop Darukhana which has very less venues.

In [82]:
print("before filter:",south_mumbai_venues.shape)
south_mumbai_venues = south_mumbai_venues[south_mumbai_venues.Neighborhood != 'Darukhana']
print("after filter:",south_mumbai_venues.shape)

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Darukhana']
print("aft:",south_mumbai_data.shape)

before filter: (988, 7)
after filter: (987, 7)
bef: (39, 5)
aft: (38, 5)


Okay, got rid of it.

But, others will be merged with nearby neigborhoods. With this approach we will not lose much data.

In [83]:
# Get coordinates of Nagpada

Nagpada = south_mumbai_venues
Nagpada = Nagpada[Nagpada.Neighborhood == 'Nagpada' ].head(1)
nagpada_latitude = Nagpada['Neighborhood Latitude'] 
nagpada_latitude = nagpada_latitude.iloc[0]
nagpada_longitude = Nagpada['Neighborhood Longitude'] 
nagpada_longitude = nagpada_longitude.iloc[0]
print("Nagpada latitude={} and longitude={}".format(nagpada_latitude,nagpada_longitude))


Nagpada latitude=18.9681782 and longitude=72.8286009


Merge Pydhoni to Nagpada

In [84]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Pydhoni', 'Neighborhood Latitude'] = nagpada_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Pydhoni', 'Neighborhood Longitude'] = nagpada_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Pydhoni', 'Neighborhood'] = 'Nagpada'
south_mumbai_venues[south_mumbai_venues.Neighborhood == 'Pydhoni']
south_mumbai_venues[south_mumbai_venues.Venue == 'Kanji Manji Kothari']


print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Pydhoni']
print("aft:",south_mumbai_data.shape)


bef: (38, 5)
aft: (37, 5)


Merge Dongri to Nagpada

In [85]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Dongri', 'Neighborhood Latitude'] = nagpada_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Dongri', 'Neighborhood Longitude'] = nagpada_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Dongri', 'Neighborhood'] = 'Nagpada'
south_mumbai_venues.iloc[332:334]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Dongri']
print("aft:",south_mumbai_data.shape)

bef: (37, 5)
aft: (36, 5)


In [86]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Grand Road', 'Neighborhood Latitude'] = nagpada_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Grand Road', 'Neighborhood Longitude'] = nagpada_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Grand Road', 'Neighborhood'] = 'Nagpada'
south_mumbai_venues.iloc[340:344]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Grand Road']
print("aft:",south_mumbai_data.shape)

bef: (36, 5)
aft: (35, 5)


In [87]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Byculla', 'Neighborhood Latitude'] = nagpada_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Byculla', 'Neighborhood Longitude'] = nagpada_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Byculla', 'Neighborhood'] = 'Nagpada'
south_mumbai_venues.iloc[334:340]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Byculla']
print("aft:",south_mumbai_data.shape)

bef: (35, 5)
aft: (34, 5)


Merge Govandi with Chembur

In [88]:
# Get coordinates of Chembur

Chembur = south_mumbai_venues
Chembur = Chembur[Chembur.Neighborhood == 'Chembur' ].head(1)
Chembur_latitude = Chembur['Neighborhood Latitude'] 
Chembur_latitude = Chembur_latitude.iloc[0]
Chembur_longitude = Chembur['Neighborhood Longitude'] 
Chembur_longitude = Chembur_longitude.iloc[0]
print("Chembur latitude={} and longitude={}".format(Chembur_latitude,Chembur_longitude))
Chembur[Chembur.Neighborhood == 'Chembur' ].head(1)

Chembur latitude=19.0612128 and longitude=72.8975909


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
894,Chembur,19.061213,72.897591,Le Café,19.061791,72.899479,Café


In [89]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Govandi', 'Neighborhood Latitude'] = Chembur_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Govandi', 'Neighborhood Longitude'] = Chembur_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Govandi', 'Neighborhood'] = 'Chembur'
south_mumbai_venues.iloc[688:693]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Govandi']
print("aft:",south_mumbai_data.shape)

bef: (34, 5)
aft: (33, 5)


Merge Kalina with Kurla

In [90]:
# Get coordinates of Kurla

Kurla = south_mumbai_venues
Kurla = Kurla[Kurla.Neighborhood == 'Kurla' ].head(1)
Kurla_latitude = Kurla['Neighborhood Latitude'] 
Kurla_latitude = Kurla_latitude.iloc[0]
Kurla_longitude = Kurla['Neighborhood Longitude'] 
Kurla_longitude = Kurla_longitude.iloc[0]
print("Kurla latitude={} and longitude={}".format(Kurla_latitude,Kurla_longitude))
Kurla[Kurla.Neighborhood == 'Kurla' ].head(1)

Kurla latitude=19.0652797 and longitude=72.8793805


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
866,Kurla,19.06528,72.87938,Guru Nanak,19.065634,72.878649,Food Truck


In [91]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Kalina', 'Neighborhood Latitude'] = Kurla_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Kalina', 'Neighborhood Longitude'] = Kurla_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Kalina', 'Neighborhood'] = 'Kurla'
south_mumbai_venues.iloc[588:594]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Kalina']
print("aft:",south_mumbai_data.shape)



bef: (33, 5)
aft: (32, 5)


Merge Malad with Andheri

In [92]:
# Get coordinates of Andheri

Andheri = south_mumbai_venues
Andheri = Andheri[Andheri.Neighborhood == 'Andheri' ].head(1)
Andheri_latitude = Andheri['Neighborhood Latitude'] 
Andheri_latitude = Andheri_latitude.iloc[0]
Andheri_longitude = Andheri['Neighborhood Longitude'] 
Andheri_longitude = Andheri_longitude.iloc[0]
print("Chembur latitude={} and longitude={}".format(Andheri_latitude,Andheri_longitude))
Andheri[Andheri.Neighborhood == 'Andheri' ].head(1)

Chembur latitude=19.1196976 and longitude=72.8464205


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
773,Andheri,19.119698,72.84642,Merwans Cake shop,19.1193,72.845418,Bakery


In [93]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malad', 'Neighborhood Latitude'] = Andheri_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malad', 'Neighborhood Longitude'] = Andheri_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malad', 'Neighborhood'] = 'Andheri'
south_mumbai_venues.iloc[773:780]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Malad']
print("aft:",south_mumbai_data.shape)

bef: (32, 5)
aft: (31, 5)


Merge Mahim to Bandra

In [94]:
# Get coordinates of Bandra West

Bandra = south_mumbai_venues
Bandra = Bandra[Bandra.Neighborhood == 'Bandra West' ].head(1)
Bandra_latitude = Bandra['Neighborhood Latitude'] 
Bandra_latitude = Bandra_latitude.iloc[0]
Bandra_longitude = Bandra['Neighborhood Longitude'] 
Bandra_longitude = Bandra_longitude.iloc[0]
print("Bandra latitude={} and longitude={}".format(Bandra_latitude,Bandra_longitude))
Bandra[Bandra.Neighborhood == 'Bandra West' ].head(1)

Bandra latitude=19.0583358 and longitude=72.8302669


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
606,Bandra West,19.058336,72.830267,Almeida Park,19.057656,72.831541,Park


In [95]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Mahim', 'Neighborhood Latitude'] = Bandra_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Mahim', 'Neighborhood Longitude'] = Bandra_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Mahim', 'Neighborhood'] = 'Bandra West'
south_mumbai_venues.iloc[610:618]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Mahim']
print("aft:",south_mumbai_data.shape)

bef: (31, 5)
aft: (30, 5)


Merge Malabar Hill to Walkeshwar

In [96]:
# Get coordinates of Bandra West

Walkeshwar = south_mumbai_venues
Walkeshwar = Walkeshwar[Walkeshwar.Neighborhood == 'Walkeshwar' ].head(1)
Walkeshwar_latitude = Walkeshwar['Neighborhood Latitude'] 
Walkeshwar_latitude = Walkeshwar_latitude.iloc[0]
Walkeshwar_longitude = Walkeshwar['Neighborhood Longitude'] 
Walkeshwar_longitude = Walkeshwar_longitude.iloc[0]
print("Walkeshwar latitude={} and longitude={}".format(Walkeshwar_latitude,Walkeshwar_longitude))
Walkeshwar[Walkeshwar.Neighborhood == 'Walkeshwar' ].head(1)

Walkeshwar latitude=18.9553434 and longitude=72.8079469


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
314,Walkeshwar,18.955343,72.807947,Soam,18.957492,72.808884,Indian Restaurant


In [97]:
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malabar Hill', 'Neighborhood Latitude'] = Walkeshwar_latitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malabar Hill', 'Neighborhood Longitude'] = Walkeshwar_longitude
south_mumbai_venues.loc[south_mumbai_venues['Neighborhood'] == 'Malabar Hill', 'Neighborhood'] = 'Walkeshwar'
south_mumbai_venues.iloc[270:280]

print("bef:",south_mumbai_data.shape)
south_mumbai_data = south_mumbai_data[south_mumbai_data.Neighborhood != 'Malabar Hill']
print("aft:",south_mumbai_data.shape)

bef: (30, 5)
aft: (29, 5)


Okay, done with neighborhoods.
Now,let's see what we can do with categories.

First, let's check how many unique categories we are dealing with.

In [98]:
print('There are {} uniques categories.'.format(len(south_mumbai_venues['Venue Category'].unique())))


There are 157 uniques categories.


Okay, so there are 157 unique categories.

So, let's group by venue category to know which categories appear most and may cause clutter if they are large in number and are relatively insignificant.

In [99]:
south_mumbai_venues_grp_ven_cat_1 = south_mumbai_venues_merge_1.groupby("Venue Category").count()
south_mumbai_venues_grp_ven_cat_2 = south_mumbai_venues_grp_ven_cat_1.sort_values("Neighborhood" , ascending=False)
south_mumbai_venues_grp_ven_cat_2 = south_mumbai_venues_grp_ven_cat_2.rename({"Neighborhood":"Count"},axis=1)
south_mumbai_venues_grp_ven_cat_2 = south_mumbai_venues_grp_ven_cat_2[["Count"]]
south_mumbai_venues_grp_ven_cat_2.head(10)


Unnamed: 0_level_0,Count
Venue Category,Unnamed: 1_level_1
Indian Restaurant,165
Café,60
Fast Food Restaurant,49
Coffee Shop,38
Bar,38
Chinese Restaurant,38
Hotel,35
Restaurant,32
Bakery,31
Ice Cream Shop,28


So, Indian restaurants are too many. Well expected. As they in are very common in all areas of Mumbai. Therefore, decided to get rid of Indian Restaurants to begin with.

In [101]:
print("rows bef filter:",south_mumbai_venues.shape[0])
south_mumbai_venues = south_mumbai_venues[(south_mumbai_venues['Venue Category'] != "Indian Restaurant")].reset_index()
print("rows aft filter:",south_mumbai_venues.shape[0])




rows bef filter: 987
rows aft filter: 852


Okay, they are gone now.

In [0]:
# backup
south_mumbai_venues_minus_ind_rest = south_mumbai_venues
south_mumbai_venues_minus_ind_rest.to_csv("south_mumbai_venues_minus_ind_rest.csv")

Similar treatment to some more such categories which are large in number and relatively insignificant.Let's get rid of them too.


In [104]:
venue_category = ['Indian Restaurant','Bar/Pub','Coffee Shop','Restaurant','Fast Food Restaurant','Train/Metro/Bus','Café']
test_df = south_mumbai_venues
test_df = test_df[test_df['Venue Category'].isin(venue_category) == False]
test_df.shape
south_mumbai_venues = test_df
south_mumbai_venues.shape

(696, 8)

#### Some categories were found be similar and can be clubbed together for example garden and parks can be clubbed together to gardens/parks etc.

In [105]:
south_mumbai_venues.replace(to_replace =['Movie Theater'],value = 'Theater\Multiplex', inplace=True)
south_mumbai_venues.replace(to_replace =['Theater'],value = 'Theater\Multiplex' , inplace=True)
south_mumbai_venues.replace(to_replace =['Multiplex'],value = 'Theater\Multiplex' , inplace=True)
south_mumbai_venues.replace(to_replace =['Amphitheater'],value = 'Theater\Multiplex' , inplace=True)
south_mumbai_venues.replace(to_replace =['Indie Movie Theater'],value = 'Theater\Multiplex' , inplace=True)
south_mumbai_venues.replace(to_replace =['Performing Arts Venue'],value = 'Theater\Multiplex' , inplace=True)

south_mumbai_venues.replace(to_replace =['Garden','Park'],value = 'Park/Garden', inplace=True)

south_mumbai_venues.replace(to_replace =['Pub'],value = 'Bar/Pub', inplace=True )
south_mumbai_venues.replace(to_replace =['Gastropub'],value = 'Bar/Pub', inplace=True)
south_mumbai_venues.replace(to_replace =['Bar'],value = 'Bar/Pub', inplace=True)
south_mumbai_venues.replace(to_replace =['Cocktail Bar'],value = 'Bar/Pub', inplace=True)
south_mumbai_venues.replace(to_replace =['Wine Bar'],value = 'Bar/Pub', inplace=True)
south_mumbai_venues.replace(to_replace =['Hotel Bar'],value = 'Bar/Pub', inplace=True)
south_mumbai_venues.replace(to_replace =['Sports Bar'],value = 'Bar/Pub', inplace=True)

south_mumbai_venues.replace(to_replace =['Train Station' \
                                         ,'Bus Station'
                                         ,'Metro Station'
                                         ,'Train'
],value = 'Train/Metro/Bus', inplace=True )

south_mumbai_venues.replace(to_replace =['Spa' \
                                         ,'Gym'
                                         ,'Gym / Fitness Center'
                                         ,'Yoga Studio'
                                         ,'Track'
                                         ,'Massage Studio'
],value = 'Fitness/Gym/Spa', inplace=True )

south_mumbai_venues.replace(to_replace =['Tennis Court' \
                                         ,'Athletics & Sports'
                                         ,'Cricket Ground'
                                         ,'Hockey Arena'
                                         ,'Field'
                                         ,'Stadium'
                                         ,'Soccer Field'
                                         ,'Playground'
],value = 'Sports', inplace=True )

south_mumbai_venues.shape

(696, 8)

Let's correct some category to a more standard category.

In [0]:
# Correct venue category to a more standard category
test_df = south_mumbai_venues

test_df.loc[test_df['Venue'] == 'Boat ride, gateway of india', 'Venue Category'] = 'Boat or Ferry'
south_mumbai_venues = test_df


In [0]:
#backup
south_mumbai_venues_bkp1 = south_mumbai_venues
south_mumbai_venues.to_csv("south_mumbai_venues_clean.csv")



In [108]:
print("After all cleanup:Total venues:",south_mumbai_venues.shape[0])
print("After all cleanup:Total neigborhoods:",neighborhoods.shape[0])


After all cleanup:Total venues: 696
After all cleanup:Total neigborhoods: 39


Data Cleansing ends here.

### **Methodology** <a name="methodology"></a>

In this study, we will first cluster the areas based on frequency of occurance of venue category within each neighborhoods using k-means algorithm. And then visualize the clusters on a map using folium. In this iteration we'll eliminate only clutter but will keep food, shopping and places to visit  for our analysis.

After, analysing  the results, we may go for another iteration based on specific set of venue categories which are most significant from tourist point of view. This will show a different perspective of the city than previous one.


The idea of the code that follows is to see top 10 venue categories within each neighborhood. Each neighborhood will appear as one row and each column will represent venue category. And they will be ordered based on their ranking. The most venue categories will get first rank and so on.

In [109]:
# one hot encoding
south_mumbai_onehot = pd.get_dummies(south_mumbai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
south_mumbai_onehot['Neighborhood'] = south_mumbai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [south_mumbai_onehot.columns[-1]] + list(south_mumbai_onehot.columns[:-1])
south_mumbai_onehot = south_mumbai_onehot[fixed_columns]

south_mumbai_onehot.shape


(696, 127)

In [110]:
#And let's examine the new dataframe size.
south_mumbai_onehot.shape  

(696, 127)

So, let's do a groupby of neighborhood and take the mean of the frequency of occurrence of each category.The formula  here is Count(A Venue Category ) divided by Count(All Venue Category for that Neighborhood ). 

In [111]:
south_mumbai_grouped = south_mumbai_onehot.groupby('Neighborhood').mean().reset_index()
print("This matrix has {} rows and {} columns".format(south_mumbai_grouped.shape[0],south_mumbai_grouped.shape[1]))
south_mumbai_grouped.head(10)

This matrix has 29 rows and 127 columns


Unnamed: 0,Neighborhood,Zoo,Accessories Store,American Restaurant,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar/Pub,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Bengali Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Brewery,Bridal Shop,Burger Joint,Burrito Place,Chaat Place,Cheese Shop,Chinese Restaurant,Clothing Store,Club House,College Academic Building,College Auditorium,Comedy Club,Comfort Food Restaurant,Convenience Store,Cosmetics Shop,Creperie,...,Mexican Restaurant,Middle Eastern Restaurant,Monument / Landmark,Mughlai Restaurant,Music Store,Music Venue,Nightclub,North Indian Restaurant,Other Great Outdoors,Outdoors & Recreation,Paper / Office Supplies Store,Park/Garden,Parsi Restaurant,Pizza Place,Platform,Plaza,Pool,Racetrack,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Smoke Shop,Snack Place,Spanish Restaurant,Sports,Sports Club,Steakhouse,Tea Room,Tex-Mex Restaurant,Thai Restaurant,Theater\Multiplex,Toy / Game Store,Train/Metro/Bus,Vegetarian / Vegan Restaurant,Women's Store
0,Andheri,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
1,Apollo Bandar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.025,0.0,0.025,0.05,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0
2,Bandra West,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.061728,0.012346,0.012346,0.074074,0.123457,0.0,0.0,0.0,0.012346,0.0,0.0,0.0,0.024691,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0,0.049383,0.012346,0.0,0.0,0.012346,0.012346,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0,0.0,0.012346,0.0,0.049383,0.0,0.0,0.0,0.0,0.012346,0.0,0.012346,0.0,0.0,0.037037,0.0,0.0,0.012346,0.024691,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0,0.012346,0.0,0.024691,0.012346,0.012346
3,Bhuleshwar,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.071429,0.071429,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0
4,Borivali,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.1,0.15,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0
5,Chembur,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.043478,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.086957,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130435,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.043478,0.0,0.0,0.086957,0.0
6,Churchgate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.05,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.175,0.0,0.0,0.0,0.0,0.0,0.075,0.0,0.05,0.025,0.0
7,Colaba,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0
8,Dadar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.125,0.0625
9,Fort,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019231,0.019231,0.0,0.019231,0.057692,0.0,0.0,0.0,0.0,0.0,0.019231,0.0,0.019231,0.019231,0.0,0.019231,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.019231,0.019231,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019231,0.038462,0.019231,0.0,0.038462,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.019231,0.0,0.019231,0.019231,0.0,0.0,0.019231,0.0,0.0,0.019231,0.0


In [32]:
south_mumbai_grouped.shape

(37, 130)

In [0]:
# save to file
south_mumbai_grouped.to_csv("south_mumbai_grouped.csv")
# backup
south_mumbai_grouped_bkp = south_mumbai_grouped

Now, let, see top 5 most common venues for each neighborhood.

In [113]:
num_top_venues = 5

for hood in south_mumbai_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = south_mumbai_grouped[south_mumbai_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Andheri----
               venue  freq
0             Bakery   0.1
1     Ice Cream Shop   0.1
2  Electronics Store   0.1
3       Burger Joint   0.1
4        Snack Place   0.1


----Apollo Bandar----
                venue  freq
0             Bar/Pub  0.20
1               Hotel  0.15
2       Boat or Ferry  0.10
3  Mughlai Restaurant  0.05
4     Fitness/Gym/Spa  0.05


----Bandra West----
                venue  freq
0             Bar/Pub  0.12
1              Bakery  0.07
2    Asian Restaurant  0.06
3  Chinese Restaurant  0.05
4         Pizza Place  0.05


----Bhuleshwar----
             venue  freq
0     Dessert Shop  0.14
1   Ice Cream Shop  0.14
2    Jewelry Store  0.07
3  Train/Metro/Bus  0.07
4           Arcade  0.07


----Borivali----
                venue  freq
0      Clothing Store  0.15
1      Ice Cream Shop  0.15
2  Chinese Restaurant  0.10
3      Sandwich Place  0.10
4              Lounge  0.05


----Chembur----
                           venue  freq
0             Seafood Res

Let's write a function below to sort the venues in descending order.

In [0]:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now, let's see the top 10 venues for each neighborhood.

In [115]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = south_mumbai_grouped['Neighborhood']

for ind in np.arange(south_mumbai_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(south_mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Andheri,Ice Cream Shop,Burger Joint,Dessert Shop,Electronics Store,Bowling Alley,Bakery,Snack Place,Sandwich Place,Food Court,Train/Metro/Bus
1,Apollo Bandar,Bar/Pub,Hotel,Boat or Ferry,Chinese Restaurant,Mughlai Restaurant,Fitness/Gym/Spa,Halal Restaurant,Mediterranean Restaurant,Monument / Landmark,Nightclub
2,Bandra West,Bar/Pub,Bakery,Asian Restaurant,Chinese Restaurant,Pizza Place,Fitness/Gym/Spa,Seafood Restaurant,Dessert Shop,Arcade,Bookstore
3,Bhuleshwar,Ice Cream Shop,Dessert Shop,Train/Metro/Bus,Hotel,Chinese Restaurant,Arcade,Jewelry Store,Asian Restaurant,BBQ Joint,Juice Bar
4,Borivali,Ice Cream Shop,Clothing Store,Sandwich Place,Chinese Restaurant,Snack Place,Burger Joint,Convenience Store,Lounge,Pizza Place,Platform


In [0]:
# save to csv file
neighborhoods_venues_sorted.to_csv("neighborhoods_venues_sorted.csv")
#backup 
neighborhoods_venues_sorted_bkp = neighborhoods_venues_sorted

In [117]:
neighborhoods_venues_sorted.shape

(29, 11)

Now let's do clustering using k-means algorithm keeping the number of clusters(k) to an optimum value.

In [118]:
kclusters = 5
south_mumbai_grouped_clustering = south_mumbai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(south_mumbai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:40] 


array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 3, 1, 1, 1, 1, 4, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

Add clustering labels.

In [0]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [41]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,4,Andheri,Hotel,Food Truck,Burger Joint,Platform,Bakery,Food Court,Sandwich Place,Gift Shop,Creperie,Cupcake Shop
1,4,Apollo Bandar,Bar/Pub,Hotel,Boat or Ferry,Chinese Restaurant,Fitness/Gym/Spa,Mughlai Restaurant,Beach,Juice Bar,Indian Sweet Shop,Diner
2,4,Bandra West,Bar/Pub,Bakery,Asian Restaurant,Chinese Restaurant,Fitness/Gym/Spa,Pizza Place,Seafood Restaurant,Snack Place,Dessert Shop,Bookstore
3,1,Bhuleshwar,Ice Cream Shop,Snack Place,Dessert Shop,Train/Metro/Bus,Arcade,Chinese Restaurant,BBQ Joint,Juice Bar,Zoo,Donut Shop
4,1,Borivali,Clothing Store,Ice Cream Shop,Chinese Restaurant,Sandwich Place,Vegetarian / Vegan Restaurant,Lounge,Department Store,Snack Place,Pizza Place,Burger Joint


Now merge both dataframes to include borough.

In [120]:
south_mumbai_merged = south_mumbai_data

# merge two dataframes
south_mumbai_merged = south_mumbai_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

south_mumbai_merged.head() 


Unnamed: 0,index,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Colaba,South Mumbai,18.915091,72.825969,1,Hotel,Fitness/Gym/Spa,Bar/Pub,Chinese Restaurant,Diner,German Restaurant,Japanese Restaurant,Italian Restaurant,Brewery,Donut Shop
1,1,Apollo Bandar,South Mumbai,18.918375,72.831443,1,Bar/Pub,Hotel,Boat or Ferry,Chinese Restaurant,Mughlai Restaurant,Fitness/Gym/Spa,Halal Restaurant,Mediterranean Restaurant,Monument / Landmark,Nightclub
2,2,Fort,South Mumbai,18.933266,72.834515,1,Dessert Shop,Chinese Restaurant,Seafood Restaurant,Bar/Pub,Irani Cafe,Clothing Store,Lounge,Parsi Restaurant,Plaza,Bookstore
3,3,Churchgate,South Mumbai,18.935957,72.82734,1,Sports,Ice Cream Shop,Hotel,Theater\Multiplex,Italian Restaurant,Pizza Place,Train/Metro/Bus,Bar/Pub,Bakery,Japanese Restaurant
4,4,Nariman Point,South Mumbai,18.925951,72.823208,1,Theater\Multiplex,Italian Restaurant,Hotel,Chaat Place,Mediterranean Restaurant,Sports,Bar/Pub,Shopping Mall,Lounge,Japanese Restaurant


In [0]:
# save to csv file
south_mumbai_merged.to_csv("south_mumbai_merged.csv")
# backup
south_mumbai_merged_bkp = south_mumbai_merged


Clusters given most suitable name based on ranking of venue categories.

### **Clusters given name**

Cluster Labels 0 : Convenience Store,Smoke Shop

Cluster Label  1 : Scenic Lookout,Sports Club

Cluster Label  2 : Hotel,Fitness/Gym/Spa,Chinese Restaurant

Cluster Label  3 : Theater\Multiplex,Ice Cream Shop,Seafood Restaurant

Cluster Label  4 : American Restaurant,Market


### **Visualize Clusters**

Let's visualize clusters on map using folium library.

In [122]:
latitude = mumbai_latitude
longitude = mumbai_longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(south_mumbai_merged['Latitude'], south_mumbai_merged['Longitude'], south_mumbai_merged['Neighborhood'], south_mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


# **Second Iteration** 

It was noticed that most significant venues like museum, beaches, aquarium, zoo and boat ride are very less in number as compared to food places and shopping places. Therefore, they will be overshadowed. Therefore, a second iteration was done with a focused set of venues like museum, aquarium, monument, beaches, zoo, boat ride etc. and run analysis on that.

In [0]:
#backup iteration 1
south_mumbai_venues_it1 = south_mumbai_venues

Let's club certain categories first.

In [0]:
south_mumbai_venues.replace(to_replace =['Boat ride'],value = 'Boat or Ferry', inplace=True)
south_mumbai_venues.replace(to_replace =['Sculpture Garden'],value = 'Monument / Landmark', inplace=True)
south_mumbai_venues.replace(to_replace =['Other Great Outdoors'],value = 'Scenic Lookout', inplace=True)
south_mumbai_venues.replace(to_replace =['Jetty - Gateway of India'],value = 'Scenic Lookout', inplace=True)



Let's pick only specific categories of tourist interest.

In [125]:
test_df = south_mumbai_venues_bkp1
venue_cat_list = ['Monument / Landmark','Boat or Ferry','Sculpture Garden' \
                 ,'History Museum','Other Great Outdoors','Scenic Lookout','Aquarium' \
                  ,'Zoo','Racetrack','Beach','Boat ride'
                  ]

print("bef:",test_df.shape)
test_df = test_df[test_df['Venue Category'].isin(venue_cat_list) == True]
print("aft:",test_df.shape) 
test_df.head()
south_mumbai_venues = test_df
print("aft:" , south_mumbai_venues.shape)

south_mumbai_venues.head()


                  

bef: (696, 8)
aft: (22, 8)
aft: (22, 8)


Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
44,55,Apollo Bandar,18.918375,72.831443,Gateway of India,18.921856,72.834877,Monument / Landmark
71,97,Apollo Bandar,18.918375,72.831443,APM Deck @ Gateway of India,18.922174,72.833947,Boat or Ferry
72,98,Apollo Bandar,18.918375,72.831443,"Boat ride, gateway of india",18.921847,72.834652,Boat or Ferry
73,99,Apollo Bandar,18.918375,72.831443,Chhatrapati Shivaji Monument,18.922713,72.834172,Monument / Landmark
74,100,Apollo Bandar,18.918375,72.831443,Scenic Lookout,18.922558,72.83464,Boat or Ferry


We have 22 venues in this select group.

In [0]:
#backup iteration 2
south_mumbai_venues_it_2 = south_mumbai_venues
# save to file
south_mumbai_venues_it_2.to_csv("south_mumbai_venues_it_2.csv")

In [49]:
south_mumbai_venues_it_2.shape

(24, 8)

In [127]:
# one hot encoding
south_mumbai_onehot = pd.get_dummies(south_mumbai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
south_mumbai_onehot['Neighborhood'] = south_mumbai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [south_mumbai_onehot.columns[-1]] + list(south_mumbai_onehot.columns[:-1])
south_mumbai_onehot = south_mumbai_onehot[fixed_columns]

south_mumbai_onehot.shape


(22, 9)

In [129]:
south_mumbai_grouped = south_mumbai_onehot.groupby('Neighborhood').mean().reset_index()
south_mumbai_grouped.head()

Unnamed: 0,Neighborhood,Aquarium,Beach,Boat or Ferry,History Museum,Monument / Landmark,Racetrack,Scenic Lookout,Zoo
0,Apollo Bandar,0.0,0.0,0.666667,0.0,0.333333,0.0,0.0,0.0
1,Churchgate,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0
2,Fort,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0
3,Juhu,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Mahalaxmi,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [0]:
# backup
south_mumbai_grouped_it_2 = south_mumbai_grouped
south_mumbai_grouped_it_2.to_csv("south_mumbai_grouped_it_2.csv")

Let's print each neighborhood with the top 2 most common venues.




In [131]:
num_top_venues = 2

for hood in south_mumbai_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = south_mumbai_grouped[south_mumbai_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Apollo Bandar----
                 venue  freq
0        Boat or Ferry  0.67
1  Monument / Landmark  0.33


----Churchgate----
            venue  freq
0           Beach   0.5
1  Scenic Lookout   0.5


----Fort----
                 venue  freq
0       History Museum   0.5
1  Monument / Landmark   0.5


----Juhu----
      venue  freq
0     Beach   1.0
1  Aquarium   0.0


----Mahalaxmi----
       venue  freq
0  Racetrack   1.0
1   Aquarium   0.0


----Marine Lines----
            venue  freq
0        Aquarium   0.5
1  Scenic Lookout   0.5


----Nagpada----
            venue  freq
0  History Museum   0.5
1             Zoo   0.5


----Nariman Point----
            venue  freq
0  Scenic Lookout   1.0
1        Aquarium   0.0


----Powai----
            venue  freq
0  Scenic Lookout   1.0
1        Aquarium   0.0


----Walkeshwar----
            venue  freq
0  History Museum   1.0
1        Aquarium   0.0


----Worli----
            venue  freq
0  Scenic Lookout   1.0
1        Aquarium   0.0


In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Take top 2 categories by rank.

In [133]:
num_top_venues = 2

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators
[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = south_mumbai_grouped['Neighborhood']

for ind in np.arange(south_mumbai_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(south_mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue
0,Apollo Bandar,Boat or Ferry,Monument / Landmark
1,Churchgate,Scenic Lookout,Beach
2,Fort,Monument / Landmark,History Museum
3,Juhu,Beach,Zoo
4,Mahalaxmi,Racetrack,Zoo


In [0]:
# backup
neighborhoods_venues_sorted_it_2 = neighborhoods_venues_sorted
neighborhoods_venues_sorted_it_2.to_csv("neighborhoods_venues_sorted_it_2.csv")

Let's create 10 clusters.

In [135]:
kclusters = 10
south_mumbai_grouped_clustering = south_mumbai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(south_mumbai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:40] 


  """


array([4, 7, 1, 0, 3, 8, 5, 2, 2, 6, 2], dtype=int32)

In [0]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [0]:
south_mumbai_merged = south_mumbai_data
south_mumbai_merged = pd.merge(south_mumbai_merged, neighborhoods_venues_sorted, on='Neighborhood', how='inner')


In [0]:
# backup
south_mumbai_merged.to_csv("south_mumbai_merged_It2.csv")
south_mumbai_merged_it_2 = south_mumbai_merged

Let's see how clusters are formed.

In [140]:
south_mumbai_merged.head(30)

Unnamed: 0,index,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue
0,1,Apollo Bandar,South Mumbai,18.918375,72.831443,4,Boat or Ferry,Monument / Landmark
1,2,Fort,South Mumbai,18.933266,72.834515,1,Monument / Landmark,History Museum
2,3,Churchgate,South Mumbai,18.935957,72.82734,7,Scenic Lookout,Beach
3,4,Nariman Point,South Mumbai,18.925951,72.823208,2,Scenic Lookout,Zoo
4,5,Marine Lines,South Mumbai,18.94567,72.823781,8,Scenic Lookout,Aquarium
5,6,Walkeshwar,South Mumbai,18.955343,72.807947,6,History Museum,Zoo
6,13,Nagpada,South Mumbai,18.968178,72.828601,5,Zoo,History Museum
7,19,Mahalaxmi,South Mumbai,18.982568,72.82416,3,Racetrack,Zoo
8,20,Worli,South Mumbai,19.011696,72.81807,2,Scenic Lookout,Zoo
9,23,Juhu,West Mumbai,19.107021,72.827528,0,Beach,Zoo


This data reveals the fact that all key touristic spots are concentrated in South Mumbai. Juhu and Powai are only exceptions.

In [0]:
# backup
south_mumbai_merged.to_csv("south_mumbai_merged_It2.csv")

Visualize clusters on the map using folium.

In [142]:
latitude = mumbai_latitude
longitude = mumbai_longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(south_mumbai_merged['Latitude'], south_mumbai_merged['Longitude'], south_mumbai_merged['Neighborhood'], south_mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


We can see clearly that all clusters are concentrated in South Mumbai with only two exceptions.

# **4. Results** <a name="results"></a>

The first iteration gives following five clusters.


Cluster 0 : “Convenience Store & Smoke Shop”

Cluster 1 : “Scenic Lookout & Sports”

Cluster 2 : “Hotel & Fitness” 

Cluster 3 : “Chinese & Seafood Restaurants”

Cluster 4 : “American Restaurant & Market”

It is noticed that Food and other venues dominate very significant spots of touristic interest like beaches, museum, aquarium, zoo, boat ride etc. Because, usually beaches, museum are in much less number compared to restaurants and others. This issue was resolved by doing a second iteration of analysis with only focused set of venues. The result shows very clearly that all such significant places of interest are concentrated in South Mumbai. Juhu and Powai are only exceptions. 

# **5. Discussion** <a name="discussion"></a>

The study revealed that major tourstic spots like museum, aquarium, beaches are within South Mumbai. Howerever, other boroughs are giving good competition to South Mumbai as far as food or shopping is concerned. Still I highly recommend tourists to book a hotel in South Mumbai and they can find all places of their intrest either in the walking distance or can reach through cab within minutes. Travelling through buses and trains are not recommended for tourists.If they book a hotel in other boroughs of Mumbai the travelling through cab will take lot of time due to heavy traffic. 

Mumbai city attracts lot of tourists each year. Each borough of Mumbai is distinct from touristic point of view. Therefore, tourists should be guided through an analysis of venues in each neighborhood and the analysis should be visulaized in the form of map to show these clusters. Since, the k-means approach does not make any distinction between one venue to another some venues can outnumber significant venues which are always relatively less. In order to overcome this problem the iterations were done with focused set of venues belonging to either places to visit like museum, aquarium, beaches etc.

# **6. Conclusion** <a name="conclusion"></a>

In this analysis, I explored Mumbai neighborhoods and clustered them based on their similarity based on venue categories and their frequency of occurance. I used k-nearest alogirthm to create clusters of Mumbai neighborhoods. These, clusters reveal an important fact that most of touristic spots are within South Mumbai which also has hotels and offers many restaurants with International cuisines. Other boroughs of Mumbai like West, East and North can compete South Mumbai as far as restaurants and shopping is concerned, but no way near South Mumbai for places like Museums, monuments, beaches, boat ride etc is concerned. A hotel in South Mumbai will save lot of time and hassles of commuting through overcrowded trains and buses. 

For future study other focused groups can be picked for analysis like Italian, French, Mexican restaurants or parks,gardens and lounges or Movie Theater, Theaters or shopping places like malls and cloth shops, shoe shops.
This approach is expected to reveal a lot more useful information from touristic point of view.