# Week 5: Assignment - The Battle of Neighborhoods

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Observations & Conclusions](#observations)

## Introduction: Business Problem <a name='introduction'></a>

There are many shopping malls in the Surat city and many more are being built. Opening shopping malls also allow the property developers to earn consistent rental income. But opening a new mall requires serious considerations and it is a lot more complicated than it seems; especially the locations of shopping mall is one of the most important decisions that will determine whether the mall will be a success or failure. 

The objective of this project is to analyse and select the best locations in Surat, Gujarat to open a new shopping mall. Using the various aspects of Data Science like visualisation and Machine Learning techniques like clustering, this project aims to provide answer to one of the prime questions, i.e., ‘What should be the recommended place to open a new shopping mall in a developed city like Surat?’ 

## Data <a name='data'></a>

1. List of neighbourhoods in Surat:  
The Wikipedia page ‘https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Surat’ contains a list of neighbourhoods in Surat, with a total of 76 neighbourhoods. Web scraping techniques will be used to extract this data from the source page. 


2. GPS coordinates of the neighbourhoods: 
<p>Geographical coordinates (latitude, longitude) of the neighbourhoods will be obtained, using Python Geocoder package, which will help us to plot the map and obtain venue data. </p>


3. Foursquare API: 
<p>After the above steps, Foursquare API will be used to extract data of the neighbourhoods. The data obtained will be used to cluster the neighbourhoods. A machine learning model (k-means clustering) can be used to do the same and recommend the best place to construct new malls.</p>

### Import Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported")

Libraries imported


### Scrap data from Wikipedia page into a DataFrame

In [2]:
# send GET request
data = requests.get('https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Surat').text

In [3]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store neighborhood data
nghList = []

In [5]:
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    nghList.append(row.text)

In [6]:
# create a new DataFrame from the list
df_surat = pd.DataFrame({"Neighborhood": nghList})

df_surat.head()

Unnamed: 0,Neighborhood
0,Agnovad
1,"Akoti, Gujarat"
2,Amroli
3,Athwalines
4,Bajipura


In [7]:
df_surat.shape

(76, 1)

### Get Geographical Coordinates

In [8]:
# function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Surat, Gujarat'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [9]:
# function call to get the coordinates. Storing it in a new list using list comprehension
coords = [get_latlng(neighborhood) for neighborhood in df_surat['Neighborhood'].tolist()]
coords

[[21.185780000000022, 72.83679000000006],
 [21.17453000000006, 73.19443000000007],
 [21.237760000000037, 72.85623000000004],
 [21.182840000000056, 72.80776000000003],
 [22.68966000000006, 73.04369000000008],
 [21.13444000000004, 72.81642000000005],
 [21.220650000000035, 72.70805000000007],
 [21.133600000000058, 73.10661000000005],
 [21.185780000000022, 72.83679000000006],
 [21.185780000000022, 72.83679000000006],
 [21.197370000000035, 72.82697000000007],
 [21.26941434693289, 72.95542404302037],
 [21.131590000000074, 72.79619000000008],
 [21.155760000000043, 72.96000000000004],
 [21.276440000000036, 72.80655000000007],
 [21.08893000000006, 73.01481000000007],
 [21.176522999974978, 72.81900900000278],
 [21.169840000000022, 72.87613000000005],
 [21.275030000000072, 73.25078000000008],
 [21.193190000000072, 72.82404000000008],
 [21.208930000000066, 73.20511000000005],
 [21.193850000000054, 72.73244000000005],
 [21.193435725630138, 72.83283709510232],
 [21.223160000000064, 73.22745000000003

In [10]:
# create temp dataframe to store the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

# merge the coordinates into the original dataframe
df_surat['Latitude'] = df_coords['Latitude']
df_surat['Longitude'] = df_coords['Longitude']

# checking
print(df_surat.shape)
df_surat.head()

(76, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Agnovad,21.18578,72.83679
1,"Akoti, Gujarat",21.17453,73.19443
2,Amroli,21.23776,72.85623
3,Athwalines,21.18284,72.80776
4,Bajipura,22.68966,73.04369


In [11]:
# exporting dataframe
df_surat.to_csv("Surat_database.csv", index=False)

### Use geopy library to get the latitude and longitude values of Surat

In [12]:
# get the coordinates of Surat
address = 'Surat, Gujarat'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Surat, Gujarat are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Surat, Gujarat are 21.1864607, 72.8081281.


### Create a map of Surat with neighborhoods superimposed on top

In [13]:
map_surat = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, neighborhood in zip(df_surat['Latitude'], df_surat['Longitude'], df_surat['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_surat)  
    
map_surat

In [14]:
# map can be saved as html file
map_surat.save('Surat_map.html')

## Methodology <a name='methodology'></a>

<p>In the first step, we collected all the required data from various sources and methods and visualized them on the corresponding map.</p>
<p>In the next step, we will use Foursquare API to get the venue data for the neighborhoods. Foursquare API will provide many categories of venue data. But we are interested particularly in the 'Shopping Mall' category</p>
<p>Further, we will apply Machine Learning algorithm (K-means clustering) and visualize the resulting data (using folium) to find the answer of the question put up in the Business Problem section.</p>

## Analysis <a name='analysis'></a>

### Defining Foursquare API Credentials

In [17]:
CLIENT_ID = 'QPS2MA23VQJNNCKUZYTBICUMZVYJHPJKWVJVJV4TE0L2GISX' # your Foursquare ID
CLIENT_SECRET = 'TNDOF3NS1KWVO3SY3RNY0T5ARA2BOAIPASPFHYCNVU2QANWG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QPS2MA23VQJNNCKUZYTBICUMZVYJHPJKWVJVJV4TE0L2GISX
CLIENT_SECRET:TNDOF3NS1KWVO3SY3RNY0T5ARA2BOAIPASPFHYCNVU2QANWG


### Exploring neighbors (top 100 within 3000 metres)

In [18]:
radius = 3000
LIMIT = 100
venues = []

for lat, lng, neighborhood in zip(df_surat['Latitude'], df_surat['Longitude'], df_surat['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            lng, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [19]:
# convert the venues list into a new DataFrame
df_venues = pd.DataFrame(venues)

# define the column names
df_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

print(df_venues.shape)
df_venues.head()

(794, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,Agnovad,21.18578,72.83679,A-One Coco,21.197061,72.821175,Ice Cream Shop
1,Agnovad,21.18578,72.83679,Chamunda Restaurant,21.181735,72.826476,Tea Room
2,Agnovad,21.18578,72.83679,Cafe Coffee Day,21.197228,72.843714,Coffee Shop
3,Agnovad,21.18578,72.83679,Ganesh Omlet,21.197151,72.837306,Indian Restaurant
4,Agnovad,21.18578,72.83679,Gokulam Dairy,21.178771,72.810985,Dairy Store


### Number of venues returned by each neighborhood

In [20]:
df_venues.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agnovad,39,39,39,39,39,39
Amroli,4,4,4,4,4,4
Athwalines,60,60,60,60,60,60
Bajipura,1,1,1,1,1,1
Bamroli,10,10,10,10,10,10
Barbodhan,2,2,2,2,2,2
Bardoli,4,4,4,4,4,4
Bedkuvadoor,39,39,39,39,39,39
Bhadbhuja,39,39,39,39,39,39
Bhagal,39,39,39,39,39,39


### Check unique categories

In [22]:
print('There are {} unique categories.'.format(len(df_venues['Venue Category'].unique())))

There are 76 unique categories.


### Analyze each neighborhood 

In [47]:
# one hot encoding
surat_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
surat_onehot['Neighborhood'] = df_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [surat_onehot.columns[-1]] + list(surat_onehot.columns[:-1])
surat_onehot = surat_onehot[fixed_columns]

print(surat_onehot.shape)
surat_onehot.head()

(794, 77)


Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Arcade,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Beach,Beer Garden,Breakfast Spot,Bridal Shop,Burger Joint,Bus Station,Café,Campground,Cheese Shop,Chinese Restaurant,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Dairy Store,Department Store,Dessert Shop,Diner,Dive Bar,Donut Shop,Electronics Store,Farm,Fast Food Restaurant,Food & Drink Shop,Food Court,Food Truck,Fried Chicken Joint,Frozen Yogurt Shop,Gas Station,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Juice Bar,Lake,Light Rail Station,Lounge,Market,Mattress Store,Miscellaneous Shop,Movie Theater,Moving Target,Multiplex,Music Venue,Park,Performing Arts Venue,Pizza Place,Platform,Plaza,Resort,Rest Area,Restaurant,River,Rock Climbing Spot,Sandwich Place,Seafood Restaurant,Shopping Mall,Smoke Shop,Snack Place,Supermarket,Tea Room,Train Station,Vegetarian / Vegan Restaurant
0,Agnovad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Agnovad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Agnovad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Agnovad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Agnovad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Group rows by neighborhood and take the mean of the frequency of occurrence of each category

In [31]:
surat_grouped=surat_onehot.groupby('Neighborhood').mean().reset_index()
print(surat_grouped.shape)
surat_grouped

(57, 77)


Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Arcade,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Beach,Beer Garden,Breakfast Spot,Bridal Shop,Burger Joint,Bus Station,Café,Campground,Cheese Shop,Chinese Restaurant,Coffee Shop,Concert Hall,Convenience Store,Cosmetics Shop,Dairy Store,Department Store,Dessert Shop,Diner,Dive Bar,Donut Shop,Electronics Store,Farm,Fast Food Restaurant,Food & Drink Shop,Food Court,Food Truck,Fried Chicken Joint,Frozen Yogurt Shop,Gas Station,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Juice Bar,Lake,Light Rail Station,Lounge,Market,Mattress Store,Miscellaneous Shop,Movie Theater,Moving Target,Multiplex,Music Venue,Park,Performing Arts Venue,Pizza Place,Platform,Plaza,Resort,Rest Area,Restaurant,River,Rock Climbing Spot,Sandwich Place,Seafood Restaurant,Shopping Mall,Smoke Shop,Snack Place,Supermarket,Tea Room,Train Station,Vegetarian / Vegan Restaurant
0,Agnovad,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.025641,0.0,0.0,0.051282,0.0,0.051282,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051282,0.0,0.025641,0.179487,0.0,0.0,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.076923,0.0,0.0,0.0,0.0,0.051282,0.0,0.0,0.025641,0.0,0.051282,0.0,0.0,0.0,0.076923,0.025641,0.0
1,Amroli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Athwalines,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.016667,0.0,0.016667,0.0,0.05,0.0,0.0,0.083333,0.016667,0.0,0.0,0.0,0.016667,0.0,0.033333,0.0,0.0,0.0,0.033333,0.0,0.133333,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.0,0.016667,0.0,0.033333,0.1,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.016667,0.016667,0.066667,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.033333,0.016667,0.016667,0.016667,0.05,0.0,0.016667
3,Bajipura,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bamroli,0.5,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Barbodhan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Bardoli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Bedkuvadoor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.025641,0.0,0.0,0.051282,0.0,0.051282,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051282,0.0,0.025641,0.179487,0.0,0.0,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.076923,0.0,0.0,0.0,0.0,0.051282,0.0,0.0,0.025641,0.0,0.051282,0.0,0.0,0.0,0.076923,0.025641,0.0
8,Bhadbhuja,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.025641,0.0,0.0,0.051282,0.0,0.051282,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051282,0.0,0.025641,0.179487,0.0,0.0,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.076923,0.0,0.0,0.0,0.0,0.051282,0.0,0.0,0.025641,0.0,0.051282,0.0,0.0,0.0,0.076923,0.025641,0.0
9,Bhagal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.025641,0.0,0.025641,0.0,0.0,0.025641,0.051282,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.0,0.0,0.051282,0.0,0.051282,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051282,0.0,0.025641,0.179487,0.0,0.0,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.076923,0.0,0.0,0.0,0.0,0.051282,0.0,0.0,0.025641,0.0,0.051282,0.0,0.0,0.0,0.076923,0.0,0.0


In [32]:
len(surat_grouped[surat_grouped['Shopping Mall'] > 0])

19

### New dataframe for Shopping Mall data only

In [33]:
surat_mall = surat_grouped[["Neighborhood", "Shopping Mall"]]
surat_mall.head()

Unnamed: 0,Neighborhood,Shopping Mall
0,Agnovad,0.051282
1,Amroli,0.0
2,Athwalines,0.033333
3,Bajipura,0.0
4,Bamroli,0.0


### Cluster Neighborhoods

In [34]:
# set number of clusters
kclusters = 3

surat_clustering = surat_mall.drop(['Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(surat_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 0, 1, 1, 1, 1, 0, 0, 0])

### Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [35]:
surat_merged = surat_mall.copy()

# add clustering labels
surat_merged["Cluster Labels"] = kmeans.labels_

surat_merged.rename(columns={"Neighborhood": "Neighborhood"}, inplace=True)
surat_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,Agnovad,0.051282,0
1,Amroli,0.0,1
2,Athwalines,0.033333,0
3,Bajipura,0.0,1
4,Bamroli,0.0,1


In [37]:
# merge surat_grouped with surat_data to add latitude, longitude for each neighborhood
surat_merged = surat_merged.join(df_surat.set_index("Neighborhood"), on="Neighborhood")

print(surat_merged.shape)
surat_merged.head()

(57, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Agnovad,0.051282,0,21.18578,72.83679
1,Amroli,0.0,1,21.23776,72.85623
2,Athwalines,0.033333,0,21.18284,72.80776
3,Bajipura,0.0,1,22.68966,73.04369
4,Bamroli,0.0,1,21.13444,72.81642


### Visualize resulting clusters

In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(surat_merged['Latitude'], surat_merged['Longitude'], surat_merged['Neighborhood'], surat_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [42]:
# export the clustered map as html 
map_clusters.save('Surat_clusters.html')

### Examine each cluster

#### Cluster 0

In [43]:
surat_merged.loc[surat_merged['Cluster Labels']==0]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Agnovad,0.051282,0,21.18578,72.83679
2,Athwalines,0.033333,0,21.18284,72.80776
7,Bedkuvadoor,0.051282,0,21.18578,72.83679
8,Bhadbhuja,0.051282,0,21.18578,72.83679
9,Bhagal,0.051282,0,21.19737,72.82697
13,Ghod Dod Road,0.039216,0,21.176523,72.819009
15,Gopipura,0.045455,0,21.19319,72.82404
17,Inderpura,0.057143,0,21.193436,72.832837
31,Mahidharpura,0.071429,0,21.20273,72.8342
34,"Mota, Gujarat",0.043478,0,21.19558,72.8202


#### Cluster 1

In [44]:
surat_merged.loc[surat_merged['Cluster Labels']==1]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
1,Amroli,0.0,1,21.23776,72.85623
3,Bajipura,0.0,1,22.68966,73.04369
4,Bamroli,0.0,1,21.13444,72.81642
5,Barbodhan,0.0,1,21.22065,72.70805
6,Bardoli,0.0,1,21.1336,73.10661
10,Bhavanivad,0.0,1,21.269414,72.955424
11,Bhimrad,0.0,1,21.13159,72.79619
12,Chalthan,0.0,1,21.15576,72.96
14,Godadara,0.0,1,21.16984,72.87613
16,Ichchhapor,0.0,1,21.19385,72.73244


#### Cluster 2

In [45]:
surat_merged.loc[surat_merged['Cluster Labels']==2]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
22,Katargam,0.142857,2,21.22561,72.82733
48,Utran,0.166667,2,21.22791,72.86106
50,Varachha,0.105263,2,21.20354,72.85385
52,Vastadevdi,0.111111,2,21.21877,72.83367
53,Vedachhi,0.142857,2,21.24566,72.81954
54,Vesu,0.133333,2,21.13592,72.77322


## Observations & Conclusions <a name='observations'></a>

<p>We observe that most of the shopping malls are concentrated in the central area of Surat, with the highest number in cluster 2 and moderate number in cluster 0. On the other hand, cluster 1 has very low number to totally no shopping malls in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 2 are likely to be suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the central area of the city, with the suburb area still having very few shopping malls. Therefore, this project recommends business developers and investors to capitalize on these findings to open new shopping malls in neighborhoods in cluster 1 with little to no competition and to avoid neighborhoods in cluster 2 which already have high concentration of shopping malls and are suffering from intense competition.</p>