# Capstone Project: The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find out an optimal livable location in a city.
By livable location here it means the number of facilities, nearby venues provided in the vicinity of a location. This report specifically targets stakeholders who are interested in finding a livable location in **Pune city, India**.

Also, the stakeholder would choose a particular location which is closest to its workplace. Hence, we would categorize various locations in the city on the basis of their distance between **the most prominent workplaces/ industrial zones in the city** and **nearby venues to those locations**

We would cluster all the important/livable and recommended regions where someone can live which have the most number of amenities/venues in the locality. User/Stakeholder can choose according the location nearest to his/her office/campus. 

For example: - Hinjewadi and Magarpatta are two IT company zones which are situated in opposite direction to each other, If someone lives in a location near to Hinjewadi and his/her company moves to Magarpatta, which location should be suggested having similar facilities/ venues compared to his/her current living location.

We would identify areas with most promising characteristics and their advantages will then be clearly expressed, so that best possible final location may be chosen by our stakeholders.

## Data <a name="data"></a>

We have taken latitude and longitude of most prominent pune locations from ___[PMC Open Data Store](http://opendata.punecorporation.org/Citizen/CitizenDatasets/Index)___ offical website and __[Kaggle Dataset](https://www.kaggle.com/dynamic22/pune-property-prices)__. The data derived from kaggle dataset do not have latitude longitude information, hence we have used geopy library to fetch latitude and longitude values for such locations.

In [1]:
#Let's try to read dataset
import pandas as pd
location_df=pd.read_excel('dataset.xlsx', sheet_name='Sheet2',  header=0, nrows=199)


In [2]:
import requests
location_df

Unnamed: 0,Location,Latitude,Longitude
0,Bund Garden,18.539848,73.885117
1,Shivajinagar,18.510099,73.817398
2,Aundh,18.563162,73.809555
3,Kondhwa,18.478436,73.890213
4,Chinchwad,18.636131,73.796143
5,Satara Road,18.488499,73.857956
6,Kothrud,18.508699,73.812500
7,Senapati Bapat Road,18.534451,73.837349
8,Kalyani Nagar,18.548101,73.900070
9,Hinjewadi Phase1,18.586555,73.734741


Let's import folium and locate this points on Pune map

In [3]:
import folium
from geopy.geocoders import Nominatim
print('libraries imported')

libraries imported


#### Use geopy library to get longitude and latitude of Pune city, India

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>pune_explorer</em>, as shown below.

The below code is used to fetch and store latitude and longitude values in excel file.

In [4]:
import math
latlon = []
for index in range(66,121):
    if math.isnan(location_df.loc[index,'Latitude']):
        address = location_df.loc[index,'Location'] + ', Pune IN'
        geolocator = Nominatim(user_agent="pune_explorer")
        location = geolocator.geocode(address)
        if location is not None:
            location_df.loc[index,'Latitude'] = location.latitude
            location_df.loc[index,'Longitude'] = location.longitude
            print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
location_df.to_excel('dataset1.xlsx')

We will also use location of 10 industrial areas in Pune. This data is also taken from ___[PMC Open Data Store](http://opendata.punecorporation.org/Citizen/CitizenDatasets/Index)___ offical website. Let's read that file

In [5]:
industries_df=pd.read_excel('Major Industries.xlsx', sheet_name='Sheet1',  header=0, nrows=10)
industries_df

Unnamed: 0,Industries,Latitude,Longitude
0,Pimpri Chinchwad MIDC,18.627929,73.800983
1,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,18.591684,73.734782
2,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,18.598255,73.706207
3,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,18.59177,73.733895
4,Magarpatta City,18.522141,73.93174
5,Kharadi Knowledge Park,18.550518,73.942494
6,Talawade InfoTech Park,18.739658,73.806857
7,Talegaon Floriculture Park,18.729488,73.654067
8,Ranjangaon Industrial Area,18.753635,74.244579
9,Chakan Industrial Area,18.762311,73.862545


Now, create a map of Pune city with nearby locations superimposed on top

In [6]:
# create map of Pune using latitude and longitude values
address = 'Pune, IN'
geolocator = Nominatim(user_agent="pune_explorer")
location = geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
map_pune = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, Location in zip(location_df['Latitude'], location_df['Longitude'], location_df['Location']):
    label = Location
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_pune)  
    
map_pune

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [7]:
CLIENT_ID = 'TR4OJXMM340Z5YTVDZ1QGAG1YHPBOJ4NEPK4V52K10Y0RYPY' # your Foursquare ID
CLIENT_SECRET = 'DZFZHGTHTBEWNOM25GNI24LBCGGUTLVPQMALSZQBQQWSIZMT' # your Foursquare Secret
VERSION = '20190324' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TR4OJXMM340Z5YTVDZ1QGAG1YHPBOJ4NEPK4V52K10Y0RYPY
CLIENT_SECRET:DZFZHGTHTBEWNOM25GNI24LBCGGUTLVPQMALSZQBQQWSIZMT


## Explore Neighborhoods in Pune

Let's borrow the **get_category_type** function from the Foursquare lab.

In [8]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Let's create a function to retrieve nearby venues in Pune

We will use 1 Km as radius this time

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    LIMIT=100
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now lets write the code to run the above function on each neighborhood and create a new dataframe.

In [10]:
pune_venues = getNearbyVenues(names=location_df['Location'],
                                   latitudes=location_df['Latitude'],
                                   longitudes=location_df['Longitude']
                                  )

Bund Garden
Shivajinagar
Aundh
Kondhwa
Chinchwad
Satara Road
Kothrud
Senapati Bapat Road
Kalyani Nagar
Hinjewadi Phase1
Hinjewadi Phase2
Magarpatta City
VimanNagar
Baner
Hinjewadi
Kirkee
Fatima Nagar
Pimpri
Model Colony - Wealth Branch
Pimple Saudagar
Sinhagad Road
Tilak Road
Bavdhan
Katraj
Aundh - Nagardas Road
Koregaon Park - Wealth Branch
Kharadi
Nigdi
Thermax Chowk
Mayur Colony
Sus Pashan Road
WTC-Kharadi
Navi Peth
Warje
Murti - Baramati
Karanjepul
Kamthadi
Kikvi
Malad Patas
Deulgaon Raje
Pimpalgaon - Daund
Pargaon
B T Kawade
Balewadi, Maharashtra
Karve Nagar
Nanded City, Maharashtra
Bhigwan
Sahakar Nagar
Bhandarkar Road
Raviwar Peth
Sadashiv Peth
Erandavana
Camp
Paud Road
Ghole Road
Blueridge Hinjewadi
Ravet
New Sanghvi
Pirangut
Narhe
Wagholi
Mohammadwadi
Bhosle Nagar
Undri Pisoli
E Square University Road
Vishrantwadi
Nana Peth
Fursungi
Salunke Vihar Road
Manjari Road Hadapsar
Phoenix Mall
Shivar Garden Chowk
Baner-D-Mart Complex
Null Stop-Karve Road
Market Yard
Pimple Nilakh
Prab

In [11]:
print(pune_venues.shape)
pune_venues.head()

(5431, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bund Garden,18.539848,73.885117,La Pizzeria,18.539621,73.883401,Italian Restaurant
1,Bund Garden,18.539848,73.885117,Hidden Place - The Hangout,18.539651,73.887023,Pub
2,Bund Garden,18.539848,73.885117,Savya Rasa,18.538874,73.886561,South Indian Restaurant
3,Bund Garden,18.539848,73.885117,Starbucks Coffee: A Tata Alliance,18.539341,73.886602,Coffee Shop
4,Bund Garden,18.539848,73.885117,Little Italy,18.539598,73.883464,Italian Restaurant


Let's check how many venues were returned for each neighborhood

In [12]:
pune_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adarsh Nagar,100,100,100,100,100,100
Akurdi,17,17,17,17,17,17
Alandi,5,5,5,5,5,5
Alandi Road,5,5,5,5,5,5
Ambedkar Nagar,100,100,100,100,100,100
Anand Nagar,12,12,12,12,12,12
Anand Park Nagar,15,15,15,15,15,15
Ashok Nagar,11,11,11,11,11,11
Aundh,59,59,59,59,59,59
Aundh - Nagardas Road,51,51,51,51,51,51


#### Let's find out how many unique categories can be curated from all the returned venues

In [13]:
print('There are {} uniques categories.'.format(len(pune_venues['Venue Category'].unique())))

There are 233 uniques categories.


I will use these 3 datasets to determine similarity across various locations in Pune city.

## Methodology <a name="methodology"></a>

We are going to analyze our datasets in two different cases: -

1.) Top 10 venues at each location will be calculated, then we will run K-Means clustering algorithm to determine common patterns in these locations.

2.) We will calculate the distance of each location from the ten industrial areas used in our analysis. And we will determine the top 10 venues at each location and then, we will run K-Means clustering algorithm to determine common patterns when distance from industrial areas is also considered.

#### Let's draw a scatter plot of all the industrial areas in Pune location

In [14]:
# create map of Pune containg industries superimposed on top using their latitude and longitude values
address = 'Pune, IN'
geolocator = Nominatim(user_agent="pune_explorer")
location = geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
map_pune_industries = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, Location in zip(industries_df['Latitude'], industries_df['Longitude'], industries_df['Industries']):
    label = Location
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_pune_industries)  
    
map_pune_industries

## Analyze Each Neighborhood <a name="analysis"></a>

We will analyze all the venues across various neighborhood

In [15]:
# one hot encoding
pune_onehot = pd.get_dummies(pune_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
pune_onehot['Neighborhood'] = pune_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [pune_onehot.columns[-1]] + list(pune_onehot.columns[:-1])
pune_onehot = pune_onehot[fixed_columns]

pune_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Andhra Restaurant,Arcade,...,Track Stadium,Trail,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Warehouse Store,Watch Shop,Women's Store,Yoga Studio,Zoo
0,Bund Garden,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bund Garden,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Bund Garden,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bund Garden,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bund Garden,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [16]:
pune_onehot.shape

(5431, 234)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [17]:
pune_grouped=pune_onehot.groupby('Neighborhood').mean().reset_index()
pune_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Andhra Restaurant,Arcade,...,Track Stadium,Trail,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Warehouse Store,Watch Shop,Women's Store,Yoga Studio,Zoo
0,Adarsh Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
1,Akurdi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Alandi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alandi Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ambedkar Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [18]:
pune_grouped.shape

(188, 234)

#### Let's write a function to sort the venues in descending order.

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [20]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = pune_grouped['Neighborhood']

for ind in np.arange(pune_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(pune_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adarsh Nagar,Indian Restaurant,Café,Chinese Restaurant,Pub,Bakery,Coffee Shop,Gym / Fitness Center,Asian Restaurant,Fast Food Restaurant,Cupcake Shop
1,Akurdi,Gym,Café,Coffee Shop,Convenience Store,Ice Cream Shop,Snack Place,Fast Food Restaurant,Middle Eastern Restaurant,Diner,Indian Restaurant
2,Alandi,Fast Food Restaurant,Bus Station,Indian Restaurant,River,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space
3,Alandi Road,Bakery,Supermarket,Juice Bar,Coffee Shop,Hotel,English Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Zoo
4,Ambedkar Nagar,Indian Restaurant,Bar,Italian Restaurant,Coffee Shop,Lounge,Dessert Shop,Bakery,Seafood Restaurant,Fast Food Restaurant,Ice Cream Shop


Let's create a copy of this dataframe as it will be used for 2 different analysis

In [21]:
neighborhoods_venues_sorted_2=neighborhoods_venues_sorted.copy(deep=True)

## Analysis 1: Cluster various locations on the basis of top venues in the neighborhood

Run *k*-means to cluster the neighborhood into 10 clusters.

In [22]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 10

pune_grouped_clustering = pune_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(pune_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([6, 0, 4, 0, 0, 4, 9, 0, 6, 6], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [23]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,6,Adarsh Nagar,Indian Restaurant,Café,Chinese Restaurant,Pub,Bakery,Coffee Shop,Gym / Fitness Center,Asian Restaurant,Fast Food Restaurant,Cupcake Shop
1,0,Akurdi,Gym,Café,Coffee Shop,Convenience Store,Ice Cream Shop,Snack Place,Fast Food Restaurant,Middle Eastern Restaurant,Diner,Indian Restaurant
2,4,Alandi,Fast Food Restaurant,Bus Station,Indian Restaurant,River,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space
3,0,Alandi Road,Bakery,Supermarket,Juice Bar,Coffee Shop,Hotel,English Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Zoo
4,0,Ambedkar Nagar,Indian Restaurant,Bar,Italian Restaurant,Coffee Shop,Lounge,Dessert Shop,Bakery,Seafood Restaurant,Fast Food Restaurant,Ice Cream Shop


In [24]:
pune_merged = location_df

# merge pune_grouped with location_df to add latitude/longitude for each neighborhood
pune_merged = pune_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Location')
pune_merged=pune_merged[pd.notnull(pune_merged['Cluster Labels'])]

pune_merged.head() # check the last columns!

Unnamed: 0,Location,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bund Garden,18.539848,73.885117,0.0,Café,Hotel,Indian Restaurant,Italian Restaurant,Bakery,Coffee Shop,Fast Food Restaurant,Chinese Restaurant,Lounge,Japanese Restaurant
1,Shivajinagar,18.510099,73.817398,6.0,Indian Restaurant,Ice Cream Shop,Dessert Shop,Café,Pizza Place,Coffee Shop,Breakfast Spot,Burger Joint,Fast Food Restaurant,Smoke Shop
2,Aundh,18.563162,73.809555,6.0,Indian Restaurant,Shopping Mall,Dessert Shop,Fast Food Restaurant,Restaurant,Bakery,Diner,Snack Place,Sporting Goods Shop,Ice Cream Shop
3,Kondhwa,18.478436,73.890213,0.0,Asian Restaurant,Coffee Shop,Bakery,Restaurant,Hookah Bar,Café,Sports Bar,Mughlai Restaurant,BBQ Joint,Ice Cream Shop
4,Chinchwad,18.636131,73.796143,6.0,Indian Restaurant,Shopping Mall,Hotel,Gym,Multiplex,Sandwich Place,Fast Food Restaurant,Bookstore,Bus Station,Restaurant


In [25]:
pune_merged['Cluster Labels']=pune_merged['Cluster Labels'].astype('int64')
pune_merged['Cluster Labels'].dtype

dtype('int64')

Finally, let's visualize the resulting clusters

In [26]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters=folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x=np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pune_merged['Latitude'], pune_merged['Longitude'], pune_merged['Location'], pune_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis Part 2: Distance from nearest 3 Industries

Let's use geopy.distance api to write a method which returns distance between two points in map in KM unit

In [27]:
from geopy import distance
def calculate_distance(point1, point2 ):
    return distance.distance(point1, point2).km

def calculate_for_location(point1, df, j):
    for i in range(industries_df.shape[0]):
        p2=(industries_df.loc[i,'Latitude'],df.loc[i,'Longitude'])
        distance=calculate_distance(point1, p2)
        df.loc[j, industries_df.loc[i,'Industries']]=distance
        
# Usage example
#p1=(pune_merged.loc[2,'Latitude'],pune_merged.loc[2,'Longitude'])
#p2=(pune_merged.loc[1,'Latitude'],pune_merged.loc[1,'Longitude'])
#print("Distance between ",pune_merged.loc[1,'Location'], " and ", pune_merged.loc[2,'Location'], " is : ", calculate_distance(p1, p2))

Let's create another dataframe from location_df having distances from various industrial areas

In [28]:
# create columns according to distance from industries
columns=['Location','Latitude','Longitude']
for i in range(industries_df.shape[0]):
    columns.append( industries_df.loc[i,'Industries'])
distance_from_industries_df=pd.DataFrame(columns=columns)
distance_from_industries_df['Location']=location_df['Location']
distance_from_industries_df['Latitude']=location_df['Latitude']
distance_from_industries_df['Longitude']=location_df['Longitude']
for i in range(distance_from_industries_df.shape[0]):
    p1=(distance_from_industries_df.loc[i,'Latitude'],distance_from_industries_df.loc[i,'Longitude'])
    calculate_for_location(p1,distance_from_industries_df, i)
distance_from_industries_df

Unnamed: 0,Location,Latitude,Longitude,Pimpri Chinchwad MIDC,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase III (SEZ),Magarpatta City,Kharadi Knowledge Park,Talawade InfoTech Park,Talegaon Floriculture Park,Ranjangaon Industrial Area,Chakan Industrial Area
0,Bund Garden,18.539848,73.885117,9.74938,9.16629,10.2672,5.77213,9.59648,3.10114,23.4062,21.5874,23.7162,29.2928
1,Shivajinagar,18.510099,73.817398,14.8727,9.03037,9.79274,11.8664,2.61032,6.1929,25.4146,24.3746,28.3329,29.2475
2,Aundh,18.563162,73.809555,10.7235,3.26381,3.88438,9.08363,4.7562,5.29778,19.5385,18.6423,23.1451,23.414
3,Kondhwa,18.478436,73.890213,16.5557,14.7048,15.761,12.5446,11.0493,8.67514,30.0547,28.3428,30.4789,35.447
4,Chinchwad,18.636131,73.796143,9.43245,5.40698,4.4249,11.0755,12.6173,11.505,11.5884,11.2105,17.0101,15.3953
5,Satara Road,18.488499,73.857956,15.6972,12.1976,13.1795,11.9273,7.51488,6.86464,28.2111,26.7629,29.6819,32.979
6,Kothrud,18.508699,73.812500,15.2618,9.19987,9.91751,12.3229,2.27963,6.66779,25.5643,24.5788,28.6429,29.2463
7,Senapati Bapat Road,18.534451,73.837349,11.5099,6.67594,7.64754,8.44953,4.55915,2.80998,22.8648,21.5882,25.1475,27.4468
8,Kalyani Nagar,18.548101,73.900070,8.97573,9.97142,11.0502,4.94422,11.3428,4.45413,23.1288,21.1399,22.7501,29.4356
9,Hinjewadi Phase1,18.586555,73.734741,16.5176,8.7427,8.00177,16.4198,9.63601,13.6045,18.8279,19.1701,25.4212,19.4541


In [29]:
distance_from_industries_df.head()

Unnamed: 0,Location,Latitude,Longitude,Pimpri Chinchwad MIDC,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase III (SEZ),Magarpatta City,Kharadi Knowledge Park,Talawade InfoTech Park,Talegaon Floriculture Park,Ranjangaon Industrial Area,Chakan Industrial Area
0,Bund Garden,18.539848,73.885117,9.74938,9.16629,10.2672,5.77213,9.59648,3.10114,23.4062,21.5874,23.7162,29.2928
1,Shivajinagar,18.510099,73.817398,14.8727,9.03037,9.79274,11.8664,2.61032,6.1929,25.4146,24.3746,28.3329,29.2475
2,Aundh,18.563162,73.809555,10.7235,3.26381,3.88438,9.08363,4.7562,5.29778,19.5385,18.6423,23.1451,23.414
3,Kondhwa,18.478436,73.890213,16.5557,14.7048,15.761,12.5446,11.0493,8.67514,30.0547,28.3428,30.4789,35.447
4,Chinchwad,18.636131,73.796143,9.43245,5.40698,4.4249,11.0755,12.6173,11.505,11.5884,11.2105,17.0101,15.3953


Let's write a function to sort the distance in ascending order

In [30]:
def return_closest_industries(row, num_top_venues):
    row_categories = row.iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=True)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Let's create a new dataframe with closest 3 industrial areas to each location

In [31]:
import numpy as np
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Location']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Industry'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Industry'.format(ind+1))

# create a new dataframe
industries_location_sorted= pd.DataFrame(columns=columns)
industries_location_sorted['Location'] = distance_from_industries_df['Location']

for ind in np.arange(distance_from_industries_df.shape[0]):
    industries_location_sorted.iloc[ind, 1:] = return_closest_industries(distance_from_industries_df.iloc[ind, :], num_top_venues)

industries_location_sorted.head()

Unnamed: 0,Location,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
0,Bund Garden,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
1,Shivajinagar,Magarpatta City,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
2,Aundh,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Magarpatta City
3,Kondhwa,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...
4,Chinchwad,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Pimpri Chinchwad MIDC


Now, let's add this data into our neighborhoods_venues_sorted_2 dataframe for our analysis

In [32]:
neighborhoods_venues_sorted_2['1st Most Common Industry']=industries_location_sorted['1st Most Common Industry']
neighborhoods_venues_sorted_2['2nd Most Common Industry']=industries_location_sorted['2nd Most Common Industry']
neighborhoods_venues_sorted_2['3rd Most Common Industry']=industries_location_sorted['3rd Most Common Industry']

In [33]:
neighborhoods_venues_sorted_2.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
0,Adarsh Nagar,Indian Restaurant,Café,Chinese Restaurant,Pub,Bakery,Coffee Shop,Gym / Fitness Center,Asian Restaurant,Fast Food Restaurant,Cupcake Shop,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
1,Akurdi,Gym,Café,Coffee Shop,Convenience Store,Ice Cream Shop,Snack Place,Fast Food Restaurant,Middle Eastern Restaurant,Diner,Indian Restaurant,Magarpatta City,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
2,Alandi,Fast Food Restaurant,Bus Station,Indian Restaurant,River,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Magarpatta City
3,Alandi Road,Bakery,Supermarket,Juice Bar,Coffee Shop,Hotel,English Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Zoo,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...
4,Ambedkar Nagar,Indian Restaurant,Bar,Italian Restaurant,Coffee Shop,Lounge,Dessert Shop,Bakery,Seafood Restaurant,Fast Food Restaurant,Ice Cream Shop,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Pimpri Chinchwad MIDC


Let's create clustering data set on which K-Means clustering, adding all the distance columns in pune_grouped dataset

In [34]:
distance_from_industries_df.columns.shape[0]
for i in range(3, distance_from_industries_df.columns.shape[0]):
    pune_grouped[distance_from_industries_df.columns[i]]=distance_from_industries_df[distance_from_industries_df.columns[i]]

In [35]:
pune_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Andhra Restaurant,Arcade,...,Pimpri Chinchwad MIDC,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase III (SEZ),Magarpatta City,Kharadi Knowledge Park,Talawade InfoTech Park,Talegaon Floriculture Park,Ranjangaon Industrial Area,Chakan Industrial Area
0,Adarsh Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,9.74938,9.16629,10.2672,5.77213,9.59648,3.10114,23.4062,21.5874,23.7162,29.2928
1,Akurdi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,14.8727,9.03037,9.79274,11.8664,2.61032,6.1929,25.4146,24.3746,28.3329,29.2475
2,Alandi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.7235,3.26381,3.88438,9.08363,4.7562,5.29778,19.5385,18.6423,23.1451,23.414
3,Alandi Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,16.5557,14.7048,15.761,12.5446,11.0493,8.67514,30.0547,28.3428,30.4789,35.447
4,Ambedkar Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,...,9.43245,5.40698,4.4249,11.0755,12.6173,11.505,11.5884,11.2105,17.0101,15.3953


The values of distance in last 10 columns in much greater than remaining columns, here we will perform feature scaling to get an uniformly distributed feature values

In [37]:
from sklearn import preprocessing
pune_grouped=pune_grouped.drop('Neighborhood',1)
pune_grouped = preprocessing.StandardScaler().fit(pune_grouped).transform(pune_grouped)

  return self.partial_fit(X, y)
  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
pune_grouped

array([[-0.14720738, -0.17991822, -0.07312724, ..., -0.41292813,
        -0.4398507 , -0.33642179],
       [-0.14720738, -0.17991822, -0.07312724, ..., -0.34331094,
        -0.32433824, -0.3376475 ],
       [-0.14720738, -0.17991822, -0.07312724, ..., -0.48648841,
        -0.4541385 , -0.49535871],
       ...,
       [-0.14720738, -0.17991822, -0.07312724, ..., -0.17453601,
        -0.21397552, -0.08289283],
       [-0.14720738, -0.17991822, -0.07312724, ..., -0.2271175 ,
        -0.21661077, -0.20903189],
       [-0.14720738, -0.17991822, -0.07312724, ...,  1.97107405,
         2.00990175,  1.73301568]])

Run *k*-means to cluster the neighborhood into 10 clusters. 

In [39]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 10

#pune_grouped_clustering = pune_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(pune_grouped)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 3, 1, 1, 1, 4, 4], dtype=int32)

add clustering labels

In [40]:
# add clustering labels
neighborhoods_venues_sorted_2.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted_2.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
0,1,Adarsh Nagar,Indian Restaurant,Café,Chinese Restaurant,Pub,Bakery,Coffee Shop,Gym / Fitness Center,Asian Restaurant,Fast Food Restaurant,Cupcake Shop,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
1,1,Akurdi,Gym,Café,Coffee Shop,Convenience Store,Ice Cream Shop,Snack Place,Fast Food Restaurant,Middle Eastern Restaurant,Diner,Indian Restaurant,Magarpatta City,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase I
2,1,Alandi,Fast Food Restaurant,Bus Station,Indian Restaurant,River,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Magarpatta City
3,1,Alandi Road,Bakery,Supermarket,Juice Bar,Coffee Shop,Hotel,English Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Zoo,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...
4,3,Ambedkar Nagar,Indian Restaurant,Bar,Italian Restaurant,Coffee Shop,Lounge,Dessert Shop,Bakery,Seafood Restaurant,Fast Food Restaurant,Ice Cream Shop,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Pimpri Chinchwad MIDC


In [41]:
pune_merged_2 = location_df

# merge pune_grouped with location_df to add latitude/longitude for each neighborhood
pune_merged_2 = pune_merged_2.join(neighborhoods_venues_sorted_2.set_index('Neighborhood'), on='Location')
pune_merged_2=pune_merged_2[pd.notnull(pune_merged_2['Cluster Labels'])]

pune_merged_2.head() # check the last columns!

Unnamed: 0,Location,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
0,Bund Garden,18.539848,73.885117,1.0,Café,Hotel,Indian Restaurant,Italian Restaurant,Bakery,Coffee Shop,Fast Food Restaurant,Chinese Restaurant,Lounge,Japanese Restaurant,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Kharadi Knowledge Park,Pimpri Chinchwad MIDC
1,Shivajinagar,18.510099,73.817398,1.0,Indian Restaurant,Ice Cream Shop,Dessert Shop,Café,Pizza Place,Coffee Shop,Breakfast Spot,Burger Joint,Fast Food Restaurant,Smoke Shop,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II
2,Aundh,18.563162,73.809555,4.0,Indian Restaurant,Shopping Mall,Dessert Shop,Fast Food Restaurant,Restaurant,Bakery,Diner,Snack Place,Sporting Goods Shop,Ice Cream Shop,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Pimpri Chinchwad MIDC
3,Kondhwa,18.478436,73.890213,1.0,Asian Restaurant,Coffee Shop,Bakery,Restaurant,Hookah Bar,Café,Sports Bar,Mughlai Restaurant,BBQ Joint,Ice Cream Shop,Pimpri Chinchwad MIDC,Ranjangaon Industrial Area,Talegaon Floriculture Park
4,Chinchwad,18.636131,73.796143,7.0,Indian Restaurant,Shopping Mall,Hotel,Gym,Multiplex,Sandwich Place,Fast Food Restaurant,Bookstore,Bus Station,Restaurant,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Magarpatta City


In [42]:
pune_merged_2['Cluster Labels']=pune_merged_2['Cluster Labels'].astype('int64')
pune_merged_2['Cluster Labels'].dtype

dtype('int64')

Let's visualize resulting clusters

In [43]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters=folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x=np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pune_merged_2['Latitude'], pune_merged_2['Longitude'], pune_merged_2['Location'], pune_merged_2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The identification of these clusters is more clear and widely distributed across the city. Now let's analyse a few cluster and try to identify difference between them.

## Analysis <a name="analysis"></a>

#### Analysis of Part 1

Let us examine and compare any 3 clusters of Part 1 analysis with 3 clusters of Part 2 analysis clusters.

#### Cluster 1 of Part 1 analysis

In [60]:
pune_merged.loc[pune_merged['Cluster Labels'] == 4, pune_merged.columns[[0] + list(range(4, pune_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Sinhagad Road,Indian Restaurant,Diner,Ice Cream Shop,Café,Bakery,Fast Food Restaurant,Gym / Fitness Center,Pizza Place,Cupcake Shop,Dosa Place
33,Warje,Coffee Shop,Indian Restaurant,Pizza Place,Fast Food Restaurant,Grocery Store,Diner,Eastern European Restaurant,Donut Shop,Dosa Place,Dumpling Restaurant
53,Paud Road,Indian Restaurant,Café,Sporting Goods Shop,Motorcycle Shop,Breakfast Spot,Bus Station,Diner,Sandwich Place,Cafeteria,Liquor Store
67,Fursungi,Fast Food Restaurant,Hotel,Indian Restaurant,Rock Club,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space
76,Prabhadevi Tech Park,Indian Restaurant,Fast Food Restaurant,Snack Place,Asian Restaurant,Shopping Mall,Lounge,South Indian Restaurant,Bistro,Café,Bar
81,Alandi,Fast Food Restaurant,Bus Station,Indian Restaurant,River,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space
84,Anand Nagar,Fast Food Restaurant,Bakery,Indian Restaurant,Coffee Shop,Snack Place,Ice Cream Shop,Pizza Place,Gym / Fitness Center,Diner,Falafel Restaurant
90,Balewadi Phata,Indian Restaurant,Fast Food Restaurant,Breakfast Spot,Café,Lounge,Vegetarian / Vegan Restaurant,Ice Cream Shop,Market,Malay Restaurant,Shopping Mall
95,Bhusari Colony,Indian Restaurant,Café,Fast Food Restaurant,Bus Station,Breakfast Spot,Cafeteria,Diner,Motorcycle Shop,Liquor Store,Eastern European Restaurant
111,Digambar Nagar,Indian Restaurant,Fast Food Restaurant,Bakery,Hotel,Farm,Smoke Shop,Bookstore,Mobile Phone Shop,Juice Bar,Restaurant


pune_merged.loc[pune_merged['Cluster Labels'] == 0, pune_merged.columns[[1] + list(range(5, pune_merged.shape[1]))]]

#### Cluster 2 of Part 1 analysis

In [50]:
pune_merged.loc[pune_merged['Cluster Labels'] == 1, pune_merged.columns[[0] + list(range(4, pune_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
35,Karanjepul,Mobile Phone Shop,Zoo,Donut Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant
40,Pimpalgaon - Daund,Mobile Phone Shop,Zoo,Donut Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant
41,Pargaon,Mobile Phone Shop,Zoo,Donut Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant


#### Cluster 3 of Part 1 analysis

In [53]:
pune_merged.loc[pune_merged['Cluster Labels'] == 3, pune_merged.columns[[0] + list(range(4, pune_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
46,Bhigwan,Indian Restaurant,Train Station,Seafood Restaurant,Zoo,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space
69,Manjari Road Hadapsar,Indian Restaurant,Seafood Restaurant,Zoo,Distillery,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant
106,Dattawadi,Indian Restaurant,Trail,Pizza Place,Food Truck,Electronics Store,Donut Shop,Dosa Place,Dumpling Restaurant,Eastern European Restaurant,Zoo
114,Gadital,Indian Restaurant,Fast Food Restaurant,Farmers Market,Coffee Shop,Department Store,Bakery,Zoo,Dosa Place,Farm,Falafel Restaurant
151,Pashan,Indian Restaurant,Farmers Market,Shopping Mall,Seafood Restaurant,Distillery,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant
178,Sus,Indian Restaurant,Resort,Zoo,Diner,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,English Restaurant


#### Analysis of Part 2

#### Cluster 1 of Part 2 analysis

In [54]:
pune_merged_2.loc[pune_merged_2['Cluster Labels'] == 0, pune_merged_2.columns[[0] + list(range(4, pune_merged_2.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
5,Satara Road,Indian Restaurant,Bakery,Ice Cream Shop,Vegetarian / Vegan Restaurant,Southern / Soul Food Restaurant,Coffee Shop,Breakfast Spot,Shopping Mall,Fast Food Restaurant,Bistro,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...
47,Sahakar Nagar,Indian Restaurant,Southern / Soul Food Restaurant,Coffee Shop,Gym / Fitness Center,Ice Cream Shop,Shopping Mall,Breakfast Spot,Electronics Store,Bus Station,Bistro,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Magarpatta City
62,Bhosle Nagar,Indian Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Fast Food Restaurant,Asian Restaurant,Garden,Shoe Store,Tennis Court,Multiplex,Ice Cream Shop,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...
152,Pashan-Sus Road,Breakfast Spot,Vegetarian / Vegan Restaurant,Indian Restaurant,Coffee Shop,Mountain,Italian Restaurant,Diner,Beer Garden,Ice Cream Shop,Food Court,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Magarpatta City


#### Cluster 2 of Part 2 analysis

In [59]:
pune_merged_2.loc[pune_merged_2['Cluster Labels'] == 5, pune_merged_2.columns[[0] + list(range(4, pune_merged_2.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
31,WTC-Kharadi,Coffee Shop,Pizza Place,Indian Restaurant,Irani Cafe,Café,Cafeteria,Pub,Go Kart Track,North Indian Restaurant,Diner,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II
112,Eon Free Zone,Indian Restaurant,North Indian Restaurant,Coffee Shop,Fast Food Restaurant,Cafeteria,Café,Pub,Go Kart Track,Irani Cafe,Pizza Place,Kharadi Knowledge Park,Magarpatta City,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...


#### Cluster 3 of Part 2 analysis

In [58]:
pune_merged_2.loc[pune_merged_2['Cluster Labels'] == 4, pune_merged_2.columns[[0] + list(range(4, pune_merged_2.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,1st Most Common Industry,2nd Most Common Industry,3rd Most Common Industry
2,Aundh,Indian Restaurant,Shopping Mall,Dessert Shop,Fast Food Restaurant,Restaurant,Bakery,Diner,Snack Place,Sporting Goods Shop,Ice Cream Shop,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Pimpri Chinchwad MIDC
24,Aundh - Nagardas Road,Indian Restaurant,Fast Food Restaurant,Shopping Mall,Sporting Goods Shop,Dessert Shop,Restaurant,Snack Place,Bakery,Grocery Store,Diner,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Magarpatta City
87,Aundh Annexe,Indian Restaurant,Dessert Shop,Fast Food Restaurant,Shopping Mall,Restaurant,Café,Sandwich Place,Bakery,Sporting Goods Shop,Grocery Store,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Magarpatta City
88,Aundh Gaon,Indian Restaurant,Dessert Shop,Shopping Mall,Restaurant,Fast Food Restaurant,Coffee Shop,Snack Place,Sporting Goods Shop,Ice Cream Shop,Bakery,Kharadi Knowledge Park,Rajiv Gandhi InfoTech Park Hinjewadi Phase III...,Pimpri Chinchwad MIDC
134,Kunj Colony,Indian Restaurant,Café,Bar,Fast Food Restaurant,Ice Cream Shop,Snack Place,Coffee Shop,Vegetarian / Vegan Restaurant,South Indian Restaurant,Juice Bar,Rajiv Gandhi InfoTech Park Hinjewadi Phase I,Rajiv Gandhi InfoTech Park Hinjewadi Phase II,Magarpatta City


Now, These clusters have many areas in common, Some of the most common venues of each analysis is as follows: -
-  Analysis 1
    -  Cluster 1: - 
        Indian Restaurants, Bakery, Shopping Mall, Gym 
    -  Cluster 2: -
        Mobile Shops, Zoo, Donut Shop, Fast Food Restaurant 
    -  Cluster 3: -
        Indian Restaurant, Fast Food Restaurant, Seafood Restaurant, Resort 
-  Analysis 3
    -  Cluster 1: -
        Indian Restaurant, Coffee Shop, Icecream Shop and close to Kharadi Knowledge Park and Magarpatta city
    -  Cluster 2: -
        Coffee Shop, Indian Restaurant, Irani Cafe and close to Magarpatta city and Hinjewadi Phase 1
    -  Cluster 3: -
        Indian Restaurant, Dessert Shop, Shopping Mall and close to Rajiv Gandhi Infotech Park and Kharadi Knowledge Park.
        

## Results and Discussion <a name="results"></a>

Our analysis shows that although their are various locations in Pune city. They can be classified into common clusters  according to the venues around 1 K.M area of those clusters. 
First we used all the locations in the Pune city and clustered them according to the frequency of venues around them and classified these 197 locations into 10 clusters.

In our second analysis, we choose 10 industrial locations and included them in our analysis also, to check for any difference in the clusters formed.

We found that the locations are classified differently in both clusters, the distance of each industrial location is also taken into consideration and then classified into clusters.

Although, the clusters are unevenly distributed in each cases, hence grouping most of the locations in 3-4 clusters only, which proves that Pune city has almost all similar set of venues available in each and every geography of the city. If a person/stakeholder tries to locate a place similar to another place at a far distant location, the probability of finding such an location is higher in our case.

## Conclusion <a name="conclusion"></a>

The objective of this analysis was to classify similar locations of Pune city India. We have used two approaches to classify location across this city.

Although, the final decision of selecting a particular location will be made by our stakeholder based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), distance from work location, traffic on roads, levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.