# Sydney or Melbourne - Exploring and Clustering to compare the two cities?

Trong Dinh Thac Do

Feb 20, 2020

### I. Introduction

Migration or tourist in Australia is a topic that interest many people around the world. Sydney or Melbourne, the two largest cities cities of Australia, always attract heat debate. People often compare the two cities. There are several questions without answers.

In this project, I will use the machine learning techniques to group the neighbourhoods of both cities and explore several venues in both cities. This work would help people with better decision when they want to migrate or visit Australia. 

### II. Business problem

This work aims to help people on the decision of where would they migrate in Australia, especially the two largest cities, Sydney and Melbourne. The findings also help visitors with the venues for better planning of their trips. Further, stakeholders can find useful information on the place they should put money on.

### III. Data Description

1. Data source

There are several soures of data, but I use the data from [1]. This source of data includes:

    a. Postal code
    b. Neighbourhood
    c. Lattitute
    d. Longitute
    
of all states in Australia.

I also extract the sources of venues from Foursquare API.

2. Data preprocessing

We only interest in the data from Sydney and Melbourne. Thus, I use the postcode for only Sydney (NSW) and Melbourne (VIC) from [2]. Further, I group the neighbourhoods with same postcode.


### IV. Methodology

These pakages are used for this project.

In [193]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

<ul>
    <li><b>pandas</b>: To perform the data analysis</li>
    <li><b>requests</b>: Handle HTTP requests</li>
    <li><b>matplotlib</b>: graph library for generating maps</li>
    <li><b>folium</b>: Generating maps of Sydney and Melbourne</li>
    <li><b>sklearn</b>: To import Kmeans for clustering the venues</li>
    <li><b>geopy</b>: To get the location of Sydney and Melbourne</li>
</ul>

The process of this project is on 4 folds. First I perform the preprocessing on geolocation of Australia to extract the information of Melbourne and Sydney. Second, I provide the maps of both cities. Next, I use the Foursquare API to get the information on venues in both cities. Lastly, I apply the clustering technique, i.e., Kmeans on data to obtain the similarities between neighbourhoods.

#### 1. Data Preprocessing

I get the data on location and geo infomation from this website [1]. For convinient, I extract the raw data to csv file and embed in the project. These data include the information on the whole Australia.

We need to get the data from only NSW (i.e., state of Sysney) and VIC (i.e., state of Melbourne).

First, read the CSV file using pandas. We obtain 16742 rows from raw data. The results should look like this.

In [194]:
df = pd.read_csv("australian-postcodes.csv")
df.head()

Unnamed: 0,Postal Code,Neighbourhood,State,Latitude,Longitude
0,200,Australian National University,ACT,-35.28,149.12
1,221,Barton,ACT,-35.2,149.1
2,800,Darwin,NT,-12.8,130.96
3,801,Darwin,NT,-12.8,130.96
4,804,Parap,NT,-12.43,130.84


In [195]:
df.shape

(16742, 5)

As the data include information from all Australia's states. However, we only interest in the infomation from NSW and VIC. Thus, we only keep the data on NSW and VIC. After this action, only 8281 rows remain. The results should look like this.

In [196]:
df = df[df['State'].isin(["NSW", "VIC"])].reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Neighbourhood,State,Latitude,Longitude
0,1001,Sydney,NSW,-33.79,151.27
1,1002,Sydney,NSW,-33.79,151.27
2,1003,Sydney,NSW,-33.79,151.27
3,1004,Sydney,NSW,-33.79,151.27
4,1005,Sydney,NSW,-33.79,151.27


In [197]:
df.shape

(8281, 5)

The data of interest are only from Sydney CBD and Melbourne CBD as defined in [3]. Thus, we exclude the data from regional of NSW and VIC. After this action, only 204 rows remain. The results should look like this.

In [198]:
sydney_range_1 = list(range(1100, 1300))
sydney_range_2 = [2000, 2001, 2007, 2009]
postcode_sydney = sydney_range_1 + sydney_range_2

melbourne_range_1 = list(range(3000, 3007))
melbourne_range_2 = [3205]
melbourne_range_3 = list(range(8000, 8400))
postcode_melbourne = melbourne_range_1 + melbourne_range_2 + melbourne_range_3

postcode_of_interest = postcode_sydney + postcode_melbourne

df = df[df['Postal Code'].isin(postcode_of_interest)].reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Neighbourhood,State,Latitude,Longitude
0,1100,Sydney,NSW,-33.79,151.27
1,1101,Sydney,NSW,-33.79,151.27
2,1105,Sydney,NSW,-33.79,151.27
3,1106,Sydney,NSW,-33.79,151.27
4,1107,Sydney,NSW,-33.79,151.27


In [199]:
df.shape

(204, 5)

Next, as there are several neighbourhood with the same geolocation (i.e., Latitude and Longitude), we need to group them into one and get the list of Neighborhoods. The results should look like this.

In [200]:
df['Postal Code'] = df['Postal Code'].astype(str)
df = df.groupby(['Latitude', 'Longitude', 'State'], as_index=False).agg({'Postal Code': ', '.join, 'Neighbourhood': ', '.join})
df.head()

Unnamed: 0,Latitude,Longitude,State,Postal Code,Neighbourhood
0,-38.37,144.77,VIC,"3001, 8001, 8045, 8051, 8060, 8061, 8066, 8069...","Melbourne, Melbourne, Melbourne, Melbourne, Me..."
1,-38.19,146.29,VIC,8010,Law Courts
2,-38.11,145.15,VIC,8002,East Melbourne
3,-37.93,145.03,VIC,"3205, 3205","South Melbourne, South Melbourne DC"
4,-37.84,144.98,VIC,"3004, 3004, 8004","Melbourne, St Kilda Road Central, St Kilda Road"


In [201]:
df.shape

(25, 5)

Although, the name of Neighbourhood look not nice. However, atleast we have the geograpical coordinate.

Next, we need to separate the data for Sydney and Melbourne. It should look like this.

In [202]:
sydney_data = df[df['State'] == 'NSW'].reset_index(drop=True)
sydney_data

Unnamed: 0,Latitude,Longitude,State,Postal Code,Neighbourhood
0,-34.79,147.69,NSW,1192,Sydney
1,-33.89,151.18,NSW,"1209, 1210, 1211, 1212, 1213, 1214, 1215","Australia Square, Australia Square, Australia ..."
2,-33.88,151.2,NSW,"2007, 2007","Broadway, Ultimo"
3,-33.87,151.19,NSW,2009,Pyrmont
4,-33.87,151.21,NSW,"1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228...","Royal Exchange, Royal Exchange, Royal Exchange..."
5,-33.86,151.21,NSW,"2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000","Barangaroo, Dawes Point, Haymarket, Millers Po..."
6,-33.82,151.04,NSW,"1231, 1232, 1233, 1234, 1235","Sydney South, Sydney South, Sydney South, Sydn..."
7,-33.79,151.27,NSW,"1100, 1101, 1105, 1106, 1107, 1108, 1109, 1110...","Sydney, Sydney, Sydney, Sydney, Sydney, Sydney..."
8,-33.74,151.03,NSW,"1216, 1217, 1218, 1219, 1220","Grosvenor Place, Grosvenor Place, Grosvenor Pl..."
9,-33.67,150.87,NSW,1116,Sydney


In [203]:
melbourne_data = df[df['State'] == 'VIC'].reset_index(drop=True)
melbourne_data

Unnamed: 0,Latitude,Longitude,State,Postal Code,Neighbourhood
0,-38.37,144.77,VIC,"3001, 8001, 8045, 8051, 8060, 8061, 8066, 8069...","Melbourne, Melbourne, Melbourne, Melbourne, Me..."
1,-38.19,146.29,VIC,8010,Law Courts
2,-38.11,145.15,VIC,8002,East Melbourne
3,-37.93,145.03,VIC,"3205, 3205","South Melbourne, South Melbourne DC"
4,-37.84,144.98,VIC,"3004, 3004, 8004","Melbourne, St Kilda Road Central, St Kilda Road"
5,-37.82,144.95,VIC,"3005, 8005","World Trade Centre, World Trade Centre"
6,-37.82,144.96,VIC,8009,Flinders Lane
7,-37.82,144.97,VIC,"3006, 3006","South Wharf, Southbank"
8,-37.82,144.99,VIC,3002,East Melbourne
9,-37.81,144.94,VIC,3003,West Melbourne


#### 2. Map

We will explore the map of both Sydney and Melbourne. First, we need to get the geograpical coordinate of both Sydney and Melbourne.

In [204]:
address_sydney = 'Sydney'

geolocator = Nominatim(user_agent="to_explorer")
location_sydney = geolocator.geocode(address_sydney)
latitude_sydney = location_sydney.latitude
longitude_sydney = location_sydney.longitude
print('The geograpical coordinate of Sydney are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Sydney are -37.8142176, 144.9631608.


In [205]:
address_melbourne = 'Melbourne'

geolocator = Nominatim(user_agent="to_explorer")
location_melbourne = geolocator.geocode(address_melbourne)
latitude_melbourne = location_melbourne.latitude
longitude_melbourne = location_melbourne.longitude
print('The geograpical coordinate of Melbourne are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Melbourne are -37.8142176, 144.9631608.


The map of Sydney should look like this.

In [206]:
map_sydney = folium.Map(location=[latitude_sydney, longitude_sydney], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(sydney_data['Latitude'], sydney_data['Longitude'], sydney_data['Neighbourhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sydney)  
map_sydney

The map of Melbourne should look like this.

In [207]:
map_melbourne = folium.Map(location=[latitude_melbourne, longitude_melbourne], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(melbourne_data['Latitude'], melbourne_data['Longitude'], melbourne_data['Neighbourhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_melbourne)  
map_melbourne

#### 3. Get the venues from Foursquare API

In this section, we will explore the venues on both cities from Foursquare API.

First, we need to define the Foursquare Credential.

In [208]:
CLIENT_ID = 'NVFHTNB1SZ4F1OWYHPAXEFYR5YDEZ23APYXPC012HSNTCT3A' # your Foursquare ID
CLIENT_SECRET = '3RWNDMZCA2T1H2ZW3HNOPU3PFAZPFIPKGP3W4APMZDSOKYPO' # your Foursquare Secret
VERSION = '20210205' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NVFHTNB1SZ4F1OWYHPAXEFYR5YDEZ23APYXPC012HSNTCT3A
CLIENT_SECRET:3RWNDMZCA2T1H2ZW3HNOPU3PFAZPFIPKGP3W4APMZDSOKYPO


In [209]:
import json
from pandas import json_normalize

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Then, we get the venues from both cities.

First, we get for Sydney.

In [210]:
sydney_data.loc[3, 'Neighbourhood']

neighborhood_latitude = sydney_data.loc[3, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = sydney_data.loc[3, 'Longitude'] # neighborhood longitude value

neighborhood_name = sydney_data.loc[3, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()
#results

Latitude and longitude values of Pyrmont are -33.87, 151.190.


In [211]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Vic's Meat Market,Butcher,-33.871722,151.191372
1,Claudio's Seafoods,Fish Market,-33.870692,151.190882
2,Gallon,Bar,-33.869517,151.193928
3,Terminus Hotel,Pub,-33.867737,151.19258
4,Peter's Fish Market - BBQ Grill,Seafood Restaurant,-33.873145,151.192109


In [212]:
print('{} venues in Sydney were returned by Foursquare.'.format(nearby_venues.shape[0]))

25 venues in Sydney were returned by Foursquare.


Next, an example of Melbourne's venues.

In [213]:
melbourne_data.loc[3, 'Neighbourhood']

neighborhood_latitude = melbourne_data.loc[3, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = melbourne_data.loc[3, 'Longitude'] # neighborhood longitude value

neighborhood_name = melbourne_data.loc[3, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

Latitude and longitude values of South Melbourne, South Melbourne DC are -37.93, 145.030.


In [214]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,4 Seasons Laksa,Malay Restaurant,-37.932719,145.031916
1,Hungry Jack's,Fast Food Restaurant,-37.93002,145.032188
2,Coles,Grocery Store,-37.933422,145.028316
3,The Doghouse,Flea Market,-37.928986,145.027084
4,Aquayak,Sporting Goods Shop,-37.92669,145.028036


In [215]:
print('{} venues in Melbourne were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues in Melbourne were returned by Foursquare.


#### 4. Exploring the data

##### Exploring Neighbourhood in Foursquare. 

In [216]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        response = requests.get(url).json()["response"]
        if response:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        else:
            continue
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Get all venues for Sydney.

In [217]:
sydney_venues = getNearbyVenues(names=sydney_data['Neighbourhood'],
                                   latitudes=sydney_data['Latitude'],
                                   longitudes=sydney_data['Longitude']
                                  )

In [218]:
print(sydney_venues.shape)
sydney_venues.head()

(372, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Australia Square, Australia Square, Australia ...",-33.89,151.18,IRO Café,-33.8878,151.18017,Café
1,"Australia Square, Australia Square, Australia ...",-33.89,151.18,Gather On The Green,-33.890266,151.176188,Café
2,"Australia Square, Australia Square, Australia ...",-33.89,151.18,Campos Coffee,-33.893007,151.183213,Coffee Shop
3,"Australia Square, Australia Square, Australia ...",-33.89,151.18,Acre Farm and Eatery,-33.888066,151.176885,Restaurant
4,"Australia Square, Australia Square, Australia ...",-33.89,151.18,Brewtown Newtown,-33.893849,151.182478,Café


In [219]:
sydney_venues.groupby('Neighbourhood').count()
print('There are {} uniques categories.'.format(len(sydney_venues['Venue Category'].unique())))

There are 115 uniques categories.


Get all venues for Melbourne.

In [220]:
melbourne_venues = getNearbyVenues(names=melbourne_data['Neighbourhood'],
                                   latitudes=melbourne_data['Latitude'],
                                   longitudes=melbourne_data['Longitude']
                                  )

In [221]:
print(melbourne_venues.shape)
melbourne_venues.head()

(540, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Melbourne, Melbourne, Melbourne, Melbourne, Me...",-38.37,144.77,St Johny Back Beach,-38.372046,144.771736,Beach
1,"Melbourne, Melbourne, Melbourne, Melbourne, Me...",-38.37,144.77,Laser Plumbing Blairgowrie,-38.368842,144.766824,Home Service
2,"Melbourne, Melbourne, Melbourne, Melbourne, Me...",-38.37,144.77,Riley Plumbing,-38.367948,144.766502,Home Service
3,Law Courts,-38.19,146.29,Raw Havest Cafe,-38.186066,146.29143,Café
4,Law Courts,-38.19,146.29,Odlums Pharmacy,-38.18577,146.29123,Pharmacy


In [222]:
melbourne_venues.groupby('Neighbourhood').count()
print('There are {} uniques categories.'.format(len(melbourne_venues['Venue Category'].unique())))

There are 155 uniques categories.


### V. Results

In this section, we will explore the results returned from Foursquare API. We also apply machine learning techniques, i.e., Kmeans to examine the clusters on both cities.

##### a. Analyzing the results

The top venues for each Neighbourhood should look like this.

In [223]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

For Sydney.

In [224]:
# one hot encoding
sydney_onehot = pd.get_dummies(sydney_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sydney_onehot['Neighbourhood'] = sydney_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [sydney_onehot.columns[-1]] + list(sydney_onehot.columns[:-1])
sydney_onehot = sydney_onehot[fixed_columns]

sydney_grouped = sydney_onehot.groupby('Neighbourhood').mean().reset_index()

In [225]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sydney_neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
sydney_neighbourhoods_venues_sorted['Neighbourhood'] = sydney_grouped['Neighbourhood']

for ind in np.arange(sydney_grouped.shape[0]):
    sydney_neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sydney_grouped.iloc[ind, :], num_top_venues)

sydney_neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Australia Square, Australia Square, Australia ...",Café,Burger Joint,Pizza Place,Pub,Park,Convenience Store,Restaurant,Portuguese Restaurant,Motorcycle Shop,Spanish Restaurant
1,"Barangaroo, Dawes Point, Haymarket, Millers Po...",Café,Hotel,Australian Restaurant,Cocktail Bar,Pub,Hotel Bar,Museum,Steakhouse,Ice Cream Shop,Restaurant
2,"Broadway, Ultimo",Café,Thai Restaurant,Chinese Restaurant,Coffee Shop,Malay Restaurant,Japanese Restaurant,Dumpling Restaurant,Indonesian Restaurant,Hotpot Restaurant,Hotel
3,"Grosvenor Place, Grosvenor Place, Grosvenor Pl...",Metro Station,Wine Bar,Gourmet Shop,Electronics Store,Fast Food Restaurant,Fish Market,Flea Market,Food Court,Fountain,French Restaurant
4,Pyrmont,Seafood Restaurant,Café,Fish Market,Japanese Restaurant,Pub,Hotel,Coffee Shop,Sports Club,Butcher,Gym / Fitness Center


For Melbourne,

In [226]:
# one hot encoding
melbourne_onehot = pd.get_dummies(melbourne_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
melbourne_onehot['Neighbourhood'] = melbourne_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [melbourne_onehot.columns[-1]] + list(melbourne_onehot.columns[:-1])
melbourne_onehot = melbourne_onehot[fixed_columns]

melbourne_grouped = melbourne_onehot.groupby('Neighbourhood').mean().reset_index()

In [227]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
melbourne_neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
melbourne_neighbourhoods_venues_sorted['Neighbourhood'] = melbourne_grouped['Neighbourhood']

for ind in np.arange(melbourne_grouped.shape[0]):
    melbourne_neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(melbourne_grouped.iloc[ind, :], num_top_venues)

melbourne_neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Abeckett Street, Little Lonsdale Street",Café,Coffee Shop,Korean Restaurant,Japanese Restaurant,Bar,Malay Restaurant,Fried Chicken Joint,Donut Shop,Indonesian Restaurant,Shopping Mall
1,East Melbourne,Café,Pub,Coffee Shop,Thai Restaurant,Construction & Landscaping,Sporting Goods Shop,Bus Stop,Convenience Store,Pharmacy,Park
2,Flinders Lane,Café,Coffee Shop,Bar,Japanese Restaurant,Italian Restaurant,Cocktail Bar,Steakhouse,Australian Restaurant,Sandwich Place,Hotel
3,Law Courts,Café,Pharmacy,Women's Store,Falafel Restaurant,Football Stadium,Food Truck,Food Court,Flea Market,Filipino Restaurant,Fast Food Restaurant
4,Melbourne,Café,Cocktail Bar,Coffee Shop,Hotel,Bar,Wine Bar,Japanese Restaurant,Ramen Restaurant,Dessert Shop,Tapas Restaurant


##### b. Clustering the Neighbourhoods

For Sydney,

In [228]:
# set number of clusters
kclusters = 5

sydney_grouped_clustering = sydney_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sydney_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 0, 1, 1, 2, 4, 3, 1], dtype=int32)

In [229]:
# add clustering labels
sydney_neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sydney_merged = sydney_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
sydney_merged = sydney_merged.join(sydney_neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

sydney_merged = sydney_merged.dropna() # check the last columns!

The visualization of Sydney clusters should look like this.

In [230]:
# create map
map_clusters = folium.Map(location=[latitude_sydney, longitude_sydney], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

print(rainbow)
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sydney_merged['Latitude'], sydney_merged['Longitude'], sydney_merged['Neighbourhood'], sydney_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000']


For Melbourne,

In [231]:
# set number of clusters
kclusters = 5

melbourne_grouped_clustering = melbourne_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(melbourne_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 0, 1, 2, 4, 3, 1, 1], dtype=int32)

In [232]:
# add clustering labels
melbourne_neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

melbourne_merged = melbourne_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
melbourne_merged = melbourne_merged.join(melbourne_neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

melbourne_merged = melbourne_merged.dropna() # check the last columns!

The visualization of Melbourne clusters should look like this.

In [233]:
# create map
map_clusters = folium.Map(location=[latitude_melbourne, longitude_melbourne], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

print(rainbow)
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(melbourne_merged['Latitude'], melbourne_merged['Longitude'], melbourne_merged['Neighbourhood'], melbourne_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000']


c. Examine the clusters

For Sydney,

In [234]:
sydney_merged.loc[sydney_merged['Cluster Labels'] == 0.0, sydney_merged.columns[[4] + list(range(5, sydney_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Grosvenor Place, Grosvenor Place, Grosvenor Pl...",0.0,Metro Station,Wine Bar,Gourmet Shop,Electronics Store,Fast Food Restaurant,Fish Market,Flea Market,Food Court,Fountain,French Restaurant


In [235]:
sydney_merged.loc[sydney_merged['Cluster Labels'] == 1.0, sydney_merged.columns[[4] + list(range(5, sydney_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Australia Square, Australia Square, Australia ...",1.0,Café,Burger Joint,Pizza Place,Pub,Park,Convenience Store,Restaurant,Portuguese Restaurant,Motorcycle Shop,Spanish Restaurant
2,"Broadway, Ultimo",1.0,Café,Thai Restaurant,Chinese Restaurant,Coffee Shop,Malay Restaurant,Japanese Restaurant,Dumpling Restaurant,Indonesian Restaurant,Hotpot Restaurant,Hotel
3,Pyrmont,1.0,Seafood Restaurant,Café,Fish Market,Japanese Restaurant,Pub,Hotel,Coffee Shop,Sports Club,Butcher,Gym / Fitness Center
4,"Royal Exchange, Royal Exchange, Royal Exchange...",1.0,Café,Coffee Shop,Bar,Shopping Mall,Cocktail Bar,Hotel,Sandwich Place,Japanese Restaurant,Speakeasy,Bookstore
5,"Barangaroo, Dawes Point, Haymarket, Millers Po...",1.0,Café,Hotel,Australian Restaurant,Cocktail Bar,Pub,Hotel Bar,Museum,Steakhouse,Ice Cream Shop,Restaurant
7,"Sydney, Sydney, Sydney, Sydney, Sydney, Sydney...",1.0,Café,Park,Furniture / Home Store,Supermarket,Golf Course,Pet Store,Hardware Store,Sporting Goods Shop,Shopping Mall,Wine Bar


In [236]:
sydney_merged.loc[sydney_merged['Cluster Labels'] == 4.0, sydney_merged.columns[[4] + list(range(5, sydney_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,"Sydney South, Sydney South, Sydney South, Sydn...",4.0,Pier,Climbing Gym,Café,Wine Bar,Golf Course,Fast Food Restaurant,Fish Market,Flea Market,Food Court,Fountain


For melbourne,

In [237]:
melbourne_merged.loc[melbourne_merged['Cluster Labels'] == 0.0, melbourne_merged.columns[[4] + list(range(5, melbourne_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Law Courts,0.0,Café,Pharmacy,Women's Store,Falafel Restaurant,Football Stadium,Food Truck,Food Court,Flea Market,Filipino Restaurant,Fast Food Restaurant


In [238]:
melbourne_merged.loc[melbourne_merged['Cluster Labels'] == 1.0, melbourne_merged.columns[[4] + list(range(5, melbourne_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,East Melbourne,1.0,Café,Pub,Coffee Shop,Thai Restaurant,Construction & Landscaping,Sporting Goods Shop,Bus Stop,Convenience Store,Pharmacy,Park
5,"World Trade Centre, World Trade Centre",1.0,Café,Coffee Shop,Hotel,Bar,Fast Food Restaurant,Japanese Restaurant,Thai Restaurant,Convenience Store,Salad Place,Pie Shop
6,Flinders Lane,1.0,Café,Coffee Shop,Bar,Japanese Restaurant,Italian Restaurant,Cocktail Bar,Steakhouse,Australian Restaurant,Sandwich Place,Hotel
7,"South Wharf, Southbank",1.0,Bar,Art Gallery,Park,Coffee Shop,Performing Arts Venue,Theater,Music Venue,Spanish Restaurant,Hotel,Ice Cream Shop
8,East Melbourne,1.0,Café,Pub,Coffee Shop,Thai Restaurant,Construction & Landscaping,Sporting Goods Shop,Bus Stop,Convenience Store,Pharmacy,Park
9,West Melbourne,1.0,Juice Bar,Clothing Store,Gift Shop,Shopping Mall,Dessert Shop,Café,Chinese Restaurant,Skating Rink,Burger Joint,Sandwich Place
10,"Abeckett Street, Little Lonsdale Street",1.0,Café,Coffee Shop,Korean Restaurant,Japanese Restaurant,Bar,Malay Restaurant,Fried Chicken Joint,Donut Shop,Indonesian Restaurant,Shopping Mall
11,Melbourne,1.0,Café,Cocktail Bar,Coffee Shop,Hotel,Bar,Wine Bar,Japanese Restaurant,Ramen Restaurant,Dessert Shop,Tapas Restaurant


In [239]:
melbourne_merged.loc[melbourne_merged['Cluster Labels'] == 3.0, melbourne_merged.columns[[4] + list(range(5, melbourne_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"South Melbourne, South Melbourne DC",3.0,Sporting Goods Shop,Flea Market,Malay Restaurant,Fast Food Restaurant,Grocery Store,French Restaurant,Football Stadium,Food Truck,Food Court,Filipino Restaurant


In [240]:
melbourne_merged.loc[melbourne_merged['Cluster Labels'] == 4.0, melbourne_merged.columns[[4] + list(range(5, melbourne_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,"Melbourne, St Kilda Road Central, St Kilda Road",4.0,Café,Hotel,Indian Restaurant,Tram Station,Park,Gym,Australian Restaurant,French Restaurant,Vietnamese Restaurant,Pizza Place


### VI. Discussion

From the resulted clusters from previous section, we can explore some interesting findings such as:

In Sydney, the results show that all sydney CBD, i.e., Barangaroo, Ultimo, Pyrmont, etc. or Northern of Sydney are grouped into one. This makes sense and show that our cluster algorithm works well. There are also some hints. For example, in CBD, there are many cafe and restaurant. Thus, if someone would like to invest on these venue should take care of others. However, in Sydney Nouth, the most common is Gym. Thus, if you want to open a new gym, you should be careful. But if you are a Gym lover, you should consider to migrate here.

In Melbourne, we also have some similar results. Cafe and restaurant are among the most common venues in CBD. And Flea market is more common than other venues in Melbourne South.

To compare between two cities, we take the CBD as an example. If you love Japanese food, you should consider to live in Melbourne as these are among the common restanrant in here. However, if you love Chinese and traditional Australian food, you definitely should consider to migrate to Sydney.

Above are just some findings on the results. We can have many more just by exploring the results.

### VII. Conclusion

In this project, we explore the data from Sydney and Melbourne and compare the results to get the similarity btween two cities. We applied the Data analysis to get some information inside the raw data. Further, we also applied Foursquare API to get some useful hints on common venues in both cities. Finally, we investigate the clusters of both cities using machine learning techniques. By doing this, we also get some hints for immigrants or visitor. This somehow answer the question: Melbourne or Sydney?

### References

1. https://gist.github.com/randomecho/5020859
2. https://postcodez.com.au/postcodes
3. https://www.spidersoft.com.au/2013/australia-wide-postcode-breakdown/