# Introduction

## Background
Moving to another country, far from one's native city, can be hard. This is even more relevant if the person has no relatives or friends in the considered country to suggest him the best borough of a certain city. Another case is when someone is willing to buy a house without having the chance to see it. Choosing the best area can be extremely difficult.

## Business Problem
In this scenario, it can be helpful to adopt Machine Learning to assist those who are willing to move the Melbourne to make wise and effective decisions. As a result, the business problem we are currently posing is: how could we support those who want to purchase a suitable real estate in Melbourne?

To solve this business problem, we are going to cluster Melbourne in order to recommend venues and the current average price of real estate where homebuyers can make a real estate investment. We will recommend profitable venues according to amenities and essential facilities surrounding such venues i.e. elementary schools, high schools, hospitals & grocery stores.

# Data
Data on Melbourne recent house transactions were extracted from the Kaggle dataset __Melbourne Housing Market__
(https://www.kaggle.com/anthonypino/melbourne-housing-market#Melbourne_housing_FULL.csv).
The given fields comprise the address (suburb), the number of rooms, the price in Australian dollars, etc on more than 30k transactions.

To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on Melbourne properties and the relative price paid data from the dataset and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we will be able to recommend profitable real estate investments.

# Methodology section
The Methodology section will describe the main components of our analysis and predication system. The Methodology section comprises four stages:

1. Collect Inspection Data
2. Explore and Understand Data
3. Data preparation and preprocessing 
4. Modeling

# 0. Import libraries

In [1]:
import os # Operating System
import numpy as np
import pandas as pd
import datetime as dt # Datetime
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium #import folium # map rendering library
from sklearn.cluster import KMeans

# 1. Collect Inspection Data

In [2]:
df = pd.read_csv("Melbourne_housing_FULL.csv")

# 2. Explore and Understand Data

In [3]:
print(df.shape)
df.head()

(34857, 21)


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


# 3. Data preparation and preprocessing

#### Format the date column and delete old transactions

In [4]:
df['Date'] = df['Date'].apply(pd.to_datetime)
df.drop(df[df.Date.dt.year < 2016].index, inplace=True)

#### Select usefull columns

In [5]:
# List of columns
print(df.columns.values.tolist())

['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount']


In [6]:
# Only a few are relevant for us
df = df[['Suburb','Address','Price','CouncilArea']].copy()
df.columns = ['Suburb','Address','Price','CouncilArea']

#### Drop rows with missing values

In [7]:
# Drop rows with NaN
df.dropna(subset=['Price'], axis=0, inplace=True)
df.dropna(subset=['CouncilArea'], axis=0, inplace=True)
print(df.shape)
df.head()

(27244, 4)


Unnamed: 0,Suburb,Address,Price,CouncilArea
1,Abbotsford,85 Turner St,1480000.0,Yarra City Council
2,Abbotsford,25 Bloomburg St,1035000.0,Yarra City Council
4,Abbotsford,5 Charles St,1465000.0,Yarra City Council
5,Abbotsford,40 Federation La,850000.0,Yarra City Council
6,Abbotsford,55a Park St,1600000.0,Yarra City Council


In [8]:
# Check missing values
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")

Suburb
False    27244
Name: Suburb, dtype: int64

Address
False    27244
Name: Address, dtype: int64

Price
False    27244
Name: Price, dtype: int64

CouncilArea
False    27244
Name: CouncilArea, dtype: int64



#### Select the relevant areas

In [9]:
# Show all possible values of CouncilArea
print(len(df['CouncilArea'].unique().tolist()))
df['CouncilArea'].unique().tolist()

33


['Yarra City Council',
 'Moonee Valley City Council',
 'Port Phillip City Council',
 'Darebin City Council',
 'Hobsons Bay City Council',
 'Stonnington City Council',
 'Boroondara City Council',
 'Monash City Council',
 'Glen Eira City Council',
 'Whitehorse City Council',
 'Maribyrnong City Council',
 'Bayside City Council',
 'Moreland City Council',
 'Manningham City Council',
 'Melbourne City Council',
 'Banyule City Council',
 'Brimbank City Council',
 'Kingston City Council',
 'Hume City Council',
 'Knox City Council',
 'Maroondah City Council',
 'Casey City Council',
 'Melton City Council',
 'Greater Dandenong City Council',
 'Nillumbik Shire Council',
 'Whittlesea City Council',
 'Frankston City Council',
 'Macedon Ranges Shire Council',
 'Yarra Ranges Shire Council',
 'Wyndham City Council',
 'Cardinia Shire Council',
 'Moorabool Shire Council',
 'Mitchell Shire Council']

In [10]:
# Select only Melbourne
df1 = df[df['CouncilArea']=='Melbourne City Council']
df2 = df[df['CouncilArea']=='Yarra City Council']
df3 = df[df['CouncilArea']=='Port Phillip City Council']
df_new = pd.concat([df1, df2, df3]).reset_index()
#df_new = df1
print(df_new.shape)
df_new.head()

(3372, 5)


Unnamed: 0,index,Suburb,Address,Price,CouncilArea
0,2875,Carlton North,527 Nicholson St,1330000.0,Melbourne City Council
1,2876,Carlton North,593 Canning St,1540000.0,Melbourne City Council
2,2877,Carlton North,112 Newry St,1425000.0,Melbourne City Council
3,2878,Carlton North,122 Richardson St,1725000.0,Melbourne City Council
4,2880,Carlton North,632 Rathdowne St,1280000.0,Melbourne City Council


#### Identify the suburbs and calculate the coordinates

In [11]:
df_AvgPrice = df_new.groupby(['Suburb'])['Price'].mean().reset_index()
print(df_AvgPrice.shape)
df_AvgPrice.head()

(29, 2)


Unnamed: 0,Suburb,Price
0,Abbotsford,1033549.0
1,Albert Park,1927651.0
2,Balaclava,820451.9
3,Burnley,1171751.0
4,Carlton,1171193.0


In [12]:
for i in range(0,len(df_AvgPrice['Suburb'])):
    df_AvgPrice['Suburb'][i] = df_AvgPrice['Suburb'][i] + ', AU'

df_AvgPrice['Suburb'][0]  = 'Melbourne, ' + df_AvgPrice['Suburb'][0]
df_AvgPrice['Suburb'][8]  = 'Melbourne, ' + df_AvgPrice['Suburb'][8]
df_AvgPrice['Suburb'][14] = 'Melbourne, ' + df_AvgPrice['Suburb'][14]
df_AvgPrice['Suburb'][28] = 'Melbourne, ' + df_AvgPrice['Suburb'][28]

df_AvgPrice.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.or

Unnamed: 0,Suburb,Price
0,"Melbourne, Abbotsford, AU",1033549.0
1,"Albert Park, AU",1927651.0
2,"Balaclava, AU",820451.9
3,"Burnley, AU",1171751.0
4,"Carlton, AU",1171193.0
5,"Carlton North, AU",1437974.0
6,"Clifton Hill, AU",1242392.0
7,"Collingwood, AU",913892.5
8,"Melbourne, Cremorne, AU",1022947.0
9,"Docklands, AU",800000.0


In [13]:
geolocator = Nominatim()
df_AvgPrice['city_coord'] = df_AvgPrice['Suburb'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df_AvgPrice[['Latitude', 'Longitude']] = df_AvgPrice['city_coord'].apply(pd.Series)
df_AvgPrice.drop(columns=['city_coord'],inplace=True)
df_AvgPrice.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Suburb,Price,Latitude,Longitude
0,"Melbourne, Abbotsford, AU",1033549.0,-37.809888,144.995489
1,"Albert Park, AU",1927651.0,-37.847772,144.962008
2,"Balaclava, AU",820451.9,-37.869921,144.993428
3,"Burnley, AU",1171751.0,-37.827622,145.008091
4,"Carlton, AU",1171193.0,-37.800423,144.968434


#### Draw the map

In [14]:
address = 'Melbourne, AU'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Melbourne are {}, {}.'.format(latitude, longitude))

  This is separate from the ipykernel package so we can avoid doing imports until


The geograpical coordinate of Melbourne are -37.8142176, 144.9631608.


In [15]:
# create map using latitude and longitude values
map_melb = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, price in zip(df_AvgPrice['Latitude'], df_AvgPrice['Longitude'], df_AvgPrice['Price']):
    label = '{}'.format(price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_melb)  
    
map_melb

#### Connect to Foursquare

In [16]:
#CLIENT_ID = 'VTTERXZ3WORI5TNPYYBA2F5XRIZMFZRDOGNTI4VSPLOTPMJN'         # your Foursquare ID
#CLIENT_SECRET = 'ZURJZPZPXVPO24AAVB4NMHXAMERPJZTLOR5CCESYQFDCE5UR'     # your Foursquare Secret
CLIENT_ID = 'KI3TR0QO4JOKMFELOMF3WSOOI3HFNBF5YLW354MYWBKDHEX3' # Foursquare ID
CLIENT_SECRET = 'QF4ZBLJRBV4BQX52DVWUPEHJ14A2UJABPCZARZQZYTKIISUD' # Foursquare Secret
VERSION = '20180604'

#### Load the venues for each suburb

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Suburb', 
                  'Suburb Latitude', 
                  'Suburb Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
location_venues = getNearbyVenues(names=df_AvgPrice['Suburb'],
                                  latitudes=df_AvgPrice['Latitude'],
                                  longitudes=df_AvgPrice['Longitude'])

In [19]:
location_venues.head()

Unnamed: 0,Suburb,Suburb Latitude,Suburb Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Melbourne, Abbotsford, AU",-37.809888,144.995489,Au79,-37.808806,144.996035,Café
1,"Melbourne, Abbotsford, AU",-37.809888,144.995489,Minh Phat Supermarket,-37.809652,144.996163,Grocery Store
2,"Melbourne, Abbotsford, AU",-37.809888,144.995489,Nhu Lan Bakery,-37.810375,144.996708,Bakery
3,"Melbourne, Abbotsford, AU",-37.809888,144.995489,Jinda Thai Restaurant,-37.809428,144.992345,Thai Restaurant
4,"Melbourne, Abbotsford, AU",-37.809888,144.995489,Three Bags Full,-37.807318,144.996603,Café


In [20]:
print('There are {} uniques categories.'.format(len(location_venues['Venue Category'].unique())))
print(location_venues.shape)

There are 195 uniques categories.
(1047, 7)


In [21]:
location_venues.to_csv('Melb_venues.csv')
df_AvgPrice.to_csv('Melb_AvgPrice.csv')

# 4. Modeling
After exploring the dataset and gaining insights into it, we are ready to use the clustering methodology to analyze real estates. We will use the k-means clustering technique as it is fast and efficient in terms of computational cost, is highly flexible to account for mutations in real estate market in London and is accurate.

In [22]:
location_venues = pd.read_csv('Melb_venues.csv')
df_AvgPrice = pd.read_csv('Melb_AvgPrice.csv')

#### Apply one-hot encoding to venues

In [23]:
venues_onehot = pd.get_dummies(location_venues[['Venue Category']], prefix="", prefix_sep="")

# add street column back to dataframe
venues_onehot['Suburb'] = location_venues['Suburb']
df_tomerge = df_AvgPrice[['Suburb','Price']].copy()
venues_onehot = pd.merge(venues_onehot, df_tomerge, on='Suburb')

# move street column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]

venues_onehot.head(10)

Unnamed: 0,Price,Accessories Store,African Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,Australian Restaurant,Austrian Restaurant,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit,Suburb
0,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
1,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
2,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
3,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
4,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
5,1033549.0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
6,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
7,1033549.0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
8,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"
9,1033549.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Melbourne, Abbotsford, AU"


In [24]:
Melb_grouped = venues_onehot.groupby('Suburb').mean().reset_index()
print(Melb_grouped.shape)
Melb_grouped.head()

(29, 197)


Unnamed: 0,Suburb,Price,Accessories Store,African Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,Australian Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit
0,"Albert Park, AU",1927651.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Balaclava, AU",820451.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,...,0.018182,0.018182,0.036364,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,"Burnley, AU",1171751.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Carlton North, AU",1437974.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.035714,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
4,"Carlton, AU",1171193.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.042553,0.021277,0.0,0.042553,0.0,0.0,0.021277,0.0,0.0


#### Define a function to return the most common venues/facilities nearby real estate investments

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Sort the most frequent venues in each suburb

In [40]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Suburb']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [41]:
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Suburb'] = Melb_grouped['Suburb']
venues_sorted['Price'] = Melb_grouped['Price']

venues_sorted.head()

Unnamed: 0,Suburb,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Price
0,"Albert Park, AU",,,,,,,,,,,1927651.0
1,"Balaclava, AU",,,,,,,,,,,820451.9
2,"Burnley, AU",,,,,,,,,,,1171751.0
3,"Carlton North, AU",,,,,,,,,,,1437974.0
4,"Carlton, AU",,,,,,,,,,,1171193.0


In [42]:
for ind in np.arange(Melb_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:-1] = return_most_common_venues(Melb_grouped.iloc[ind, 2:], num_top_venues)

venues_sorted.head()

Unnamed: 0,Suburb,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Price
0,"Albert Park, AU",Café,Grocery Store,Light Rail Station,Hotel,Metro Station,Seafood Restaurant,Tennis Court,Golf Course,Athletics & Sports,Indian Restaurant,1927651.0
1,"Balaclava, AU",Café,Coffee Shop,Breakfast Spot,Pizza Place,Pharmacy,Japanese Restaurant,Vietnamese Restaurant,Bar,Bakery,Spa,820451.9
2,"Burnley, AU",Café,Pub,Breakfast Spot,Furniture / Home Store,Convenience Store,Park,Liquor Store,Cocktail Bar,Shop & Service,Fish & Chips Shop,1171751.0
3,"Carlton North, AU",Café,Bakery,Grocery Store,Flower Shop,Wine Bar,Pub,Light Rail Station,Burger Joint,Breakfast Spot,Liquor Store,1437974.0
4,"Carlton, AU",Italian Restaurant,Café,Coffee Shop,Ice Cream Shop,Vegetarian / Vegan Restaurant,Deli / Bodega,Bar,Wine Bar,Burger Joint,Lebanese Restaurant,1171193.0


In [43]:
Melb_grouped_clustering = Melb_grouped.drop('Suburb', 1)
Melb_grouped_clustering.head()

Unnamed: 0,Price,Accessories Store,African Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,Australian Restaurant,Austrian Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit
0,1927651.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,820451.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,...,0.018182,0.018182,0.036364,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
2,1171751.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1437974.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.035714,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
4,1171193.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.042553,0.021277,0.0,0.042553,0.0,0.0,0.021277,0.0,0.0


#### Apply k_means

In [44]:
kclusters = 5
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Melb_grouped_clustering)

kmeans.labels_[0:20]

array([0, 3, 4, 2, 4, 4, 3, 3, 2, 1, 4, 3, 3, 1, 1, 3, 1, 0, 3, 2])

In [45]:
#Dataframe to include Clusters
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
venues_sorted.head(10)

Unnamed: 0,Cluster Labels,Suburb,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Price
0,0,"Albert Park, AU",Café,Grocery Store,Light Rail Station,Hotel,Metro Station,Seafood Restaurant,Tennis Court,Golf Course,Athletics & Sports,Indian Restaurant,1927651.0
1,3,"Balaclava, AU",Café,Coffee Shop,Breakfast Spot,Pizza Place,Pharmacy,Japanese Restaurant,Vietnamese Restaurant,Bar,Bakery,Spa,820451.9
2,4,"Burnley, AU",Café,Pub,Breakfast Spot,Furniture / Home Store,Convenience Store,Park,Liquor Store,Cocktail Bar,Shop & Service,Fish & Chips Shop,1171751.0
3,2,"Carlton North, AU",Café,Bakery,Grocery Store,Flower Shop,Wine Bar,Pub,Light Rail Station,Burger Joint,Breakfast Spot,Liquor Store,1437974.0
4,4,"Carlton, AU",Italian Restaurant,Café,Coffee Shop,Ice Cream Shop,Vegetarian / Vegan Restaurant,Deli / Bodega,Bar,Wine Bar,Burger Joint,Lebanese Restaurant,1171193.0
5,4,"Clifton Hill, AU",Café,Pizza Place,Pharmacy,Park,Convenience Store,Stadium,Train Station,Seafood Restaurant,Fish & Chips Shop,Garden,1242392.0
6,3,"Collingwood, AU",Café,Cocktail Bar,Japanese Restaurant,Coffee Shop,Supermarket,Pizza Place,Grocery Store,Vietnamese Restaurant,Brewery,Ice Cream Shop,913892.5
7,3,"Docklands, AU",Italian Restaurant,Middle Eastern Restaurant,Hotel,Café,Asian Restaurant,Restaurant,Steakhouse,Coffee Shop,Indian Restaurant,Chinese Restaurant,800000.0
8,2,"East Melbourne, AU",Café,Hotel,Sculpture Garden,Light Rail Station,Wine Bar,Grocery Store,Australian Restaurant,Convenience Store,Fish & Chips Shop,Pharmacy,1374431.0
9,1,"Elwood, AU",Café,Fish & Chips Shop,Indian Restaurant,River,Bakery,Bar,Zoo Exhibit,Football Stadium,Food Truck,Food Court,991924.1


In [53]:
# merge london_grouped with london_data to add latitude/longitude for each neighborhood
Melb_merged = []
Melb_merged = df_AvgPrice[['Suburb','Price','Latitude', 'Longitude']].copy()
df_tomerge = venues_sorted[['Suburb','Cluster Labels','1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue','4th Most Common Venue','5th Most Common Venue','6th Most Common Venue','7th Most Common Venue','8th Most Common Venue','9th Most Common Venue','10th Most Common Venue']].copy()
Melb_merged = Melb_merged.join(df_tomerge.set_index('Suburb'), on='Suburb')

Melb_merged

Unnamed: 0,Suburb,Price,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Melbourne, Abbotsford, AU",1033549.0,-37.809888,144.995489,1,Vietnamese Restaurant,Café,Thai Restaurant,Korean Restaurant,Chinese Restaurant,Asian Restaurant,Vegetarian / Vegan Restaurant,Grocery Store,Bakery,Bar
1,"Albert Park, AU",1927651.0,-37.847772,144.962008,0,Café,Grocery Store,Light Rail Station,Hotel,Metro Station,Seafood Restaurant,Tennis Court,Golf Course,Athletics & Sports,Indian Restaurant
2,"Balaclava, AU",820451.9,-37.869921,144.993428,3,Café,Coffee Shop,Breakfast Spot,Pizza Place,Pharmacy,Japanese Restaurant,Vietnamese Restaurant,Bar,Bakery,Spa
3,"Burnley, AU",1171751.0,-37.827622,145.008091,4,Café,Pub,Breakfast Spot,Furniture / Home Store,Convenience Store,Park,Liquor Store,Cocktail Bar,Shop & Service,Fish & Chips Shop
4,"Carlton, AU",1171193.0,-37.800423,144.968434,4,Italian Restaurant,Café,Coffee Shop,Ice Cream Shop,Vegetarian / Vegan Restaurant,Deli / Bodega,Bar,Wine Bar,Burger Joint,Lebanese Restaurant
5,"Carlton North, AU",1437974.0,-37.784559,144.972855,2,Café,Bakery,Grocery Store,Flower Shop,Wine Bar,Pub,Light Rail Station,Burger Joint,Breakfast Spot,Liquor Store
6,"Clifton Hill, AU",1242392.0,-37.788877,144.995363,4,Café,Pizza Place,Pharmacy,Park,Convenience Store,Stadium,Train Station,Seafood Restaurant,Fish & Chips Shop,Garden
7,"Collingwood, AU",913892.5,-37.802104,144.988139,3,Café,Cocktail Bar,Japanese Restaurant,Coffee Shop,Supermarket,Pizza Place,Grocery Store,Vietnamese Restaurant,Brewery,Ice Cream Shop
8,"Melbourne, Cremorne, AU",1022947.0,-37.825105,144.983743,1,Tennis Stadium,Football Stadium,Field,Stadium,Athletics & Sports,Café,Park,Tea Room,Tennis Court,Beer Garden
9,"Docklands, AU",800000.0,-37.817542,144.939492,3,Italian Restaurant,Middle Eastern Restaurant,Hotel,Café,Asian Restaurant,Restaurant,Steakhouse,Coffee Shop,Indian Restaurant,Chinese Restaurant


#### Create map

In [54]:
address = 'Melbourne, AU'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Melb_merged['Latitude'], Melb_merged['Longitude'], Melb_merged['Suburb'], Melb_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

  This is separate from the ipykernel package so we can avoid doing imports until


#### Cluster 1

In [60]:
Melb_merged.loc[Melb_merged['Cluster Labels'] == 0, Melb_merged.columns[[0] + [1] + list(range(5, Melb_merged.shape[1]))]].head()

Unnamed: 0,Suburb,Price,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Albert Park, AU",1927651.0,Café,Grocery Store,Light Rail Station,Hotel,Metro Station,Seafood Restaurant,Tennis Court,Golf Course,Athletics & Sports,Indian Restaurant
16,"Middle Park, AU",2232148.0,Café,Light Rail Station,Grocery Store,Hotel,Seafood Restaurant,Thai Restaurant,Beach,Metro Station,Indian Restaurant,Athletics & Sports


#### Cluster 2

In [61]:
Melb_merged.loc[Melb_merged['Cluster Labels'] == 1, Melb_merged.columns[[0] + [1] + list(range(5, Melb_merged.shape[1]))]].head()

Unnamed: 0,Suburb,Price,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Melbourne, Abbotsford, AU",1033549.0,Vietnamese Restaurant,Café,Thai Restaurant,Korean Restaurant,Chinese Restaurant,Asian Restaurant,Vegetarian / Vegan Restaurant,Grocery Store,Bakery,Bar
8,"Melbourne, Cremorne, AU",1022947.0,Tennis Stadium,Football Stadium,Field,Stadium,Athletics & Sports,Café,Park,Tea Room,Tennis Court,Beer Garden
11,"Elwood, AU",991924.1,Café,Fish & Chips Shop,Indian Restaurant,River,Bakery,Bar,Zoo Exhibit,Football Stadium,Food Truck,Food Court
21,"Richmond, AU",1067585.0,Café,Sandwich Place,Vietnamese Restaurant,Fast Food Restaurant,Pub,Thai Restaurant,Japanese Restaurant,Gym,Asian Restaurant,Food & Drink Shop
24,"South Yarra, AU",1058338.0,Café,Italian Restaurant,Hotel,Japanese Restaurant,Grocery Store,Convenience Store,Pizza Place,Dessert Shop,Coffee Shop,French Restaurant


#### Cluster 3

In [62]:
Melb_merged.loc[Melb_merged['Cluster Labels'] == 2, Melb_merged.columns[[0] + [1] + list(range(5, Melb_merged.shape[1]))]].head()

Unnamed: 0,Suburb,Price,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,"Carlton North, AU",1437974.0,Café,Bakery,Grocery Store,Flower Shop,Wine Bar,Pub,Light Rail Station,Burger Joint,Breakfast Spot,Liquor Store
10,"East Melbourne, AU",1374431.0,Café,Hotel,Sculpture Garden,Light Rail Station,Wine Bar,Grocery Store,Australian Restaurant,Convenience Store,Fish & Chips Shop,Pharmacy
18,"Parkville, AU",1447563.0,Zoo Exhibit,Basketball Court,Hockey Arena,Food & Drink Shop,Food,Park,Sculpture Garden,Zoo,Austrian Restaurant,Frozen Yogurt Shop
20,"Princes Hill, AU",1633265.0,Breakfast Spot,Café,Light Rail Station,Park,Flower Shop,Zoo Exhibit,Fish Market,Football Stadium,Food Truck,Food Court
23,"South Melbourne, AU",1349624.0,Café,Bar,Coffee Shop,Gastropub,Wine Shop,Asian Restaurant,Mexican Restaurant,Gym,Malay Restaurant,Spa


#### Cluster 4

In [63]:
Melb_merged.loc[Melb_merged['Cluster Labels'] == 3, Melb_merged.columns[[0] + [1] + list(range(5, Melb_merged.shape[1]))]].head()

Unnamed: 0,Suburb,Price,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Balaclava, AU",820451.923077,Café,Coffee Shop,Breakfast Spot,Pizza Place,Pharmacy,Japanese Restaurant,Vietnamese Restaurant,Bar,Bakery,Spa
7,"Collingwood, AU",913892.473118,Café,Cocktail Bar,Japanese Restaurant,Coffee Shop,Supermarket,Pizza Place,Grocery Store,Vietnamese Restaurant,Brewery,Ice Cream Shop
9,"Docklands, AU",800000.0,Italian Restaurant,Middle Eastern Restaurant,Hotel,Café,Asian Restaurant,Restaurant,Steakhouse,Coffee Shop,Indian Restaurant,Chinese Restaurant
13,"Flemington, AU",841887.755102,Hotel,Racetrack,Pizza Place,Café,Bowling Green,Liquor Store,Light Rail Station,Zoo Exhibit,Fish Market,Food Truck
14,"Melbourne, Kensington, AU",850563.766667,Café,Pizza Place,Park,Fried Chicken Joint,Pub,Wine Shop,Burger Joint,Ice Cream Shop,Fish & Chips Shop,Gym


#### Cluster 5

In [64]:
Melb_merged.loc[Melb_merged['Cluster Labels'] == 4, Melb_merged.columns[[0] + [1] + list(range(5, Melb_merged.shape[1]))]].head()

Unnamed: 0,Suburb,Price,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"Burnley, AU",1171751.0,Café,Pub,Breakfast Spot,Furniture / Home Store,Convenience Store,Park,Liquor Store,Cocktail Bar,Shop & Service,Fish & Chips Shop
4,"Carlton, AU",1171193.0,Italian Restaurant,Café,Coffee Shop,Ice Cream Shop,Vegetarian / Vegan Restaurant,Deli / Bodega,Bar,Wine Bar,Burger Joint,Lebanese Restaurant
6,"Clifton Hill, AU",1242392.0,Café,Pizza Place,Pharmacy,Park,Convenience Store,Stadium,Train Station,Seafood Restaurant,Fish & Chips Shop,Garden
12,"Fitzroy, AU",1274792.0,Café,Bar,Cocktail Bar,Pub,Bookstore,Vietnamese Restaurant,Bakery,Japanese Restaurant,Ice Cream Shop,Deli / Bodega
19,"Port Melbourne, AU",1273470.0,Paintball Field,Café,Go Kart Track,Beach,Fish & Chips Shop,Food Truck,Food Court,Food & Drink Shop,Food,Flower Shop


# Results and Discussion section
We may analyze our results according to the five clusters we have produced. Even though, all clusters could praise an optimal range of facilities and amenities, we have found that all clusters show the presence of many cafès and restaurants. Clusters 2 and 4 offer more asian food and houses are at a lower price. While cluster 1 and 3 havev higher prices and offer more sport-related venues.

# Conclusion
To solve the business problem, we clustered Melbourne suburbs in order to recommend venues and the current average price of real estate where homebuyers can make a real estate investment. We recommended profitable venues according to amenities and essential facilities surrounding such venues i.e. elementary schools, high schools, hospitals & grocery stores.

First, we gathered data on transactions and the relative price paid data. Moreover, to explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we accessed data through FourSquare API interface and arranged them as a data frame for visualization. By merging data on Melbourne properties and the relative price paid and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we were able to recommend profitable real estate investments.

The Methodology section comprised four stages: 1. Collect Inspection Data; 2. Explore and Understand Data; 3. Data preparation and preprocessing; 4. Modeling. In particular, in the modeling section, we used the k-means clustering technique as it is fast and efficient in terms of computational cost, is highly flexible to account for mutations in real estate market in London and is accurate.