# Coursera Applied Data Science Capstone
This notebook contains my work for the applied data sience capstone course from IBM/Coursera.

# Introduction

One of the most difficult things about moving to a new city on the other side of the world is deciding where to live. While you can look-up online the characteristics of the most famous suburbs many others that could be hidden gems are overlooked. So, what if you could use the place you are currently living in as a reference to look for recommendations of where to live on the other side of the world. This recommendation could include suburb characteristics such as the types of businesses and facilities around it and some statistical information on the rent prices. Thus, narrowing down your search for the perfect place to live on the other side of the world!
This service is aimed at the people of our globalised world, who are looking to move to a new city within their country or internationally. For this project I'll use myself as an example, back in in the day I move from Guadalajara, MX to Melbourne, AU. So I'll be using the place I used to live in Mexico to rank suburbs in Melbourne that could be suited to me.

# Data

This is the data I plan to use for my solution:
## Venue data for a specific suburb in Guadalajara, MX.  
The location (lat, lon) for the specific suburb will need to be determined for the Foursquare queries, this can be easily obtained through a Google search.  
This location data will be used to obtain venue information using the Foursquare API. The focus will be in obtaining the top 5 venue categories by frequency in that area.  
## Rent price data for all suburbs in Guadalajara, MX.  
This is probably the most difficult to obtain as Mexico is not great with data gathering. The objective with this data is to obtain the percentile the specific suburb is in w.r.t. rent prices. This is in order to have a better price comparison between the two cities as they may be quite different in terms of cost of living. This is with the assumption that the person is (at least) looking at maintaining their current living conditions.  
Thus, the median rental price for housing properties in all the suburbs of Guadalajara, MX are needed.
## Venue data for all Melbourne, AU suburbs.
Location data for all the Melbourne suburbs is needed for the Foursquare venue queries. This can be easily obtained from the Australian government's data access website: https://data.gov.au/dataset/ds-dga-af33dd8c-0534-4e18-9245-fc64440f742e/details  
Similarly to the venue information discussed before, the top 5 venue categories will be required for all Melbourne suburbs. The venue data will be used in the comparison between the two cities (by suburb).
## Rent price data for all suburbs in Melbourne, AU.  
Median rental price data for all the Melbourne suburbs can be obtained from the Victorian government's website: https://www.dhhs.vic.gov.au/publications/rental-report.  
This data will be used to determine each suburb's rent percentile, which will then be used as an additional feature for suburb comparison.

#### Imports
All the stuff needed for the project.

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests
import geocoder # import geocoder
from sklearn.cluster import KMeans # import k-means from clustering stage
from sklearn.cluster import DBSCAN 
from sklearn.preprocessing import StandardScaler

# Source suburb data

## Geographical location

Let's use geolocator to get the geographical coordinates of the source suburb of interest.

In [2]:
source_suburb = 'providencia'
source_city = 'guadalajara, mexico'
address = source_suburb + ', ' + source_city

geolocator = Nominatim(user_agent="to_explorer", timeout=5)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
source_loc = [latitude, longitude]
print('The geograpical coordinates of {} are {}, {}.'.format(source_suburb.upper(), latitude, longitude))

The geograpical coordinates of PROVIDENCIA are 20.6973972, -103.3786781.


Let's show it in the map

In [3]:
# create map of New York using latitude and longitude values
source_map = folium.Map(location=source_loc, zoom_start=16)
source_map

That looks about right!

## Rent price data

After a good time searching, I found a database containing rental information for the whole city of Guadalajara at the following link:
https://iieg.gob.mx/ns/?page_id=11967  
I've downloaded the spreadsheet and cleaned it up so I can easily import it, let's check it out.

In [4]:
source_rent_df = pd.read_csv('Data/Clean/Guadalajara_Rent_Data.csv')
source_rent_df.head()

Unnamed: 0,Type,Rent,City,Suburb
0,Departamento,7000.0,Guadalajara,LAS TORRES
1,Casa,15000.0,Tlajomulco,BOSQUE REAL DE SANTA ANITA
2,Casa,15000.0,Tlajomulco,BOSQUE REAL DE SANTA ANITA
3,Departamento,16500.0,Tlajomulco,LA RIOJA
4,Departamento,17500.0,Guadalajara,JARDINES DEL BOSQUE


Alright, so we are interested in suburb and rent prices. Now, this dataset contains multiple entries per suburb as it reflects the current market. Let's first start by only having suburb and rent in the dataframe and changing the suburb names to uppercase.

In [5]:
source_rent_df = source_rent_df[['Suburb', 'Rent']] # Keep only Suburb and Rent
source_rent_df['Suburb'] = source_rent_df['Suburb'].str.upper() # Lower case
source_rent_df.head()

Unnamed: 0,Suburb,Rent
0,LAS TORRES,7000.0
1,BOSQUE REAL DE SANTA ANITA,15000.0
2,BOSQUE REAL DE SANTA ANITA,15000.0
3,LA RIOJA,16500.0
4,JARDINES DEL BOSQUE,17500.0


Now let's group by suburn and calculate the mean rent price.

In [6]:
source_rent_df = source_rent_df.groupby(['Suburb']).mean()
source_rent_df.reset_index(inplace=True)
source_rent_df.head()

Unnamed: 0,Suburb,Rent
0,AGRARIA,13000.0
1,ALBATERRA,7250.0
2,ALTAMIRA,23250.0
3,ALTEA,7000.0
4,AMERICANA,20026.136364


Looking good, let's check some basic stats as a sanity check.

In [7]:
source_rent_stats = source_rent_df.describe()
source_rent_stats

Unnamed: 0,Rent
count,317.0
mean,17272.443272
std,9935.217925
min,1800.0
25%,9500.0
50%,15500.0
75%,23000.0
max,65000.0


Since the source and targets may be in diferent countries, the rent prices will be in different currency. Also, the cost of living may be different. Thus, we need a relative measure of the rent price that can be used for comparison. I chose to use the percentile. Let's calculate what percentile does the source suburb belong to.

In [8]:
def calc_percentile(max_value, min_value, value):
    return 100*((value - min_value)/(max_value-min_value))

In [9]:
# Create a percentile column for the source dataframe
source_rent_df['Rent Percentile'] = calc_percentile(source_rent_df.max()['Rent'], source_rent_df.min()['Rent'], source_rent_df['Rent'])
source_rent_df.head()

Unnamed: 0,Suburb,Rent,Rent Percentile
0,AGRARIA,13000.0,17.721519
1,ALBATERRA,7250.0,8.623418
2,ALTAMIRA,23250.0,33.939873
3,ALTEA,7000.0,8.227848
4,AMERICANA,20026.136364,28.838823


In [10]:
# Get the source median rent price
source_rent_df[source_rent_df['Suburb']==source_suburb.upper()]

Unnamed: 0,Suburb,Rent,Rent Percentile
313,PROVIDENCIA,26000.0,38.291139


## Venue data

Now we just need to collect venue information for the source suburb, we'll use the Foursquare API for this.

In [11]:
CLIENT_ID = '543RCQIDW44OE0ZDAPA5H5I3WDPAQTJZBSA1IIRNMYOV2B4W' # your Foursquare ID
CLIENT_SECRET = 'HTU223QZD4GOZ3W2H424YP2CZ1LGKOF1KSYNKICNEOYAV1V2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 200 # limit of number of venues returned by Foursquare API

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 543RCQIDW44OE0ZDAPA5H5I3WDPAQTJZBSA1IIRNMYOV2B4W
CLIENT_SECRET:HTU223QZD4GOZ3W2H424YP2CZ1LGKOF1KSYNKICNEOYAV1V2


#### Define function to explore neighborhoods

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now run the above function on each neighborhood and create a new dataframe called *source_venues*.

In [13]:
source_explore_data = pd.DataFrame({'Suburb':[source_suburb.upper()], 'Latitude':[source_loc[0]], 'Longitude':[source_loc[1]]})
source_venues = getNearbyVenues(names=source_explore_data['Suburb'],
                                 latitudes=source_explore_data['Latitude'],
                                 longitudes=source_explore_data['Longitude'], radius = 500)

PROVIDENCIA


#### Let's check the size of the resulting dataframe

In [14]:
print(source_venues.shape)
source_venues.head()

(59, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,PROVIDENCIA,20.697397,-103.378678,Sensecycle,20.698501,-103.377518,Other Event
1,PROVIDENCIA,20.697397,-103.378678,The Blooming Tea,20.697546,-103.380022,Tea Room
2,PROVIDENCIA,20.697397,-103.378678,Parque Dr. Atl,20.695867,-103.378644,Garden
3,PROVIDENCIA,20.697397,-103.378678,The Barre Studio,20.698223,-103.377308,Gym / Fitness Center
4,PROVIDENCIA,20.697397,-103.378678,Anytime Fitness Ottawa,20.698181,-103.377353,Gym / Fitness Center


#### Let's find out how many unique categories can be curated from all the returned venues

In [15]:
print('There are {} uniques categories.'.format(len(source_venues['Venue Category'].unique())))

There are 43 uniques categories.


### Suburb analysis

In [16]:
# Start with one hot encoding
source_onehot = pd.get_dummies(source_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
source_onehot['Neighborhood'] = source_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [source_onehot.columns[-1]] + list(source_onehot.columns[:-1])
source_onehot = source_onehot[fixed_columns]

source_onehot.head()

Unnamed: 0,Neighborhood,Argentinian Restaurant,Arts & Crafts Store,Asian Restaurant,Café,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,...,Shopping Mall,Snack Place,Spa,Supermarket,Sushi Restaurant,Taco Place,Tea Room,Toy / Game Store,Wings Joint,Yoga Studio
0,PROVIDENCIA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PROVIDENCIA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,PROVIDENCIA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PROVIDENCIA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PROVIDENCIA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [17]:
source_onehot.shape

(59, 44)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [18]:
source_grouped = source_onehot.groupby('Neighborhood').mean().reset_index()
source_grouped

Unnamed: 0,Neighborhood,Argentinian Restaurant,Arts & Crafts Store,Asian Restaurant,Café,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,...,Shopping Mall,Snack Place,Spa,Supermarket,Sushi Restaurant,Taco Place,Tea Room,Toy / Game Store,Wings Joint,Yoga Studio
0,PROVIDENCIA,0.016949,0.016949,0.016949,0.033898,0.016949,0.067797,0.016949,0.016949,0.016949,...,0.016949,0.016949,0.033898,0.016949,0.016949,0.016949,0.016949,0.016949,0.016949,0.033898


#### Combine with rent cost percentile

In [19]:
source_grouped_full = source_grouped.merge(source_rent_df, left_on='Neighborhood', right_on='Suburb')
source_grouped_full.drop(columns=['Suburb'], inplace=True)
source_grouped_full.head()

Unnamed: 0,Neighborhood,Argentinian Restaurant,Arts & Crafts Store,Asian Restaurant,Café,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,...,Spa,Supermarket,Sushi Restaurant,Taco Place,Tea Room,Toy / Game Store,Wings Joint,Yoga Studio,Rent,Rent Percentile
0,PROVIDENCIA,0.016949,0.016949,0.016949,0.033898,0.016949,0.067797,0.016949,0.016949,0.016949,...,0.033898,0.016949,0.016949,0.016949,0.016949,0.016949,0.016949,0.033898,26000.0,38.291139


#### Let's confirm the new size

In [20]:
source_grouped.shape

(1, 44)

#### Let's put the top venues into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [22]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
source_venues_sorted = pd.DataFrame(columns=columns)
source_venues_sorted['Neighborhood'] = source_grouped['Neighborhood']

for ind in np.arange(source_grouped.shape[0]):
    source_venues_sorted.iloc[ind, 1:] = return_most_common_venues(source_grouped.iloc[ind, :], num_top_venues)

source_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,PROVIDENCIA,Restaurant,Coffee Shop,Gym / Fitness Center,Nightclub,Yoga Studio,Café,Deli / Bodega,Mexican Restaurant,Ice Cream Shop,Spa


#### Combine with rent cost percentile

In [23]:
source_venues_sorted['Rent Percentile'] = source_rent_df[source_rent_df['Suburb']==source_suburb.upper()]['Rent Percentile']
source_venues_sorted.at[0, 'Rent Percentile'] = source_rent_df[source_rent_df['Suburb']==source_suburb.upper()]['Rent Percentile']
source_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Rent Percentile
0,PROVIDENCIA,Restaurant,Coffee Shop,Gym / Fitness Center,Nightclub,Yoga Studio,Café,Deli / Bodega,Mexican Restaurant,Ice Cream Shop,Spa,38.291139


Now we have all the data from the source suburb to be able to compare it to suburbs in the target city.

# Target City Data

## Geographical location

Let's use geolocator to get the geographical coordinates of the source suburb of interest.

In [24]:
target_city = 'melbourne'
target_state = 'victoria, australia'
address = target_city + ', ' + target_state

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
target_loc = [latitude, longitude]
print('The geograpical coordinates of {} are {}, {}.'.format(target_city.upper(), latitude, longitude))

The geograpical coordinates of MELBOURNE are -37.8142176, 144.9631608.


Let's show it in the map

In [25]:
# create map of New York using latitude and longitude values
target_map = folium.Map(location=target_loc, zoom_start=11)
target_map

That looks about right!

## Rent price data

For the target city, I'll use the rents by suburb as provided by the Victorian goverment in https://www.dhhs.vic.gov.au/publications/rental-report.  
The data is provided in weekly median rent prices, so these need to be converted to monthly. Only those for Melbourne will be used.  
The data was cleaned a bit before to make it easier to import as it includes annual rates.

In [26]:
target_rent_df = pd.read_csv('Data/Clean/Melbourne_Rent_Data.csv')
target_rent_df['Rent'] = 52*target_rent_df['Rent']/12 # Convert to monthly rates
target_rent_df.head()

Unnamed: 0,Suburb,Rent
0,Albert Park-Middle Park-West St Kilda,2535.0
1,Armadale,2080.0
2,Carlton North,2491.666667
3,Carlton-Parkville,1841.666667
4,CBD-St Kilda Rd,2210.0


We can see that there are some suburbs that are grouped together, we need to split those.

In [27]:
# Step 1
# Create a new dataframe from the series and stack to keep the original index so we can extract the data
new_df = pd.DataFrame(target_rent_df['Suburb'].str.split('-').tolist()).stack()

# Step 2
# Reset the index and rename columns so we can fill the rent data
new_df = new_df.reset_index()
new_df.columns = ['index', 'sub_i', 'Suburb']
new_df.drop(columns=['sub_i'], inplace=True)

# Step 3
# Also need to reset the index of the previous dataframe
target_rent_df.reset_index(inplace=True)

# Step 4
# Fit the rent data and drop the old index
new_df = new_df.merge(target_rent_df, on='index', how='right')
new_df.drop(columns=['index','Suburb_y'], inplace=True)
new_df.columns=['Suburb','Rent']
target_rent_df = new_df
target_rent_df.head()

Unnamed: 0,Suburb,Rent
0,Albert Park,2535.0
1,Middle Park,2535.0
2,West St Kilda,2535.0
3,Armadale,2080.0
4,Carlton North,2491.666667


Looking good, let's check some basic stats as a sanity check.

In [28]:
target_rent_stats = target_rent_df.describe()
target_rent_stats

Unnamed: 0,Rent
count,158.0
mean,1959.681435
std,288.584545
min,1473.333333
25%,1733.333333
50%,1939.166667
75%,2145.0
max,2990.0


Let's calculate the suburb percentiles.

In [29]:
# Create a percentile column for the source dataframe
target_rent_df['Rent Percentile'] = calc_percentile(target_rent_df.max()['Rent'], target_rent_df.min()['Rent'], target_rent_df['Rent'])
target_rent_df.head()

Unnamed: 0,Suburb,Rent,Rent Percentile
0,Albert Park,2535.0,70.0
1,Middle Park,2535.0,70.0
2,West St Kilda,2535.0,70.0
3,Armadale,2080.0,40.0
4,Carlton North,2491.666667,67.142857


## Venue data

Now we need to collect venue information for all the suburbs in the target city, we'll use the Foursquare API for this.

#### First, we need to gather the geolocation data for the suburbs in the target city dataframe

In [30]:
geolocator = Nominatim(user_agent="to_explorer", timeout=5)
latitudes = []
longitudes = []
suburbs = []

# Repeat for each suburb
for suburb in target_rent_df['Suburb']:

    # Get location
    #address = suburb + ', ' + target_city + ', ' + target_state    
    address = suburb + ', ' + target_state    
    location = geolocator.geocode(address)
    
    if type(location) != type(None):
        latitude = location.latitude
        longitude = location.longitude

        latitudes.append(latitude)
        longitudes.append(longitude)
        suburbs.append(suburb)
    else:
        print(suburb + ' skipped.')

East Hawthorn skipped.


Create a dataframe with the geo data and save to CSV as this takes long to obtain, just in case something goes wrong.

In [31]:
target_geo_data = pd.DataFrame({'Suburb':suburbs, 'Latitude':latitudes, 'Longitude':longitudes})
target_geo_data.to_csv('Data/Clean/target_geo.csv', index=False)
target_geo_data.head()

Unnamed: 0,Suburb,Latitude,Longitude
0,Albert Park,-37.847772,144.962008
1,Middle Park,-37.851151,144.96204
2,West St Kilda,-37.863826,144.981637
3,Armadale,-37.856762,145.020691
4,Carlton North,-37.784559,144.972855


#### Now let's merge the geo data with the rent data

In [32]:
target_rent_df = target_rent_df.merge(target_geo_data, on='Suburb', how='right')
print(target_rent_df.shape)
target_rent_df.head()

(157, 5)


Unnamed: 0,Suburb,Rent,Rent Percentile,Latitude,Longitude
0,Albert Park,2535.0,70.0,-37.847772,144.962008
1,Middle Park,2535.0,70.0,-37.851151,144.96204
2,West St Kilda,2535.0,70.0,-37.863826,144.981637
3,Armadale,2080.0,40.0,-37.856762,145.020691
4,Carlton North,2491.666667,67.142857,-37.784559,144.972855


Looking pretty good!

#### OK! Let's get venue data now.

In [33]:
target_venues = getNearbyVenues(names=target_rent_df['Suburb'],
                                 latitudes=target_rent_df['Latitude'],
                                 longitudes=target_rent_df['Longitude'], radius = 500)

Albert Park
Middle Park
West St Kilda
Armadale
Carlton North
Carlton
Parkville
CBD
St Kilda Rd
Collingwood
Abbotsford
Docklands
East Melbourne
East St Kilda
Elwood
Fitzroy
Fitzroy North
Clifton Hill
Flemington
Kensington
North Melbourne
West Melbourne
Port Melbourne
Prahran
Windsor
Richmond
Burnley
South Melbourne
South Yarra
Southbank
St Kilda
Toorak
Balwyn
Blackburn
Box Hill
Bulleen
Templestowe
Doncaster
Burwood
Ashburton
Camberwell
Glen Iris
Canterbury
Surrey Hills
Mont Albert
Chadstone
Oakleigh
Clayton
Doncaster East
Donvale
Glen Waverley
Mulgrave
Hawthorn
Kew
Mount Waverley
Nunawading
Mitcham
Vermont
Forest Hill
Burwood East
Aspendale
Chelsea
Carrum
Bentleigh
Brighton
Brighton East
Carnegie
Caulfield
Cheltenham
Elsternwick
Hampton
Beaumaris
Malvern
Malvern East
Mentone
Parkdale
Mordialloc
Murrumbeena
Hughesdale
Altona
Footscray
Keilor East
Avondale Heights
Melton
Newport
Spotswood
St Albans
Deer Park
Sunshine
Sydenham
Werribee
Hoppers Crossing
West Footscray
Williamstown
Yarravill

#### Let's check the size of the resulting dataframe

In [34]:
print(target_venues.shape)
target_venues.head()

(2612, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albert Park,-37.847772,144.962008,Formula 1 Grand Prix Circuit,-37.848324,144.967346,Racetrack
1,Albert Park,-37.847772,144.962008,Albert Park Driving Range,-37.844778,144.962922,Golf Course
2,Albert Park,-37.847772,144.962008,Hot Honey Cafe,-37.850893,144.963772,Café
3,Albert Park,-37.847772,144.962008,The Armstrong Street Foodstore,-37.850394,144.964328,Café
4,Albert Park,-37.847772,144.962008,The Roti Man,-37.849948,144.964738,Indian Restaurant


#### Let's find out how many unique categories can be curated from all the returned venues

In [35]:
print('There are {} uniques categories.'.format(len(target_venues['Venue Category'].unique())))

There are 264 uniques categories.


### Suburb analysis

In [36]:
# Start with one hot encoding
target_onehot = pd.get_dummies(target_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
target_onehot['Neighborhood'] = target_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [target_onehot.columns[-1]] + list(target_onehot.columns[:-1])
target_onehot = target_onehot[fixed_columns]

target_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit
0,Albert Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Albert Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Albert Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Albert Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Albert Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [37]:
target_onehot.shape

(2612, 265)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [38]:
target_grouped = target_onehot.groupby('Neighborhood').mean().reset_index()
target_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit
0,Abbotsford,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,Albert Park,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,Alphington,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,Altona,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,Armadale,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,West Melbourne,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
144,West St Kilda,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.027027,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
145,Williamstown,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
146,Windsor,0.0,0.0,0.0,0.0,0.013889,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0


#### Combine with rent cost percentile

In [39]:
target_grouped_full = target_grouped.merge(target_rent_df, left_on='Neighborhood', right_on='Suburb')
target_grouped_full.drop(columns=['Suburb'], inplace=True)
target_grouped_full.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Women's Store,Xinjiang Restaurant,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit,Rent,Rent Percentile,Latitude,Longitude
0,Abbotsford,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2166.666667,45.714286,-37.804551,144.998854
1,Albert Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2535.0,70.0,-37.847772,144.962008
2,Alphington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1993.333333,34.285714,-37.778395,145.031282
3,Altona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1733.333333,17.142857,-37.867206,144.830142
4,Armadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2080.0,40.0,-37.856762,145.020691


#### Let's confirm the new size

In [40]:
target_grouped.shape

(148, 265)

#### Let's put the top venues into a *pandas* dataframe

Create the new dataframe and display the top 10 venues for each neighborhood.

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
target_venues_sorted = pd.DataFrame(columns=columns)
target_venues_sorted['Neighborhood'] = target_grouped['Neighborhood']

for ind in np.arange(target_grouped.shape[0]):
    target_venues_sorted.iloc[ind, 1:] = return_most_common_venues(target_grouped.iloc[ind, :], num_top_venues)

target_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Abbotsford,Pub,Café,Thrift / Vintage Store,Garden,Farmers Market,Vegetarian / Vegan Restaurant,Harbor / Marina,Coffee Shop,Japanese Restaurant,Furniture / Home Store
1,Albert Park,Café,Grocery Store,Indian Restaurant,Athletics & Sports,Racetrack,Tennis Court,Thai Restaurant,Light Rail Station,Seafood Restaurant,Golf Course
2,Alphington,Park,Train Station,Convenience Store,Liquor Store,Thai Restaurant,Gym / Fitness Center,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Falafel Restaurant
3,Altona,Bar,Café,Seafood Restaurant,Park,Burger Joint,Performing Arts Venue,Gym,Harbor / Marina,Fish & Chips Shop,Pizza Place
4,Armadale,Café,Convenience Store,Breakfast Spot,Grocery Store,Pizza Place,Train Station,Diner,Farmers Market,Football Stadium,Food Truck


#### Combine with rent cost percentile

In [42]:
target_venues_sorted = target_venues_sorted.merge(target_rent_df, left_on='Neighborhood', right_on='Suburb')
target_venues_sorted.drop(columns=['Suburb'], inplace=True)
target_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Rent,Rent Percentile,Latitude,Longitude
0,Abbotsford,Pub,Café,Thrift / Vintage Store,Garden,Farmers Market,Vegetarian / Vegan Restaurant,Harbor / Marina,Coffee Shop,Japanese Restaurant,Furniture / Home Store,2166.666667,45.714286,-37.804551,144.998854
1,Albert Park,Café,Grocery Store,Indian Restaurant,Athletics & Sports,Racetrack,Tennis Court,Thai Restaurant,Light Rail Station,Seafood Restaurant,Golf Course,2535.0,70.0,-37.847772,144.962008
2,Alphington,Park,Train Station,Convenience Store,Liquor Store,Thai Restaurant,Gym / Fitness Center,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Falafel Restaurant,1993.333333,34.285714,-37.778395,145.031282
3,Altona,Bar,Café,Seafood Restaurant,Park,Burger Joint,Performing Arts Venue,Gym,Harbor / Marina,Fish & Chips Shop,Pizza Place,1733.333333,17.142857,-37.867206,144.830142
4,Armadale,Café,Convenience Store,Breakfast Spot,Grocery Store,Pizza Place,Train Station,Diner,Farmers Market,Football Stadium,Food Truck,2080.0,40.0,-37.856762,145.020691


Awesome, so that's the end of the target city data. Now we can proceed to do a comparison and recommendation.

# Suburb Recommendation
We want to use the venue and rent percentile data for the source suburb to recommend the best suburbs in the target city that are closer to the source suburb (if there are more than 5 in the recommended cluster, it'll be the closest 5 in rent percentile). To do this, I'll culster the suburbs in the target city by similarity using k-means. Then I'll use the built model to estimate what cluster the source suburb could belong to. The answer will be the closest suburbs.

#### Cluster Neighborhoods

Run *k*-means to cluster the suburbs. One thing to considers is that venues that are in the source suburb may not be in the target city suburbs, and viceversa. Therefore, we can only use the intersection of these features for the clustering and estimation.

In [43]:
print(target_grouped_full.columns.intersection(source_grouped_full.columns))

Index(['Neighborhood', 'Argentinian Restaurant', 'Arts & Crafts Store',
       'Asian Restaurant', 'Café', 'Clothing Store', 'Coffee Shop',
       'Convenience Store', 'Cosmetics Shop', 'Dance Studio', 'Deli / Bodega',
       'Department Store', 'Dessert Shop', 'Dive Bar', 'Food & Drink Shop',
       'Food Truck', 'French Restaurant', 'Furniture / Home Store', 'Garden',
       'Gym', 'Gym / Fitness Center', 'Hotel', 'Ice Cream Shop',
       'Italian Restaurant', 'Japanese Restaurant',
       'Latin American Restaurant', 'Mediterranean Restaurant',
       'Mexican Restaurant', 'Music Venue', 'Nightclub', 'Pharmacy',
       'Restaurant', 'Shopping Mall', 'Snack Place', 'Spa', 'Supermarket',
       'Sushi Restaurant', 'Taco Place', 'Tea Room', 'Toy / Game Store',
       'Wings Joint', 'Yoga Studio', 'Rent', 'Rent Percentile'],
      dtype='object')


In [44]:
# set number of clusters
kclusters = 20
krand = 0

# Only use the intersection of features for training and estimation
target_grouped_clustering = target_grouped_full[target_grouped_full.columns.intersection(source_grouped_full.columns)]
target_grouped_clustering.drop(columns=['Neighborhood', 'Rent'], inplace=True) # Get rid of the neighborhood and rent since they are not relevant (using percentile)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=krand).fit(target_grouped_clustering)
#kmeans = KMeans(n_clusters=kclusters).fit(target_grouped_clustering)

# check cluster labels generated for each row in the dataframe
unique, counts = np.unique(kmeans.labels_, return_counts=True)
dict(zip(unique, counts))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


{0: 4,
 1: 12,
 2: 13,
 3: 5,
 4: 8,
 5: 2,
 6: 11,
 7: 18,
 8: 3,
 9: 6,
 10: 3,
 11: 15,
 12: 17,
 13: 1,
 14: 7,
 15: 8,
 16: 4,
 17: 4,
 18: 5,
 19: 2}

After trying different cluster numbers, 20 clusters achieves the best distribution.

Add clusters to the target city venues so we can pick the recommended ones.

In [45]:
# add clustering labels to target venues
#target_venues_sorted.drop(columns=['Cluster Labels'], inplace=True)
target_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
target_venues_sorted.head() # check the last columns!

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Rent,Rent Percentile,Latitude,Longitude
0,7,Abbotsford,Pub,Café,Thrift / Vintage Store,Garden,Farmers Market,Vegetarian / Vegan Restaurant,Harbor / Marina,Coffee Shop,Japanese Restaurant,Furniture / Home Store,2166.666667,45.714286,-37.804551,144.998854
1,0,Albert Park,Café,Grocery Store,Indian Restaurant,Athletics & Sports,Racetrack,Tennis Court,Thai Restaurant,Light Rail Station,Seafood Restaurant,Golf Course,2535.0,70.0,-37.847772,144.962008
2,2,Alphington,Park,Train Station,Convenience Store,Liquor Store,Thai Restaurant,Gym / Fitness Center,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Falafel Restaurant,1993.333333,34.285714,-37.778395,145.031282
3,1,Altona,Bar,Café,Seafood Restaurant,Park,Burger Joint,Performing Arts Venue,Gym,Harbor / Marina,Fish & Chips Shop,Pizza Place,1733.333333,17.142857,-37.867206,144.830142
4,16,Armadale,Café,Convenience Store,Breakfast Spot,Grocery Store,Pizza Place,Train Station,Diner,Farmers Market,Football Stadium,Food Truck,2080.0,40.0,-37.856762,145.020691


Let's check whether all neighbourhoods have been properly clustered.

In [46]:
print('There are {} neighbourhoods without results.'.format(target_venues_sorted['Cluster Labels'].isnull().sum()))

There are 0 neighbourhoods without results.


Looks good!

## Recommendation
Now we want to predict what cluster does the source suburb belong to in the target city.  
First, we need to slect the intersection of features.

In [47]:
source_grouped_clustering = source_grouped_full[source_grouped_full.columns.intersection(target_grouped_full.columns)]
source_grouped_clustering.drop(columns=['Neighborhood', 'Rent'], inplace=True) # Get rid of the neighborhood and rent since they are not relevant (using percentile)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### K Means
Use kmeans to predict the source suburb cluster.

In [48]:
source_cluster = kmeans.predict(source_grouped_clustering)
print("The source suburb, {}, belongs to cluster {}.".format(source_suburb, source_cluster))

The source suburb, providencia, belongs to cluster [16].


Now we need to extract the target suburbs that belong to the cluster.

In [49]:
recommended_suburbs = target_venues_sorted[target_venues_sorted['Cluster Labels']==source_cluster.tolist()[0]]
recommended_suburbs.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Rent,Rent Percentile,Latitude,Longitude
4,16,Armadale,Café,Convenience Store,Breakfast Spot,Grocery Store,Pizza Place,Train Station,Diner,Farmers Market,Football Stadium,Food Truck,2080.0,40.0,-37.856762,145.020691
20,16,Brunswick,Café,Bar,Grocery Store,Pizza Place,Lebanese Restaurant,Supermarket,Thrift / Vintage Store,Dessert Shop,Thai Restaurant,Gastropub,2036.666667,37.142857,-37.766472,144.96131
32,16,Caulfield,Café,Grocery Store,Gym,Falafel Restaurant,Pizza Place,Convenience Store,Fast Food Restaurant,Filipino Restaurant,Farmers Market,Fish & Chips Shop,2058.333333,38.571429,-37.882265,145.022463
35,16,Cheltenham,Café,Bakery,Coffee Shop,Pet Store,Pizza Place,Gastropub,Fish & Chips Shop,Breakfast Spot,Park,Flea Market,2036.666667,37.142857,-37.967008,145.054695


Display the source suburb for comparison.

In [50]:
source_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Rent Percentile
0,PROVIDENCIA,Restaurant,Coffee Shop,Gym / Fitness Center,Nightclub,Yoga Studio,Café,Deli / Bodega,Mexican Restaurant,Ice Cream Shop,Spa,38.291139


The recommended cluster displays 4 suburbs: Armadale, Brunswick, Caulfield and Chelthenham. These are around the 37-40 percentile in rent and have Cafes as the most common venues. The other top venues are varied, but include Gyms, various types of restaurants, and grocery stores. Funny enough, I do live in a suburb next to one of these, which is a really interesting finding!

## Display Results on Map

In [51]:
# create map
map_recommended = folium.Map(location=target_loc, zoom_start=11)

# add markers to map
for lat, lng, nb, rt, fst, snd, trd in zip(recommended_suburbs['Latitude'], recommended_suburbs['Longitude'],
                                      recommended_suburbs['Neighborhood'], recommended_suburbs['Rent'],
                                      recommended_suburbs['1st Most Common Venue'], recommended_suburbs['2nd Most Common Venue'], recommended_suburbs['3rd Most Common Venue']):
    label = '<strong>Suburb:</strong> {}.<br>  <strong>Median monthly rent:</strong> ${:10.2f} AUD.<br>  <strong>Top 3 venues within 500m:</strong><br> {}, {}, {}.'.format(nb, rt, fst, snd, trd)
    label = folium.Popup(label, parse_html=False)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_recommended)  
    
map_recommended