# Moving from Rome to Toronto

By using kmeans clustering I would like to address the following problem:  
    Rome and Toronto are quite different cities. an example for all, wheather is very different  
    Also, within Rome, neighbourhoods are quite different from each other. Someone born and raised in "Centocelle" will give you a different description of Rome that someone who lives in "Torre Gaia"  
    If someone from Rome would like to relocate in Toronto, in which Torornto neighbourhoods should he/she looking for a flat that is a similar as possible to the one he/she lives in Rome? 
     

In [1]:
#Importing necessary packages
!conda install -c anaconda beautifulsoup4 --yes
!conda install lxml --yes
import pandas as pd
import numpy as np

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    opencv-3.4.2               |   py36h6fd60c2_1          11 KB  anaconda
    numpy-base-1.15.4          |   py36h81de0dd_0         4.2 MB  anaconda
    beautifulsoup4-4.8.2       |           py36_0         161 KB  anaconda
    numpy-1.15.4               |   py36h1d66e8a_0          35 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    soupsieve-1.9.5            |           py36_0          61 KB  anaconda
    mkl_fft-1.0.6              |   py36h7dd41cf_0         150 KB  anaconda
    certifi-2019.11.28         |           

### The below is quite similar to the labs in weeks 3 so I have grouped all the code in the same cell as all the reviewers will be already familiar with it  
My goal here is to get to the Toronto Dataset Postcode, Borough, Neighbourhood, Latitude, Longitude


In [2]:
#Scraping the Canada postal code table
Canada_pc = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
Canada_pc = Canada_pc[Canada_pc['Borough']!='Not assigned']

#More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice 
#and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma
Canada_pc=Canada_pc.groupby('Postcode').agg(lambda x:','.join(set(x)))

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the 
#Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park
Canada_pc['Neighbourhood'] = np.where(Canada_pc['Neighbourhood'] == 'Not assigned', Canada_pc['Borough'], Canada_pc['Neighbourhood'])

#import csv file
Lat_long = pd.read_csv('http://cocl.us/Geospatial_data')
#Add lat long to my table
Lat_long_Postcode = Canada_pc.join(Lat_long.set_index('Postal Code'), on='Postcode', how = 'left')
#As instructed, I only want to consider neighborhoods in Toronto
Toronto_pc = Lat_long_Postcode.query('Borough.str.contains("Toronto",case=False)')
Toronto_pc = Toronto_pc.groupby('Neighbourhood').mean()
Toronto_pc.reset_index(level=0, inplace=True)
Toronto_pc['City'] = 'Toronto'
Toronto_pc.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,City
0,"Adelaide,Richmond,King",43.650571,-79.384568,Toronto
1,Berczy Park,43.644771,-79.373306,Toronto
2,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Toronto
3,Central Bay Street,43.657952,-79.387383,Toronto
4,Christie,43.669542,-79.422564,Toronto


Now I need to collect the same data for Rome.  
On the website http://download.geonames.org/export/zip/ I have found all the Italian postcode with Zip codes, Latitude and Longitude.
I need to find another dataset which connects postcodes with Neighbourhoods inside the city of Rome. I found it here: http://www.roma-o-matic.com/zoom.php3?actions[Z]=x

In [3]:
Rome_zip_lat_long = pd.read_csv("/resources/labs/Data/Roman Zip Codes.csv")
Rome_zip_lat_long.head()

Unnamed: 0,Cap,Lat,Long
0,100,41.9228,12.501
1,118,41.8199,12.615
2,123,42.0231,12.323
3,133,41.862,12.633
4,185,41.8977,12.509


In [4]:
Rome_inner_Neigh = pd.read_csv('/resources/labs/Data/Rome_Neigh.csv')

Rome_inner_Neigh.head()

Unnamed: 0,Zona,CAP
0,Acqua Vergine,10
1,Acqua Vergine,155
2,Acqua Vergine,173
3,Aeroporto di Ciampino,40
4,Aeroporto di Ciampino,43


We immediatly spot a problem with both those dataset. The leading zeros are missing!  
All the inner Neighbourhoods of Rome have zip code (CAP, in Italian) as 001+2 more digits.  
Since leading zeros are missing, any zip code with less than 3 digits and not staring with 1 is not of our interest (those are areas outside Rome - even if they may be considered in the roman sphere of influence


In [5]:
Rome_zip_lat_long = Rome_zip_lat_long[(Rome_zip_lat_long['Cap']>=100) & (Rome_zip_lat_long['Cap']<200)]
Rome_zip_lat_long.head()

Unnamed: 0,Cap,Lat,Long
0,100,41.9228,12.501
1,118,41.8199,12.615
2,123,42.0231,12.323
3,133,41.862,12.633
4,185,41.8977,12.509


In [6]:
Rome_inner_Neigh = Rome_inner_Neigh[(Rome_inner_Neigh['CAP']>=100) & (Rome_inner_Neigh['CAP']<200)]
Rome_inner_Neigh.head()

Unnamed: 0,Zona,CAP
1,Acqua Vergine,155
2,Acqua Vergine,173
5,Aeroporto di Ciampino,178
6,Alessandrino,155
7,Alessandrino,169


Rome_inner_Neigh have few duplicates that I can get rid of (just data quality)

In [7]:
Rome_inner_Neigh.drop_duplicates(inplace=True)
Rome_zip_lat_long.head()

Unnamed: 0,Cap,Lat,Long
0,100,41.9228,12.501
1,118,41.8199,12.615
2,123,42.0231,12.323
3,133,41.862,12.633
4,185,41.8977,12.509


We can now join the 2 dataset together. Since it's quite possible for a Roman Neighbourhood to spread across several postcodes, I will pick the average of the Lat/Long within the Neighbourhood. This should give me the most central location

In [8]:
Rome_inner_Neigh = Rome_inner_Neigh.join(Rome_zip_lat_long.set_index('Cap'), on='CAP', how = 'left')
#Eliminate the ones I don't have lat/long for
Rome_inner_Neigh.dropna(inplace=True)
Rome_inner_Neigh.rename(columns={"Zona": "Neighbourhood", "Lat": "Latitude", "Long" : "Longitude"},inplace=True)
Rome_inner_Neigh = Rome_inner_Neigh.groupby('Neighbourhood').mean()
Rome_inner_Neigh = Rome_inner_Neigh[['Latitude','Longitude']]
Rome_inner_Neigh.reset_index(level=0, inplace=True)
Rome_inner_Neigh['City']='Rome'
Rome_inner_Neigh

Unnamed: 0,Neighbourhood,Latitude,Longitude,City
0,Acqua Vergine,41.874100,12.589500,Rome
1,Alessandrino,41.893550,12.579500,Rome
2,Appio Claudio,41.842000,12.592000,Rome
3,Appio Latino,41.873600,12.511500,Rome
4,Ardeatino,41.852967,12.492000,Rome
...,...,...,...,...
97,Trieste,41.923460,12.517000,Rome
98,Trionfale,41.918517,12.412167,Rome
99,Tuscolano,41.877880,12.522800,Rome
100,Val Melaina,41.962475,12.530750,Rome


We can Now append Toronto and Rome dataset together

In [9]:
Rome_Toronto_Neigh = Toronto_pc.append(Rome_inner_Neigh, ignore_index=True)
Rome_Toronto_Neigh

Unnamed: 0,Neighbourhood,Latitude,Longitude,City
0,"Adelaide,Richmond,King",43.650571,-79.384568,Toronto
1,Berczy Park,43.644771,-79.373306,Toronto
2,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Toronto
3,Central Bay Street,43.657952,-79.387383,Toronto
4,Christie,43.669542,-79.422564,Toronto
...,...,...,...,...
136,Trieste,41.923460,12.517000,Rome
137,Trionfale,41.918517,12.412167,Rome
138,Tuscolano,41.877880,12.522800,Rome
139,Val Melaina,41.962475,12.530750,Rome


## Cluster the neighborhoods  
after run, I have delete my CLIENT_ID and for security

In [10]:
CLIENT_ID = 
CLIENT_SECRET = 
VERSION = 

In [11]:
# define a function that get the top limit venues given a radius that are in Neighbourhood.

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
#I use the above functions to get the venues of the 2 cities

import requests # library to handle requests
LIMIT = 100 # limit of number of venues returned by Foursquare API

toronto_rome_venues = getNearbyVenues(names=Rome_Toronto_Neigh['Neighbourhood'],
                                   latitudes=Rome_Toronto_Neigh['Latitude'],
                                   longitudes=Rome_Toronto_Neigh['Longitude']
                                  )


Adelaide,Richmond,King
Berczy Park
Business Reply Mail Processing Centre 969 Eastern
Central Bay Street
Christie
Church and Wellesley
Commerce Court,Victoria Hotel
Davisville
Davisville North
Design Exchange,Toronto Dominion Centre
Dovercourt Village,Dufferin
Exhibition Place,Parkdale Village,Brockton
Forest Hill West,Forest Hill North
Harbourfront
Harbourfront East,Toronto Islands,Union Station
High Park,The Junction South
Kensington Market,Grange Park,Chinatown
Lawrence Park
Little Portugal,Trinity
North Toronto West
Parkdale,Roncesvalles
Queen's Park
Riverdale,The Danforth West
Rosedale
Roselawn
Ryerson,Garden District
South Niagara,Bathurst Quay,King and Spadina,Railway Lands,CN Tower,Harbourfront West,Island airport
St. James Town
St. James Town,Cabbagetown
Stn A PO Boxes 25 The Esplanade
Studio District
Summerhill East,Moore Park
Summerhill West,Forest Hill SE,Deer Park,South Hill,Rathnelly
Swansea,Runnymede
The Beaches
The Beaches West,India Bazaar
Underground city,First Canadia

In [14]:
#check the results
toronto_rome_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Adelaide,Richmond,King",43.650571,-79.384568,Four Seasons Centre for the Performing Arts,43.650592,-79.385806,Concert Hall
1,"Adelaide,Richmond,King",43.650571,-79.384568,Rosalinda,43.650252,-79.385156,Vegetarian / Vegan Restaurant
2,"Adelaide,Richmond,King",43.650571,-79.384568,Nathan Phillips Square,43.65227,-79.383516,Plaza
3,"Adelaide,Richmond,King",43.650571,-79.384568,The Keg Steakhouse + Bar,43.649937,-79.384196,Steakhouse
4,"Adelaide,Richmond,King",43.650571,-79.384568,Shangri-La Toronto,43.649129,-79.386557,Hotel


### Explore the categories I can use for clustering

In [15]:
#Let's explore the categories I can use for cluresting

# one hot encoding
toronto_rome_onehot = pd.get_dummies(toronto_rome_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_rome_onehot['Neighborhood'] = toronto_rome_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_rome_onehot.columns[-1]] + list(toronto_rome_onehot.columns[:-1])
toronto_rome_onehot = toronto_rome_onehot[fixed_columns]

toronto_rome_onehot.head()

Unnamed: 0,Zoo,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

toronto_rome_grouped = toronto_rome_onehot.groupby('Neighborhood').mean().reset_index()
toronto_rome_grouped

Unnamed: 0,Neighborhood,Zoo,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Acqua Vergine,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000
1,"Adelaide,Richmond,King",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.020000,0.000000,0.0,0.01,0.0,0.0,0.0,0.01,0.000000
2,Alessandrino,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000
3,Appio Claudio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000
4,Appio Latino,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118,Tuscolano,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000
119,"Underground city,First Canadian Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.010000,0.000000,0.0,0.01,0.0,0.0,0.0,0.00,0.000000
120,"University of Toronto,Harbord",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.028571,0.0,0.00,0.0,0.0,0.0,0.00,0.028571
121,Val Melaina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.0,0.00,0.0,0.0,0.0,0.00,0.000000


I want to display the top 10 venues for each neighborhood

In [73]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10 

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_rome_grouped['Neighborhood']

for ind in np.arange(toronto_rome_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_rome_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Acqua Vergine,Italian Restaurant,Tattoo Parlor,Electronics Store,Yoga Studio,Event Space,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant
1,"Adelaide,Richmond,King",Coffee Shop,Bar,Café,Steakhouse,Thai Restaurant,Cosmetics Shop,Sushi Restaurant,Restaurant,Gym,Bakery
2,Alessandrino,Flea Market,Library,Pizza Place,Soccer Field,Italian Restaurant,Event Space,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
3,Appio Claudio,Hotel,Bus Station,Pizza Place,Fast Food Restaurant,Yoga Studio,Event Space,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,Appio Latino,Pizza Place,Hotel,Plaza,Café,Gastropub,Gym,Pub,Ice Cream Shop,Electronics Store,Dog Run


Now I am all set to apply to calculate the distance between Neighborhoods  
I am not after clustering but more after the distance matrix  
So ideally, I would like to have a cluster per Neighborhood, as the function euclidean_distances(kmeans.cluster_centers_) gives me the distance across clusters     
Some Neighborhood are so similar to each other that the algorithm create a single cluster for them  
This is the reason why I used 107 cluster instead of 123 (number of distinct Neighborhoods)

In [18]:
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances

# set number of clusters
kclusters = 107

toronto_rome_grouped_clustering = toronto_rome_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_rome_grouped_clustering)


dists = euclidean_distances(kmeans.cluster_centers_)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([ 28,  37,  44,   9,  52,  69,  43,   2, 104,  23], dtype=int32)

I want the cluster number per Neighbourhood

In [19]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_rome_merged = Rome_Toronto_Neigh

# merge toronto_rome_grouped with toronto_rome_data to add latitude/longitude for each neighborhood
toronto_rome_merged = toronto_rome_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_rome_merged.head()

toronto_rome_cluter = toronto_rome_merged[['Neighbourhood','City', 'Cluster Labels']]

toronto_rome_cluter.head()

Unnamed: 0,Neighbourhood,City,Cluster Labels
0,"Adelaide,Richmond,King",Toronto,37.0
1,Berczy Park,Toronto,104.0
2,Business Reply Mail Processing Centre 969 Eastern,Toronto,62.0
3,Central Bay Street,Toronto,97.0
4,Christie,Toronto,57.0


Let's have a look at the distance matrix across clusters

In [20]:
#toronto_rome_merged.groupby('Cluster Labels').count()
dists[0]
pd.DataFrame(dists)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,97,98,99,100,101,102,103,104,105,106
0,0.000000,0.338937,1.016317,0.618789,0.238118,1.016317,0.702068,1.016317,1.016317,0.502892,...,0.148530,0.163611,0.160312,0.221477,0.352846,0.313847,0.162173,0.168099,0.128452,0.136748
1,0.338937,0.000000,1.062793,0.710302,0.255708,1.062793,0.586357,1.062793,1.062793,0.470496,...,0.382890,0.368955,0.363395,0.225824,0.122519,0.109128,0.335089,0.362676,0.321328,0.356014
2,1.016317,1.062793,0.000000,1.172604,1.031019,1.414214,1.224745,1.414214,1.414214,1.118034,...,1.023809,1.014361,1.013015,1.029554,1.068363,1.055178,1.012423,1.013304,1.018627,1.018037
3,0.618789,0.710302,1.172604,0.000000,0.654217,1.172604,0.935414,1.172604,1.172604,0.790569,...,0.636312,0.626229,0.629444,0.651417,0.711618,0.684398,0.616441,0.619620,0.642339,0.641405
4,0.238118,0.255708,1.031019,0.654217,0.000000,1.031019,0.694982,1.031019,1.031019,0.497996,...,0.268174,0.276827,0.254558,0.125490,0.277489,0.211187,0.243311,0.262814,0.234094,0.258457
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,0.313847,0.109128,1.055178,0.684398,0.211187,1.055178,0.611065,1.055178,1.055178,0.461952,...,0.359030,0.350761,0.343511,0.173141,0.140000,0.000000,0.318434,0.340965,0.301330,0.335857
103,0.162173,0.335089,1.012423,0.616441,0.243311,1.012423,0.703562,1.012423,1.012423,0.509902,...,0.191948,0.160287,0.128841,0.221895,0.357491,0.318434,0.000000,0.129560,0.139284,0.145602
104,0.168099,0.362676,1.013304,0.619620,0.262814,1.013304,0.713392,1.013304,1.013304,0.517549,...,0.195923,0.177040,0.172544,0.248378,0.378871,0.340965,0.129560,0.000000,0.151516,0.156889
105,0.128452,0.321328,1.018627,0.642339,0.234094,1.018627,0.691086,1.018627,1.018627,0.497594,...,0.153592,0.156987,0.156844,0.208150,0.337935,0.301330,0.139284,0.151516,0.000000,0.073485


I want to transform cluster number in actual Neighbourhood names  
before I can do that, I need to group together the Neighbourhood so similar to eaach other that they belong to the same cluster

In [21]:
from difflib import SequenceMatcher

#rshape distance into a dataframe
Distances = pd.DataFrame(dists)

toronto_rome_cluter.dropna(inplace=True)
s = toronto_rome_cluter.sort_values('Cluster Labels')
s=s.groupby('Cluster Labels').agg(lambda x:','.join(set(x)))
s.reset_index(inplace=True)

#adding Neighbourhood to rows
Cluster_Matrix = Distances.join(s.set_index('Cluster Labels'), how = 'left')

cluster_neigh = s['Neighbourhood'].to_numpy()
c_n = cluster_neigh.tolist()

#create a list for columns
c_n.insert(len(c_n),'Neighbourhood')
c_n.insert(len(c_n),'City')

#renaming columns with Neighbourhood
Cluster_Matrix.columns = Cluster_Matrix.columns[:0].tolist() + c_n

Cluster_Matrix.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,"Harbourfront East,Toronto Islands,Union Station","Sallustiano,Campo Marzio",Aurelio (suburbio),Trastevere,"Ponte,Sant'Eustachio,Prati,Parione",Roselawn,"La Giustiniana,La Storta",Torricola,"Summerhill East,Moore Park","Casal Morena,Torre Maura,Capannelle,Appio Claudio",...,"Ryerson,Garden District",Monti,Castro Pretorio,Esquilino,St. James Town,Berczy Park,"Design Exchange,Toronto Dominion Centre","Underground city,First Canadian Place",Neighbourhood,City
0,0.0,0.338937,1.016317,0.618789,0.238118,1.016317,0.702068,1.016317,1.016317,0.502892,...,0.160312,0.221477,0.352846,0.313847,0.162173,0.168099,0.128452,0.136748,"Harbourfront East,Toronto Islands,Union Station",Toronto
1,0.338937,0.0,1.062793,0.710302,0.255708,1.062793,0.586357,1.062793,1.062793,0.470496,...,0.363395,0.225824,0.122519,0.109128,0.335089,0.362676,0.321328,0.356014,"Sallustiano,Campo Marzio",Rome
2,1.016317,1.062793,0.0,1.172604,1.031019,1.414214,1.224745,1.414214,1.414214,1.118034,...,1.013015,1.029554,1.068363,1.055178,1.012423,1.013304,1.018627,1.018037,Aurelio (suburbio),Rome
3,0.618789,0.710302,1.172604,0.0,0.654217,1.172604,0.935414,1.172604,1.172604,0.790569,...,0.629444,0.651417,0.711618,0.684398,0.616441,0.61962,0.642339,0.641405,Trastevere,Rome
4,0.238118,0.255708,1.031019,0.654217,0.0,1.031019,0.694982,1.031019,1.031019,0.497996,...,0.254558,0.12549,0.277489,0.211187,0.243311,0.262814,0.234094,0.258457,"Ponte,Sant'Eustachio,Prati,Parione",Rome


And Now I can easily create a function that given the name of the Neighbourhood (doesn't have to be exact) and the town I'm ineterested to move (Rome or Toronto), it will sort the less distance Neighbourhoods to the one I inserted

In [22]:
def closest_Neigh(name,town):

    print(name)
    result_array = np.array([])

    #since name is a free text, I am finding the closest name in my distance matrix
    for i in range(len(c_n)):
        result = SequenceMatcher(None, c_n[i], name).ratio()
        result_array = np.append(result_array, result)

    quartiere = c_n[np.argmax(result_array)] 

    list_q = Cluster_Matrix[[quartiere,'Neighbourhood','City']].sort_values(quartiere)
    list_q = list_q[list_q['City']==town]
    return list_q

Let's try an example. I'm from Torre Angela and I want to check which are the 5 closest Neighbourhoods in Toronto. I will start my house hunting from there!

In [23]:
closest_n = closest_Neigh('Torre Angela','Toronto')
closest_n.head(5)

Torre Angela


Unnamed: 0,"Torre Angela,Borghesiana",Neighbourhood,City
68,0.572334,"Dovercourt Village,Dufferin",Toronto
76,0.57336,Davisville,Toronto
63,0.573488,"Summerhill West,Forest Hill SE,Deer Park,South...",Toronto
37,0.579339,"Adelaide,Richmond,King",Toronto
92,0.581732,"Swansea,Runnymede",Toronto


Dovercourt Village,Dufferin seems to be the closest to my home

I may do the other way around. Let's say I have found a place in Central Bay Street and, since I only know Rome, I would like to know what is the closest Neighbourhood in Rome

In [74]:
closest_n = closest_Neigh('Central Bay Street','Rome')
closest_n.head(5)

Central Bay Street


Unnamed: 0,Central Bay Street,Neighbourhood,City
100,0.257668,Monti,Rome
86,0.262798,Tuscolano,Rome
4,0.268174,"Ponte,Sant'Eustachio,Prati,Parione",Rome
72,0.268279,San Saba,Rome
81,0.277145,Campitelli,Rome


### Thank you for reviewing!