# The Battle of neighborhoods! You have to move on ? Don't panic!

# Data science as a tool for real estate rental agencies



# Table of Content

* [Introduction to the Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)


# 1 Introduction to the Business Problem <a name="introduction"></a>

This section provides a description of the problem and a discussion of the background.

## 1.1 Background

Many people move from their home country every year for different reasons, some of them because are starting a new job or a new business, others for a semester of study and other for love.

All these people have something in common, they are locking for a place that is comparable to the current home.

In big cities such as Rome, Paris or London, with huge population of renters, it's common to use a real estate agent to find a rental property.

The mainly requests that real estate agency receive from customers are :
+ find a house in a neighborhood that is as similar as possible to the one they come from;
+ That the new neighborhood meets a list of requirements such as parks, traditional restaurants, and so on.

The aim of this work is to demonstrate how using some data science techniques it is possible to help real estate agencies to find apartments for rent that meet the needs of customers.


## 1.2 Problem description

A family is moving from their hometown in Rome to Paris.
They ask a real estate agency to find an apartment for rent that is in a neighborhood similar to the one they are leaving and that has parks where they can walk their dog.

They would like to find a neighborhood with many restaurants and would like to be able to choose where to train between the various gyms.
They would also like to have some grocery stores nearby, so they can buy the ingredients needed to cook the Italian dishes.

Summarized, the family like to have the following venues nearby:

+ park;
+ gym;
+ restaurants & bars;
+ grocery store.

And that the apartment has:

+ Low price per m²;
+ boroughs that is similar to the one they are currently living in.



# 2 Data  <a name="data"></a>

This section provides a description of the data and how it will be used to solve the problem.


## 2.1 Description of the Data
The following data will be used :

1. __Average cost of a rental house in Paris:__ This information is gathered from this webpage 'https://www.seloger.com/prix-de-l-immo/location/ile-de-france/paris.htm'. The dataset consists of the district number and the average monthly cost of a rented apartment in that district.

2. __Average burglary in the borough of Paris:__ This information is gathered from this webpage 'https://www.bfmtv.com/societe/carte-delinquance-a-paris-quels-sont-les-arrondissements-ou-l-on-recense-le-plus-de-delits_AN-201910180103.html'. The dataset is composed of the district number and the number of annual burglaries in that district.

3. __Information about the venues in Paris neighboroods :__ This information is gathered through FourSquare API. The dataset contains Paris neighborhood information. It consists of the district number, the neighborhood name and all the premises that are present within a 750 meter radius from the neighborhood center. 

4. __Information about the venues in home town neighborood :__ This information is gathered through FourSquare API. The dataset contains home town neighborhood information. It consists of the district number, the neighborhood name and all the premises that are present within a 750 meter radius from the neighborhood center. 

5. __The names of all Paris neighboroods :__ This information is gathered from this webpage 'https://opendata.paris.fr/explore/dataset/quartier_paris'.

Not all the data is in the proper format and it needs to be transformed. 
The Geocoder Python package (https://geocoder.readthedocs.io/index.html) will be used to receive the latitude and logitude coordinates of all neighborhoods. The neighborhoods and their corresponding latitude and longitude will be used as input for FourSquare to get information about them


## 2.2 How the data will be used to solve the problem

First we will analyze the distribution of venues in the Paris neighborhoods to find those neighborhoods that best suit the preferences of the family.

Next, we'll divide the neighborhoods of Paris into clusters to find the ones that are as similar as possible to the neighborhood of the family's hometown. One hot encoding and k-means will be used for this porpouse.

The last step is to use the average rental cost per square meter and the crime rate to create a ranking of neighborhoods that meet the customer's needs.


## 2.3 Data Preparation

Let's start by importing all the necessary python libraries into our project.


In [1]:
import numpy as np
import pandas as pd
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### 2.3.1 Import  Paris boroughs dataset

Paris has in total 20 boroughs (called arrondissements in French) and are divided in 80 neighborhoods.

The dataset of Paris boroughs can be found at the following link:

https://opendata.paris.fr/explore/dataset/quartier_paris


In [2]:
df_neighbor = pd.read_csv("https://opendata.paris.fr/explore/dataset/quartier_paris/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B", sep=";") 
df_neighbor.head()

Unnamed: 0,N_SQ_QU,C_QU,C_QUINSEE,L_QU,C_AR,N_SQ_AR,PERIMETRE,SURFACE,Geometry X Y,Geometry
0,750000014,14,7510402,Saint-Gervais,4,750000004,2678.340923,422028.2,"48.8557186509,2.35816233385","{""type"": ""Polygon"", ""coordinates"": [[[2.363764..."
1,750000025,25,7510701,Saint-Thomas-d'Aquin,7,750000007,3827.253353,826559.4,"48.8552632694,2.32558765258","{""type"": ""Polygon"", ""coordinates"": [[[2.322133..."
2,750000038,38,7511002,Porte-Saint-Denis,10,750000010,2736.292954,472113.6,"48.873617661,2.35228289495","{""type"": ""Polygon"", ""coordinates"": [[[2.355344..."
3,750000001,1,7510101,Saint-Germain-l'Auxerrois,1,750000001,5057.549475,869000.7,"48.8606501352,2.33491032928","{""type"": ""Polygon"", ""coordinates"": [[[2.344593..."
4,750000073,73,7511901,Villette,19,750000019,5191.01883,1285705.0,"48.8876610888,2.37446821213","{""type"": ""Polygon"", ""coordinates"": [[[2.370498..."


We can remove unnecessary columns 


In [3]:
# Drop useless columns

neighborhoods = df_neighbor.drop(['C_QU', 'C_QUINSEE', 'N_SQ_QU', 'N_SQ_AR', 'PERIMETRE', 'SURFACE', 'Geometry'], axis=1)

So we need to add two columns, latitude and longitude of the center point of the neighborhoods, and populate these columns by dividing the values that are in "Geometry X Y".


In [4]:
neighborhoods[['Latitude', 'Longitude']] = neighborhoods['Geometry X Y'].str.split(',', n=1, expand=True)
neighborhoods.head()

Unnamed: 0,L_QU,C_AR,Geometry X Y,Latitude,Longitude
0,Saint-Gervais,4,"48.8557186509,2.35816233385",48.8557186509,2.35816233385
1,Saint-Thomas-d'Aquin,7,"48.8552632694,2.32558765258",48.8552632694,2.32558765258
2,Porte-Saint-Denis,10,"48.873617661,2.35228289495",48.873617661,2.35228289495
3,Saint-Germain-l'Auxerrois,1,"48.8606501352,2.33491032928",48.8606501352,2.33491032928
4,Villette,19,"48.8876610888,2.37446821213",48.8876610888,2.37446821213


Now we can drop the "Geometry X Y" column.


In [5]:
neighborhoods = neighborhoods.drop('Geometry X Y', axis=1)

Finally we change the columns name.


In [6]:
# define the dataframe columns
column_names = {'C_AR':'Borough', 'L_QU':'Neighborhood'} 

neighborhoods.rename(columns=column_names, inplace=True)
neighborhoods.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Saint-Gervais,4,48.8557186509,2.35816233385
1,Saint-Thomas-d'Aquin,7,48.8552632694,2.32558765258
2,Porte-Saint-Denis,10,48.873617661,2.35228289495
3,Saint-Germain-l'Auxerrois,1,48.8606501352,2.33491032928
4,Villette,19,48.8876610888,2.37446821213


We check that the datase contains all the needed data.


In [7]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 20 boroughs and 80 neighborhoods.


Using the geolocator we get the coordinates of Paris and we create a map centered on Paris showing the location of the neighborhoods.


In [8]:
address = 'Paris, France'

geolocator = Nominatim(user_agent="paris_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Paris are 48.8566969, 2.3514616.


In [11]:
# create map of Paris using latitude and longitude values obtained by geolocator
map_paris = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

In [12]:
paris_venues = pd.read_pickle('paris_neighborhood_2.plk')
paris_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Id,Venue Category
0,Quinze-Vingts,48.8469159441,2.37440162648,Promenade plantée – La Coulée Verte,48.847632,2.375107,4bf58dd8d48988d159941735,Trail
1,Quinze-Vingts,48.8469159441,2.37440162648,Les Embruns,48.8471,2.371883,52e81612bcbc57f1066b79f2,Creperie
2,Quinze-Vingts,48.8469159441,2.37440162648,Le Calbar,48.848702,2.375487,4bf58dd8d48988d11e941735,Cocktail Bar
3,Quinze-Vingts,48.8469159441,2.37440162648,Viaduc des Arts,48.848664,2.372931,4bf58dd8d48988d1df941735,Bridge
4,Quinze-Vingts,48.8469159441,2.37440162648,Rue Crémieux,48.847021,2.37111,52e81612bcbc57f1066b7a25,Pedestrian Plaza


### 2.3.2 Create Paris venues dataset

Using the Foursquare API we prepare and populate a dataset that will describe each district of Paris in terms of venues.

First we set he the Foresquare credentials and version.


In [13]:
CLIENT_ID = '' # Foursquare ID has been removed
CLIENT_SECRET = '' # Foursquare Secret has been removed
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: HEYWTHRFMDIM45NFOAGCFXQTAVB1CMINFNCOC0X4IFSII424
CLIENT_SECRET:JYYJUIMMMAEW51ECMK5RMBVBKINDOZ2O0DI3W0VYE2MKUAPY


The following function is used to create a new dataset containing all the needed info about Paris venues.

For each venue we get its geographical position, its category (by name and by id) and its name.


In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['id'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Id',
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
paris_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Quinze-Vingts
Rochechouart
Bercy
Halles
Monnaie
Odéon
Champs-Elysées
Maison-Blanche
Croulebarbe
Vivienne
Enfants-Rouges
Saint-Germain-des-Prés
Saint-Vincent-de-Paul
Saint-Ambroise
Bel-Air
Montparnasse
Plaine de Monceaux
Saint-Victor
Madeleine
Saint-Fargeau
Porte-Dauphine
Grandes-Carrières
Saint-Merri
Notre-Dame
Gros-Caillou
Sainte-Avoie
Hôpital-Saint-Louis
Belleville
Ternes
Folie-Méricourt
Salpêtrière
Place-Vendôme
Combat
Charonne
Javel
Arsenal
Jardin-des-Plantes
Porte-Saint-Martin
Roquette
Picpus
Plaisance
Sorbonne
Saint-Georges
Chaussée-d'Antin
Palais-Royal
Ecole-Militaire
Grenelle
Auteuil
Saint-Gervais
Saint-Thomas-d'Aquin
Porte-Saint-Denis
Saint-Germain-l'Auxerrois
Villette
Val-de-Grâce
Necker
Père-Lachaise
La Chapelle
Notre-Dame-des-Champs
Petit-Montrouge
Pont-de-Flandre
Muette
Chaillot
Epinettes
Europe
Sainte-Marguerite
Parc-de-Montsouris
Saint-Lambert
Arts-et-Métiers
Archives
Faubourg-du-Roule
Mail
Bonne-Nouvelle
Gare
Clignancourt
Goutte-d'Or
Batignolles
Invalides
Faubourg-Montm

We save the dataset in case we have to use it again for another search.


In [13]:
paris_venues.to_pickle('paris_neighborhood_2.plk')

Let's take a look at the data


In [15]:
print(paris_venues.shape)
paris_venues.head()

(5245, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Id,Venue Category
0,Quinze-Vingts,48.8469159441,2.37440162648,Promenade plantée – La Coulée Verte,48.847632,2.375107,4bf58dd8d48988d159941735,Trail
1,Quinze-Vingts,48.8469159441,2.37440162648,Les Embruns,48.8471,2.371883,52e81612bcbc57f1066b79f2,Creperie
2,Quinze-Vingts,48.8469159441,2.37440162648,Le Calbar,48.848702,2.375487,4bf58dd8d48988d11e941735,Cocktail Bar
3,Quinze-Vingts,48.8469159441,2.37440162648,Viaduc des Arts,48.848664,2.372931,4bf58dd8d48988d1df941735,Bridge
4,Quinze-Vingts,48.8469159441,2.37440162648,Rue Crémieux,48.847021,2.37111,52e81612bcbc57f1066b7a25,Pedestrian Plaza


In [16]:
paris_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Id,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Amérique,12,12,12,12,12,12,12
Archives,100,100,100,100,100,100,100
Arsenal,72,72,72,72,72,72,72
Arts-et-Métiers,100,100,100,100,100,100,100
Auteuil,18,18,18,18,18,18,18
...,...,...,...,...,...,...,...
Sorbonne,100,100,100,100,100,100,100
Ternes,64,64,64,64,64,64,64
Val-de-Grâce,44,44,44,44,44,44,44
Villette,58,58,58,58,58,58,58


In [17]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))

There are 300 uniques categories.


paris_venues dataset contains 5245 venues that are divided in 300 categories.


### 2.3.3 Create datasets about family favorite places

Starting from the paris_venues dataset we create another one that cointains the family favorite venues only.

This dataset will be used to find all neighborhoods that meet the needs of the family.

We are intersted in the main category only we collapse sub category, for example Italian Restaurant will be modified in Restaurant or Internet Cafè in Cafeè.


In [18]:
# venues of customer interest
favorite_venues = paris_venues[(paris_venues['Venue Category'].str.match('Park')==True) | (paris_venues['Venue Category'].str.contains('Café')==True) |(paris_venues['Venue Category'].str.contains('Gym')==True) | (paris_venues['Venue Category'].str.contains('Restaurant')==True) | (paris_venues['Venue Category'].str.contains('Grocery')==True)].copy()

In [19]:
favorite_venues['Venue Category'].replace(to_replace ='.*Restaurant*', value = 'Restaurant', regex = True, inplace=True)

In [20]:
favorite_venues['Venue Category'].replace(to_replace ='.*Gym.*', value = 'Gym', regex = True, inplace=True)

In [21]:
favorite_venues['Venue Category'].replace(to_replace ='.*Café.*', value = 'Café', regex = True, inplace=True)

In [22]:
favorite_venues['Venue Category'].replace(to_replace ='.*Grocery.*', value = 'Grocery', regex = True, inplace=True)

In [23]:
favorite_venues['Venue Category'].replace(to_replace ='.*Park*', value = 'Park', regex = True, inplace=True)

We create a map that represents the geographic distribution of favorite venues.


In [24]:
from folium.plugins import FastMarkerCluster

# create map of New York using latitude and longitude values
favourite_map = folium.Map(location=[location.latitude, location.longitude], zoom_start=12)
def add_marker(row):
    if (row['Venue Category'] == 'Park'):
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='green',popup=row['Venue Category'])
        marker.add_to(favourite_map)
    elif row['Venue Category'] == 'Café':
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='orange',popup=row['Venue Category'])
        marker.add_to(favourite_map)
    elif row['Venue Category'] == 'Gym':
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='yellow',popup=row['Venue Category'])
        marker.add_to(favourite_map) 
    elif row['Venue Category'] == 'Restaurant':
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='purple',popup=row['Venue Category'])
        marker.add_to(favourite_map)
    elif row['Venue Category'] == 'Grocery':
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='blue',popup=row['Venue Category'])
        marker.add_to(favourite_map)
    else:
        marker=folium.CircleMarker([row['Venue Latitude'],row['Venue Longitude']],radius=2,color='grey',popup=row['Venue Category'])
        marker.add_to(favourite_map)
        
favorite_venues.apply(add_marker,axis=1)
favourite_map

### 2.3.4 Create family hometown neighborhood dataset

Using the same steps as above we create a new dataset that describes the hometown dataset in term of venues.

We start getting the geographical coordinates of hometown neighborhood.


In [25]:
address = 'San Paolo, Rome, Italy'

geolocator_rome = Nominatim(user_agent="rome_explorer")
location_rome = geolocator_rome.geocode(address)
latitude_rome = location_rome.latitude
longitude_rome = location_rome.longitude
print('The geograpical coordinate of Rome are {}, {}.'.format(latitude_rome, longitude_rome))

The geograpical coordinate of Rome are 41.8546357, 12.4799703.


And then we get data from FourSquare.


In [26]:
rome_venues_list=[]

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude_rome, 
    longitude_rome, 
    750, 
    100)
            
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
        
# return only relevant information for each nearby venue
rome_venues_list.append([(
            'San Paolo', 
            latitude_rome, 
            longitude_rome, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['id'],
            v['venue']['categories'][0]['name']) for v in results])

rome_nearby_venues = pd.DataFrame([item for rome_venues_list in rome_venues_list for item in rome_venues_list])
rome_nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Id',
                  'Venue Category']

We quickly check the consistency of the data.


In [27]:
rome_nearby_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Id,Venue Category
0,San Paolo,41.854636,12.47997,Ilios,41.854703,12.478428,4bf58dd8d48988d10e941735,Greek Restaurant
1,San Paolo,41.854636,12.47997,Buskers Pub,41.852135,12.479969,4bf58dd8d48988d11b941735,Pub
2,San Paolo,41.854636,12.47997,Miami 3,41.851892,12.478228,4bf58dd8d48988d1c9941735,Ice Cream Shop
3,San Paolo,41.854636,12.47997,Bar San Paolo,41.85629,12.478663,4bf58dd8d48988d16d941735,Café
4,San Paolo,41.854636,12.47997,La Muffineria,41.853127,12.476754,4bf58dd8d48988d1bc941735,Cupcake Shop


### 2.3.5 Create average cost dataset

From 'https://www.seloger.com/prix-de-l-immo/location/ile-de-france/paris.htm' we create a simple table that contains the id of the boroughs and the average cost of a rent per square meter.


In [28]:
average_cost_list = [
        [1, 37.9],
        [2, 36.9],
        [3, 37.3],
        [4, 38.6],
        [5, 36.3],
        [6, 39.2],
        [7, 37.5],
        [8, 35.7],
        [9, 34.3],
        [10, 32.3],
        [11, 33.1],
        [12, 30.0],
        [13, 29.8],
        [14, 31.1],
        [15, 30.8],
        [16, 33.4],
        [17, 32.8],
        [18, 31.6],
        [19, 28.3],
        [20, 28.6],
        ]

df_average_cost = pd.DataFrame(data=average_cost_list, columns=['Borough', 'Cost'])

In [29]:
df_average_cost.head()

Unnamed: 0,Borough,Cost
0,1,37.9
1,2,36.9
2,3,37.3
3,4,38.6
4,5,36.3


### 2.3.6 Create burglary per year dataset

From https://www.bfmtv.com/societe/carte-delinquance-a-paris-quels-sont-les-arrondissements-ou-l-on-recense-le-plus-de-delits_AN-201910180103.html we create a simple table that contains the id of the boroughs and number of burglary per year.


In [30]:
burglary_year_list = [
        [1, 302],
        [2, 516],
        [3, 446],
        [4, 396],
        [5, 435],
        [6, 437],
        [7, 387],
        [8, 483],
        [9, 529],
        [10, 790],
        [11, 852],
        [12, 939],
        [13, 659],
        [14, 468],
        [15, 1025],
        [16, 958],
        [17, 986],
        [18, 1344],
        [19, 752],
        [20, 720],
        ]

df_burglary_year = pd.DataFrame(data=burglary_year_list, columns=['Borough', 'Burglary'])

In [31]:
df_burglary_year.head()

Unnamed: 0,Borough,Burglary
0,1,302
1,2,516
2,3,446
3,4,396
4,5,435


# 3 Methodology  <a name="methodology"></a>

This is the principal part of the work.

We start analyzing Paris venues in order to find the list of neighborhoods that meets family requirements.



## 3.1 Neighborhoods that meets family requirements

Rearrange the favorite venues dataset to present the data in a different way .


In [32]:
favorite_venues_grouped = favorite_venues.groupby('Neighborhood')['Venue Category'].value_counts().unstack().fillna(0)
favorite_venues_grouped.head()

Venue Category,Café,Grocery,Gym,Park,Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amérique,1.0,0.0,0.0,1.0,2.0
Archives,1.0,0.0,0.0,0.0,28.0
Arsenal,1.0,0.0,1.0,3.0,29.0
Arts-et-Métiers,1.0,1.0,0.0,1.0,40.0
Auteuil,0.0,0.0,0.0,0.0,1.0


Not all neighborhoods satisfy all family needs, we only select those that satisfy all of them.


In [33]:
favorite_venues_grouped[(favorite_venues_grouped[['Gym','Park', 'Café', 'Grocery', 'Restaurant']] != 0).all(axis=1)]

Venue Category,Café,Grocery,Gym,Park,Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Batignolles,3.0,1.0,1.0,2.0,48.0
Hôpital-Saint-Louis,4.0,1.0,1.0,1.0,43.0
Palais-Royal,3.0,1.0,1.0,1.0,34.0
Porte-Dauphine,1.0,1.0,2.0,1.0,1.0


Only four neighborhoods meet all the needs of the family.


## 3.2 Neighborhoods similar to the one of the hometown

For finding neighborhoods similar to that of the hometown we use k means clustering.
k means clustering is an unsupervised machine learning algorithm that is able to partitioning a dataset into groups of elements that have similar characteristics. 
in our case we want to group the neighborhoods according to the distribution of the venues.


### 3.2.1 Preparing data for clustering


We create a dataset that contains all the neighborhoods and venues of Paris and the venues or Rome neighborhoods.


In [34]:
mixed_neighborhoods = neighborhoods

In [35]:
mixed_neighborhoods.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Saint-Gervais,4,48.8557186509,2.35816233385
1,Saint-Thomas-d'Aquin,7,48.8552632694,2.32558765258
2,Porte-Saint-Denis,10,48.873617661,2.35228289495
3,Saint-Germain-l'Auxerrois,1,48.8606501352,2.33491032928
4,Villette,19,48.8876610888,2.37446821213


We set Borough to 0 for Rome neighborhood.


In [36]:
home_neighborhood = {'Neighborhood':'San Paolo', 'Borough':0, 'Latitude': latitude_rome, 'Longitude':longitude_rome}
mixed_neighborhoods = mixed_neighborhoods.append(home_neighborhood, ignore_index=True)

In [37]:
mixed_venues = paris_venues.append(rome_nearby_venues)

In [38]:
mixed_venues.shape

(5305, 8)

In [39]:
mixed_neighborhoods.shape

(81, 4)

For applying the k means clustering algorithm we have to transform all the categorical variables.
The one hot encoding tecnique will be used.


In [40]:
# one hot encoding
cluster_onehot = pd.get_dummies(mixed_venues[['Venue Category']], prefix="", prefix_sep="")
cluster_onehot.head()

Unnamed: 0,Accessories Store,Afghan Restaurant,African Restaurant,Alsatian Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# add neighborhood column back to dataframe
cluster_onehot['Neighborhood'] = mixed_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [cluster_onehot.columns[-1]] + list(cluster_onehot.columns[:-1])
paris_onehot = cluster_onehot[fixed_columns]

cluster_onehot.head()

Unnamed: 0,Accessories Store,Afghan Restaurant,African Restaurant,Alsatian Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Venezuelan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit,Neighborhood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Quinze-Vingts
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Quinze-Vingts
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Quinze-Vingts
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Quinze-Vingts
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Quinze-Vingts


In [42]:
cluster_grouped = cluster_onehot.groupby('Neighborhood').mean().reset_index()
cluster_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Alsatian Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Amérique,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.000000,0.000000,0.00,0.00,0.0,0.0,0.0
1,Archives,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.04,...,0.000000,0.0,0.0,0.000000,0.000000,0.00,0.00,0.0,0.0,0.0
2,Arsenal,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.00,...,0.027778,0.0,0.0,0.000000,0.013889,0.00,0.00,0.0,0.0,0.0
3,Arts-et-Métiers,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.01,0.02,...,0.020000,0.0,0.0,0.030000,0.040000,0.02,0.00,0.0,0.0,0.0
4,Auteuil,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.000000,0.000000,0.00,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,Sorbonne,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.010000,0.020000,0.00,0.00,0.0,0.0,0.0
77,Ternes,0.0,0.0,0.0,0.0,0.015625,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.015625,0.015625,0.00,0.00,0.0,0.0,0.0
78,Val-de-Grâce,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.000000,0.022727,0.00,0.00,0.0,0.0,0.0
79,Villette,0.0,0.0,0.0,0.0,0.017241,0.00,0.0,0.00,0.00,...,0.000000,0.0,0.0,0.000000,0.000000,0.00,0.00,0.0,0.0,0.0


We explore the one hot encoding dataset.

The top 5 venues per neighborhood.

In [43]:
num_top_venues = 5

for hood in cluster_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = cluster_grouped[cluster_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Amérique----
               venue  freq
0  French Restaurant  0.17
1              Plaza  0.17
2        Supermarket  0.17
3               Park  0.08
4            Theater  0.08


----Archives----
               venue  freq
0  French Restaurant  0.08
1     Clothing Store  0.05
2        Coffee Shop  0.05
3             Bistro  0.04
4              Hotel  0.04


----Arsenal----
               venue  freq
0  French Restaurant  0.18
1              Hotel  0.08
2               Park  0.04
3              Plaza  0.04
4   Tapas Restaurant  0.04


----Arts-et-Métiers----
                venue  freq
0   French Restaurant  0.13
1               Hotel  0.07
2        Cocktail Bar  0.05
3  Italian Restaurant  0.04
4            Wine Bar  0.04


----Auteuil----
               venue  freq
0       Tennis Court  0.28
1            Stadium  0.17
2             Garden  0.11
3  French Restaurant  0.06
4              Plaza  0.06


----Batignolles----
                venue  freq
0   French Restaurant  0.20
1       

                 venue  freq
0    French Restaurant  0.18
1                Hotel  0.10
2               Bakery  0.07
3  Japanese Restaurant  0.05
4               Bistro  0.04


----Odéon----
               venue  freq
0              Hotel  0.07
1  French Restaurant  0.07
2               Café  0.06
3             Bakery  0.04
4             Bistro  0.04


----Palais-Royal----
                 venue  freq
0  Japanese Restaurant  0.10
1                Hotel  0.06
2   Italian Restaurant  0.05
3                Plaza  0.05
4     Ramen Restaurant  0.05


----Parc-de-Montsouris----
                 venue  freq
0   Italian Restaurant  0.17
1  Japanese Restaurant  0.11
2                Hotel  0.06
3                 Café  0.06
4                 Park  0.06


----Petit-Montrouge----
                venue  freq
0               Hotel  0.16
1   French Restaurant  0.13
2  Italian Restaurant  0.07
3         Supermarket  0.06
4   Food & Drink Shop  0.04


----Picpus----
               venue  freq
0         

And the top ten venues showed in a tabular form.


In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [45]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = cluster_grouped['Neighborhood']

for ind in np.arange(cluster_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(cluster_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amérique,Plaza,French Restaurant,Supermarket,Pool,Bed & Breakfast,Park,Café,Theater,Bistro,Zoo Exhibit
1,Archives,French Restaurant,Clothing Store,Coffee Shop,Bistro,Hotel,Art Gallery,Plaza,Bookstore,Burger Joint,Cocktail Bar
2,Arsenal,French Restaurant,Hotel,Plaza,Park,Tapas Restaurant,Boat or Ferry,Seafood Restaurant,Thai Restaurant,Cocktail Bar,Pedestrian Plaza
3,Arts-et-Métiers,French Restaurant,Hotel,Cocktail Bar,Italian Restaurant,Wine Bar,Bar,Vietnamese Restaurant,Restaurant,Chinese Restaurant,Coffee Shop
4,Auteuil,Tennis Court,Stadium,Garden,Outdoors & Recreation,French Restaurant,Racecourse,Sporting Goods Shop,Plaza,Museum,Botanical Garden


In [46]:
neighborhoods_venues_sorted[neighborhoods_venues_sorted['Neighborhood'] == 'San Paolo']

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
75,San Paolo,Italian Restaurant,Café,Pizza Place,Ice Cream Shop,Park,Pub,Fast Food Restaurant,Asian Restaurant,Clothing Store,Bistro


I'm from Rome and I know quite well San Paolo neighborhood. Since the district became the seat of the third university of Rome, many restaurants, pubs and fast food have been opened. The data we obtained from FourSquare API correctly represent the distribution of the venues in San Paolo.


### 3.2.2 Clustering

Now everything is ready for clustering, let's see what happen.


In [47]:
# set number of clusters
kclusters = 7

cluster_grouped_clustering = cluster_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([5, 1, 6, 1, 4, 6, 3, 1, 1, 1], dtype=int32)

In [48]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cluster_merged = mixed_neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
cluster_merged = cluster_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

cluster_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Saint-Gervais,4,48.8557186509,2.35816233385,1,French Restaurant,Clothing Store,Italian Restaurant,Hotel,Ice Cream Shop,Gay Bar,Thai Restaurant,Gourmet Shop,Pastry Shop,Bookstore
1,Saint-Thomas-d'Aquin,7,48.8552632694,2.32558765258,6,French Restaurant,Hotel,Café,Art Gallery,Coffee Shop,Italian Restaurant,American Restaurant,Historic Site,Sandwich Place,Tailor Shop
2,Porte-Saint-Denis,10,48.873617661,2.35228289495,1,Hotel,French Restaurant,Bakery,Bar,Bistro,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Indian Restaurant,Japanese Restaurant,Pizza Place
3,Saint-Germain-l'Auxerrois,1,48.8606501352,2.33491032928,6,French Restaurant,Hotel,Plaza,Coffee Shop,Art Museum,Historic Site,Bar,Italian Restaurant,Café,Cosmetics Shop
4,Villette,19,48.8876610888,2.37446821213,1,Hotel,Bar,French Restaurant,Café,Asian Restaurant,Food Truck,Fast Food Restaurant,Multiplex,Supermarket,Bistro


Let's show in a map the geographic cluster distribution.


In [49]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cluster_merged['Latitude'], cluster_merged['Longitude'], cluster_merged['Neighborhood'], cluster_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [50]:
cluster_merged.loc[cluster_merged['Borough'] == 0]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
80,San Paolo,0,41.8546,12.48,1,Italian Restaurant,Café,Pizza Place,Ice Cream Shop,Park,Pub,Fast Food Restaurant,Asian Restaurant,Clothing Store,Bistro


The hometown neighborhood belongs to cluster 1.

And here is the list of all Paris neighborhood in cluster 1.


In [51]:
cluster_merged.loc[cluster_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Saint-Gervais,4,48.8557186509,2.35816233385,1,French Restaurant,Clothing Store,Italian Restaurant,Hotel,Ice Cream Shop,Gay Bar,Thai Restaurant,Gourmet Shop,Pastry Shop,Bookstore
2,Porte-Saint-Denis,10,48.873617661,2.35228289495,1,Hotel,French Restaurant,Bakery,Bar,Bistro,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Indian Restaurant,Japanese Restaurant,Pizza Place
4,Villette,19,48.8876610888,2.37446821213,1,Hotel,Bar,French Restaurant,Café,Asian Restaurant,Food Truck,Fast Food Restaurant,Multiplex,Supermarket,Bistro
5,Quinze-Vingts,12,48.8469159441,2.37440162648,1,French Restaurant,Coffee Shop,Sandwich Place,Hotel,Bakery,Bar,Farmers Market,Train Station,Cocktail Bar,Italian Restaurant
7,Bercy,12,48.8352090499,2.38621008421,1,Hotel,Italian Restaurant,Bus Stop,Bakery,Gym / Fitness Center,French Restaurant,Wine Bar,Plaza,Museum,Supermarket
8,Halles,1,48.8622891081,2.34489885831,1,French Restaurant,Bar,Bakery,Italian Restaurant,Chinese Restaurant,Pub,Café,Sandwich Place,Furniture / Home Store,Restaurant
9,Monnaie,6,48.8543844036,2.34003537113,1,French Restaurant,Plaza,Cocktail Bar,Ice Cream Shop,Bistro,Historic Site,Hotel,Creperie,Pub,Bookstore
10,Odéon,6,48.8478006293,2.33633882759,1,Hotel,French Restaurant,Café,Plaza,Bistro,Bakery,Italian Restaurant,Athletics & Sports,Pub,Cocktail Bar
13,Croulebarbe,13,48.8337336761,2.34767304607,1,French Restaurant,Sushi Restaurant,Bar,Park,Bakery,Sandwich Place,Italian Restaurant,Hotel,Cocktail Bar,Fast Food Restaurant
14,Vivienne,2,48.8691001998,2.33946074375,1,Japanese Restaurant,French Restaurant,Bistro,Wine Bar,Coffee Shop,Hotel,Bookstore,Korean Restaurant,Salad Place,Café


Recalling the neighborhoods that satisfy the family needs 


In [52]:
favorite_venues_grouped[(favorite_venues_grouped[['Gym','Park', 'Café', 'Grocery', 'Restaurant']] != 0).all(axis=1)]

Venue Category,Café,Grocery,Gym,Park,Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Batignolles,3.0,1.0,1.0,2.0,48.0
Hôpital-Saint-Louis,4.0,1.0,1.0,1.0,43.0
Palais-Royal,3.0,1.0,1.0,1.0,34.0
Porte-Dauphine,1.0,1.0,2.0,1.0,1.0


## 3.3 Average cost and burglary rate

From the analysis of the sites we have identified two neighborhoods that meet all customer requirements:

+ Hôpital-Saint-Louis
+ Palais-Royal

Now let's see what are the average rental cost and the burglary rate in these two neighborhoods


In [55]:
neighborhoods.loc[(neighborhoods['Neighborhood'] == 'Hôpital-Saint-Louis') | (neighborhoods['Neighborhood'] == 'Palais-Royal')]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
34,Hôpital-Saint-Louis,10,48.87600829,2.36812301789
52,Palais-Royal,1,48.8646599781,2.33630891897


In [58]:
df_average_cost.loc[(df_average_cost['Borough'] == 1) | (df_average_cost['Borough'] == 10)]

Unnamed: 0,Borough,Cost
0,1,37.9
9,10,32.3


In [59]:
df_burglary_year.loc[(df_burglary_year['Borough'] == 1) | (df_burglary_year['Borough'] == 10)]

Unnamed: 0,Borough,Burglary
0,1,302
9,10,790


# 4 Results <a name="results"></a>

We found four neighborhoods that had all the features the customer requested.
Using the k-means clustering algorithm we found 38 neighborhoods that are similar to customer hometown neighborhood.
The intersection of the two previous results gives only two neighborhoods.

Using the information from cost and crime rate we can summarize the result in the following table :

| Neighborhood      | Cost per sqm | Burglary Rate     |
| :---        |    :----:   |          ---: |
| Hôpital-Saint-Louis      | 32.3       | 790   |
| Palais-Royal   | 37.9        | 302      |


Considering a 100 square meter apartment, the difference in rent is 50 euros and the risk of burglary is reduced by half.

Anyway we left the choise to the customer.

# 5 Discussion <a name="discussion"></a>

We have use the simplest clustering algorithm, one can try to use other clustering algorithms and find which one is best for this type of problem.

Other clustering algorithm can be used in order to find the best for this kind of problem.

Moreover, having a customer history, one could think of creating user profiles to use with recommendation system.


# 6 Conclusion <a name="conclusion"></a>

The aim of this project was to identify a neighborhood similar to the client's current one and which, at the same time, also had venues that were important to him.

We have succeeded in demonstrating that data science methodologies can be used for the solution of this type of problem.

As a future development, the use of recommendation systems could be investigated to get further information on choosing the apartment to rent.