Importing relevant packages

In [5]:
import pandas as pd
import locale
import plotly.express as px

Importing the database

In [6]:
data = pd.read_csv('listings.csv', low_memory = False)

First, let's develop a function to turn variables with prices into numeric values

In [None]:
def money_to_numeric(array, locale):
    """
    This function convert money style array to numeric array
    
    Input:
    array         - An array with money format
    locale        - Locale package
    
    Output:
    numeric_array - An array with data types already converted 
    """
    
    numeric_array = [locale.atof(str(element).strip("$")) for element in array]
    
    return numeric_array

and apply it to price related columns

In [None]:
locale.setlocale(locale.LC_ALL, 'es_MX.UTF8')
array = ['price','weekly_price','monthly_price','security_deposit','cleaning_fee','extra_people']
data[array] = data[array].apply(lambda col: money_to_numeric(col, locale))

The questions proposed in this notebook are:
1. Which neighbourhood is the more expensive?
2. Which neighbourhood has more super hosts?
3. Is there a relantionship between super hosts and price in the neighbourhoods?
4. Why are these neighbourhoods expensive?

### Which neighbourhoods are more expensive?

In [26]:
top_5_neighbourhoods = data.groupby('neighbourhood_cleansed')                       # Grouping data by neighbourhood
top_5_neighbourhoods = top_5_neighbourhoods.agg({'price': 'mean'})                  # Calculating average price
top_5_neighbourhoods = top_5_neighbourhoods.sort_values('price', ascending = False) # Sorting from largest to smallest
top_5_neighbourhoods = top_5_neighbourhoods.head(5)
top_5_neighbourhoods

Unnamed: 0_level_0,price
neighbourhood_cleansed,Unnamed: 1_level_1
Cuajimalpa de Morelos,3131.604288
Iztapalapa,2689.812102
Xochimilco,2441.907801
Miguel Hidalgo,2021.49655
Cuauhtémoc,1655.697666


Is easy to see that on average, Cuajimalpa de Morelos is the most expensive, followed by Iztapalapa and Xochimilco. But the catch here is that, they are the most expensive on average, that can be because of the pressence of outliers.

To take into account that is a good idea to look at a Box plot.

Note that the graph lacks of data points with prices above $10,000 for visualization purposes

In [30]:
# First, get the rows from the 5 neighbourhoods
filtered_neighbourhoods = top_5_neighbourhoods.index
# Filter data to only take info from the neighbourhoods and prices below $10,000
filtered_data = data.loc[(data.neighbourhood_cleansed.isin(filter_neighbourhoods)) & (data.price <=10000)]
# Box plot
fig1 = px.box(filtered_data, 'neighbourhood_cleansed', 'price', points = 'all')
fig1.show()

The box plot shows that Cuauhtémoc and Miguel Hidalgo have more pricey properties. A Box plot with all neighbourhoods may help clarify the question

In [29]:
# Filter data to only take prices below $10,000
filtered_data = data.loc[data.price <=10000]
# Box plot
fig1 = px.box(filtered_data, 'neighbourhood_cleansed', 'price', points = 'all')
fig1.show()

The neighbourhoods with more properties in the high price scale are Cuauhtémoc and Miguel Hidalgo, followed by Coyoacán, Benito Juárez and Álvaro Obregón

### Which neighbourhoods have more super hosts?

In [24]:
top_by_super_hosts = data[data.host_is_superhost == 't']
top_by_super_hosts = top_by_super_hosts.groupby('neighbourhood_cleansed')    # Groupping by neighbourhood
top_by_super_hosts = top_by_super_hosts.agg({'id': 'count'})                 # Counting the number of super hosts
top_by_super_hosts = top_by_super_hosts.sort_values('id', ascending = False) # sorting values from larger to smaller
top_by_super_hosts.head(5)

Unnamed: 0_level_0,id
neighbourhood_cleansed,Unnamed: 1_level_1
Cuauhtémoc,3353
Miguel Hidalgo,1170
Benito Juárez,1077
Coyoacán,637
Álvaro Obregón,253


Cuauhtémoc, Miguel Hidalgo and Benito Juárez are the neighbourhoods with more super hosts, followed by Coyoacán and Álvaro Obregón. It is interesting that the same top pricey neighbourhoods are the ones with more super hosts.

### Is there a relantionship between super hosts and price in the neighbourhoods?

In [13]:
# Filter data to only take info from superhosts and prices below $10,000
filtered_data = data.loc[(data.host_is_superhost == 't') & (data.price <= 10000)]
# Box plot
fig = px.box(filtered_data, 'neighbourhood_cleansed', 'price', points = 'all')
fig.show()

The Box plot makes easy to see that Cuauhtémoc and Miguel Hidalgo has the most pricey properties with super hosts, followed by Coyoacán

### Why are these neighbourhoods expensive?

Cuauhtémoc is basically at the turistic center of Mexico city, with attractions such as Palacio de Bellas Artes, Museo del Templo Mayor, Catedral Metropolitana and close to many more

Miguel Hidalgo is next to Cuauhtémoc and some parts of Bosque de Chapultepec and Polanco are close

Coyoacán is not close to the other two, but has various attractions at walking distance

The three neighbourhoods are among the most valued neighbourhoods in the city, that and as mentioned before, their closeness to turistic attarctions explains the prices of their properties 