## Airbnb data for Mexico city

This notebook analyses the data of Airbnb in Mexico city, this database can be found [here](http://insideairbnb.com/get-the-data.html).

Mexico is one of the biggest cities in the world and as such, it has a lot of different attarctions, neighbourhoods and of course, prices in Airbnb listings.

This notebook is separated in 3 stages:
* Questions we want to answer
* Analysis of data
* Model to predict prices

In [5]:
import pandas as pd
import locale
import plotly.express as px

In [6]:
data = pd.read_csv('listings.csv', low_memory = False)

In [None]:
def money_to_numeric(array, locale):
    """
    This function convert money style array to numeric array
    
    Input:
    array         - An array with money format
    locale        - Locale package
    
    Output:
    numeric_array - An array with data types already converted 
    """
    
    numeric_array = [locale.atof(str(element).strip("$")) for element in array]
    
    return numeric_array

In [None]:
locale.setlocale(locale.LC_ALL, 'es_MX.UTF8')
array = ['price','weekly_price','monthly_price','security_deposit','cleaning_fee','extra_people']
data[array] = data[array].apply(lambda col: money_to_numeric(col, locale))

Question 1. Which neighbourhood is the more expensive

In [26]:
top_5_neighbourhoods = data.groupby('neighbourhood_cleansed')                       # Grouping data by neighbourhood
top_5_neighbourhoods = top_5_neighbourhoods.agg({'price': 'mean'})                  # Calculating average price
top_5_neighbourhoods = top_5_neighbourhoods.sort_values('price', ascending = False) # Sorting from largest to smallest
top_5_neighbourhoods = top_5_neighbourhoods.head(5)
top_5_neighbourhoods

Unnamed: 0_level_0,price
neighbourhood_cleansed,Unnamed: 1_level_1
Cuajimalpa de Morelos,3131.604288
Iztapalapa,2689.812102
Xochimilco,2441.907801
Miguel Hidalgo,2021.49655
Cuauhtémoc,1655.697666


In [22]:
# First, get the rows from the 5 neighbourhoods
filtered_neighbourhoods = top_5_neighbourhoods.index
# Filter data to only take info from the neighbourhoods and prices below $10,000
filtered_data = data.loc[(data.neighbourhood_cleansed.isin(filter_neighbourhoods)) & (data.price <=10000)]
# Box plot
fig1 = px.box(filtered_data, 'neighbourhood_cleansed', 'price', points = 'all')
fig1.show()

Question 2. Which neighbourhood has more super hosts?

In [24]:
top_by_super_hosts = data[data.host_is_superhost == 't']
top_by_super_hosts = top_by_super_hosts.groupby('neighbourhood_cleansed')    # Groupping by neighbourhood
top_by_super_hosts = top_by_super_hosts.agg({'id': 'count'})                 # Counting the number of super hosts
top_by_super_hosts = top_by_super_hosts.sort_values('id', ascending = False) # sorting values from larger to smaller
top_by_super_hosts.head(5)

Unnamed: 0_level_0,id
neighbourhood_cleansed,Unnamed: 1_level_1
Cuauhtémoc,3353
Miguel Hidalgo,1170
Benito Juárez,1077
Coyoacán,637
Álvaro Obregón,253


Question 3. Is there a relantionship between super hosts and price in the neighbourhoods?

In [13]:
# Filter data to only take info from superhosts and prices below $10,000
filtered_data = data.loc[(data.host_is_superhost == 't') & (data.price <= 10000)]
# Box plot
fig = px.box(filtered_data, 'neighbourhood_cleansed', 'price', points = 'all')
fig.show()