# Seatle and Boston Airbnb Open Data
## Exploratory Data Analysis

**Author:** Paola Rocha  
**Date:** March 3rd, 2024

**Description**  
The objective of this notebook is to answers business questions for [Boston](https://www.kaggle.com/datasets/airbnb/boston?select=calendar.csv) and [Seatle](https://www.kaggle.com/datasets/airbnb/seattle?resource=download) Airbnb Open Data. 

Este notebook se enfocarà en responder las siguientes preguntas para cada una de las ciudades:
1. Dònde se encuentran los superhosts?
    - Utilizar el dataset de listing con longitud y latitud
    - Utilizar el precio con escala de colores para mostrar superhost mas costosos.
2. Què cualidades de la habitaciòn afectan màs para ser un superhost?
3. Què tipo de reviews tienen los superhosts?
    - Usar NPL para describir los reviews.
4. Predicciòn de precios promedios para la siguiente temporada de super hosts.
    - Existe una diferencia de tendencias para los que no son superhosts?

**Notebook contents**
1. Libraries
2. Dataset description
3. Data acquistion
4. Cleaning data
5. Saving data

In [None]:
# Processing data
from pandas import pandas as pd

# Visualization
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# Reading data
seatle_listings = pd.read_csv('../data/processed/Seatle/listings.csv', dtype={'host_is_superhost': bool})
boston_listings = pd.read_csv('../data/processed/Boston/listings.csv', dtype={'host_is_superhost': bool})

## 1. Dònde se encuentran principalmente los superhosts?

### Hosts ubication

In [None]:
def pivot_listings(df_listings):
    """This function group data by neighbourhood and host type ('host' or 'superhost').
    Also, get the count of airbnb hosts and the average price cost by accommodation.

    Args:
        df_listings (pd.DataFrame): Listing dataframe

    Returns:
        pd.DataFrame: Dataframe with count of hosts and average price by host type.
    """
    # Grouping by host and neighbourhood and getting the average price
    df_neighbourhood = df_listings.groupby(by=['host_is_superhost', 'neighbourhood']).agg({'price':'mean', 'id':'count'}).reset_index()
    
    # Separation by host type
    neighbour_host = df_neighbourhood[df_neighbourhood['host_is_superhost'] == False].drop('host_is_superhost', axis=1)
    neighbour_superhost = df_neighbourhood[df_neighbourhood['host_is_superhost'] == True].drop('host_is_superhost', axis=1)
    
    # Merging dataframes
    df_neighbour_pivot = neighbour_host.merge(neighbour_superhost, on='neighbourhood', how='outer', suffixes=('_host', '_superhost')).fillna(0).sort_values('price_host')
    return df_neighbour_pivot

In [None]:
# Grouping by host and neighbourhood and getting the average bas price
seattle_neighbour_pivot = pivot_listings(seatle_listings)
boston_neighbour_pivot = pivot_listings(boston_listings)

In [None]:
def vbar_host_comparison(df_pivot, column_name:str, title:str):

    df_pivot = df_pivot.sort_values(f'{column_name}_host')

    y = list(range(len(df_pivot)))

    fig = go.Figure(data=[
        go.Bar(y=y, x=df_pivot[f'{column_name}_superhost'], orientation='h', name="Superhost", base=0),
        go.Bar(y=y, x=-df_pivot[f'{column_name}_host'], orientation='h', name="Host", base=0),
    ])

    fig.update_layout(
        barmode='stack',
        title={'text': f"<b>{title}</b><br>Host vs Superhost",
            'x':0.5,
            'xanchor': 'center'
        },
        width=1000,
        height=1000,
        margin=dict(
            l=10,
            r=10,
            b=10,
            t=50,
            pad=0
        ),)

    fig.update_yaxes(
            ticktext=df_pivot['neighbourhood'],  # Updating y axis names with neighbourhood names
            tickvals=y
        )
    fig.show()

In [None]:
vbar_host_comparison(seattle_neighbour_pivot, 'id', 'How many Airbnbs are by host in Seattle by neighbourhood?')

In [None]:
vbar_host_comparison(boston_neighbour_pivot, 'id', 'How many Airbnbs are by host in Boston by neighbourhood?')

In [None]:
def geographical_price(df_listings, host_is_superhost:bool):
    """Filter listings dataframe by host type and calculate the average of price by coordenates.

    Args:
        df_listings (pd.DataFrame): Listings dataframe
        host_is_superhost (bool): Type of host. If True, host is superhost, otherwise is only a host

    Returns:
        pd.DataFrame: Dataframe with the coordenates and average price.
    """
    df_listings_filtered = df_listings[df_listings['host_is_superhost'] == host_is_superhost]
    df_coords = df_listings_filtered.groupby(by=['latitude', 'longitude']).agg({'price': 'mean'}).reset_index()
    return df_coords

In [None]:
def mapbox_price(df_coords, color_continuous_scale:str='matter'):
    fig = px.density_mapbox(df_coords, lat='latitude', lon='longitude', z='price', radius=15, zoom=0, color_continuous_scale='matter')
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

In [None]:
seattle_superhost_coords = geographical_price(seatle_listings, host_is_superhost=True)
seattle_host_coords = geographical_price(seatle_listings, host_is_superhost=False)

In [None]:
mapbox_price(seattle_superhost_coords)

In [None]:
mapbox_price(seattle_host_coords)

In [None]:
boston_superhost_coords = geographical_price(boston_listings, host_is_superhost=True)
boston_host_coords = geographical_price(boston_listings, host_is_superhost=False)

In [None]:
mapbox_price(boston_superhost_coords)

In [None]:
mapbox_price(boston_host_coords)