# Seatle and Boston Airbnb Open Data
## Exploratory Data Analysis

**Author:** Paola Rocha  
**Date:** March 3rd, 2024

**Description**  
The objective of this notebook is to answers business questions for [Boston](https://www.kaggle.com/datasets/airbnb/boston?select=calendar.csv) and [Seatle](https://www.kaggle.com/datasets/airbnb/seattle?resource=download) Airbnb Open Data. 

Este notebook se enfocarà en responder las siguientes preguntas para cada una de las ciudades:
1. Dònde se encuentran los superhosts?
    - Utilizar el dataset de listing con longitud y latitud
    - Utilizar el precio con escala de colores para mostrar superhost mas costosos.
2. Què cualidades de la habitaciòn afectan màs para ser un superhost?
3. Què tipo de reviews tienen los superhosts?
    - Usar NPL para describir los reviews.
4. Predicciòn de precios promedios para la siguiente temporada de super hosts.
    - Existe una diferencia de tendencias para los que no son superhosts?

**Notebook contents**
1. Libraries
2. Dataset description
3. Data acquistion
4. Cleaning data
5. Saving data

In [None]:
# Processing data
from pandas import pandas as pd

# NLP tools
from textblob import TextBlob

# Visualization
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

# sns.set_context("talk")

In [None]:
# Reading data
seatle_listings = pd.read_csv('../data/processed/Seatle/listings.csv', dtype={'host_is_superhost': bool})
boston_listings = pd.read_csv('../data/processed/Boston/listings.csv', dtype={'host_is_superhost': bool})
boston_reviews = pd.read_csv('../data/processed/Boston/reviews.csv')

## 1. Dònde se encuentran principalmente los superhosts?

### Hosts ubication

In [None]:
def pivot_listings(df_listings):
    """This function group data by neighbourhood and host type ('host' or 'superhost').
    Also, get the count of airbnb hosts and the average price cost by accommodation.

    Args:
        df_listings (pd.DataFrame): Listing dataframe

    Returns:
        pd.DataFrame: Dataframe with count of hosts and average price by host type.
    """
    # Grouping by host and neighbourhood and getting the average price
    df_neighbourhood = df_listings.groupby(by=['host_is_superhost', 'neighbourhood']).agg({'price':'mean', 'id':'count'}).reset_index()
    
    # Separation by host type
    neighbour_host = df_neighbourhood[df_neighbourhood['host_is_superhost'] == False].drop('host_is_superhost', axis=1)
    neighbour_superhost = df_neighbourhood[df_neighbourhood['host_is_superhost'] == True].drop('host_is_superhost', axis=1)
    
    # Merging dataframes
    df_neighbour_pivot = neighbour_host.merge(neighbour_superhost, on='neighbourhood', how='outer', suffixes=('_host', '_superhost')).fillna(0).sort_values('price_host')
    return df_neighbour_pivot

In [None]:
# Grouping by host and neighbourhood and getting the average bas price
seattle_neighbour_pivot = pivot_listings(seatle_listings)
boston_neighbour_pivot = pivot_listings(boston_listings)

In [None]:
def vbar_host_comparison(df_pivot, column_name:str, title:str):

    df_pivot = df_pivot.sort_values(f'{column_name}_host')

    y = list(range(len(df_pivot)))

    fig = go.Figure(data=[
        go.Bar(y=y, x=df_pivot[f'{column_name}_superhost'], orientation='h', name="Superhost", base=0),
        go.Bar(y=y, x=-df_pivot[f'{column_name}_host'], orientation='h', name="Host", base=0),
    ])

    fig.update_layout(
        barmode='stack',
        title={'text': f"<b>{title}</b><br>Host vs Superhost",
            'x':0.5,
            'xanchor': 'center'
        },
        width=1000,
        height=1000,
        margin=dict(
            l=10,
            r=10,
            b=10,
            t=50,
            pad=0
        ),)

    fig.update_yaxes(
            ticktext=df_pivot['neighbourhood'],  # Updating y axis names with neighbourhood names
            tickvals=y
        )
    fig.show()

In [None]:
vbar_host_comparison(seattle_neighbour_pivot, 'id', 'How many Airbnbs are by host in Seattle by neighbourhood?')

In [None]:
vbar_host_comparison(boston_neighbour_pivot, 'id', 'How many Airbnbs are by host in Boston by neighbourhood?')

In [None]:
def geographical_price(df_listings, host_is_superhost:bool):
    """Filter listings dataframe by host type and calculate the average of price by coordenates.

    Args:
        df_listings (pd.DataFrame): Listings dataframe
        host_is_superhost (bool): Type of host. If True, host is superhost, otherwise is only a host

    Returns:
        pd.DataFrame: Dataframe with the coordenates and average price.
    """
    df_listings_filtered = df_listings[df_listings['host_is_superhost'] == host_is_superhost]
    df_coords = df_listings_filtered.groupby(by=['latitude', 'longitude']).agg({'price': 'mean'}).reset_index()
    return df_coords

In [None]:
def mapbox_price(df_coords, color_continuous_scale:str='matter'):
    fig = px.density_mapbox(df_coords, lat='latitude', lon='longitude', z='price', radius=15, zoom=0, color_continuous_scale='matter')
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

In [None]:
seattle_superhost_coords = geographical_price(seatle_listings, host_is_superhost=True)
seattle_host_coords = geographical_price(seatle_listings, host_is_superhost=False)

In [None]:
mapbox_price(seattle_superhost_coords)

In [None]:
mapbox_price(seattle_host_coords)

In [None]:
boston_superhost_coords = geographical_price(boston_listings, host_is_superhost=True)
boston_host_coords = geographical_price(boston_listings, host_is_superhost=False)

In [None]:
mapbox_price(boston_superhost_coords)

In [None]:
mapbox_price(boston_host_coords)

## 2. Què cualidades de la habitaciòn afectan màs para ser un superhost?
Renombrar pregunta a: ¿Cuál es la importancia de recibir scores positivos?

In [None]:
seatle_listings.columns

In [None]:
seatle_listings['amenities_count'] = seatle_listings['amenities'].str.count(',') + 1
acommodation_qualities_cols = [#'host_is_superhost', # 'property_type', 'room_type',  'bed_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities_count',
       'minimum_nights', 'review_scores_cleanliness', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value']
seatle_listings[acommodation_qualities_cols]

In [None]:
seatle_accomodation = seatle_listings[acommodation_qualities_cols + ["host_is_superhost"]]
seatle_accomodation['host_is_superhost'] = seatle_accomodation['host_is_superhost'].astype(int)

TODO: hacer esta gráfica como gráfica de barras verticales de comparación.

Podemos observar que los super host se destacan en obtener buenos reviews en el rubro de Comunicación, por lo que la atención a la comodidad y necesidades del usuario son importantes para ser un Super Host. En esto se puede ver en buenos reviews de limpieza del lugar.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

fig.suptitle('Reviews frequency', fontsize=16)

sns.histplot(ax=axes[0, 0], data=seatle_accomodation.dropna(axis=0), x='review_scores_cleanliness', hue="host_is_superhost", kde=True, legend=False).set(title='Cleanliness')
# sns.histplot(ax=axes[0, 1], data=seatle_accomodation.dropna(axis=0), x='review_scores_rating', hue="host_is_superhost", kde=True, legend=False).set(title='')
sns.histplot(ax=axes[0, 1], data=seatle_accomodation.dropna(axis=0), x='review_scores_accuracy', hue="host_is_superhost", kde=True, legend=False).set(title='Accuracy')
sns.histplot(ax=axes[0, 2], data=seatle_accomodation.dropna(axis=0), x='review_scores_checkin', hue="host_is_superhost", kde=True, legend=True).set(title='Check In')
sns.histplot(ax=axes[1, 0], data=seatle_accomodation.dropna(axis=0), x='review_scores_communication', hue="host_is_superhost", kde=True, legend=False).set(title='Communication')
sns.histplot(ax=axes[1, 1], data=seatle_accomodation.dropna(axis=0), x='review_scores_location', hue="host_is_superhost", kde=True, legend=False).set(title='Location')
sns.histplot(ax=axes[1, 2], data=seatle_accomodation.dropna(axis=0), x='review_scores_value', hue="host_is_superhost", kde=True, legend=False).set(title='Overall value')
axes[0, 2].legend(title='Host type', loc='upper left', labels=['Super Host', 'Host'])
sns.move_legend(axes[0, 2], "upper left", bbox_to_anchor=(1, 1))

## 3. Análisis de reviews de los superhosts.

In [None]:
boston_listings.sort_values('id').head()

In [None]:
df_host_type_reviews = boston_reviews.merge(boston_listings[['id', 'name', 'host_is_superhost']], left_on='listing_id', right_on='id', how='inner')
df_host_type_reviews

In [None]:
df_host_type_reviews['comments'][100]

In [None]:
TextBlob(df_host_type_reviews['comments'][68270]).sentiment.polarity

In [None]:
def get_sentiment(text:str):
    blob = TextBlob(str(text))
    sentiment = blob.sentiment  # tuple of Sentiment(polarity=0.21547619047619046, subjectivity=0.4841269841269841)
    return sentiment.polarity

df_host_type_reviews['sentiment'] = df_host_type_reviews['comments'].apply(get_sentiment)

In [None]:
def categorize_sentiment(polarity:float):
    if polarity < 0:
        return 'Negative'
    elif polarity > 0:
        return 'Positive'
    else:
        return 'Neutral'

df_host_type_reviews['sentiment_category'] = df_host_type_reviews['sentiment'].apply(categorize_sentiment)

In [None]:
df_superhost = df_host_type_reviews[df_host_type_reviews['host_is_superhost'] == True]
df_host = df_host_type_reviews[df_host_type_reviews['host_is_superhost'] == False]

df_superhost_sentiment = df_superhost['sentiment_category'].value_counts() / df_superhost.shape[0]
df_superhost_sentiment

In [None]:
df_host_sentiment = df_host['sentiment_category'].value_counts() / df_host.shape[0]
df_host_sentiment

In [None]:
def plot_pie_chart(df, host_type:str):
	# colors
	colors = ['#3cb371', '#ffa500', '#FF0000']  # Positive, Neutral, Negative
	# explosion
	explode = (0.00, 0.05, 0.1)

	# Pie Chart
	plt.pie(df, colors=colors, labels=df.index,
			autopct='%1.1f%%', pctdistance=0.85,
			explode=explode)

	plt.title(f'Reviews categorization of {host_type}')
	plt.show()

In [None]:
plot_pie_chart(df_superhost_sentiment, host_type='Superhost')

In [None]:
plot_pie_chart(df_host_sentiment, host_type='Host')

Links:
- https://medium.com/@umarsmuhammed/how-to-perform-sentiment-analysis-using-python-step-by-step-tutorial-with-code-snippets-4ac3e9747fff
- https://textblob.readthedocs.io/en/dev/quickstart.html