<a href="https://colab.research.google.com/github/datascience-uniandes/data-analysis-tutorial/blob/master/airbnb/eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis (EDA)

MINE-4101: Applied Data Science  
Univerisdad de los Andes
Lizeth Viviana Perdomo Castañeda  
  
**Dataset:** AirBnb Listings - Santiago, Región Metropolitana de Santiago, Chile [[dataset](http://insideairbnb.com/get-the-data/) | [dictionary](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?usp=sharing)]. This dataset comprises information about Airbnb property listings in Santiago de Chile. It includes data points like neighborhood, property type, price per night, number of reviews, review scores, availability, amenities, and more.

**Business Context:** Property Investment and Vacation Rental Strategy. You're a consultant for individuals and firms looking to invest in properties for Airbnb rentals. They want to identify the most lucrative neighborhoods, optimal pricing strategies, and understand the factors that contribute to positive reviews and frequent bookings.

Last update: August, 2024

In [1]:
import pip

In [2]:
pip.main(["install","seaborn"])

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


0

In [3]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Pandas configuration for extending the number of columns and rows to show
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

### 1. Load the data

In [5]:
# Loading the CSV file as dataframe
listings_Santiago_df = pd.read_csv("./listings.csv.gz")

In [6]:
# Showing dataframe dimensions
listings_Santiago_df.shape

(13053, 75)

In [7]:
# Showing column types
listings_Santiago_df.dtypes

id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
source                                           object
name                                             object
description                                      object
neighborhood_overview                            object
picture_url                                      object
host_id                                           int64
host_url                                         object
host_name                                        object
host_since                                       object
host_location                                    object
host_about                                       object
host_response_time                               object
host_response_rate                               object
host_acceptance_rate                            

In [8]:
listings_Santiago_df.sample(5) # Showing a sample of n rows

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
2381,25416390,https://www.airbnb.com/rooms/25416390,20240629050059,2024-06-29,city scrape,Estudio en Valle Nevado Ski Resort,Magnificent studio in the newest building in V...,"Best ski slopes in sudamerica ski out, restaur...",https://a0.muscache.com/pictures/0f4ff7a2-9a73...,56324000,https://www.airbnb.com/users/show/56324000,Carlos,2016-01-27,"Las Condes, Chile",Me encanta viajar y por lo mismo se lo importa...,within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/04e69...,https://a0.muscache.com/im/pictures/user/04e69...,,1,1,"['email', 'phone']",t,t,"Farellones, Región Metropolitana, Chile",Lo Barnechea,,-33.35292,-70.24881,Entire rental unit,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Resort access"", ""Hot water kettle"", ""Freezer...","$260,000.00",4,1125,3,4,1125,1125,3.7,1125.0,,t,5,15,42,317,2024-06-29,23,7,0,2018-07-20,2023-09-22,4.96,4.96,4.96,5.0,5.0,4.91,4.43,,f,1,1,0,0,0.32
6116,677538361037515074,https://www.airbnb.com/rooms/677538361037515074,20240629050059,2024-06-29,city scrape,Habitación cerca metro Manquehue,Enjoy the simplicity of this quiet and central...,Residential neighborhood next to mall and metro,https://a0.muscache.com/pictures/miso/Hosting-...,471142764,https://www.airbnb.com/users/show/471142764,Melisa,2022-07-23,,"Mi nombre es Melisa, tengo dos hijos adolescen...",within a few hours,100%,0%,f,https://a0.muscache.com/im/pictures/user/78709...,https://a0.muscache.com/im/pictures/user/78709...,,1,1,"['email', 'phone']",t,t,"Las Condes, Región Metropolitana, Chile",Las Condes,,-33.41674,-70.57165,Private room in rental unit,Private room,1,1.0,1 shared bath,1.0,1.0,"[""Hot water kettle"", ""Dining table"", ""Freezer""...","$18,000.00",1,1125,1,1,1125,1125,1.0,1125.0,,t,1,1,1,272,2024-06-29,9,0,0,2022-07-25,2023-02-10,4.11,4.11,3.67,4.78,4.33,4.78,3.89,,f,1,0,1,0,0.38
7775,853529888138650919,https://www.airbnb.com/rooms/853529888138650919,20240629050059,2024-06-29,city scrape,"Pequeña habitación en barrio bohemio, Providen...",A central and safe location in the nearby Prov...,There are bars and restaurants in front of the...,https://a0.muscache.com/pictures/105adf7d-a915...,461358072,https://www.airbnb.com/users/show/461358072,Pedro Juan,2022-05-28,"Providencia, Chile",Me llamo Pedro. Soy artista conceptual y actu...,within an hour,100%,95%,t,https://a0.muscache.com/im/pictures/user/f8f5c...,https://a0.muscache.com/im/pictures/user/f8f5c...,,1,2,"['email', 'phone']",t,t,"Providencia, Región Metropolitana, Chile",Providencia,,-33.4311,-70.61892,Private room in condo,Private room,1,1.0,1 private bath,1.0,0.0,"[""Hot water kettle"", ""Dining table"", ""Freezer""...","$20,740.00",1,365,1,1,365,365,1.0,365.0,,t,4,4,32,128,2024-06-29,36,25,3,2023-04-02,2024-06-20,5.0,5.0,4.97,5.0,5.0,5.0,5.0,,f,1,0,1,0,2.37
3736,38146787,https://www.airbnb.com/rooms/38146787,20240629050059,2024-06-29,previous scrape,Casa Lily - Habitación doble,"It is well lit, with square, supermarkets, cen...",Votive Temple of Maipu,https://a0.muscache.com/pictures/a0b26a23-d574...,271107973,https://www.airbnb.com/users/show/271107973,Liliana,2019-06-24,"Santiago, Chile",,,,,f,https://a0.muscache.com/im/pictures/user/14e5a...,https://a0.muscache.com/im/pictures/user/14e5a...,,1,2,"['email', 'phone']",t,f,"Maipú, Región Metropolitana, Chile",Maipú,,-33.52527,-70.76594,Private room in bungalow,Private room,2,,1 shared bath,,,"[""Wifi"", ""Indoor fireplace"", ""TV"", ""Kitchen"", ...",,1,10,1,1,10,10,1.0,10.0,,,0,0,0,0,2024-06-29,7,0,0,2020-12-16,2023-01-29,5.0,5.0,5.0,5.0,5.0,5.0,5.0,,f,1,0,1,0,0.16
4512,46194392,https://www.airbnb.com/rooms/46194392,20240629050059,2024-06-29,previous scrape,Habitacion doble,Double room with classic and minimalist decora...,,https://a0.muscache.com/pictures/miso/Hosting-...,312594237,https://www.airbnb.com/users/show/312594237,Casona,2019-11-27,"Providencia, Chile","Casona, antigua inserta en la calle Viña de...",within a day,100%,100%,f,https://a0.muscache.com/im/pictures/user/a3685...,https://a0.muscache.com/im/pictures/user/a3685...,,12,13,"['email', 'phone']",t,t,,Providencia,,-33.44137,-70.63267,Private room in hostel,Private room,2,,1 private bath,1.0,,"[""Wifi"", ""TV"", ""Essentials"", ""Air conditioning...",,1,1125,1,1,1125,1125,1.0,1125.0,,t,30,60,90,365,2024-06-29,0,0,0,,,,,,,,,,,t,12,0,12,0,


### 2. Univariate analysis

In [None]:
# Showing the number of unique values of listing ids
# This can be helpful to diagnose duplicity
listings_df["id"].nunique()

In [None]:
# Showing unique values of neighborhoods
listings_df["neighbourhood_cleansed"].unique()

In [None]:
# Calculating the relative frecuency of room types
listings_df["room_type"].value_counts(dropna=False, normalize=True) # You can set normalize to False for calculating the absolute frecuency

In [None]:
# Calculating basic statistics of accommodates
listings_df["accommodates"].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95])

<span style="color: red;">What does it mean the value of 0 for this attribute?</span>

In [None]:
# We need to transform the price attribute from object to float
listings_df["price_float"] = listings_df["price"].str.replace("[$,]", "", regex=True).astype(float)

In [None]:
# Plotting a histogram for price
plt.figure(figsize=(20, 5))
plt.hist(listings_df["price_float"], bins=50)
plt.title("Price distribution")
plt.plot()

<span style="color: red;">This attribute has an extreme outlier making difficult a proper visualization?</span>

In [None]:
# Let's make some calculations for determining an outlier threshold
q1 = listings_df["price_float"].quantile(0.25)
q3 = listings_df["price_float"].quantile(0.75)
iqr = q3 - q1

In [None]:
plt.figure(figsize=(20, 5))
plt.hist(listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)]["price_float"], bins=50)
plt.title("Price distribution")
plt.plot()

In [None]:
# Plotting bar charts for has availability and instant bookable
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(20, 5))
listings_df["has_availability"].value_counts().plot(kind="bar", ax=ax1, color="orange")
listings_df["instant_bookable"].value_counts().sort_index(ascending=False).plot(kind="bar", ax=ax2, color="green")
ax1.set_title("Has availability frecuency")
ax2.set_title("Instant bookable frecuency")
plt.show()

In [None]:
# Plotting a boxplot for number of reviews
plt.figure(figsize=(20, 5))
plt.boxplot(listings_df["number_of_reviews"], showmeans=True, vert=False)
plt.title("Number of reviews distribution")
plt.show()

### 2. Bivariate analysis

In [None]:
# Plotting correlation heatmap among review scores
plt.figure(figsize=(10, 8))
sns.heatmap(
    listings_df[["review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value"]].corr(),
    vmin=0.5, vmax=1,
    cmap="Blues"
)
plt.title("Correlation among review scores")
plt.show()

In [None]:
# For large datasets, some visualizations are innefective when trying to represent individual instances
# A naive strategy is to use only a random sample to visualize
listings_sample_df =  listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)].sample(frac=0.1)
listings_sample_df.shape

In [None]:
# Plotting the relationship between price and review score value
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x=listings_sample_df["price_float"],
    y=listings_sample_df["review_scores_value"]
)
plt.title("Relationship between price and review score value")
plt.grid()
plt.show()

In [None]:
# Another common strategy for working with datasets with high variability is by filtering the dataset by representative groups according to the context
# In this sense, let's work only with the neighbourhoods with more listings (pareto analysis)
neighbourhood_frec_cumsum = listings_df["neighbourhood_cleansed"].value_counts(normalize=True).cumsum()

In [None]:
# Plotting pareto analysis for neighbourhood frecuency
plt.figure(figsize=(20, 8))
neighbourhood_frec_cumsum.plot(kind="bar", color="steelblue")
plt.title("Pareto analysis for neighbourhood frecuency")
plt.grid(axis="y")
plt.show()

In [None]:
most_representative_neighbourhoods = neighbourhood_frec_cumsum.loc[neighbourhood_frec_cumsum < 0.8].index.tolist()
most_representative_neighbourhoods

In [None]:
# Plotting price distribution by neighborhood
fig, ax = plt.subplots(1, 1, figsize=(20, 8))
sns.kdeplot(
    data=listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)].loc[listings_df["neighbourhood_cleansed"].isin(most_representative_neighbourhoods)],
    x="price_float",
    hue="neighbourhood_cleansed",
    bw_adjust=.3,
    ax=ax
)
for (neighbourhood, color) in zip(most_representative_neighbourhoods, ["steelblue", "orange", "green"]):
    ax.axvline(listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)].loc[listings_df["neighbourhood_cleansed"] == neighbourhood, "price_float"].mean(), color=color, linestyle="dashed", linewidth=2, ymax=0.2)
plt.title("Price distribution by neighbourhood (with means)")
plt.show()

In [None]:
# Plotting number of reviews distribution by neighbourhood
fig, ax = plt.subplots(1, 1, figsize=(20, 8))
sns.kdeplot(
    data=listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)].loc[listings_df["neighbourhood_cleansed"].isin(most_representative_neighbourhoods)],
    x="number_of_reviews",
    hue="neighbourhood_cleansed",
    bw_adjust=.3,
    ax=ax
)
for (neighbourhood, color) in zip(most_representative_neighbourhoods, ["steelblue", "orange", "green"]):
    ax.axvline(listings_df.loc[listings_df["price_float"] <= (q3 + 1.5 * iqr)].loc[listings_df["neighbourhood_cleansed"] == neighbourhood, "number_of_reviews"].mean(), color=color, linestyle="dashed", linewidth=2, ymax=0.2)
plt.title("Price distribution by neighbourhood (with means)")
plt.xlim([0, 200])
plt.show()