![Airbnb_logo](Airbnb_logo.png)

<h1 style="color: #FF6B81;">Airbnb - Lisbon listings overview</h1>

<h2 style="color: #FF6B81;">Importing Libraries</h2>

In [74]:
import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import chi2_contingency
from scipy.stats.contingency import association

In [2]:
# Ignoring warnings
import warnings
warnings.filterwarnings('ignore')

<h2 style="color: #FF6B81;">Data extraction</h2>

In [3]:
data_weekdays = pd.read_csv('lisbon_weekdays.csv')

In [4]:
data_weekends = pd.read_csv('lisbon_weekends.csv')

<h2 style="color: #FF6B81;">Exploring the datasets</h2>

##### Variables list:


- `realSum`: the full price of accommodation for two people and two nights in EUR
- `room_type`: the type of the accommodation
- `room_shared`: dummy variable for shared rooms 
- `room_private`: dummy variable for private rooms
- `person_capacity`: maximum number of guests
- `host_is_superhost`: dummy variable for superhost status
- `multi`: dummy for listings offered by hosts with 2–4 listings
- `biz`: dummy for listings offered by hosts with more than 4 listings
- `cleanliness_rating`: cleanliness rating
- `guest_satisfaction_overall`: overall rating of the listing (scale to 100)
- `bedrooms`: number of bedrooms (0 for studios)
- `dist`: distance to the city centre in km
- `metro_dist`: distance from nearest metro station in km
- `attr_index`: attraction index of the listing location
- `attr_index_norm`: normalised attraction index (0-100)
- `rest_index`: restaurant index of the listing location
- `rest_index_norm`: normalised restaurant index (0-100)
- `lng`: longitude of the listing location
- `lat`: latitude of the listing location


##### Notes on the listed price:

- The offers were collected four to six weeks in advance of the travel dates, and the collected prices refer to the full amount due for the accommodation, including the reservation fee and cleaning fee. Weekdays refer to Tuesday-Thursday and weekends to Friday-Sunday bookings. 


In [5]:
data_weekdays.sample(10)

Unnamed: 0.1,Unnamed: 0,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat
990,990,285.64728,Entire home/apt,False,False,3.0,True,0,1,10.0,99.0,1,0.629722,0.331059,270.458919,8.929008,864.372716,38.657973,-9.1453,38.70884
16,16,270.637899,Entire home/apt,False,False,4.0,False,1,0,9.0,95.0,1,0.81077,0.586973,549.416394,18.138591,867.292565,38.78856,-9.13031,38.71212
291,291,305.816135,Entire home/apt,False,False,6.0,True,0,0,9.0,96.0,2,0.777644,0.498577,373.900958,12.344074,867.598835,38.802257,-9.13127,38.70988
158,158,310.037523,Entire home/apt,False,False,6.0,False,0,0,10.0,94.0,2,0.44797,0.370221,455.127317,15.025704,1043.259691,46.658466,-9.13546,38.71
2066,2066,163.69606,Entire home/apt,False,False,2.0,False,0,1,9.0,70.0,0,2.383527,1.804974,111.847432,3.692563,297.719705,13.315136,-9.16568,38.70568
1356,1356,295.028143,Entire home/apt,False,False,4.0,True,0,1,10.0,99.0,2,0.58048,0.547918,336.245209,11.100896,1037.030255,46.379863,-9.1462,38.71144
1513,1513,391.41651,Entire home/apt,False,False,6.0,False,0,0,10.0,93.0,2,6.588172,5.607164,190.58335,6.291973,160.425734,7.174837,-9.21232,38.69562
2836,2836,170.262664,Entire home/apt,False,False,4.0,False,0,0,2.0,20.0,2,1.618669,0.524593,165.773636,5.472898,363.135677,16.240783,-9.12263,38.71832
2539,2539,239.681051,Entire home/apt,False,False,2.0,True,0,0,10.0,99.0,1,0.42691,0.218784,350.306165,11.565108,925.918543,41.410532,-9.138,38.716
545,545,193.949343,Entire home/apt,False,False,2.0,False,0,0,10.0,91.0,1,0.63321,0.552186,445.45768,14.706468,905.576662,40.500768,-9.14691,38.71254


In [6]:
data_weekends.sample(10)

Unnamed: 0.1,Unnamed: 0,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat
1130,1130,158.536585,Private room,False,True,2.0,False,1,0,9.0,93.0,3,0.258316,0.111305,399.900639,13.19003,1205.241542,67.737762,-9.14127,38.71043
2209,2209,184.099437,Private room,False,True,2.0,False,0,1,10.0,93.0,1,2.276772,0.102534,118.446612,3.906756,369.252124,20.752946,-9.14738,38.73197
1083,1083,334.427767,Entire home/apt,False,False,4.0,True,0,1,10.0,96.0,2,0.138039,0.195221,526.058363,17.351124,1304.88436,73.337952,-9.141,38.713
904,904,197.701689,Entire home/apt,False,False,4.0,False,1,0,9.0,90.0,1,0.578397,0.281456,284.51543,9.384249,921.19291,51.773478,-9.14433,38.71606
2704,2704,170.262664,Entire home/apt,False,False,2.0,False,1,0,10.0,96.0,1,0.622007,0.210778,350.976029,11.576336,708.546147,39.822167,-9.1338,38.71562
1222,1222,283.302064,Entire home/apt,False,False,6.0,False,0,0,10.0,98.0,2,3.713938,1.755158,82.651754,2.726125,174.98102,9.834396,-9.10572,38.73273
1854,1854,277.439024,Entire home/apt,False,False,4.0,False,0,1,9.0,91.0,1,7.18963,6.132681,127.139502,4.193476,109.094391,6.131393,-9.21994,38.69695
2128,2128,311.210131,Entire home/apt,False,False,6.0,False,0,1,9.0,83.0,2,2.23489,1.021632,115.21169,3.800058,305.252453,17.155995,-9.1652,38.71444
1916,1916,272.983114,Entire home/apt,False,False,4.0,False,0,1,10.0,94.0,1,2.305888,0.4645,112.890707,3.723504,332.639705,18.695231,-9.14049,38.73314
843,843,263.602251,Entire home/apt,False,False,3.0,True,0,1,10.0,97.0,0,0.574356,0.524001,336.998321,11.115306,1134.50639,63.76226,-9.146,38.711


In [7]:
data_weekdays.shape

(2857, 20)

In [8]:
data_weekends.shape

(2906, 20)

In [9]:
data_weekdays.columns

Index(['Unnamed: 0', 'realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist',
       'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index',
       'rest_index_norm', 'lng', 'lat'],
      dtype='object')

In [10]:
data_weekends.columns

Index(['Unnamed: 0', 'realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist',
       'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index',
       'rest_index_norm', 'lng', 'lat'],
      dtype='object')

**First impression:**
- `Unnamed: 0` will be dropped - it is an index column;
- `attr_index`and `rest_index`will be dropped too given that we have the same variables in `attr_index_norm` and `rest_index_norm` but in a more insightful 0-100 scale;
- `room_shared` and `room_private`will also be dropped given that these bollean info comes from `room_type`and adds no value to our analysis;
- `realSum` will be changed to `listing_price`for better understanding;
- In order to be able to distinguish from weekend and weekdays listings info it is relevant to create a boolean column with True for weekends listings and False for weekdays listings;
- Afterwards the data_weekends and data_weekdays will be concatenated to continue the cleaning process.  


In [11]:
df_weekdays = data_weekdays.copy()

In [12]:
df_weekdends = data_weekends.copy()

<h2 style="color: #FF6B81;">Data Cleaning</h2>

<h3 style="color: #FF6B81;">Dropping columns</h3>

In [13]:
df_weekdays.drop(columns=["Unnamed: 0", "attr_index", "rest_index", "room_shared", "room_private"], inplace = True)

In [14]:
df_weekdends.drop(columns=["Unnamed: 0", "attr_index", "rest_index", "room_shared", "room_private"], inplace = True)

<h3 style="color: #FF6B81;">Boolean column weekdend</h3>

In [15]:
df_weekdays["weekend"] = 0
df_weekdends["weekend"] = 1

<h3 style="color: #FF6B81;">Concatenate two dataframes</h3>

In [16]:
airbnb_lisbon = pd.concat([df_weekdays, df_weekdends], axis=0)

In [17]:
airbnb_lisbon.reset_index(drop=True, inplace=True)

In [18]:
airbnb_lisbon.rename(columns={"realSum":"listing_price"}, inplace=True)

<h3 style="color: #FF6B81;">Saving my airbnb_lisbon dataframe</h3>

In [19]:
airbnb_lisbon.to_csv('airbnb_lisbon.csv', index=False)

<h3 style="color: #FF6B81;">Checking Null values</h3>

In [20]:
airbnb_lisbon.isna().sum()

listing_price                 0
room_type                     0
person_capacity               0
host_is_superhost             0
multi                         0
biz                           0
cleanliness_rating            0
guest_satisfaction_overall    0
bedrooms                      0
dist                          0
metro_dist                    0
attr_index_norm               0
rest_index_norm               0
lng                           0
lat                           0
weekend                       0
dtype: int64

<h3 style="color: #FF6B81;">Checking Duplicates</h3>

In [21]:
airbnb_lisbon.duplicated().sum()

0

<h3 style="color: #FF6B81;">Checking Empty Spaces</h3>

In [22]:
airbnb_lisbon.eq(" ").sum()

listing_price                 0
room_type                     0
person_capacity               0
host_is_superhost             0
multi                         0
biz                           0
cleanliness_rating            0
guest_satisfaction_overall    0
bedrooms                      0
dist                          0
metro_dist                    0
attr_index_norm               0
rest_index_norm               0
lng                           0
lat                           0
weekend                       0
dtype: int64

<h3 style="color: #FF6B81;">Checking and changing column types</h3>

In [23]:
airbnb_lisbon.dtypes

listing_price                 float64
room_type                      object
person_capacity               float64
host_is_superhost                bool
multi                           int64
biz                             int64
cleanliness_rating            float64
guest_satisfaction_overall    float64
bedrooms                        int64
dist                          float64
metro_dist                    float64
attr_index_norm               float64
rest_index_norm               float64
lng                           float64
lat                           float64
weekend                         int64
dtype: object

In [24]:
# Changing super host variable from boolean to integer
super_host_mapping = {False: 0 , True: 1}
airbnb_lisbon["host_is_superhost"] = airbnb_lisbon["host_is_superhost"].map(super_host_mapping)

In [25]:
# Changing person capacity, cleanliness_rating and guest_satisfaction_overall from float to integer
airbnb_lisbon["person_capacity"] = airbnb_lisbon["person_capacity"].astype(int)
airbnb_lisbon["cleanliness_rating"] = airbnb_lisbon["cleanliness_rating"].astype(int)
airbnb_lisbon["guest_satisfaction_overall"] = airbnb_lisbon["guest_satisfaction_overall"].astype(int)

In [26]:
airbnb_lisbon.dtypes

listing_price                 float64
room_type                      object
person_capacity                 int32
host_is_superhost               int64
multi                           int64
biz                             int64
cleanliness_rating              int32
guest_satisfaction_overall      int32
bedrooms                        int64
dist                          float64
metro_dist                    float64
attr_index_norm               float64
rest_index_norm               float64
lng                           float64
lat                           float64
weekend                         int64
dtype: object

<h3 style="color: #FF6B81;">Rounding float variables (excluding latitude and longitude) </h3>

In [27]:
float_variables = ["dist", "metro_dist", "attr_index_norm", "rest_index_norm", "listing_price"]

In [28]:
airbnb_lisbon[float_variables] = airbnb_lisbon[float_variables].apply(lambda x: round(x, 2))

<h3 style="color: #FF6B81;">Moving listing price to the right</h3>

In [29]:
column_to_move = airbnb_lisbon.pop("listing_price")
airbnb_lisbon["listing_price"] = column_to_move

<h2 style="color: #FF6B81;">EDA (Exploratory Data Analysis)</h2>

<h3 style="color: #FF6B81;">Categorical vs Numerical data</h3>

In [45]:
for col in airbnb_lisbon.columns:
    number_unique_values = airbnb_lisbon[f"{col}"].nunique()
    print(f"{col} number of unique values: {number_unique_values}")

room_type number of unique values: 3
person_capacity number of unique values: 5
host_is_superhost number of unique values: 2
multi number of unique values: 2
biz number of unique values: 2
cleanliness_rating number of unique values: 9
guest_satisfaction_overall number of unique values: 46
bedrooms number of unique values: 7
dist number of unique values: 626
metro_dist number of unique values: 341
attr_index_norm number of unique values: 1415
rest_index_norm number of unique values: 3270
lng number of unique values: 2190
lat number of unique values: 2054
weekend number of unique values: 2
listing_price number of unique values: 1159


From unique values we may start splitting categorical from numerical variables 

Categorical:
- `room_type` 
- `person_capacity`
- `host_is_superhost`
- `multi`
- `biz`
- `cleanliness_rating`
- `bedrooms`
- `weekend`

Note: For now let's assume `guest_satisfaction_overall` as numerical

Numerical:
- `guest_satisfaction_overall` 
- `dist`
- `metro_dist`
- `attr_index_norm`
- `rest_index_norm`
- `listing_price`

    Geographical data:
    - `lng` 
    - `lat`

In [None]:
#Splitting numerical and categorical variables in two dataframes
df_cat= airbnb_lisbon.loc[:, airbnb_lisbon.nunique() < 10]

df_num = airbnb_lisbon.drop(columns=df_cat.columns)

<h3 style="color: #FF6B81;">Statistical description (numerical variables)</h3>

In [49]:
df_num.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
guest_satisfaction_overall,5763.0,91.09,9.15,20.0,88.0,93.0,97.0,100.0
dist,5763.0,1.97,1.74,0.04,0.81,1.39,2.44,9.57
metro_dist,5763.0,0.71,0.92,0.01,0.29,0.45,0.67,6.16
attr_index_norm,5763.0,7.32,5.08,1.36,3.77,5.54,10.02,100.0
rest_index_norm,5763.0,28.27,17.88,3.35,14.99,24.27,37.82,100.0
lng,5763.0,-9.14,0.02,-9.23,-9.15,-9.14,-9.13,-9.09
lat,5763.0,38.72,0.02,38.69,38.71,38.72,38.73,38.79
listing_price,5763.0,238.21,108.97,70.59,160.18,225.38,286.35,1681.05


<h3 style="color: #FF6B81;">Categorical vs Categorical</h3>

In [86]:
# Cleanliness rating frequency by super_host condition
crosstab_clean_host = pd.crosstab(airbnb_lisbon["cleanliness_rating"], airbnb_lisbon["host_is_superhost"])
crosstab_clean_host

host_is_superhost,0,1
cleanliness_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
2,13,0
3,4,0
4,7,2
5,7,0
6,65,0
7,79,2
8,494,8
9,1750,149
10,2111,1072


In [87]:
# Cleanliness rating frequency by room type
crosstab_clean_room = pd.crosstab(airbnb_lisbon["cleanliness_rating"], airbnb_lisbon["room_type"])
crosstab_clean_room

room_type,Entire home/apt,Private room,Shared room
cleanliness_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,9,2,2
3,0,4,0
4,5,0,4
5,5,2,0
6,33,32,0
7,41,38,2
8,255,238,9
9,1301,569,29
10,2229,926,28


In [88]:
# Cleanliness rating frequency by person capacity
crosstab_clean_capacity = pd.crosstab(airbnb_lisbon["cleanliness_rating"], airbnb_lisbon["person_capacity"])
crosstab_clean_capacity

person_capacity,2,3,4,5,6
cleanliness_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0,2,7,0,4
3,4,0,0,0,0
4,0,7,2,0,0
5,2,2,3,0,0
6,32,8,21,2,2
7,39,3,27,5,7
8,276,56,104,23,43
9,744,251,552,125,227
10,1244,355,1073,161,350


In [89]:
# Room type frequency by super_host condition
crosstab_room_host = pd.crosstab(airbnb_lisbon["room_type"], airbnb_lisbon["host_is_superhost"])
crosstab_room_host

host_is_superhost,0,1
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Entire home/apt,2978,900
Private room,1492,319
Shared room,60,14


##### Chi-square tests

Now let´s evaluate whether there is a significant association between the different categorical variables presented in crosstab frequencies. In other words, let's evaluate whether we can reject the null hypothesis which states that the variables are independent.

In [90]:
crosstab_results = [crosstab_clean_host, crosstab_clean_room, crosstab_clean_capacity, crosstab_room_host]

In [95]:
for crosstabs in crosstab_results:
    chi2_stats, chi2_pvalue, _, _ = chi2_contingency(crosstabs)
    print(f"p-value is {chi2_pvalue}")

p-value is 2.0713496927190733e-135
p-value is 1.8601164370012524e-46
p-value is 7.164859143880844e-16
p-value is 8.994563069235146e-06


From the low p-values for all the 4 crosstabs we may infer that we have enough evidence to reject that there is independence between:
- `cleanliness_rating` and `host_is_superhost`;
- `cleanliness_rating` and `room_type`;
- `cleanliness_rating` and `person_capacity`;
- `room_type`and `host_is_superhost`;

##### Cramér's V

In [None]:
for crosstabs in crosstab_results:
    print(association(crosstabs, method="cramer"))    

0.3361994500947648
0.150740773468755
0.07858722448200578
0.06349989586449292


<h3 style="color: #FF6B81;">Categorical vs listing price price</h3>

In [None]:
#Checking average prices differences between weekdays and weekends bookings
round(airbnb_lisbon.groupby("weekend")["listing_price"].mean().reset_index(), 2)

Unnamed: 0,weekend,listing_price
0,0,236.35
1,1,240.04


In [None]:
#Checking average prices differences between room types
round(airbnb_lisbon.groupby("room_type")["listing_price"].mean().reset_index(), 2)

Unnamed: 0,room_type,listing_price
0,Entire home/apt,282.5
1,Private room,148.9
2,Shared room,103.06


In [None]:
#Checking average prices differences between superhost and non-superhost
round(airbnb_lisbon.groupby("host_is_superhost")["listing_price"].mean().reset_index(), 2)

Unnamed: 0,host_is_superhost,listing_price
0,0,234.01
1,1,253.66


In [None]:
#Checking average prices differences between cleanliness_rating
round(airbnb_lisbon.groupby("cleanliness_rating")["listing_price"].mean().reset_index(), 2)

Unnamed: 0,cleanliness_rating,listing_price
0,2,230.63
1,3,120.37
2,4,223.16
3,5,202.12
4,6,214.5
5,7,219.34
6,8,196.16
7,9,226.41
8,10,253.15


In [None]:
#Checking average prices differences between number of bedrooms
round(airbnb_lisbon.groupby("bedrooms")["listing_price"].mean().reset_index(), 2)

Unnamed: 0,bedrooms,listing_price
0,0,218.44
1,1,204.21
2,2,317.89
3,3,370.84
4,4,358.9
5,9,129.98
6,10,77.74


<h2 style="color: #008080;">Selecting numerical</h2>

<h2 style="color: #008080;">Checking Distributions</h2>

<h2 style="color: #008080;">Checking Outliers (continous numbers)</h2>

<h2 style="color: #008080;">Checking Outliers (discrete numbers)</h2>

<h2 style="color: #008080;">Looking for Correlations</h2>

<h1 style="color: #00BFFF;">04 | Data Processing</h1>

<h2 style="color: #008080;">X-Y Split</h2>

<h2 style="color: #008080;">Normalizing the Data</h2>

<h1 style="color: #00BFFF;">05 | Modeling</h1>

<h2 style="color: #008080;">Train-Test Split</h2>

<h2 style="color: #008080;">Model Validation</h2>

<h1 style="color: #00BFFF;">06 | Improving the Model</h1>

<h1 style="color: #00BFFF;">07 | Reporting</h1>