### AirBnb listings dataset cleaning
Source: http://insideairbnb.com/get-the-data.html - listings.csv.gz - Detailed Listings data for Barcelona

The specific file used is the one related to the 14th of August 2018. The reason why a 2018 dataset was used is because the analysis will be carried out in relation to Barcelona's average rental prices as provided by the Ajuntament of Barcelona, which are released every year. For obvious reasons the complete data available currently stops at 2018.

In [76]:
import pandas as pd
import numpy as np

In [77]:
airbnb_full = pd.read_csv("/home/emanuele/Desktop/IronHack/Projects/Project-Week-2-Barcelona/your-project/listings.csv")

Given the sheer number of attributes in the listings.csv file it is preferrable to list the columns instead of using head() or describe()

In [78]:
list(airbnb_full.columns)

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


The columns that will be kept will be related only the data strictly needed for the analysis: 'id', for matters of indexing; 'price', since our analysis will be fully related to the prices of the listings in comparison to rental prices; the columns related to neighbourhood data; 'accomodates', which stores the maximum number of people allowed; 'availability_30', which can be used to determine the percentage of booked days in the 30 days prior to the 14th of August 2018 to better relate the listing price to a rental price, which of course assumes a "booked" percentage of 100%. We will also be using 'room_type' initially to limit the data to listings related to the Entire house, exluding Private and Shared rooms since we assume the prices to be skewed. 

Given the presence of 3 different neighbourhood attributes and the possibility of carrying out a more detailed process that doesn't stop at a per-district analysis but goes deeper to a per-neighbourhood one (e.g.: El Clot, La Verneda y La Pau, etc. instead of simply the whole Sant Martì district) we will first check a sample for the aforementioned columns to determine which ones, if not all, are useful in our case.

We will also be checking the different types of listings through the 'room_type' column.

In [79]:
airbnb_full.loc[:,('neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed')].sample(30)

Unnamed: 0,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed
15132,,la Dreta de l'Eixample,Eixample
10843,La Sagrada Família,la Sagrada Família,Eixample
10985,Dreta de l'Eixample,la Dreta de l'Eixample,Eixample
7460,L'Antiga Esquerra de l'Eixample,l'Antiga Esquerra de l'Eixample,Eixample
11069,Turó de la Peira - Can Peguera,el Turó de la Peira,Nou Barris
7559,El Gòtic,el Barri Gòtic,Ciutat Vella
431,La Barceloneta,la Barceloneta,Ciutat Vella
14282,El Raval,el Raval,Ciutat Vella
1937,La Nova Esquerra de l'Eixample,la Nova Esquerra de l'Eixample,Eixample
12228,La Sagrada Família,la Sagrada Família,Eixample


In [80]:
airbnb_full['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object)

We decided to use the cleansed columns given the absence of NaN values. We also confirmed that 'neighborhood_group_cleansed' refers to a district while 'neighborhood_cleansed' refers to the smaller "Barrios".

That said, we proceed with creating a new DataFrame including Entire homes only (since they relate better to a rental house agreement when compared to a Private or Shared room) and then dropping all the unneeded columns.

In [81]:
airbnb_full = airbnb_full[airbnb_full['room_type'] == 'Entire home/apt']

In [82]:
airbnb_full = airbnb_full.loc[:,('id', 'neighbourhood_group_cleansed', 'price', 'neighbourhood' ,'accommodates','availability_30','last_review')]

We decided to cutoff all homes with an available accomodation of more than 6 people to try and limit the effect that apartments renovated as vacation homes (i.e. with more beds or sleeping spaces that a normal house of the same size would accomodate, and therefore more people) will not severely influence our final result

In [83]:
airbnb_full = airbnb_full[airbnb_full['accommodates'] <= 6]

The data related to each listing's availability (which will be used in the analysis to weight the listing price when compared to a traditional monthly rental as accurately as possible) is, as explained in the source website, parsed from the calendar on the AirBnb website. This calendar **DOES NOT** discern unavailable days between those caused by a pre-existing booking or a simple choice to make the listing unavailable by the host.

To try and clean the data of listings that have been inactive for a long time as of 2018-08-14 we will be filtering out of all the listings with a last review that dates **more than 2 months before** the date of the parsing. Given the sheer number of observations available we believe that even a timeframe of 2 only months, at least regarding the review, will offer enough data for the analysis to be relevant.

The next cell's purpose is to confirm that the last date that a review was added to whatever listing matches the date of the parsing, so that the following slice will respect the 2-months cutoff timeframe.

In [84]:
airbnb_full.loc[:,('id','last_review')].sort_values('last_review', ascending=False).head()

Unnamed: 0,id,last_review
4784,8060075,2018-08-13
3164,4543132,2018-08-13
14714,23890155,2018-08-13
6015,11264883,2018-08-13
1892,2208080,2018-08-13


Slicing the dataframe again. The slice is from the last review date, 2018-08-14, to 2 months prior, 2018-06-14. Also dropping NaN values.

In [85]:
airbnb_full = airbnb_full[(airbnb_full['last_review'] < '2018-08-14') & (airbnb_full['last_review'] > '2018-06-14')]
airbnb_full = airbnb_full[(airbnb_full['last_review'] != np.nan)]


We can now drop the 'last_review' column since it fulfilled its purpose.

In [86]:
airbnb_full = airbnb_full.drop('last_review', axis=1)

Checking prices for possible outliers.

In [87]:
airbnb_full.sort_values('price', ascending=False).head(50)

Unnamed: 0,id,neighbourhood_group_cleansed,price,neighbourhood,accommodates,availability_30
574,721473,Eixample,$999.00,el Fort Pienc,4,20
573,721470,Eixample,$999.00,el Fort Pienc,2,13
1876,2187751,Eixample,$99.00,el Fort Pienc,3,4
7798,14497583,Sarrià-Sant Gervasi,$99.00,El Putget i Farró,5,8
17253,26352364,Ciutat Vella,$99.00,El Raval,3,23
1993,2360266,Eixample,$99.00,,2,6
9851,18309933,Sant Martí,$99.00,El Poblenou,4,0
2036,2417803,Eixample,$99.00,La Sagrada Família,4,14
9725,18146433,Gràcia,$99.00,Vila de Gràcia,4,3
9660,18038313,Sants-Montjuïc,$99.00,Sants-Montjuïc,3,2


Given the sorting the prices are currently saved as a string. Since they are shown with a "$" sign we will need to drop it to convert the type to int.

In [89]:
airbnb_full['price'] = airbnb_full['price'].replace("\$", "", regex=True)

Converting to int

In [90]:
airbnb_full['price'] = pd.to_numeric(airbnb_full['price'])

ValueError: Unable to parse string "1,000.00" at position 12

The presence of one of more commas is preventing the conversion, we will therefore remove them.

In [91]:
airbnb_full['price'] = airbnb_full['price'].replace(",", "", regex=True)

In [92]:
airbnb_full['price'] = pd.to_numeric(airbnb_full['price'])

TODO: Random checks on listings with a very high price per night have shows that the price in the dataset **DOES NOT** match the one on the website by a VERY HIGH MARGIN. We will need to do further checks to determine a cutoff value for a price to be considered a real outlier.

In [96]:
airbnb_full.sort_values('price', ascending=False)

Unnamed: 0,id,neighbourhood_group_cleansed,price,neighbourhood,accommodates,availability_30
8773,16607374,Les Corts,3000.0,La Maternitat i Sant Ramon,2,3
8774,16607759,Les Corts,3000.0,La Maternitat i Sant Ramon,2,4
6549,12534863,Sants-Montjuïc,1000.0,Sants-Montjuïc,6,18
13597,22757076,Gràcia,1000.0,Vila de Gràcia,4,5
13593,22756295,Gràcia,1000.0,Vila de Gràcia,4,5
13587,22742743,Gràcia,1000.0,Vila de Gràcia,4,5
13585,22741727,Gràcia,1000.0,Vila de Gràcia,4,10
13582,22741274,Gràcia,1000.0,Vila de Gràcia,5,13
13581,22740570,Gràcia,1000.0,Vila de Gràcia,5,11
13580,22740205,Gràcia,1000.0,Vila de Gràcia,5,14


Creating a new column for "revenue", which will simply be the price multiplied by the actual number of booked days (calculated as 30 - availability_30) to estimate a monthly revenue and compare it with the rental prices.

In [99]:
map_func = lambda x: 30 - airbnb_full['availability_30']

In [100]:
df = airbnb_full

In [103]:
df['revenue']

0        0        29
8        24
9        24
10       1...
8        0        29
8        24
9        24
10       1...
9        0        29
8        24
9        24
10       1...
10       0        29
8        24
9        24
10       1...
11       0        29
8        24
9        24
10       1...
19       0        29
8        24
9        24
10       1...
20       0        29
8        24
9        24
10       1...
26       0        29
8        24
9        24
10       1...
27       0        29
8        24
9        24
10       1...
28       0        29
8        24
9        24
10       1...
33       0        29
8        24
9        24
10       1...
34       0        29
8        24
9        24
10       1...
36       0        29
8        24
9        24
10       1...
37       0        29
8        24
9        24
10       1...
38       0        29
8        24
9        24
10       1...
39       0        29
8        24
9        24
10       1...
40       0        29
8        24
9        24
10       1.

In [102]:
df

Unnamed: 0,id,neighbourhood_group_cleansed,price,neighbourhood,accommodates,availability_30
0,11368,Ciutat Vella,100.0,,4,1
8,31823,Eixample,75.0,La Nova Esquerra de l'Eixample,6,6
9,31958,Gràcia,65.0,Camp d'en Grassot i Gràcia Nova,4,6
10,32471,Gràcia,95.0,Camp d'en Grassot i Gràcia Nova,5,13
11,32711,Gràcia,140.0,Camp d'en Grassot i Gràcia Nova,6,14
19,40983,Eixample,95.0,Dreta de l'Eixample,4,3
20,44868,Ciutat Vella,82.0,El Raval,2,4
26,58512,Sant Martí,105.0,El Camp de l'Arpa del Clot,6,3
27,61444,Eixample,105.0,Dreta de l'Eixample,6,14
28,66037,Ciutat Vella,150.0,El Born,6,7
