### AirBnb listings dataset cleaning
Source: http://insideairbnb.com/get-the-data.html - listings.csv.gz - Detailed Listings data for Barcelona

The specific file used is the one related to the 14th of August 2018. The reason why a 2018 dataset was used is because the analysis will be carried out in relation to Barcelona's average rental prices as provided by the Ajuntament of Barcelona, which are released every year. For obvious reasons the complete data available currently stops at 2018.

In [175]:
import pandas as pd
import numpy as np

In [176]:
airbnb_full = pd.read_csv("/home/emanuele/Desktop/IronHack/Projects/Project-Week-2-Barcelona/your-project/listings.csv")

Given the sheer number of attributes in the listings.csv file it is preferrable to list the columns instead of using head() or describe()

In [177]:
list(airbnb_full.columns)

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


The columns that will be kept will be related only the data strictly needed for the analysis: 'id', for matters of indexing; 'price', since our analysis will be fully related to the prices of the listings in comparison to rental prices; the columns related to neighbourhood data; 'accomodates', which stores the maximum number of people allowed; 'availability_30', which can be used to determine the percentage of booked days in the 30 days prior to the 14th of August 2018 to better relate the listing price to a rental price, which of course assumes a "booked" percentage of 100%. We will also be using 'room_type' initially to limit the data to listings related to the Entire house, exluding Private and Shared rooms since we assume the prices to be skewed. 

Given the presence of 3 different neighbourhood attributes and the possibility of carrying out a more detailed process that doesn't stop at a per-district analysis but goes deeper to a per-neighbourhood one (e.g.: El Clot, La Verneda y La Pau, etc. instead of simply the whole Sant Martì district) we will first check a sample for the aforementioned columns to determine which ones, if not all, are useful in our case.

We will also be checking the different types of listings through the 'room_type' column.

In [178]:
airbnb_full.loc[:,('neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed')].sample(30)

Unnamed: 0,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed
11842,La Nova Esquerra de l'Eixample,la Nova Esquerra de l'Eixample,Eixample
4591,Sants-Montjuïc,el Poble Sec,Sants-Montjuïc
399,La Sagrada Família,la Sagrada Família,Eixample
10140,L'Antiga Esquerra de l'Eixample,la Nova Esquerra de l'Eixample,Eixample
8434,El Raval,el Raval,Ciutat Vella
11725,,la Vila de Gràcia,Gràcia
1246,Dreta de l'Eixample,la Dreta de l'Eixample,Eixample
15222,Sants-Montjuïc,Sants - Badal,Sants-Montjuïc
14127,Dreta de l'Eixample,la Dreta de l'Eixample,Eixample
14659,El Gòtic,el Barri Gòtic,Ciutat Vella


In [179]:
airbnb_full['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object)

We decided to use the cleansed columns given the absence of NaN values. We also confirmed that 'neighborhood_group_cleansed' refers to a district while 'neighborhood_cleansed' refers to the smaller "Barrios".

That said, we proceed with creating a new DataFrame including Entire homes only (since they relate better to a rental house agreement when compared to a Private or Shared room) and then dropping all the unneeded columns.

In [180]:
airbnb_full = airbnb_full[airbnb_full['room_type'] == 'Entire home/apt']

In [181]:
airbnb_full = airbnb_full.loc[:,('id', 'neighbourhood_group_cleansed', 'price', 'neighbourhood_cleansed' ,'accommodates','availability_365','last_review','minimum_nights')]

We decided to cutoff all homes with an available accomodation of more than 6 people to try and limit the effect that apartments renovated as vacation homes (i.e. with more beds or sleeping spaces that a normal house of the same size would accomodate, and therefore more people) will not severely influence our final result

In [182]:
airbnb_full = airbnb_full[airbnb_full['minimum_nights'] >= 30]

In [183]:
airbnb_full = airbnb_full[airbnb_full['accommodates'] <= 6]

The data related to each listing's availability (which will be used in the analysis to weight the listing price when compared to a traditional monthly rental as accurately as possible) is, as explained in the source website, parsed from the calendar on the AirBnb website. This calendar **DOES NOT** discern unavailable days between those caused by a pre-existing booking or a simple choice to make the listing unavailable by the host.

To try and clean the data of listings that have been inactive for a long time as of 2018-08-14 we will be filtering out of all the listings with a last review that dates **more than 2 months before** the date of the parsing. Given the sheer number of observations available we believe that even a timeframe of 2 only months, at least regarding the review, will offer enough data for the analysis to be relevant.

The next cell's purpose is to check the latest date that a review was added, so that the following slice will respect the 2-months cutoff timeframe.

In [184]:
airbnb_full.loc[:,('id','last_review')].sort_values('last_review', ascending=False).head()

Unnamed: 0,id,last_review
6015,11264883,2018-08-13
18799,27472765,2018-08-12
14689,23866376,2018-08-11
1642,1817640,2018-08-11
11589,20264909,2018-08-11


Slicing the dataframe again. The slice is from the last review date, 2018-08-14, to 2 months prior, 2018-06-14. Also dropping NaN values.

In [185]:
airbnb_full = airbnb_full[(airbnb_full['last_review'] <= '2018-08-13') & (airbnb_full['last_review'] >= '2018-06-13')]
airbnb_full = airbnb_full[(airbnb_full['last_review'] != np.nan)]

We can now drop the 'last_review' column since it fulfilled its purpose.

In [186]:
airbnb_full = airbnb_full.drop('last_review', axis=1)

Checking prices for possible outliers.

In [187]:
airbnb_full.sort_values('price', ascending=False).head(50)

Unnamed: 0,id,neighbourhood_group_cleansed,price,neighbourhood_cleansed,accommodates,availability_365,minimum_nights
8752,16556943,Ciutat Vella,$99.00,el Barri Gòtic,4,305,31
7570,14152120,Ciutat Vella,$99.00,el Barri Gòtic,4,239,31
16482,25541279,Eixample,$99.00,la Sagrada Família,3,34,31
5523,9937559,Ciutat Vella,$99.00,el Barri Gòtic,6,316,31
120,206846,Ciutat Vella,$98.00,"Sant Pere, Santa Caterina i la Ribera",2,220,31
277,417239,Gràcia,$97.00,la Vila de Gràcia,5,287,30
1576,1691141,Eixample,$96.00,la Nova Esquerra de l'Eixample,5,363,30
6257,11762804,Ciutat Vella,$95.00,el Barri Gòtic,2,321,32
3215,4668430,Gràcia,$95.00,la Vila de Gràcia,4,276,32
10809,19464663,Eixample,$95.00,la Dreta de l'Eixample,6,128,32


Given the sorting the prices are currently saved as a string. Since they are shown with a "$" sign we will need to drop it to convert the type to int.

In [188]:
airbnb_full['price'] = airbnb_full['price'].replace("\$", "", regex=True)
airbnb_full['price'] = airbnb_full['price'].replace(",", "", regex=True)

Converting to int

In [189]:
airbnb_full['price'] = pd.to_numeric(airbnb_full['price'])

Checking for outliers on high prices.

In [190]:
airbnb_full.sort_values('price', ascending=False)

Unnamed: 0,id,neighbourhood_group_cleansed,price,neighbourhood_cleansed,accommodates,availability_365,minimum_nights
8732,16506765,Ciutat Vella,220.0,la Barceloneta,2,171,32
14079,23244848,Ciutat Vella,200.0,la Barceloneta,2,267,32
14691,23868358,Ciutat Vella,199.0,el Raval,4,174,32
7229,13694580,Ciutat Vella,190.0,el Raval,2,125,32
7323,13811701,Eixample,180.0,la Dreta de l'Eixample,4,321,32
6800,13103577,Ciutat Vella,180.0,el Barri Gòtic,6,365,32
16600,25712520,Ciutat Vella,172.0,la Barceloneta,2,173,31
9098,17195079,Ciutat Vella,169.0,"Sant Pere, Santa Caterina i la Ribera",6,175,32
18794,27471887,Ciutat Vella,160.0,el Raval,4,133,31
7612,14209613,Ciutat Vella,155.0,"Sant Pere, Santa Caterina i la Ribera",4,365,32


Creating a new column for "revenue", which will simply be the price multiplied by the actual number of booked days (calculated as 30 - availability_30) to estimate a monthly revenue and compare it with the rental prices.

In [191]:
map_func = lambda x: round((x['price'] * (365 - x['availability_365']))/12, 2)
airbnb_full['revenue'] = airbnb_full.apply(map_func, axis = 1)

In [193]:
price_by_district = round(airbnb_full.groupby(['neighbourhood_group_cleansed']).mean(),2)
price_by_district

Unnamed: 0_level_0,id,price,accommodates,availability_365,minimum_nights,revenue
neighbourhood_group_cleansed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ciutat Vella,15951497.43,68.64,3.44,246.5,34.94,644.72
Eixample,16679385.24,72.3,3.75,241.73,32.06,754.13
Gràcia,13121482.21,70.0,3.68,225.21,31.64,808.24
Horta-Guinardó,17806375.17,48.33,2.83,207.83,31.33,789.11
Les Corts,8624766.5,81.0,5.0,226.75,31.5,643.02
Nou Barris,20772098.67,46.0,3.33,189.33,31.33,767.61
Sant Andreu,20929894.0,47.5,3.0,170.5,33.5,712.08
Sant Martí,17285030.22,61.03,3.97,256.09,33.03,526.14
Sants-Montjuïc,17093859.89,57.11,3.33,233.67,32.96,645.17
Sarrià-Sant Gervasi,13415083.75,59.38,3.25,273.0,31.5,514.33


In [196]:
price_by_district = pd.DataFrame(price_by_district.loc[:,'revenue'])
price_by_district.to_csv('price_by_district.csv')

In [145]:
test = airbnb_full['availability_30'].iloc[:50]
test


KeyError: 'availability_30'

In [None]:
test2 = test
test

In [None]:
for index in range(0, len(test)):
    if index == 0:
        if test.iloc[index] < 10:
            test.iloc[index] = np.nan
    else:
        if test.iloc[index] < 10:
            if not test.iloc[index - 1] >= 10:
                test.iloc[index] = np.nan
            else:
                test.iloc[index] = test.iloc[index - 1]
        
            
    
test

In [None]:
test