# Exploring Airbnb data of London neighbourhoods

In this notbook, we will explore *Airbnb*'s data on London neighbourhoods in order to highlight *Airbnb* price differences among areas of England's capital.

This project's first step of exploring data is essential because we have a large amout of data gathered by the famous *Airbnb* website on hundred cities of the world. However, we don't know yet the exact structure of the data and where and how the information that is interesting for us, is stored. And even if our objective is to analyze multiple cities, we will concentrate our work on London in order to develop methods that clean and isolate the needed information from this flow of data. Theses methods will then be generalized and use on other cities in the next notebook: *Airbnb Notebook - Cities Neighbourhoods*.

*Note: I choose London as the first analyzed city because I am going there in two weeks and I am still looking for an Airbnb at fair price.*

# 0. Importing the data

First thing first, we need to import the data and get a look on how it is structured.

In [61]:
# Usual imports
import pandas as pd
import numpy as np

# Import to display interactive maps (You need to install folium previously)
# To do so, use this command on your terminal: conda install -c ioos folium=0.2.1
import folium

We load London's data from an external csv file. You can download the data from the following link: http://insideairbnb.com/get-the-data.html

In [62]:
df = pd.read_csv('data/london_listings.csv')

In [63]:
df.head(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,9554,https://www.airbnb.com/rooms/9554,20180910002454,2018-09-10,"Cozy, 3 minutes to Piccadilly Line",PLEASE CONTACT ME BEFORE BOOKING Homely apartm...,"Hello people, This is a bright, comfortable ro...",PLEASE CONTACT ME BEFORE BOOKING Homely apartm...,none,Details to follow..,...,f,,,f,f,strict_14_with_grace_period,t,f,4,1.71
1,11076,https://www.airbnb.com/rooms/11076,20180910002454,2018-09-10,The Sanctuary,The room has a double bed and a single foldawa...,This Listing is for The Sanctury The accommoda...,The room has a double bed and a single foldawa...,none,"Ealing Broadway, as short walk from our place ...",...,f,,,t,f,strict_14_with_grace_period,f,f,6,0.07


In [64]:
len(df.columns)

96

As we can see, there is a tone of information stored for each single Airbnb flat and the majority of them are useless for our project's goal. Let's then observe which columns can be used and which ones we definitely don't need.

In [65]:
df.columns.values

array(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name',
       'summary', 'space', 'description', 'experiences_offered',
       'neighborhood_overview', 'notes', 'transit', 'access',
       'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
       'picture_url', 'xl_picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
       'market', 'smart_location', 'country_code', 'country', 'latitude',
       'longitude', 'is_location_exact', 'property_type', 'room_type',
       'accommodates', 'bath

Let's check more precisely what column may be interesting for us.

In [66]:
df['price'].head(4)

0     $35.00
1     $70.00
2     $45.00
3    $300.00
Name: price, dtype: object

In [67]:
df['number_of_reviews'].head(4)

0    133
1      2
2     14
3     35
Name: number_of_reviews, dtype: int64

In [68]:
df['description'].head(4)

0    PLEASE CONTACT ME BEFORE BOOKING Homely apartm...
1    The room has a double bed and a single foldawa...
2    My bright double bedroom with a large window h...
3    Open from June 2018 after a 3-year break, we a...
Name: description, dtype: object

In [69]:
df['neighbourhood_cleansed'].head(4)

0       Haringey
1         Ealing
2      Islington
3    Westminster
Name: neighbourhood_cleansed, dtype: object

In [70]:
df['smart_location'].head(4)

0       London, United Kingdom
1       Ealing, United Kingdom
2    Islington, United Kingdom
3       London, United Kingdom
Name: smart_location, dtype: object

After a quick overview of the data, we establish a list of only 15 columns that may be useful for our analysis in this project. Therefore, we create a generic function to load only this interesting information from the cvs files.

In [71]:
def load_cvs_airbnb(filename):
    
    interesting_columns = ['id',
                           'host_id',
                           'name',
                           'description',
                           'price',
                           'number_of_reviews',
                           'review_scores_rating',
                           'review_scores_location',
                           'review_scores_value',
                           'neighbourhood_cleansed',
                           'smart_location',
                           'country_code',
                           'latitude',
                           'longitude']
    
    df_city = pd.read_csv(filename, usecols=interesting_columns)
    
    return df_city

In [72]:
df_lon = load_cvs_airbnb('data/london_listings.csv')
df_lon.set_index(['id'], inplace=True)
df_lon.sort_index(ascending=True, inplace=True)
df_lon.head(3)

Unnamed: 0_level_0,name,description,host_id,neighbourhood_cleansed,smart_location,country_code,latitude,longitude,price,number_of_reviews,review_scores_rating,review_scores_location,review_scores_value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
9554,"Cozy, 3 minutes to Piccadilly Line",PLEASE CONTACT ME BEFORE BOOKING Homely apartm...,31655,Haringey,"London, United Kingdom",GB,51.587767,-0.105666,$35.00,133,97.0,9.0,10.0
11076,The Sanctuary,The room has a double bed and a single foldawa...,40471,Ealing,"Ealing, United Kingdom",GB,51.515645,-0.314508,$70.00,2,90.0,9.0,9.0
13913,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,54730,Islington,"Islington, United Kingdom",GB,51.568017,-0.111208,$45.00,14,95.0,9.0,9.0


In [73]:
df_lon.index.is_unique

True

We can use *Airbnb* preconfigured **id** as our DataFrame indexes since ids are unique. It will be very usefull when we will put multiple cities in a single DataFrame.

# 1. Cleaning the data

In this first exercise, we only want to compare the price of the flats regrouped by neighbourhoods. Therefore, we can drop all informations but **neighbourhood**'s name, Airbnb **price**, and Airbnb's **name**.

In [74]:
df_london_ngbh = df_lon.copy()[['neighbourhood_cleansed', 'price', 'name']]

# Let's rename 'neighbourhood_cleansed' column for more readability
df_london_ngbh.rename(columns={'neighbourhood_cleansed': 'neighbourhood'}, inplace = True)
df_london_ngbh.head(5)

Unnamed: 0_level_0,neighbourhood,price,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9554,Haringey,$35.00,"Cozy, 3 minutes to Piccadilly Line"
11076,Ealing,$70.00,The Sanctuary
13913,Islington,$45.00,Holiday London DB Room Let-on going
17402,Westminster,$300.00,Superb 3-Bed/2 Bath & Wifi: Trendy W1
24328,Wandsworth,$150.00,Battersea 2 bedroom house & parking


The first observation we can make is that the price is a string and not an integer or a float. This is anoying, especially if we want to compute the mean price of each neighbourhood. Thus, we need to transform theses strings into float values.

Moreover, by looking into the data we saw that some prices are set to $0.00, which is obviously not correct. Let's find how many data are concerned by this mistake:

In [75]:
df_london_ngbh[df_london_ngbh['price'] == '$0.00'].head(3)

Unnamed: 0_level_0,neighbourhood,price,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14352218,Hackney,$0.00,Bedroom for 2 in huge 3000sqf+ Hackney warehouse
18373061,Waltham Forest,$0.00,"Cosy loft with ensuite in relaxed, friendly home"
18545675,Wandsworth,$0.00,Wimbledon tennis own king size room


In [76]:
nbr_zero_price = len(df_london_ngbh[df_london_ngbh['price'] == '$0.00'])
nbr_total = len(df_london_ngbh)

print('Number of airbnb with null price: %d' %(nbr_zero_price))
print('Number of airbnb in total: %d' %(nbr_total))
print('Percentage of data to discard: %f' %( float(nbr_zero_price)/float(nbr_total) *100))


Number of airbnb with null price: 53
Number of airbnb in total: 75506
Percentage of data to discard: 0.070193


As we can see here, the number of Airbnb flats with a null price represents less that 1% of the total amount. Therefore, we can easily delete them without loosing too much information.
Let's create a function that "clean" prices data by transforming them into float and deleting the zero values.

In [77]:
# Function that transform string s.a. '$9,000' into float number
def priceStringIntoFloat(priceString):
    if type(priceString) == str:
        priceString = priceString.split('$')[1]
        priceString = priceString.replace(",", "")
    return float(priceString)

# Clean data: Transform prices string into float and delete zero values
def cleanPricesData(df):
    df_cleaned = df.copy()
    
    # Transforming price strings into float
    df_cleaned['price'] = df_cleaned['price'].apply(lambda x: priceStringIntoFloat(x))
    
    # Getting rid of null values
    df_cleaned = df_cleaned[df_cleaned['price'] != 0.0]
    
    return df_cleaned

In [78]:
df_london_ngbh_cleaned = cleanPricesData(df_london_ngbh)

# Let's sort the values to see more clearly
df_london_ngbh_cleaned.sort_values(
                    by = ['neighbourhood', 'price', 'name'],
                    ascending = [True, False, True],
                    inplace = True)

df_london_ngbh_cleaned.head(5)

Unnamed: 0_level_0,neighbourhood,price,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24962253,Barking and Dagenham,440.0,Greater London Essex 11 people location property
23890501,Barking and Dagenham,331.0,Spacious Studio
27572868,Barking and Dagenham,250.0,Beautiful 2 Bedroom London Serviced Flat Barking
24399982,Barking and Dagenham,200.0,London (Greater) 4Bed 3Bath house
26729229,Barking and Dagenham,200.0,Luxury London 6 Bedroom House


In [79]:
df_london_ngbh_cleaned.tail(5)

Unnamed: 0_level_0,neighbourhood,price,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
16431832,Westminster,12.0,My Listing
11323463,Westminster,10.0,Baker Street Sofa
18930044,Westminster,10.0,Fantastic 2 bed apartment close to Baker St st...
19186132,Westminster,10.0,Fantastic 2 bed apartment close to Baker St st...
5483803,Westminster,10.0,Immaculate 3 Bedroom Flat Central London.


Now that we have reformated prices and sorted the data, we want to compute the mean price for each area. However, we still have to deal with duplicates. In fact, we consider that two entities are duplicates if they are located in the same neighbourhood, have the same price, and (exactly) the same name. Therefore, if we look at the DataFrame above, *Fantastic 2 bed apartment close to Baker St st...* appears two times in the data with different ids.

We can write a function that first sort the values by neighbourhood, price and name, and then delete the duplicates checking if two following entries have the exact same name.

In [80]:
def deleteDuplicates(df, printInfo=False):
    
    df_cleaned = df.copy()
    df_cleaned.sort_values(by = ['neighbourhood', 'price', 'name'], ascending = [True, False, True], inplace = True)
    
    # Because data has been sorted, we just have to check if one flat's name is similar
    # to the previous flat's name in the DataFrame. If it is, we drop this line.
    previous_name = ''
    num_deleted = 0
    for i in df_cleaned.index:
        current_name = df_cleaned.loc[i]['name']
        
        # Check if two following enties have the same string name
        if(current_name == previous_name):
            # If yes, we drop the corresponding line
            df_cleaned.drop([i], inplace = True)
            num_deleted += 1
            
            # If we want to know which entries are deleted
            if(printInfo):
                print('deleting: ' + current_name)
        else:
            # If not, we don't do anything and continue to check the rest of the data
            previous_name = current_name
    
    print('TOTAL: %d duplicate(s) deleted' %num_deleted)
    return df_cleaned

In [81]:
df_london_ngbh_cleaned = deleteDuplicates(df_london_ngbh_cleaned, False) # Put 'True' to see which flats are deleted

TOTAL: 476 duplicate(s) deleted


Note that 476 entries were duplicates which reprensent 0.63% of the data.

We can regroup theses two functions into a single:

In [82]:
def cleanNeighbourhoodsDataFrame(df, printInfo=False):
    df_cleaned = cleanPricesData(df)
    df_cleaned = deleteDuplicates(df_cleaned, printInfo)
    return df_cleaned

In [83]:
df_london_ngbh_cleaned = cleanNeighbourhoodsDataFrame(df_london_ngbh)
df_london_ngbh_cleaned.head(10)

TOTAL: 476 duplicate(s) deleted


Unnamed: 0_level_0,neighbourhood,price,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24962253,Barking and Dagenham,440.0,Greater London Essex 11 people location property
23890501,Barking and Dagenham,331.0,Spacious Studio
27572868,Barking and Dagenham,250.0,Beautiful 2 Bedroom London Serviced Flat Barking
24399982,Barking and Dagenham,200.0,London (Greater) 4Bed 3Bath house
26729229,Barking and Dagenham,200.0,Luxury London 6 Bedroom House
11274768,Barking and Dagenham,200.0,SPACIOUS 1-bed 1-bath flat in fantastic location
23805633,Barking and Dagenham,199.0,5 bedrooms House.Tourists welcome! LittleKingdom
26566031,Barking and Dagenham,198.0,"A place called home, uniquely beautiful"
25894417,Barking and Dagenham,187.0,DAGENHAM 4 BED HOUSE WITH FREE PARKING
19199377,Barking and Dagenham,175.0,"MAGNIFICENT4 BEDROOM, 2LOUNGE HOUSE"


# 2. Comparing the price among London neighbourhoods

Now that our data are clean and duplicates free, we can compute the mean price for each London neighbourhood and determine which area is the most expensive or show the result on an interactive map.

First, let's create an new DataFrame in which we regroup the information by neighbourhood.

In [84]:
serie_london_nghb = df_london_ngbh_cleaned['neighbourhood'].value_counts()
serie_london_nghb

Westminster               8011
Tower Hamlets             7316
Hackney                   5757
Kensington and Chelsea    5294
Camden                    5179
Islington                 4689
Southwark                 4495
Lambeth                   4479
Hammersmith and Fulham    3848
Wandsworth                3821
Brent                     2173
Lewisham                  2021
Haringey                  1972
Newham                    1833
Ealing                    1538
Greenwich                 1483
Barnet                    1375
Merton                    1269
Waltham Forest            1244
Richmond upon Thames      1074
Hounslow                   927
Croydon                    924
Bromley                    579
Redbridge                  573
Enfield                    508
Hillingdon                 459
Kingston upon Thames       443
City of London             424
Harrow                     411
Barking and Dagenham       243
Sutton                     227
Havering                   197
Bexley  

In [85]:
df_london_ngbh_means = pd.DataFrame({'neighbourhood': serie_london_nghb.index.values, 
                                      'counts': serie_london_nghb.values,
                                      'mean_price': 0}).reindex(['neighbourhood','counts','mean_price'], axis=1)
df_london_ngbh_means

Unnamed: 0,neighbourhood,counts,mean_price
0,Westminster,8011,0
1,Tower Hamlets,7316,0
2,Hackney,5757,0
3,Kensington and Chelsea,5294,0
4,Camden,5179,0
5,Islington,4689,0
6,Southwark,4495,0
7,Lambeth,4479,0
8,Hammersmith and Fulham,3848,0
9,Wandsworth,3821,0


Now we can define a new function that compute the mean of all airbnb prices in one neibourhood by first summing all the prices and then divinding the sum by the total number of flat computed.

In [88]:
def computingMeanPrice(df_neighbourhood, df_prices):
    for neighbourhoodName in df_neighbourhood['neighbourhood']:
        
        sumPrices = df_prices[df_prices['neighbourhood'] == neighbourhoodName]['price'].sum()
        totalNum = df_neighbourhood[df_neighbourhood['neighbourhood'] == neighbourhoodName]['counts']
        meanPrice = sumPrices / totalNum
        
        index = df_neighbourhood[df_neighbourhood['neighbourhood'] == neighbourhoodName].index
        
        df_neighbourhood.loc[index,'mean_price'] = meanPrice
    return

In [89]:
computingMeanPrice(df_london_ngbh_means, df_london_ngbh_cleaned)
df_london_ngbh_means

Unnamed: 0,neighbourhood,counts,mean_price
0,Westminster,8011,179.536387
1,Tower Hamlets,7316,84.59322
2,Hackney,5757,84.724336
3,Kensington and Chelsea,5294,188.415754
4,Camden,5179,127.554161
5,Islington,4689,105.097036
6,Southwark,4495,92.792659
7,Lambeth,4479,86.808216
8,Hammersmith and Fulham,3848,112.93815
9,Wandsworth,3821,106.194452


# 3. Interactive visualization

Now that we hava all the information we were looking for, we can display it in an interactive and visual way. To do so, we will use the famous and very helpful library called **folium**.

**Note that the map does not display correctly on Github's preview**. You can either download the file *london_airbnb_prices.html* and open it in your browser, or check the image *london_airbnb_prices.png* in the same *maps* folder.

In [117]:
def createInteractiveNeighbourhoodMap(df, jsonName, loc, zoom=10, colors='YlGn'):
    
    # Create a folium map 
    geoMap = folium.Map(location = loc, zoom_start = zoom)
    
    # Color the map with our data
    geoMap.choropleth(geo_path = jsonName,
                      data = df,
                      columns=['neighbourhood', 'mean_price'],
                      key_on='feature.properties.neighbourhood',
                      threshold_scale=[0, 60, 80, 100, 120, 200],
                      fill_color =colors, fill_opacity = 0.7, line_opacity = 0.2,
                      legend_name='Airbnb mean price per neighbourhood')
    
    return geoMap

In [120]:
map_london = createInteractiveNeighbourhoodMap(df_london_ngbh_means, 'geojson/london_neighbourhoods.geojson', [51.4894, -0.1], 10)
map_london.save('maps/london_airbnb_prices.html')
map_london