## Berlin AirBnB

Among the most important cities in Europe, the heart of important historical events in the 20th century, Berlin is definitely a city to visit at least once in our life.

AirBnB extends the possibility to stay in Berlin and we'll give a look into what's the offer in terms of properties, availability, costs etc.

The dataset used here is, for the sake of brevity, a _clean_ version the one available in the Kernel (in order to go straight to the analysis) plus another dataset which provides geographical coordinates of the city (used to create what I hope are easy to understand maps of the city).


<img src="https://i.imgur.com/PTZvS7M.jpg" alt="Drawing" style="width: 80%;"/>


###### <font size="-3" color="grey">Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash</font>

## Index

1. [Loading data](#loading_data)<br>
2. [Overview](#overview)<br>
    2.1 [Property Types](#property_types)<br>
    2.2 [Property Sizes](#property_sizes)<br>
    2.3 [Where to sleep](#where_to_sleep)<br>
    2.4 [How long are these properties available during the year?](#available_for)<br>
3. [Prices](#prices)<br>
    3.1 [Averages around the city](#averages_and_outliers)<br>
    3.2 [Focusing a bit more on the Property Type](#prices_property_type)<br>
    3.3 [What parts of the city are hotter?](#scatter_prices)<br>
    3.4 [Location Appreciation](#location_appreciation)<br>
4. [Superhosts](#superhosts)<br>
    4.1 [...Who are they?](#who_are_they)<br>
    4.2 [Looking for a Superhost](#looking_for_superhost)<br>
    4.3 [Superhosts costs](#superhosts_costs)<br>
    4.4 [Superhosts... and super reviews?](#super_reviews)<br>
4. [Mapping](#mapping)<br>
    4.1 [Preparing data](#preparing_data)<br>
    5.2 [Where can we choose from? AKA: Apartments availability by Neighbourhood Group](#map_apt_by_neigh)<br>
    5.3 [This neighbourhood is hot... Prices Distribution by neighbourhood](#map_price_by_neigh)<br>
    5.4 [Location Appreciation... again](#location_appreciation_again)

<a id='loading_data'>&nbsp;</a>

## Loading Data

Loading Pandas and Numpy libraries plus Bokeh, to help me create more sophisticated plots (even during the exploratory analysis, almost completely replacing Matplotlib or Seaborn... just for me to try it!)

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns

from math import pi

import matplotlib.pyplot as plt
import seaborn as sns

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, FixedTicker, GeoJSONDataSource, LinearColorMapper, ColorBar, Span
from bokeh.models.tools import HoverTool
from bokeh.palettes import Category20c, brewer
from bokeh.layouts import row
from bokeh.transform import factor_cmap, cumsum, dodge

import json

pd.set_option('display.max_columns', None)

output_notebook()

In [None]:
listings = pd.read_csv('../input/berlin-listings-pulito/listings_pulito.csv', index_col = None)
listings.head()

In [None]:
print('The Dataset includes {} different properties'.format(listings.shape[0]))

<a id='overview'>&nbsp;</a>
## Overview

When I think about AirBnB I think about spare rooms in apartments to be shared with the owner or full apartments which give you a stronger taste of how life in that city could be. <br>
Let's see if Berlin follows this same kind of offer or it's possible to find more peculiar properties to rent and how these properties are distributed within the whole city.

<a id='property_types'>&nbsp;</a>
### Property Types

According to the plot below

In [None]:
properties = listings.property_type.value_counts(ascending=False).reset_index()
properties.columns = ['property_type', 'occurrencies']

properties_data = ColumnDataSource(properties[:10])

prop_type = properties_data.data['property_type'].tolist()

p = figure(plot_width=800, x_range = prop_type, title="Top 10 Property Types in Berlin", toolbar_location='right')

p.vbar(x='property_type', top='occurrencies', source = properties_data, width=.7,  line_color='black',
       fill_color='orange', fill_alpha = .7, line_alpha = .2)

hover = HoverTool()
hover.tooltips = [
    ("Property", "@property_type"),
    ("Occurrencies", "@occurrencies")]

hover.mode = 'vline'

p.add_tools(hover)

p.grid.grid_line_color = None

show(p)

The offer has been quite _in line_ with what I was expecting.<br>Focusing on the Apartments, what's the _room type_ and _bed type_ offer?<br>
Will we sleep in a private room and a _normal_ bed, or will we sleep in a more _exotic_ type of bed?

In [None]:
# creating plot 1: Room Type

room_type = listings[listings.property_type == 'Apartment'].room_type.value_counts(ascending=False).reset_index()
room_type.columns = ['Room_Type', 'Occurrencies']

room_type_list = room_type['Room_Type'].tolist()

room_type_plot = figure(plot_width=400, x_range = room_type_list, 
                        title="Apartments - Available Room Types", toolbar_location = None)

room_type_plot.vbar(x='Room_Type', top='Occurrencies', source = room_type, width=.5,  line_color='black', line_alpha = .2,
       fill_color='orange', fill_alpha = .7)

hover_room_type = HoverTool()
hover_room_type.tooltips = [
    ("Room Type", "@Room_Type"),
    ("Occurrencies", "@Occurrencies"),
]

hover_room_type.mode = 'vline'

room_type_plot.add_tools(hover_room_type)

room_type_plot.grid.grid_line_color = None

In [None]:
# creating plot 2: Bed Type

bed_type = listings[listings.property_type == 'Apartment'].bed_type.value_counts(ascending=False).reset_index()
bed_type.columns = ['Bed_Type', 'Occurrencies']

bed_type_list = bed_type['Bed_Type'].tolist()

bed_type_plot = figure(plot_width=400, x_range = bed_type_list, 
                       title="Apartments - Available Bed Types", toolbar_location = None)

bed_type_plot.vbar(x='Bed_Type', top='Occurrencies', source = bed_type, width=.7,
                   line_color='black', line_alpha = .2,
       fill_color='orange', fill_alpha = .7)

hover_bed_type = HoverTool()
hover_bed_type.tooltips = [
    ("Bed Type", "@Bed_Type"),
    ("Occurrencies", "@Occurrencies"),
]

hover_bed_type.mode = 'vline'

bed_type_plot.add_tools(hover_bed_type)

bed_type_plot.grid.grid_line_color = None

In [None]:
show(row(room_type_plot, bed_type_plot))

_Private Room_ and _Entire Apartment_ are (no suprise) the core of the offer. As long as _Real Bed_ is the most common type of bed you can sleep on in Berlin.<br>
I don't know you, but I'd give Futon a try!

<img src='https://i.imgur.com/cg28FtX.jpg' width='80%'>

The offer is aligned with the my basic idea of AirBnB. There's a large majority of apartments, followed by Condominiums and Lofts/Houses. More peculiar offers (like Guesthouses or Hostels) are far behind in terms of numerical availability.

<a id='property_sizes'>&nbsp;</a>
### Property Sizes


If you're up to planning a trip to this city, how many people can you bring with you? What's the most available accommodation type?

In [None]:
# grouping listings by the accommodates field, and in particular grouping in the '5+' group all those
# listings available for at least 5 people

accomm_overview = listings.accommodates.value_counts(ascending=False)
accom = []
five_or_more = 0
for i, v in accomm_overview.iteritems():
    if i <=4:
        accom.append({'accommodates': i, 'occurrencies': v})
    else:
        five_or_more += v
accom.append({'accommodates': '5+', 'occurrencies': five_or_more})

accomm_sizes = pd.DataFrame(accom)

vals = accomm_sizes['accommodates']

accomm_sizes['angle'] = accomm_sizes.occurrencies/accomm_sizes.occurrencies.sum() * 2 * pi

colors = Category20c[5]

accomm_sizes['color'] = colors[::-1] 

In [None]:
p = figure(plot_width=800, plot_height = 400, title="Accommodations Availability in Berlin", toolbar_location='right',
           x_range=(-1, 1))

acc = accomm_sizes.accommodates.to_list()

p.annular_wedge(x=0, y=-1,  inner_radius=0.15, outer_radius=0.25, direction="anticlock",
                start_angle = cumsum('angle', include_zero = True), end_angle = cumsum('angle'),
        line_color="white", fill_color='color', legend_field = 'accommodates', source=accomm_sizes)

hover = HoverTool()
hover.tooltips = [
    ("Accommodates", " @accommodates"),
    ("Occurrencies", "@occurrencies"),
]

p.add_tools(hover)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

As shown above, the offers for 2 people cover more than 50% of the total availability, while all other accommodations (for 1, 3, 4, 5 or more) almost have the same share with each other.<br>
So, if you're going to Berlin, bring your partner with you... or sleep alone in a very comfortable bed for two!

In the plot below we can see where _small_ and _big_ properties are located within the city:

In [None]:
listing_center = listings[(listings.latitude >= 52.45) & (listings.latitude <= 52.562) &
                             (listings.longitude >=13.26) & (listings.longitude <= 13.52)
                         & (listings.price <= 300)]
# index_cmap = factor_cmap('size', palette=['red', 'blue', 'yellow', 'green', 'black', 'cyan'], 
#                           factors=['1', '2', '3', '4', '5', '6'])

avgerage_accom = listing_center[listing_center.accommodates <= 4]
bigger_accom = listing_center[listing_center.accommodates >= 5]

p = figure(plot_width=800, plot_height=700, 
           title = "Berlin: Average sized properties (up to 4) vs bigger properties (5+)",
           toolbar_location='right'
)

p.circle(avgerage_accom.longitude, avgerage_accom.latitude,
         color='lightblue',
         fill_alpha=0.2,
         size=avgerage_accom.price / 20,
         legend_label = "For 1 to 4"
)

p.diamond(bigger_accom.longitude, bigger_accom.latitude,
         color='firebrick',
         fill_alpha=.2,
         size=bigger_accom.price / 20,
        legend_label = "5+"
)


p.xaxis.axis_label = 'longitude'
p.yaxis.axis_label = 'latitude'

show(p)

For the sake of simplicity, in the plot above I've grouped the properties between accommodations for 1 to 4 people and those for 5 or more customers.<br>
The 1 to 4 are in general more well distributed around the city (and they're clearly more in number); the other category of accommodations are more concentrated in the North-Eastern area of the city centre and present (as could be expected) a higher number of high-price accommodations (represented by diamonds of bigger size).<br>
The idea I have from this plot is that, being Berlin such a big and important city, the distribution of properties, being them smaller or bigger, is quite regular around the city making it easier to find a solution which meets travellers needs.

<a id='where_to_sleep'>&nbsp;</a>
### Where to sleep

Splitting the city by neighbourhoods, the following plot will tell us which neighbourhood offers the most properties:

In [None]:
listings_by_neigh = listings.neighbourhood_group_cleansed.value_counts().reset_index()

listings_by_neigh.columns = ['Neighbourhood_Group', 'Occurrencies']
listings_by_neigh['share'] = round(listings_by_neigh.Occurrencies / listings_by_neigh.Occurrencies.sum() * 100, 2)
listings_by_neigh['angle'] = listings_by_neigh.Occurrencies/listings_by_neigh.Occurrencies.sum() * 2 * pi

colors = ['#1784ba', '#5faed3', '#97cadf', '#c3dbee', '#ffd0a6', '#ffd0a6', '#ffd0a6', '#ffd0a6', '#ffd0a6', '#ffd0a6', '#ffd0a6', '#ffd0a6']

listings_by_neigh['color'] = colors

p = figure(plot_width=800, plot_height = 500, title="Accommodations Availability by Neighbourhood Group in Berlin",
           toolbar_location='right', x_range=(-1, 1))

neigh = listings_by_neigh.Neighbourhood_Group.to_list()

p.annular_wedge(x=0, y=-1,  inner_radius=0.15, outer_radius=0.25, direction="anticlock",
        start_angle = cumsum('angle', include_zero = True), end_angle = cumsum('angle'),
        line_color="white", fill_color='color', legend_field = 'Neighbourhood_Group', source=listings_by_neigh)

hover = HoverTool()
hover.tooltips = [
    ("Neighbourhood Group", " @Neighbourhood_Group"),
    ("Occurrencies", "@share{0.2f}%"),
]

p.add_tools(hover)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

As shown above, the neighbourhoods:

- **Friedrichshain-Kreuzberg**<br>
- **Mitte**<br>
- **Pankow**<br>
- **Neukölln**

Offer the most properties (76.2% of the total) and this is an information which will be useful later, for some map plotting.

In [None]:
top_neigh = ['Friedrichshain-Kreuzberg', 'Mitte', 'Pankow', 'Neukölln']

<a id='available_for'>&nbsp;</a>

### How long are these properties available during the year?

The dataset allows us to see how many days each property is available (I focused on the availability over 365 days):

In [None]:
availabilities = listings.groupby('neighbourhood_group_cleansed').availability_365.mean().reset_index().sort_values(by = 'availability_365',ascending = False)
availabilities.columns = ['Neighbourhood', 'Availability']
availabilities_list = availabilities['Neighbourhood'].tolist()

p = figure(plot_width=800, x_range = availabilities_list, title="Average Availability (days)", toolbar_location='right')

p.vbar(x='Neighbourhood', top='Availability', source = availabilities, width=.5,  line_color='black', 
       line_alpha = .2, fill_color='orange', fill_alpha = .7)

hover = HoverTool()
hover.tooltips = [
    ("Neighbourhood", "@Neighbourhood"),
    ("Availability (days)", "@Availability{0.2f}"),
]

hover.mode = 'vline'

p.add_tools(hover)

p.xaxis.major_label_orientation = pi/4
p.grid.grid_line_color = None

show(p)

This plot tells us that, surprisingly, the neighbourhoods Spandau and Marzahn-Hellersdorf are those open the most during the year (the average availability is about half of the year).<br>
The _fab four_ we saw earlier (Friedrichshain-Kreuzberg, Mitte, Pankow, Neukölln) are in the last positions of the plot, with their availability (about 25% of the year at most) quite low in comparison to the high number of available properties.

Digging a bit deeper:

In [None]:
top_prop = listings[listings.neighbourhood_group_cleansed.isin(top_neigh)].sort_values(by='neighbourhood_group_cleansed')

plt.figure(figsize=(18, 8))

sns.boxplot(data=top_prop, x='neighbourhood_group_cleansed', y = 'availability_365', color='orange', saturation = .9);
plt.title('Availability (365 days) in the top 4 Neighbourhood Groups');
plt.xlabel('Neighbourhood Group');
plt.ylabel('Availability over the year (max. 365 days)');

Mitte and Pankow appear to be the most _consistent_ over the year, unlike _Friedrichshain-Kreuzbert_ and _Neukölln_ where being open for more than 100 days is in most cases almost an _exception_ .

<a id='prices'>&nbsp;</a>

## Prices

So far we've been dreaming about sweet things like what bed to sleep on, what neighbourhood is the one keeping its doors open the most waiting for us, but now let's talk about something less pleasant... **prices**!

<a id='averages_and_outliers'>&nbsp;</a>

### Averages around the city

The median price considering all property types is:

In [None]:
print('{} Euro per night'.format(listings.price.median()))

<a id='prices_property_type'>&nbsp;</a>

### Focusing a bit more on the Property Type

And if we compare this value to the median price for each property type

In [None]:
listings_price = listings.groupby('property_type').price.median().sort_values(ascending=False).reset_index()

property_type_list = listings_price['property_type'].tolist()

p = figure(plot_width=800, x_range = property_type_list, title="Prices per Property Type vs Median Price", 
           toolbar_location='right')

p.vbar(x='property_type', top='price', source = listings_price, width=.8,  line_color='black', line_alpha = .5,
        fill_color='orange', fill_alpha = .7)

x = property_type_list
y = 45

p.line(x, y,
       line_dash='dashed', line_width=.5, legend_label="Median price")

p.grid.grid_line_color = None
p.xaxis.major_label_orientation = pi/4
show(p)


We find a number of different property types with prices much above the Berlin median price. We have to keep in mind what we saw before: all these property types are almost irrilevant (in number) compared to the number of available apartments.

Plotting this info in a bit different way

In [None]:
price_arr = np.arange(0, 1100, 100)

plt.figure(figsize=(18, 4))
sns.boxplot(data=listings, x='price', color='orange', saturation = .9);
plt.xlim(0, 1000)
plt.title('Prices (all Property types)');
plt.xlabel('Prices in Euro');
plt.xticks(price_arr, price_arr);

We notice that what appears to be the largest part of Prices goes as high as 300 Euro. Above that, the number of properties is much lower, which makes taking account of such properties less interesting in the next plots.

In [None]:
price_arr = np.arange(0, 1050, 50)

plt.figure(figsize=(18, 4))
apt_listings = listings[listings.property_type == 'Apartment']
sns.boxplot(data=apt_listings, x='price', color='orange', saturation = .9);
plt.xlim(0, 1000)
plt.title('Prices (Apartments)');
plt.xlabel('Prices in Euro');
plt.xticks(price_arr, price_arr);

Same boxplot when limiting the listings to the Apartments only. From here on, we'll only focus on Apartments up to 300 Euro per night.

<a id='scatter_prices'>&nbsp;</a>

### What parts of the city are hotter?

... Under several points of view, which don't include the weather... at least, not in this notebook.

The below scatterplot helps us give a quick look at how prices (Apartments, below 300 Euro) are distributed and it's pretty interesting

In [None]:
# focusing on Apartments within 300 Euro per night
listings_300 = listings[(listings.price <= 300) & (listings.property_type == 'Apartment')]


In [None]:
listings_300.plot(kind='scatter', x = 'longitude', y = 'latitude', alpha = .4, label = 'Price', figsize=(20, 16),
             c = 'price', cmap = plt.get_cmap('afmhot'), colorbar = True);


plt.title('Prices Distribution - Properties within 300 Euro');
plt.xlabel('Longitude');
plt.ylabel('Latitude');

Prices seem to be distributed as two concentric circles: the inner circle (the very center of Berlin) has a lighter colour which means higher prices. From that, a bigger and darker circle develops, showing that the farther we move from the centre, the cheaper apartments get.

In the <a href='#mapping'>Mapping the results</a> section we'll graphically focus on how such prices are divided between the neighbourhood groups.

<a id='location_appreciation'>&nbsp;</a>
### Location appreciation

In the Regular vs Super host chapter, we'll focus on hosts scores as single _hosts_ , but now I would like to use one of the scores used to evaluate the property to see if it provides a broader indication about the city.<br>
Focusing on the _location_ score, let's see if some neighbourhoods are clearly people's favourites.

In [None]:
by_location = listings.groupby('neighbourhood_group_cleansed').review_scores_location.mean().sort_values(ascending = True).reset_index()
by_location.columns = ['neighbourhood', 'avg_location_score']


In [None]:
location_plot = figure(plot_width=800, plot_height = 400, y_range = by_location.neighbourhood.to_list(),
                        title="Location Scores", toolbar_location = None)
colori = Category20c[12]
colori = ['#ff4800', '#FF5400', '#FF6000', '#FF6D00', '#FF7900', '#FF8500', '#FF9100', '#FF9E00', '#FFAA00',
          '#FFB600', '#FFB600', '#FFB600']

location_plot.hbar('neighbourhood', left = 'avg_location_score', right = 0, height = .8, source = by_location,
                    line_color='black', line_alpha = .2,
                    fill_color=factor_cmap('neighbourhood', palette=colori[::-1], 
                                             factors=by_location.neighbourhood.to_list()
                    ), 
                    fill_alpha = .7
)

hover_locations = HoverTool()
hover_locations.tooltips = [
    ("Neighbourhood", "@neighbourhood"),
    ("Average Location Score", "@avg_location_score{2.5f}"),
]

hover_locations.mode = 'hline'

location_plot.add_tools(hover_locations)

location_plot.grid.grid_line_color = None


show(location_plot)

Friedrichshain-Kreuzberg and Pankow are the neighbourhoods which reach the very deep heart of Berlin. So, no suprise these two neighbourhoods get such good grades.<br>
Later we'll represent this chart on a map, to better understand where these neighbourhoods are compared to the centre.

<a id='superhosts'>&nbsp;</a>
### Superhosts

<a id='who_are_they'>&nbsp;</a>
### ... Who are they?

Among the features available in AirBnB there's the chance of being a _superhost_ . The status of <a href='https://www.airbnb.com/help/article/828/what-is-a-superhost?_set_bev_on_new_domain=1597339231_wNoRxgqkPUhcM3qY&locale=en' target='_blank'>superhost</a> is given to _experienced_ and _successful_ hosts which are required to comply with <a href='https://www.airbnb.com/superhost' target='_blank'>performances</a> and goals.

Can we find any differences in the experiences the Berlin tourists had between _regular_ and _super_ hosts? (let's keep in mind that I'm focusing on the most _common_ property type and price range offers, which are Apartments within 300 Euro per night).

<a id='looking_for_superhost'>&nbsp;</a>
### Looking for a SuperHost

How common is to find a superhost?

In [None]:
norm_super = listings_300.groupby('host_is_superhost').experiences_offered.count().reset_index()
norm_super.rename(columns={'experiences_offered':'occurrencies'}, inplace=True)
norm_super.replace({'f': 'Regular Host', 't': 'Superhost'}, inplace = True)
norm_super['angle'] = norm_super.occurrencies/norm_super.occurrencies.sum() * 2 * pi
colors = Category20c[5]
norm_super['color'] = colors[::-3]

p = figure(plot_width=800, plot_height = 400, title="Distribution between Normal and Super Hosts",
           toolbar_location='right',
           x_range=(-1, 1))

no_su = norm_super.host_is_superhost.to_list()

p.annular_wedge(x=0, y=-1,  inner_radius=0.15, outer_radius=0.25, direction="anticlock",
                start_angle = cumsum('angle', include_zero = True), end_angle = cumsum('angle'),
        line_color="white", fill_color='color', legend_field = 'host_is_superhost', source=norm_super)

hover = HoverTool()
hover.tooltips = [
    ("Host Type", " @host_is_superhost"),
    ("Occurrencies", "@occurrencies"),
]

p.add_tools(hover)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

Hmmm not that common!<br>
But since they're so _rare_ they're supposed to give you that _little something_ which can make your Berlin experience unforgettable.

<a id='superhosts_costs'>&nbsp;</a>
### Superhosts costs

Now I don't want to be so _romantic_ about superhosts. Let's compare them to the _regular_ hosts in terms of price per night!

In [None]:
norm_super = listings_300.groupby('host_is_superhost').price.median().reset_index()
norm_super.replace({'f': 'Regular Host', 't': 'Superhost'}, inplace = True)

ns_prices_plot = figure(plot_width=400, x_range = norm_super.host_is_superhost.to_list(),
                        title="Regular and Super Hosts - Median Prices", toolbar_location = None, 
                        x_axis_label = 'Host Type',
                        y_axis_label = "Prices in Euro"
                      )

colori = ['#e6550d', '#6baed6']

ns_prices_plot.vbar(x='host_is_superhost', top='price', source = norm_super, width=.7,
                    line_color='black', line_alpha = .2,
                    fill_color = factor_cmap('host_is_superhost', palette=colori, 
                                             factors=norm_super.host_is_superhost.to_list()
                                            ),
                    fill_alpha = .7
)

hover_ns_prices = HoverTool()
hover_ns_prices.tooltips = [
    ("Host Type", "@host_is_superhost"),
    ("Median Price", "@price{2.2f} Euro"),
]

hover_ns_prices.mode = 'vline'

ns_prices_plot.add_tools(hover_ns_prices)

ns_prices_plot.grid.grid_line_color = None

In [None]:
prices_hist = figure(plot_width = 400,
      title = "Prices Distribution",
      x_axis_label = 'Prices in Euro',
      y_axis_label = "Hosts in this price range"
    
)  

arr_hist_norm, edges_norm = np.histogram(listings_300[listings_300.host_is_superhost == 'f'].price, 
                                         bins = int(305 / 5), range = [0, 305])
arr_hist_super, edges_super = np.histogram(listings_300[listings_300.host_is_superhost == 't'].price, 
                                         bins = int(305 / 5), range = [0, 305])

prices_norm = pd.DataFrame({'prices':arr_hist_norm, 'left':edges_norm[:-1], 'right':edges_norm[1:]})
prices_super = pd.DataFrame({'prices':arr_hist_super, 'left':edges_super[:-1], 'right':edges_super[1:]})

prices_hist.quad(bottom=0, top =prices_norm.prices, left = prices_norm['left'], right = prices_norm['right'], 
                 fill_color = '#e6550d', fill_alpha = .6, line_color = 'black', line_alpha = .1
                 , legend_label = 'Regular Host'
)

prices_hist.quad(bottom=0, top =prices_super.prices, left = prices_super['left'], right = prices_super['right'], 
                 fill_color = '#6baed6', fill_alpha = .8, line_color = 'black', line_alpha = .2
                 , legend_label = 'Super Host'
)

prices_hist.grid.grid_line_color = None

In [None]:
show(row(ns_prices_plot, prices_hist))

On the left plot, we can see that Superhosts have a slightly higher median price than the Regular hosts, and this is understandable, if you want _quality_ you have to spend a little more.<br> On the right side, we see that prices distributions follow a very similar right-skewed distribution.
<a id='super_reviews'>&nbsp;</a>
### Superhosts... and super reviews?

On one side, we have hosts (either regular or _super_ ) who do their best to provide us  with the best experience in terms of accommodation.<br>
But how much do customers reward their effort? And how good renting an AirBnB in the lovely Berlin is? (I said the lovely Berlin because I've been an AirBnB guest in that city too... and the experience was a full 10 for me!)

In [None]:
rev_norm_super = listings_300.groupby('host_is_superhost').review_scores_value.mean().reset_index()
rev_norm_super.replace({'f': 'Regular Host', 't': 'Superhost'}, inplace = True)

rns_plot = figure(plot_width=800, plot_height = 200, y_range = rev_norm_super.host_is_superhost.to_list(),
                        title="Regular and Super Hosts - Average Review Scores", toolbar_location = None)
rns_plot.hbar('host_is_superhost', left = 'review_scores_value', right = 0, height = .8, source = rev_norm_super,
                    line_color='black', line_alpha = .2,
                   fill_color=factor_cmap('host_is_superhost', palette=colori, 
                                             factors=norm_super.host_is_superhost.to_list()
                                            ), 
              fill_alpha = .7
)

hover_rns_prices = HoverTool()
hover_rns_prices.tooltips = [
    ("Host Type", "@host_is_superhost"),
    ("Average Review Score", "@review_scores_value{2.5f}"),
]

hover_rns_prices.mode = 'hline'

rns_plot.add_tools(hover_rns_prices)

rns_plot.grid.grid_line_color = None

from bokeh.models import Span

daylight_savings_start = Span(location=10,
                              dimension='height', line_color='black',
                              line_dash='solid', line_width=.7, line_alpha = .3)
rns_plot.add_layout(daylight_savings_start)

show(rns_plot)

Superhosts are closer to perfection! So, they're _doing their job_ quite well!

Checking a bit more _specific_ reviews:

In [None]:
col_colors = Category20c[3]

avg_accur = listings_300.groupby('host_is_superhost').review_scores_accuracy.mean()
avg_clean = listings_300.groupby('host_is_superhost').review_scores_cleanliness.mean()
avg_checkin = listings_300.groupby('host_is_superhost').review_scores_checkin.mean()
avg_comm = listings_300.groupby('host_is_superhost').review_scores_communication.mean()

avg_rs = pd.DataFrame({
        'host_is_superhost': ['Regular Host', 'Super Host'],
        'accuracy': avg_accur.to_list(),
        'checkin': avg_checkin.to_list(),
        'communication': avg_comm.to_list()
    }
)


p_avg_rs = figure(x_range=avg_rs.host_is_superhost.to_list(), y_range=(0, 11), plot_width = 800, plot_height=350, 
        title="Regular vs Super Hosts - Average Ratings Comparison",
        toolbar_location=None, tools=""
)

p_avg_rs.vbar(x=dodge('host_is_superhost', -0.25, range=p_avg_rs.x_range), top='accuracy', width=.2, 
        source=avg_rs,
        color = col_colors[0],
        legend_label="Accuracy"
)

p_avg_rs.vbar(x=dodge('host_is_superhost', 0, range=p_avg_rs.x_range), top='checkin', width=.2, 
        source=avg_rs,
        color = col_colors[1],
        legend_label="Check-in"
)

p_avg_rs.vbar(x=dodge('host_is_superhost', 0.25, range=p_avg_rs.x_range), top='communication', width=.2, 
            source=avg_rs,
            color=col_colors[2],
            legend_label="Communication"
)

p_avg_rs.x_range.range_padding = 0.1
p_avg_rs.xgrid.grid_line_color = None
p_avg_rs.legend.location = "bottom_right"
p_avg_rs.legend.orientation = "vertical"

show(p_avg_rs)


In general, Superhosts are on average rated better than Regular Hosts (and apparently they excel at _communication_ being rated 9.94 out of 10).<br>
In general, the difference can be sligthly perceived, so in my opinion being a Superhost is surely a sign of quality but not a big discriminant.

<a id='mapping'>&nbsp;</a>
## Mapping the results

From the website <a href='https://daten.berlin.de/datensaetze/openstreetmap-daten-f%C3%BCr-berlin'>Berlin.de</a> I've downloaded a shape file with the coordinates which allowed me to build a map of the Neighbourhood Groups we find in the starting dataset.

<a id='preparing_data'>&nbsp;</a>
### Preparing data

In the following cells I'll clean the data up and merge it to the listings I've been working with so far. In order to keep the analysis easy to read, I'm going to hide the code cells, which are commented to give an idea of what they do.

In [None]:
import geopandas as gpd
shapefile = '../input/berlin-neighbourhood-coords/Berlin.shp'

#Read shapefile using Geopandas
ber_shp = gpd.read_file(shapefile)
ber_shp.columns = ['osm_id', 'code', 'fclass', 'neighbourhood_group', 'name', 'geometry']

ber_shp.head()

In [None]:
#Some names must be edited in order to be correctly shown and spelled and the neighbourhood_group
# must be standardized with the one we have from the listings dataframe

neighbourhoods = ['Charlottenburg-Wilm.', 'Friedrichshain-Kreuzberg', 'Lichtenberg',
       'Marzahn - Hellersdorf', 'Mitte', 'Neukölln', 'Pankow',
       'Reinickendorf', 'Spandau', 'Steglitz - Zehlendorf',
       'Tempelhof - Schöneberg', 'Treptow - Köpenick']

pankow = ['Buch', 'Pankow', 'Blankenfelde', 'FranzÃ¶sisch Buchholz', 'Wilhelmsruh', 'Rosenthal', 'Blankenburg', 
         'NiederschÃ¶nhausen', 'Stadtrandsiedlung Malchow', 'Heinersdorf', 'WeiÃ\x9fensee', 'Prenzlauer Berg']

reinickendorf = ['Reinickendorf', 'Frohnau', 'Heiligensee', 'KonradshÃ¶he', 'Tegel', 'Waidmannslust', 'Wittenau'
                'MÃ¤rkisches Viertel', 'Borsigwalde']

charlottenburg_wilm = ['Charlottenburg', 'Westend', 'Grunewald', 'Halensee', 'Wilmersdorf']

friedrichshain_kreuzberg = ['Friedrichshain', 'Kreuzberg']

lichtenberg = ['Lichtenberg', 'Wartenberg', 'Malchow', 'Falkenberg', 'Neu-HohenschÃ¶nhausen', 'Alt-HohenschÃ¶nhausen'
              'Fennpfuhl', 'Rummelsburg', 'Friedrichsfelde', 'Karlshorst']

marzahn_hellersdorf = ['Marzahn', 'Hellersdorf', 'Biesdorf', 'Kaulsdorf', 'Mahlsdorf']

mitte = ['Mitte', 'Gesundbrunnen', 'Wedding', 'Moabit', 'Hansaviertel', 'Tiergarten']

neukolln = ['NeukÃ¶lln', 'Britz', 'Buckow', 'Gropiusstadt', 'Rudow']

spandau = ['Spandau', 'Hakenfelde', 'Falkenhagener Feld', 'Staaken', 'Kladow', 'Gatow', 'Wilhelmstadt', 'Haselhorst'
          'Siemensstadt']

steglitz_zehlendorf = ['Steglitz', 'Zehlendorf', 'Lankwitz', 'Lichterfelde', 'Dahlem', 'Nikolassee', 
                      'Wannsee']

tempelhof_schoneberg = ['Tempelhof', 'SchÃ¶neberg', 'Friedenau', 'Mariendorf', 'Marienfelde', 'Lichtenrade']

treptow_kopenick = ['Alt-Treptow', 'KÃ¶penick', 'PlÃ¤nterwald', 'OberschÃ¶neweide', 'Baumschulenweg', 'NiederschÃ¶neweide', 
                   'Johannisthal', 'Adlershof', 'Altglienicke', 'Bohnsdorf', 'GrÃ¼nau', 'Friedrichshagen',
                   'Rahnsdorf', 'MÃ¼ggelheim', 'SchmÃ¶ckwitz']


# small adjustments I couldn't complete with the code above

osm_id = ['409203', '407715', '410446', '407714', '164785', '16346', '55746', '55743', '409234', '55744', '404664',
         '25179280']
osm_id_neigh = ['Lichtenberg', 'Lichtenberg', 'Spandau', 'Spandau', 'Charlottenburg-Wilm.', 'Charlottenburg-Wilm.', 
               'Reinickendorf', 'Reinickendorf', 'Reinickendorf', 'Reinickendorf', 'Pankow', 'Steglitz - Zehlendorf']

to_check = [charlottenburg_wilm, friedrichshain_kreuzberg, lichtenberg, marzahn_hellersdorf, 
        mitte, neukolln, pankow, reinickendorf, spandau, steglitz_zehlendorf, 
        tempelhof_schoneberg, treptow_kopenick]



In [None]:
# checking all districts and setting their neighbourhood_group to the corresponding one
# whose name has been standardized with the neighbourhood group names available in the AirBnB dataset

for i in range(0, len(to_check)):
    cerca = to_check[i]
    neigh = neighbourhoods[i]
    ber_shp.loc[ber_shp.name.isin(cerca), 'neighbourhood_group'] = neigh

# setting the remaining neighbourhood to the neighbourhood_group available from the AirBnB dataset
cont = 0
for i in osm_id:
    ber_shp.loc[ber_shp.osm_id == i, 'neighbourhood_group'] = osm_id_neigh[cont]
    cont += 1
    
ber_shp.drop(['osm_id', 'code', 'fclass'], axis = 1, inplace = True) # removing unnecessary columns
ber_shp.head(4)

In [None]:
# The ber_shp data will be used to provide the right geometry to the maps plotted from now on.

# First thing first, I'm creating a method through which I'll merge the df containing the info of interest 
# (i.e. the mean price grouped by neighbourhood)
# and it returns a df with the strictly necessary columns: coordinates, neighbourhood_group, the info to plot

def merge_coords(df, merge_by, value):
    merged = pd.merge(ber_shp, df, left_on='neighbourhood_group', right_on=merge_by, how = 'left')
    columns = ['geometry', 'neighbourhood_group_cleansed', '{}'.format(value)]
    merged = merged[columns]
    return merged

In [None]:
# this function creates a map of Berlin. It will render each neighbourhood with a colour
# correspoinding to its value (i.e. the number of available properties).
# This method receives as input:
# a dataframe with the neighbourhood_groups and the values I want to highlight (i.e. the median price per neighbourhood)
# the column I want to focus on
# a list to help me customize the ColorBar object
# the plot title


def draw_map(df, focus, ticks_lab, title):
    #Read data to json.
    data_json = json.loads(df.to_json())

    #Convert to String like object.
    map_data = json.dumps(data_json)

    #Input GeoJSON source that contains features for plotting.
    geosource = GeoJSONDataSource(geojson = map_data)

    #Define a sequential multi-hue color palette.
    palette = brewer['Oranges'][len(ticks_lab)-1]


    #Reverse color order
    palette = palette[::-1]

    #Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
    color_mapper = LinearColorMapper(palette = palette, low = ticks_lab[0], high = ticks_lab[-1])

    #Define custom tick labels for color bar.
    tick_labels = ticks_lab
    
    #Create color bar. 
    cust_ticks = FixedTicker(ticks = ticks_lab)
    color_bar = ColorBar(color_mapper=color_mapper, label_standoff=5, width = 500, height = 20,
                border_line_color=None, location = (0,0) , orientation = 'horizontal'
                , ticker = cust_ticks
    )

    #Create figure object.
    tooltip = ''
    tooltip = tooltip.join(['@', focus])
    p = figure(title = title, plot_height = 600 , plot_width = 800, 
               toolbar_location = 'right' 
               , tooltips = [("Neigh", '@neighbourhood_group_cleansed'), (focus, tooltip)]
              )

    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    p.hover.point_policy = "follow_mouse"

    #Add patch renderer to figure. 
    p.patches('xs', 'ys', source = geosource,
              fill_color = {'field' :focus, 'transform' : color_mapper}, 
              line_color = None, line_width = .15, fill_alpha = 1)


    #Specify figure layout.
    p.add_layout(color_bar, 'below')
    
    p.xaxis.visible = False
    p.yaxis.visible = False

    #Display figure.
    show(p)

<a id='map_apt_by_neigh'>&nbsp;</a>
### Where can we choose from? AKA: Apartments availability by Neighbourhood Group

Earlier we saw a donut plot showing us that 4 neighbourhoods represent the 76,2% of the offer in Berlin.<br>
This map shows us where these neighbourhoods are located

In [None]:
offer_by_neigh = listings_300.groupby('neighbourhood_group_cleansed').id.count().reset_index().sort_values(by='id', ascending = False)
offer_by_neigh.columns = ['neighbourhood_group_cleansed', 'occurrencies']
neigh_coord = merge_coords(offer_by_neigh, 'neighbourhood_group_cleansed', 'occurrencies')
str_labels = [0, 1000, 2000, 3000, 4000, 5000, 6000]
draw_map(neigh_coord, 'occurrencies', str_labels, 'Berlin Neighbourhoods by number of available properties')

<a id='map_price_by_neigh'>&nbsp;</a>

### This neighbourhood is hot... Prices Distribution by neighbourhood

Let's talk again about hot stuff!<br>
The plot below shows which neighbourhoods have a higher average price and where they're located within Berlin's territory

In [None]:
prices_by_neigh_group = listings_300.groupby('neighbourhood_group_cleansed').price.mean().reset_index()
neigh_coord = merge_coords(prices_by_neigh_group, 'neighbourhood_group_cleansed', 'price')

str_labels = np.arange(45, 65, 3)

draw_map(neigh_coord, 'price', str_labels, 'Berlin Neighbourhoods by Average price (Euro)')

<a id='location_appreciation_again'>&nbsp;</a>
### Location appreciation, map version

A few plots above, we listed neighbourhoods by location score.<br>
In the following plot we transform that list of neighbourhoods (which for non-Berlin locals maybe don't say too much) into a map of the city centre, to help us understand where the best locations are.

In [None]:
by_location.columns = ['neighbourhood_group_cleansed', 'avg_location_score']
neigh_coord = merge_coords(by_location, 'neighbourhood_group_cleansed', 'avg_location_score')

str_labels = np.arange(8.8, 9.8, .1)

draw_map(neigh_coord, 'avg_location_score', str_labels, 'Berlin Neighbourhoods by Location Score')

### Conclusions

In my analysis I see confirmed what's the _true sprit_ of AirBnB: a lot of apartments and spare rooms (or single beds) available for travellers who maybe aren't looking for the comfort of an Hotel but at the same time can find an easier (and often cozy) apartment where they can spend a quick and satisfying vacation.

The average price is also very affordable, in the perfect AirBnB style! But more sophisticated an expensive offers are available, too.

It's confirmed that the _best_ and _richer_ (in terms of available properties) locations are those in the surroundings of the heart of the city, where the _location_ reviews are more generous from the guests.<br>

We've also focused on Superhosts and it was possible to see confirmed that their average performances overcome the _regular_ hosts performances (even if just slightly) but this is a sign, in my opinion, that AirBnB's idea (offer a _mark of quality_ for some users) works as expected.