# Introduction
In this Jupyter notebook, we investigate the [Airbnb Munich dataset](http://insideairbnb.com/get-the-data.html) (24th of December 2020). 
<br> The structure of this notebook follows the CRISP-DM process, consisting of the following five steps:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modeling
5. Results Evaluation


# Business Understanding

In this project, we want to answer the following three questions, which might be relevant for tourists staying in an Airbnb listing in Munich in 2021:

1. In which months are most Airbnb listings still available (total and by room type)?
2. In which Munich areas (zip codes) are the best Airbnb listings (according to total rating)?
3. Do cheaper listings have a lower rating compared to more expensive listings? 

# Data understanding

For answering the questions, we will focus on the following data:

1. For the first question, we will use data from the calendar.csv file. This file includes dates, prices, availability information as well as the number of overnight stays for each Airbnb listing. Relevant columns for us are "listing_id", "date", and "available". In addition, we will use the column "room_type" from the listing.csv file, which includes a detailed description of each Airbnb listing.

2. For the second question, we will use data from the listing.csv file. Relevant columns for us are "id" (of the listing), "review_scores_rating" (total rating score), "latitude", and "longitude". We will use the latitude and longitude data to obtain the zip code for each listing, using [geographic data from Germany](https://www.suche-postleitzahl.org/plz-karte-erstellen). Relevant columns of the geographic data are "plz" (zip code), and "geometry". 

3. For the third question, we will use data from the listing.csv file. Relevant columns for us are "id" (of the listing), "review_scores_rating" (total rating score), and "price". 

In [1]:
# Import necessary libraries
%matplotlib inline
import altair as alt
import calendar
import geopandas as gpd
from geopandas import GeoDataFrame
import matplotlib as mpl
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sb
from shapely.geometry import Point

In [37]:
# Load relevant Airbnb data
calendar_df = pd.read_csv("./Data/calendar.csv")
listings_df = pd.read_csv("./Data/listings.csv")
# Load map data
gdf_locations = gpd.read_file('./Data/plz-5stellig.shp', dtype={'plz': str})

In [3]:
# Display the calendar data
calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,172672,2020-12-24,f,$49.00,$49.00,1,4
1,172672,2020-12-25,f,$49.00,$49.00,1,4
2,97945,2020-12-30,f,$80.00,$80.00,2,90
3,97945,2020-12-31,f,$80.00,$80.00,2,90
4,97945,2021-01-01,f,$80.00,$80.00,2,90


In [4]:
# Display the listings data
pd.set_option('display.max_columns', 100)
listings_df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,97945,https://www.airbnb.com/rooms/97945,20201224153430,2020-12-30,Deluxw-Apartm. with roof terrace,"<b>The space</b><br />We offer a modern, quiet...",We are living in a outskirt of Munich its call...,https://a0.muscache.com/pictures/2459996/10b4c...,517685,https://www.airbnb.com/users/show/517685,Angelika,2011-04-18,"Munich, Bayern, Germany",Ich freue mich auf viele internationale Gäste!...,,,100%,t,https://a0.muscache.com/im/users/517685/profil...,https://a0.muscache.com/im/users/517685/profil...,Hadern,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Munich, Bavaria, Germany",Hadern,,48.11476,11.48782,Entire apartment,Entire home/apt,2,,1 bath,1.0,1.0,"[""Coffee maker"", ""Essentials"", ""Bed linens"", ""...",$80.00,2,90,2.0,2.0,90.0,90.0,2.0,90.0,,t,0,0,0,5,2020-12-30,130,0,0,2011-12-20,2019-10-03,97.0,10.0,10.0,10.0,10.0,9.0,9.0,,f,2,2,0,0,1.18
1,114695,https://www.airbnb.com/rooms/114695,20201224153430,2020-12-24,Apartment Munich/East with sundeck,<b>The space</b><br />It´s a quiet and sunny a...,,https://a0.muscache.com/pictures/21571874/960e...,581737,https://www.airbnb.com/users/show/581737,Stephan,2011-05-12,"Munich, Bayern, Germany",I am looking forward to meet interesting peopl...,,,100%,f,https://a0.muscache.com/im/users/581737/profil...,https://a0.muscache.com/im/users/581737/profil...,Berg am Laim,3.0,3.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,,Berg am Laim,,48.11923,11.63726,Entire apartment,Entire home/apt,5,,1 bath,1.0,3.0,"[""Hot water"", ""Coffee maker"", ""Essentials"", ""B...",$95.00,2,30,2.0,2.0,1125.0,1125.0,2.0,1125.0,,t,0,0,0,52,2020-12-24,53,0,0,2011-07-08,2019-10-06,95.0,9.0,10.0,10.0,10.0,9.0,9.0,,f,2,2,0,0,0.46
2,127383,https://www.airbnb.com/rooms/127383,20201224153430,2020-12-24,City apartment next to Pinakothek,<b>The space</b><br />My cosy apartment is loc...,,https://a0.muscache.com/pictures/79238c11-bc61...,630556,https://www.airbnb.com/users/show/630556,Sonja,2011-05-26,"Munich, Bayern, Germany","Hi, mein Name ist Sonja und ich freue mich net...",within a few hours,100%,88%,t,https://a0.muscache.com/im/users/630556/profil...,https://a0.muscache.com/im/users/630556/profil...,Maxvorstadt,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'selfie...",t,t,,Maxvorstadt,,48.15198,11.56486,Entire apartment,Entire home/apt,4,,1 bath,1.0,1.0,"[""Hot water"", ""Pack \u2019n Play/travel crib"",...",$99.00,2,14,2.0,2.0,14.0,14.0,2.0,14.0,,t,9,9,9,9,2020-12-24,93,11,0,2011-06-04,2020-10-20,98.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,1,1,0,0,0.8
3,170815,https://www.airbnb.com/rooms/170815,20201224153430,2020-12-24,Your own flat near central station!,<b>The space</b><br />It's a 1-room studio app...,,https://a0.muscache.com/pictures/86b4037c-098a...,814793,https://www.airbnb.com/users/show/814793,Inge,2011-07-13,"Munich, Bavaria, Germany",Die \nH\n\n,within a day,67%,71%,f,https://a0.muscache.com/im/pictures/user/d5dec...,https://a0.muscache.com/im/pictures/user/d5dec...,Neuhausen,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,,Neuhausen-Nymphenburg,,48.16132,11.54154,Entire apartment,Entire home/apt,2,,1 bath,1.0,1.0,"[""Hot water"", ""Dedicated workspace"", ""Coffee m...",$65.00,3,14,3.0,3.0,14.0,14.0,3.0,14.0,,t,0,0,0,0,2020-12-24,64,2,0,2011-08-31,2020-02-18,91.0,9.0,9.0,10.0,10.0,9.0,9.0,,f,1,1,0,0,0.56
4,171749,https://www.airbnb.com/rooms/171749,20201224153430,2020-12-24,1min to subway - Wettersteinplatz,The apartment is located in a very quiet locat...,"Nearby is the FC Bayern Munich area, about 10 ...",https://a0.muscache.com/pictures/88ca5688-2b45...,819382,https://www.airbnb.com/users/show/819382,Tarek,2011-07-14,"Munich, Bavaria, Germany",Lieber Besucher/in\n\nich bin Tarek und wohne ...,within an hour,100%,98%,t,https://a0.muscache.com/im/pictures/user/31c24...,https://a0.muscache.com/im/pictures/user/31c24...,Untergiesing - Harlaching,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Munich, Bavaria, Germany",Untergiesing-Harlaching,,48.10583,11.57843,Private room in apartment,Private room,1,,1 shared bath,1.0,1.0,"[""Hot water"", ""Coffee maker"", ""Essentials"", ""P...",$37.00,3,10,3.0,3.0,1125.0,1125.0,3.0,1125.0,,t,22,47,77,85,2020-12-24,357,21,1,2011-07-31,2020-12-04,98.0,10.0,10.0,10.0,10.0,10.0,10.0,,t,1,0,1,0,3.12


In [5]:
# Display the geographic data
gdf_locations.head()

Unnamed: 0,einwohner,note,plz,qkm,geometry
0,11957,01067 Dresden,1067,6.866862,"POLYGON ((13.68689 51.06395, 13.69570 51.06499..."
1,25491,01069 Dresden,1069,5.351816,"MULTIPOLYGON (((13.72635 51.02186, 13.73031 51..."
2,14821,01097 Dresden,1097,3.304476,"POLYGON ((13.72548 51.06860, 13.72629 51.06900..."
3,28018,01099 Dresden,1099,58.500065,"POLYGON ((13.74218 51.08979, 13.74335 51.08863..."
4,5876,01108 Dresden,1108,16.447222,"POLYGON ((13.76543 51.17491, 13.76637 51.17469..."


In [38]:
# Select columns relevant for all three questions
calendar_df = calendar_df[["listing_id", "date", "available"]]
listings_df = listings_df [["id", "room_type", "review_scores_rating", "latitude", "longitude", "price"]]
gdf_locations = gdf_locations [["plz", "geometry"]]

In [8]:
# Count number of rows
print("The number of rows for the calendar data is: " + str(f"{len(calendar_df.index):,}"))
print("The number of rows for the listings data is: " + str(f"{len(listings_df.index):,}"))
print("The number of rows for the geographic data is: " + str(f"{len(gdf_locations.index):,}"))

The number of rows for the calendar data is: 1,754,920
The number of rows for the listings data is: 4,815
The number of rows for the geographic data is: 8,169


# Data preparation

In the following, we will do some common data cleanup steps. A particular focus is set on handling of different data types, and missing data.

In [103]:
# Identify duplicates
def duplicates(df):
    '''
    Identifies duplicates for a given dataframe (df) and
    prints whether or not a duplicate was found
    '''
    duplicates = df[df.duplicated(keep=False)]
    if len(duplicates) == 0:
        print("There are no duplicates were found for dataframe " + [x for x in globals() if globals()[x] is df][0] + ".")
    else:
        print("There are some duplicates in the dataframe " + [x for x in globals() if globals()[x] is df][0] + ".")
    
duplicates(calendar_df)
duplicates(listings_df)
duplicates(gdf_locations)

There are no duplicates were found for dataframe calendar_df.
There are no duplicates were found for dataframe listings_df.
There are no duplicates were found for dataframe gdf_locations.


In [39]:
# Change the data type for calendar_df were necessary for columns date and available
print("Before data preparation:")
print(calendar_df.dtypes)
calendar_df["date"] = pd.to_datetime(calendar_df["date"], errors='coerce')
calendar_df["available"] = calendar_df["available"].replace({'t': True, 'f': False})
print("\nAfter data preparation: ")
print(calendar_df.dtypes)

Before data preparation:
listing_id     int64
date          object
available     object
dtype: object

After data preparation: 
listing_id             int64
date          datetime64[ns]
available               bool
dtype: object


In [46]:
# Change the data type for listing_df for column price
print("Before data preparation:")
print(listings_df.dtypes)
listings_df["price"] = listings_df["price"].str.replace("$", '',regex=True)
listings_df["price"] = listings_df["price"].str.replace(",", '',regex=True).astype(float)
print("\nAfter data preparation: ")
print(listings_df.dtypes)

Before data preparation:
id                        int64
room_type                object
review_scores_rating    float64
latitude                float64
longitude               float64
price                    object
dtype: object

After data preparation: 
id                        int64
room_type                object
review_scores_rating    float64
latitude                float64
longitude               float64
price                   float64
dtype: object


In [85]:
# Identify missing values
def missing_values(df):
    '''
    Calculates the number and percentage of missing values for a given dataframe (df)
    Prints the missing value statistics for each column
    '''
    print("Missing value statistics for dataframe: " + [x for x in globals() if globals()[x] is df][0])
    missing_stats = df.isnull().sum()
    for key,value in missing_stats.iteritems():   
        percent = round (value * 100 / len(df.index))
        print(key + ": " + str(value) + " (" + str(percent) + "%)")
    print()

missing_values(calendar_df)
missing_values(listings_df)
missing_values(gdf_locations)

Missing value statistics for dataframe: calendar_df
listing_id: 0 (0%)
date: 0 (0%)
available: 0 (0%)

Missing value statistics for dataframe: listings_df
id: 0 (0%)
room_type: 0 (0%)
review_scores_rating: 1220 (25%)
latitude: 0 (0%)
longitude: 0 (0%)
price: 0 (0%)

Missing value statistics for dataframe: gdf_locations
plz: 0 (0%)
geometry: 0 (0%)



TODO: only review_scores_rating has missing values   

1220 / 4,815 = 25% of the data is missing

In [None]:
Imputation means that you input a value for values that were originally missing.
-	Methods: fill missing value with the mean, median, mode (categorical data or a variable with outliers) of a column

dropping affected columns or rows or via replacing missing values by using imputation


imputing ist nicht notwendig da der ähnliche ergebnisse erwartet herauskommt. 
nicht um graue felder weisen weiterhin zu wenige daten um mit ihnen weiter zu arbeiten.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. 

In [None]:
For this step, all of the missing np.nan data need to be dropped or imputed. 
There are several strategies to deal with NaNs. A simple option of dropping data points with NaNs can result in loss of a lot of information (~30%). A better strategy is imputation of missing values with mean or most frequent value. Imputed values won’t be exactly right in most cases but systematically above or below their actual values. However, usually it gives more accurate results than dropping data points or columns entirely.

Missing value statistics for dataframe: calendar_df
listing_id: 0 (0%)
date: 0 (0%)
available: 0 (0%)

Missing value statistics for dataframe: listings_df
id: 0 (0%)
room_type: 0 (0%)
review_scores_rating: 1220 (25%)
latitude: 0 (0%)
longitude: 0 (0%)
price: 0 (0%)

Missing value statistics for dataframe: gdf_locations
plz: 0 (0%)
geometry: 0 (0%)



In [58]:
print(len(calendar_df.index))

1754920


In [None]:
def missing_statistics(df):
    '''
        Calculates missing value statistics for a given dataframe and
        returns a dataframe containing number of missing values per column
        and the percentage of values missing per column.
        arguments:
            df: the dataframe for which missing values need to be calculated.
    '''
    missing_stats = df.isnull().sum().to_frame()
    missing_stats.columns = ['num_missing']
    missing_stats['pct_missing'] = np.round(100 * (missing_stats['num_missing'] / df.shape[0]))
    missing_stats.sort_values(by='num_missing', ascending=False, inplace=True)
    
    

    return missing_stats

In [None]:
print("The Airbnb Munich dataset originally includes data between " + min(calendar_df['date']).strftime("%Y/%m/%d") + " and " + max(calendar_df['date']).strftime("%Y/%m/%d") +".")
print("As we are only interested in data from 2021, we skip the December 2020 data.")
calendar_df = calendar_df.loc[(calendar_df["date"] >= "2021-01-01") & (calendar_df["date"] <= "2021-12-31")]
print("The Airbnb Munich dataset now includes data between " + min(calendar_df['date']).strftime("%Y/%m/%d") + " and " + max(calendar_df['date']).strftime("%Y/%m/%d") +".")





In [None]:
merge data frame

In [None]:

# Data transformation
listings_data_df = listings_data_df.copy()
listings_data_df['loc'] = listings_data_df.apply(lambda x: list([x['latitude'], x['longitude']]),axis=1).copy()
listings_data_df['loc'] = listings_data_df['loc'].apply(lambda x: Point(x[1],x[0]))
# Retrieving map data
geo_data = gpd.GeoDataFrame(listings_data_df,geometry=listings_data_df['loc'],crs=gdf_locations.crs).drop(['loc'], axis=1)
geo_result = gpd.sjoin(geo_data,gdf_locations,how='left',op='within')
geo_result.head()

In [None]:
###########################################################################################################
# Question 1: In which months are most Airbnb listings still available (total and by room type)?
###########################################################################################################

# Data Modeling

!!!!!!!!!!!!TODO describe
distribution: bar charts, histgrams to better understand data --> in analyse




In [None]:
# number of listings per month:
no_listings_per_month=calendar_df[['listing_id']].groupby([calendar_df['date'].dt.year.rename('year'), calendar_df['date'].dt.month.rename('month')]).count()
no_listings_per_month.rename(columns={'listing_id':'number_of_listings'}, inplace=True)
print (no_listings_per_month)
print( "\n" + "The graphic shows that there are less rooms listed for some months.") 
print("December is the month with the lowest amount of rooms listed at Airbnb - Keep in mind that we also have less data provided for that month (data for the 30th and 31st of December is missing)")

In [None]:


x = [u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
y = no_listings_per_month.transpose().iloc[::3, :].values[0]

fig, ax = plt.subplots() 

fig.set_size_inches(16, 6)
width = 0.75 # the width of the bars 
ind = np.arange(len(y))  # the x locations for the groups

ax.barh(ind, y, width, color="blue")
ax.set_yticks(ind+width/2)
ax.set_yticklabels(x, minor=False)
for i, v in enumerate(y):
    ax.text(v, i, str(format(v, ',')), color='black')

plt.title('Number of Airbnb listings per month', fontsize=14)
plt.xlabel('Number of listings')
plt.ylabel('Month')      
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))


plt.show()

In [None]:
# merge the two dataframes listings_df and calendar_df 
calendar_df_with_room_types=pd.merge(calendar_df,listings_df, left_on='listing_id',right_on='id')
# select relevant columns
calendar_df_with_room_types = calendar_df_with_room_types[["date", "available", "listing_id", "room_type"]]

print("Question 1a: In which months are most Airbnb listings still available (total)?", end="\n\n")
sum_1a = calendar_df_with_room_types.groupby([calendar_df_with_room_types['date'].dt.year.rename('year'), calendar_df_with_room_types['date'].dt.month.rename('month'), 'available'])['listing_id'].sum()
result_1a = (sum_1a / sum_1a.groupby(level=[0, 1]).transform("sum")*100)
result_1a_df = result_1a.to_frame().rename(columns = {'listing_id':'percentage'}) #Convert series to dataframe and rename column
print(result_1a_df, end="\n\n")
print("February 2021 is the month with the highest availablity, followed by March 2021 and January 2021. This makes sense,")
print("if you think of the current Corona situation. People hesitate to book a room soon.")
print("October 2021 is the busiest month regarding the availablity rate. People look forward to go to the Octoberfest again.")


print("Question 1b: In which months are most Airbnb listings still available (by room_type)?", end="\n\n")
sum_1b = calendar_df_with_room_types.groupby([calendar_df_with_room_types['date'].dt.year.rename('year'), calendar_df_with_room_types['date'].dt.month.rename('month'), 'available', 'room_type'])['listing_id'].sum()
result_1b = sum_1b / sum_1b.groupby(level=[0, 1]).transform("sum")*100
result_1b_df = result_1b.to_frame().rename(columns = {'listing_id':'percentage'}) #Convert series to dataframe and rename column
pd.set_option('display.max_rows', 1000)
print(result_1b_df, end="\n\n")
print("This data is hard to read. Let's split the data and make some nice graphics and try out two different python libaries.")


In [None]:
# The final dateframes from here also serve as input for the second graphic (see below)
result_1b_df_ind = result_1b_df.reset_index()

print("Percentage of available rooms by month and room type:", end="\n\n")
result_1b_available_df_ind= result_1b_df_ind[result_1b_df_ind['available'] == True]
available_df = result_1b_available_df_ind.pivot('month', 'room_type', 'percentage')
print(available_df, end="\n\n\n\n")

print("Percentage of unavailable rooms by month and room type:", end="\n\n")
result_1b_unvailable_df_ind= result_1b_df_ind[result_1b_df_ind['available'] == False]
unavailable_df = result_1b_unvailable_df_ind.pivot('month', 'room_type', 'percentage')
print(unavailable_df, end="\n\n\n\n")


print("Entire home/apartments, private rooms and shared rooms have the hightest availability rate in Jan, Feb and Mar 2021.")
print("Only hotel rooms are more booked in January 2021 compared to the other months.")

#Hint: the sum of values from the first row of the two graphhics is 100 (equivalent for all other rows)
#first row: 33.857044 + 0.897881 + 31.324853 + 1.372434 + 21.252804 + 0.512711 + 10.579063 + 0.203210 = 100 

In [None]:
# Set the labels
fig = plt.figure()
fig.suptitle('Availability rate of Airbnb listings for 2021', fontsize=14)
plt.xlabel('Month', fontsize=10)
plt.ylabel('Percentage', fontsize=10)

# Transform the data 
result_1a_df_ind = result_1a_df.reset_index()
available = np.array(result_1a_df_ind[result_1a_df_ind['available'] == True].iloc[:,[3]].T.values[0])  # Transpose (.T)
unavailable = np.array(result_1a_df_ind[result_1a_df_ind['available'] == False].iloc[:,[3]].T.values[0])

# Make the stacked bar plot
columns = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pos = np.arange(len(columns))
p1 = plt.bar(pos, np.add(unavailable , available), color='red', edgecolor='red', label= 'unavailable')
p2 = plt.bar(pos, available, color='green', edgecolor='green', label= 'available')
plt.xticks(pos, columns)

# Create a legend
fontP = FontProperties()
fontP.set_size('medium')
plt.legend(handles=[p1, p2], bbox_to_anchor=(1.05, 1), loc='upper left', prop=fontP)

# Show graphic
plt.show()

In [None]:


# Change from pivot table to normal table format; add column available
def prepare_df(df, name):
    df = df.stack().reset_index()
    df.columns = ['month', 'room_type', 'values']
    df['available'] = name
    return df

# Settings for displaying the chart
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
order = ['available', 'unavailable'] 

# Prepare the data (dateframes see two cells above)
unavailable_df = prepare_df(unavailable_df, 'unavailable')
available_df = prepare_df(available_df, 'available')
df = pd.concat([available_df, unavailable_df]) # join the two dataframes
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x]) # convert values from month column to month name


# Make the stacked bar plot
chart = alt.Chart(df, title=['Availability rate of different Airbnb room types', '   ']).mark_bar().encode(

    # Use room type as x-axis 
    x=alt.X('room_type:N', title=None),

    # Use percentage as y-axis 
    y=alt.Y('sum(values):Q',
        axis=alt.Axis(
            grid=False,
            title='Percentage')),

    # Use month as the set of columns to be represented in each group
    column=alt.Column('month:N',  title=None, sort=months),

    # Set the colours 
    color=alt.Color('available:N', sort=order, 
            scale=alt.Scale(
                # make it look pretty with an enjoyable color pallet
                range=['green', 'red']
            ), title =None, 
        ), 
    # Determine the order for displaying (we want available on the bottom)
    order="order:Q" 
    )

#Configure the chart and display it
chart.configure_view(strokeOpacity=0)

In [None]:
###########################################################################################################
# Question 2: In which Munich areas (zip codes) are the best Airbnb listings (according to total rating)?
###########################################################################################################

In [None]:
# Convert geodataframe to pandas-dataframe
munich_df  = pd.DataFrame(geo_result)
# Remove non-relevant columns
munich_df = munich_df.drop(columns=['index_right',  'geometry', 'latitude', 'longitude'])
munich_df.head()

In [None]:
print("- The number of rows is: " + str(f"{len(munich_df.index):,}"))

In [None]:
# Remove null values 
print(munich_df.isnull().sum())
munich_df = munich_df.dropna()

In [None]:
#Display amount of the different review_scores for each zip code
munich_df.groupby(['plz','review_scores_rating']).size().unstack()

In [None]:
# Calculate mean review_score for each zip code
res_2 = munich_df.groupby(['plz']).agg(['count','mean'])
res_2  = res_2 .drop(columns=['id'])
res_2 = res_2.reset_index()
print(res_2.head(100))
print()
print("We can see that the ratings are very high for all listings (all above 90).")
print("Some of the zip codes only have a few ratings.")

In [None]:
# Get polygon geometry data for the plot
pd.set_option('display.max_rows', 100)
plot_input_data = pd.merge(left=res_2, right=gdf_locations, on='plz', how='left')
plot_input_data = plot_input_data.drop(columns=['einwohner', 'note', 'qkm'])
plot_input_data.head(100)

In [None]:
#Replace values with Nan which have less than 10 ratings in total
minimum_number = 10
plot_input_data[plot_input_data.columns[3]].where((plot_input_data[plot_input_data.columns[2]] > minimum_number) ,np.NaN,inplace=True) 
plot_input_data = plot_input_data.sort_values(by=plot_input_data.columns[3], ascending=False)
plot_input_data.head(100)

In [None]:

# Create the plot
plt.style.use('seaborn')
%matplotlib inline
fig, ax = plt.subplots(figsize=(28,14))

GeoDataFrame(plot_input_data).plot(ax=ax, column=plot_input_data.columns[3], categorical=False, legend=True, cmap='summer_r',
                                   missing_kwds=dict(color='grey'))

#Set the title
ax.set_title('Average rating of Airbnb listings for each zip code', pad=10, fontsize=18)

# Remove axis labels for latitude and longitude
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)
ax.set(facecolor='lightgrey');

# Add zip code labels
plot_input_data.apply(lambda x: ax.annotate(text=x.plz, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);

print("Annotation: Zip codes with less than 10 ratings were excluded from the analysis and marked in dark gray.")
print("Results: It seems to be that most of the popular areas are placed in the West of Munich.")
print("This area has very good traffic connections to the inner city and also to the Oktoberfest.")
print("As some the zip codes in the West of Munich like Laim have lower house/appartment prices compared to other areas,") 
print("let's investigate the relationship between listing price and listing (total) rating")

In [None]:
# same plot with all Airbnb listings included in the figure

# Create the plot
plt.style.use('seaborn')
%matplotlib inline
fig, ax = plt.subplots(figsize=(28,14))

GeoDataFrame(plot_input_data).plot(ax=ax, column=plot_input_data.columns[3], categorical=False, legend=True, cmap='summer_r',
                                   missing_kwds=dict(color='grey'))

listings_data_df['const']=1
GeoDataFrame(listings_data_df).plot(ax=ax, column='const', categorical=False, legend=False, cmap='gray');

#Set the title
ax.set_title('Average rating of Airbnb listings for each zip code (including the locations of the listings)', pad=10, fontsize=18)

# Remove axis labels for latitude and longitude
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)
ax.set(facecolor='lightgrey');

print("Most of the Airbnb listings are located near to the inner city.")

In [None]:
###########################################################################################################
# Question 3: Do cheaper listings have a lower rating compared to more expensive listings? 
###########################################################################################################


In [None]:
listings_data_df_price.isnull().sum()
# As NaN values are not taken into account in the scotter plot, we do not need to remove them.

In [None]:
print(listings_data_df_price[["price", "review_scores_rating"]].describe())
print()
print("The high standard deviation (247) indicates that the data are more spread out.")
print("Also the maximum is very high (8255).")
print("There is also at least one price value, which is equals 0 (see minimum).")

In [None]:
# Let's have a look at the listings with a price of 0. 
# All rows with price values of 0, also have NaN rating values.
# As NaN values are not taken into account in the scotter plot, we do not need to remove them.
listings_data_df_price[listings_data_df_price["price"] == 0]

In [None]:
# Scatter plot to describe relationship between price and (total) rating

ax = listings_data_df_price.plot(kind='scatter', x='price', y='review_scores_rating')

plt.xlabel('Price in $')
plt.ylabel('Total review score')
plt.title('Scatter plot of total review rating vs. price')

ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
plt.show()

print("From the scatter plot, we can see the outliers.")

In [None]:
# Remove outliers by querying the rows with a price less than $600
price_less_than_600 = listings_data_df_price.query('price <= 600')
count_outliers = len(listings_data_df_price[listings_data_df_price['price'] > 600])
percentage_outliers = len(listings_data_df_price[listings_data_df_price['price'] > 600]) / listings_data_df_price.shape[0] * 100
print('We found {} rows ({:.2f}%) with a price less than $600.'.format(count_outliers, percentage_outliers))

In [None]:

plt.scatter(price_less_than_600['price'],price_less_than_600['review_scores_rating'])
plt.xlabel('Price in $')
plt.ylabel('Total review score')
plt.title('Scatter plot of total review rating vs. price (price less than $600)')
plt.show()

print("From the scatter plot, we can see that there is hardly no relationship between total review score and price.")
print("Many listings in the lower price range also have very high ratings.")

In [None]:


pearsoncorr = price_less_than_600[["price", "review_scores_rating"]].corr(method='pearson')

sb.heatmap(pearsoncorr, 
            xticklabels=pearsoncorr.columns,
            yticklabels=pearsoncorr.columns,
            cmap='RdBu',
            annot=True,
            linewidth=0.5)
print("From the results of the correlation coefficient, we come to the same conclusion:")
print("The correlation coefficient of -0.011 is considered as negligible correlation.")