# Business Understanding and Set-up

## Background and Key Question

**Airbnb**

Brief description

**Key Question**

1. Taking data from a given specific listing time stamp, **can we accurately predict its price** in order to provide future hosts with a solid pricing estimate without requiring an Airbnb account beforehand?

**Assumptions**

As the data is accessible information only and does not include data such as actual occupancy, several assumptions were necessary to perform the analysis. Some of the key ones are described below

| **TOPIC** | **ASSUMPTION** |
| :----- | :----- |
| **Data date selection** | Main dataset "data" is taken from before April 2020 (in order to stay clear of COVID effects).
| **** |  |
| **** |  |



## Feature Glossary

[LINK](https://github.com/L-Lewis/Airbnb-neural-network-price-prediction/blob/master/Airbnb-price-prediction.ipynb)

| **FEATURE** | **DESCRIPTION** |
| :----- | :----- |
| **name** | header of Airbnb listing |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |



## Dataset Glossary

| **DATASET** | **DESCRIPTION** |
| :----- | :----- |
| **data_raw** | Originally imported dataset listings.csv.gz (February 2020) |
| **data** | Naming for main working dataset throughout all notebooks |
| **data_clean** | Export from Notebook 1-Clean, import for Notebooks 2-EDA and 3-Feature Engineering |
| **data_engineered** | Export from Notebook 3-Feature Engineering, import for Notebook 4-Predictive Modeling |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |



## Target Feature(s) and Metric(s)

**Target 1**:
- Feature: Occupancy class
- Metric: F1-Score

## Libraries and Dashboard

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import loadtxt
import os, glob
import geopandas as gpd
from datetime import datetime
from datetime import timedelta
%matplotlib inline

In [2]:
# Dashboard
dataset_loc = "berlin"  # "berlin", "paris", "amsterdam"
dataset_date = "2020-03-17"  # "2019-12-11", "2020-01-10", "2020-02-18", "2020-03-17", "2020-05-14"
dataset_date_min3mth = "2019-12-11"
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 100)
pd.options.display.max_seq_items = 300
#pd.options.display.max_rows = 4000
sns.set(style="white")

# Data Mining

## Data Checks

The monthly data for Berlin is composed of various files that are briefly visualized here (based on Dec 2019):

- listings.csv.gz
- listings.csv
- reviews.csv.gz
- reviews.csv
- calendar.csv.gz
- neighbourhoods.csv
- neighbourhoods.geojson

**listings.csv.gz**

In [3]:
# Display contents of listings.csv.gz as well as its shape
data_listings_gz_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/listings.csv.gz")
print(data_listings_gz_insp.shape)
data_listings_gz_insp.head(3)

(25164, 106)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3176,https://www.airbnb.com/rooms/3176,20200317045838,2020-03-17,Fabulous Flat in great Location,This beautiful first floor apartment is situa...,1st floor (68m2) apartment on Kollwitzplatz/ P...,This beautiful first floor apartment is situa...,none,The neighbourhood is famous for its variety of...,We welcome FAMILIES and cater especially for y...,"We are 5 min walk away from the tram M2, whic...",The apartment will be entirely yours. We are c...,Feel free to ask any questions prior to bookin...,"It’s a non smoking flat, which likes to be tre...",,,https://a0.muscache.com/im/pictures/243355/84a...,,3718,https://www.airbnb.com/users/show/3718,Britta,2008-10-19,"Coledale, New South Wales, Australia",We love to travel ourselves a lot and prefer t...,within a few hours,100%,80%,f,https://a0.muscache.com/im/users/3718/profile_...,https://a0.muscache.com/im/users/3718/profile_...,Prenzlauer Berg,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"Berlin, Berlin, Germany",Prenzlauer Berg,Prenzlauer Berg Südwest,Pankow,Berlin,Berlin,10405,Berlin,"Berlin, Germany",DE,Germany,52.535,13.41758,t,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",720.0,$90.00,$520.00,"$1,900.00",$300.00,$100.00,2,$20.00,62,1125,62,62,1125,1125,62.0,1125.0,3 weeks ago,t,0,0,0,140,2020-03-17,145,1,2009-06-20,2019-06-27,93.0,9.0,9.0,9.0,9.0,10.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,1,1,0,0,1.11
1,3309,https://www.airbnb.com/rooms/3309,20200317045838,2020-03-17,BerlinSpot Schöneberg near KaDeWe,First of all: I prefer short-notice bookings. ...,"Your room is really big and has 26 sqm, is ver...",First of all: I prefer short-notice bookings. ...,none,"My flat is in the middle of West-Berlin, direc...",The flat is a strictly non-smoking facility! A...,The public transportation is excellent: Severa...,I do have a strictly non-smoker-flat. Keep th...,I'm working as a freelancing photographer. My ...,House-Rules and Information ..............(deu...,,,https://a0.muscache.com/im/pictures/29054294/b...,,4108,https://www.airbnb.com/users/show/4108,Jana,2008-11-07,"Berlin, Berlin, Germany",ENJOY EVERY DAY AS IF IT'S YOUR LAST!!! \r\n\r...,within a day,100%,100%,f,https://a0.muscache.com/im/pictures/user/d8049...,https://a0.muscache.com/im/pictures/user/d8049...,Schöneberg,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,f,"Berlin, Berlin, Germany",Schöneberg,Schöneberg-Nord,Tempelhof - Schöneberg,Berlin,Berlin,10777,Berlin,"Berlin, Germany",DE,Germany,52.49885,13.34906,t,Apartment,Private room,1,1.0,1.0,1.0,Pull-out Sofa,"{Internet,Wifi,""Pets live on this property"",Ca...",0.0,$28.00,$175.00,$599.00,$250.00,$30.00,1,$18.00,7,35,7,7,35,35,7.0,35.0,2 months ago,t,0,15,45,320,2020-03-17,27,1,2013-08-12,2019-05-31,89.0,9.0,9.0,9.0,10.0,9.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,1,0,1,0,0.34
2,6883,https://www.airbnb.com/rooms/6883,20200317045838,2020-03-17,Stylish East Side Loft in Center with AC & 2 b...,,Stay in a stylish loft on the second floor and...,Stay in a stylish loft on the second floor and...,none,The emerging and upcoming East of the new hip ...,Information on Berlin Citytax: English (Websit...,Location: - Very close to Alexanderplatz just ...,"More details: - Electricity, heating fees and ...",I rent out my space when I am travelling so I ...,No Pets. No loud Parties. Smoking only on th...,,,https://a0.muscache.com/im/pictures/274559/b0d...,,16149,https://www.airbnb.com/users/show/16149,Steffen,2009-05-07,"Berlin, Berlin, Germany",Hello and thanks for visitng my page. My name ...,within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/5df24...,https://a0.muscache.com/im/pictures/user/5df24...,Friedrichshain,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Berlin, Berlin, Germany",Friedrichshain,Frankfurter Allee Süd FK,Friedrichshain-Kreuzberg,Berlin,Berlin,10243,Berlin,"Berlin, Germany",DE,Germany,52.51171,13.45477,t,Loft,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",,$125.00,$599.00,"$1,399.00",$0.00,$39.00,1,$0.00,3,90,3,3,1125,1125,3.0,1125.0,4 days ago,t,0,0,0,0,2020-03-17,133,9,2010-02-15,2020-02-16,99.0,10.0,10.0,10.0,10.0,10.0,10.0,t,02/Z/RA/008250-18,,f,f,moderate,f,t,1,1,0,0,1.08


**listings.csv**

In [4]:
# Display contents of listings.csv as well as its shape
#data_listings_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/listings.csv")
#print(data_listings_insp.shape)
#data_listings_insp.head(2)

**reviews.csv.gz**

In [5]:
# Display contents of reviews.csv.gz as well as its shape
#data_reviews_gz_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/reviews.csv.gz")
#print(data_reviews_gz_insp.shape)
#data_reviews_gz_insp.head(2)

**reviews.csv**

In [6]:
# Display contents of reviews.csv as well as its shape
#data_reviews_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/reviews.csv")
#print(data_reviews_insp.shape)
#data_reviews_insp.head(2)

**calendar.csv.gz**

In [7]:
# Display contents of calendar.csv.gz as well as its shape
#data_cal_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/calendar.csv.gz")
#print(data_cal_insp.shape)
#data_cal_insp.head(2)

**neighbourhoods.csv**

In [8]:
# Display contents of neighbourhoods.csv as well as its shape
#data_neighb_insp = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/neighbourhoods.csv")
#print(data_neighb_insp.shape)
#data_neighb_insp.head(2)

**neighbourhoods.geojson**

In [9]:
# Display contents of neighbourhoods.geojson as well as its shape
#data_neighb_geojson_insp = gpd.read_file(f"data/{dataset_loc}_{dataset_date}/neighbourhoods.geojson")
#print(data_neighb_geojson_insp.shape)
#data_neighb_geojson_insp.head(2)

## Data Import

**Create main dataset (listings on January 10th, i.e. pre-COVID-19)**

In [10]:
# Import dataset as DataFrame (as csv-file)
data_raw = pd.read_csv(f"data/{dataset_loc}_{dataset_date}/listings.csv.gz")

  interactivity=interactivity, compiler=compiler, result=result)


In [11]:
# Assign data_raw to data (in order to always keep a freshly imported data_raw) and set id as index
data = data_raw.copy()
data.set_index('id', inplace=True)

# Data Cleaning

## Pre-cleaning

In [12]:
# Display shape of "data"
data.shape

(25164, 105)

In [13]:
# Display head(1) of "data"
data.head(1)

Unnamed: 0_level_0,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
3176,https://www.airbnb.com/rooms/3176,20200317045838,2020-03-17,Fabulous Flat in great Location,This beautiful first floor apartment is situa...,1st floor (68m2) apartment on Kollwitzplatz/ P...,This beautiful first floor apartment is situa...,none,The neighbourhood is famous for its variety of...,We welcome FAMILIES and cater especially for y...,"We are 5 min walk away from the tram M2, whic...",The apartment will be entirely yours. We are c...,Feel free to ask any questions prior to bookin...,"It’s a non smoking flat, which likes to be tre...",,,https://a0.muscache.com/im/pictures/243355/84a...,,3718,https://www.airbnb.com/users/show/3718,Britta,2008-10-19,"Coledale, New South Wales, Australia",We love to travel ourselves a lot and prefer t...,within a few hours,100%,80%,f,https://a0.muscache.com/im/users/3718/profile_...,https://a0.muscache.com/im/users/3718/profile_...,Prenzlauer Berg,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"Berlin, Berlin, Germany",Prenzlauer Berg,Prenzlauer Berg Südwest,Pankow,Berlin,Berlin,10405,Berlin,"Berlin, Germany",DE,Germany,52.535,13.41758,t,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",720.0,$90.00,$520.00,"$1,900.00",$300.00,$100.00,2,$20.00,62,1125,62,62,1125,1125,62.0,1125.0,3 weeks ago,t,0,0,0,140,2020-03-17,145,1,2009-06-20,2019-06-27,93.0,9.0,9.0,9.0,9.0,10.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,1,1,0,0,1.11


In [14]:
# Display columns of "data"
#data.columns

In [15]:
# Define columns for pre-cleaning drop
select_columns = [
    'accommodates', 'amenities', 'availability_365', 'availability_90',
    'bathrooms', 'bed_type', 'bedrooms', 'beds',
    'calculated_host_listings_count', 'cancellation_policy', 'cleaning_fee',
    'description', 'experiences_offered', 'extra_people', 'first_review',
    'guests_included', 'has_availability', 'host_acceptance_rate',
    'host_has_profile_pic', 'host_identity_verified', 'host_is_superhost',
    'host_listings_count', 'host_location', 'host_response_rate',
    'host_response_time', 'house_rules', 'instant_bookable', 'interaction',
    'is_business_travel_ready', 'is_location_exact', 'last_review', 'latitude',
    'listing_url', 'longitude', 'maximum_nights', 'minimum_nights',
    'monthly_price', 'name', 'neighborhood_overview', 'neighbourhood_cleansed',
    'notes', 'number_of_reviews', 'number_of_reviews_ltm', 'price',
    'property_type', 'require_guest_phone_verification',
    'require_guest_profile_picture', 'requires_license',
    'review_scores_accuracy', 'review_scores_checkin',
    'review_scores_cleanliness', 'review_scores_communication',
    'review_scores_location', 'review_scores_rating', 'review_scores_value',
    'reviews_per_month', 'room_type', 'security_deposit', 'space',
    'square_feet', 'summary', 'transit', 'weekly_price', 'zipcode'
]

In [16]:
# Drop innecessary columns and sort dataset
drop_columns = [el for el in data.columns if el not in select_columns]
data.drop(labels=drop_columns, inplace=True, axis=1)
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

## Inspection

In [17]:
# Display shape of "data"
data.shape

(25164, 64)

In [18]:
# Display head(5) of remaining "data"
data.head(5)

Unnamed: 0_level_0,accommodates,amenities,availability_365,availability_90,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,cancellation_policy,cleaning_fee,description,experiences_offered,extra_people,first_review,guests_included,has_availability,host_acceptance_rate,host_has_profile_pic,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_response_rate,host_response_time,house_rules,instant_bookable,interaction,is_business_travel_ready,is_location_exact,last_review,latitude,listing_url,longitude,maximum_nights,minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,security_deposit,space,square_feet,summary,transit,weekly_price,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1
3176,4,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",140,0,1.0,Real Bed,1.0,2.0,1,strict_14_with_grace_period,$100.00,This beautiful first floor apartment is situa...,none,$20.00,2009-06-20,2,t,80%,t,t,f,1.0,"Coledale, New South Wales, Australia",100%,within a few hours,"It’s a non smoking flat, which likes to be tre...",f,Feel free to ask any questions prior to bookin...,f,t,2019-06-27,52.535,https://www.airbnb.com/rooms/3176,13.41758,1125,62,"$1,900.00",Fabulous Flat in great Location,The neighbourhood is famous for its variety of...,Prenzlauer Berg Südwest,We welcome FAMILIES and cater especially for y...,145,1,$90.00,Apartment,f,f,t,9.0,9.0,9.0,9.0,10.0,93.0,9.0,1.11,Entire home/apt,$300.00,1st floor (68m2) apartment on Kollwitzplatz/ P...,720.0,This beautiful first floor apartment is situa...,"We are 5 min walk away from the tram M2, whic...",$520.00,10405
3309,1,"{Internet,Wifi,""Pets live on this property"",Ca...",320,45,1.0,Pull-out Sofa,1.0,1.0,1,strict_14_with_grace_period,$30.00,First of all: I prefer short-notice bookings. ...,none,$18.00,2013-08-12,1,t,100%,t,f,f,1.0,"Berlin, Berlin, Germany",100%,within a day,House-Rules and Information ..............(deu...,f,I'm working as a freelancing photographer. My ...,f,t,2019-05-31,52.49885,https://www.airbnb.com/rooms/3309,13.34906,35,7,$599.00,BerlinSpot Schöneberg near KaDeWe,"My flat is in the middle of West-Berlin, direc...",Schöneberg-Nord,The flat is a strictly non-smoking facility! A...,27,1,$28.00,Apartment,f,f,t,9.0,9.0,9.0,10.0,9.0,89.0,9.0,0.34,Private room,$250.00,"Your room is really big and has 26 sqm, is ver...",0.0,First of all: I prefer short-notice bookings. ...,The public transportation is excellent: Severa...,$175.00,10777
6883,2,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0,0,1.0,Real Bed,1.0,1.0,1,moderate,$39.00,Stay in a stylish loft on the second floor and...,none,$0.00,2010-02-15,1,t,100%,t,t,f,1.0,"Berlin, Berlin, Germany",100%,within an hour,No Pets. No loud Parties. Smoking only on th...,f,I rent out my space when I am travelling so I ...,f,t,2020-02-16,52.51171,https://www.airbnb.com/rooms/6883,13.45477,90,3,"$1,399.00",Stylish East Side Loft in Center with AC & 2 b...,The emerging and upcoming East of the new hip ...,Frankfurter Allee Süd FK,Information on Berlin Citytax: English (Websit...,133,9,$125.00,Loft,t,f,t,10.0,10.0,10.0,10.0,10.0,99.0,10.0,1.08,Entire home/apt,$0.00,Stay in a stylish loft on the second floor and...,,,Location: - Very close to Alexanderplatz just ...,$599.00,10243
7071,2,"{Wifi,Heating,""Family/kid friendly"",Essentials...",45,45,1.0,Real Bed,1.0,2.0,2,moderate,$0.00,Cozy and large room in the beautiful district ...,none,$27.00,2009-08-18,1,t,96%,t,t,t,2.0,"Berlin, Berlin, Germany",100%,within an hour,Please take good care of everything during you...,f,I am glad if I can give you advice or help as ...,f,t,2020-03-06,52.54316,https://www.airbnb.com/rooms/7071,13.41509,10,1,,BrightRoom with sunny greenview!,"Great neighborhood with plenty of Cafés, Baker...",Helmholtzplatz,I hope you enjoy your stay to the fullest! Ple...,292,75,$33.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,97.0,9.0,2.27,Private room,$0.00,"The BrightRoom is an approx. 20 sqm (215ft²), ...",,Cozy and large room in the beautiful district ...,Best access to other parts of the city via pub...,,10437
9991,7,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",8,0,2.5,Real Bed,4.0,7.0,1,strict_14_with_grace_period,$80.00,4 bedroom with very large windows and outstand...,none,$10.00,2015-08-09,5,t,25%,t,t,f,1.0,"Berlin, Berlin, Germany",100%,within a day,,f,Guests will have the whole apartment to themse...,f,f,2020-01-04,52.53303,https://www.airbnb.com/rooms/9991,13.41605,14,6,,Geourgeous flat - outstanding views,Prenzlauer Berg is an amazing neighbourhood wh...,Prenzlauer Berg Südwest,,8,2,$180.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,100.0,10.0,0.14,Entire home/apt,$400.00,"THE APPARTMENT - 4 bedroom (US, Germany: 5 roo...",,4 bedroom with very large windows and outstand...,Excellent location regarding public transport ...,$650.00,10405


In [19]:
# Describe data (summary)
data.describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
accommodates,25164.0,2.7,1.6,1.0,2.0,2.0,3.0,24.0
availability_365,25164.0,72.81,112.2,0.0,0.0,0.0,101.0,365.0
availability_90,25164.0,21.9,31.32,0.0,0.0,0.0,45.0,90.0
bathrooms,25146.0,1.1,0.35,0.0,1.0,1.0,1.0,8.5
bedrooms,25131.0,1.16,0.68,0.0,1.0,1.0,1.0,12.0
beds,24951.0,1.61,1.23,0.0,1.0,1.0,2.0,24.0
calculated_host_listings_count,25164.0,2.44,5.44,1.0,1.0,1.0,2.0,57.0
guests_included,25164.0,1.37,0.92,1.0,1.0,1.0,1.0,24.0
host_listings_count,25142.0,3.92,39.34,0.0,1.0,1.0,2.0,1384.0
latitude,25164.0,52.51,0.03,52.34,52.49,52.51,52.53,52.66


In [20]:
# List datatypes (data.info()) (pre-cleaning)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25164 entries, 3176 to 42927052
Data columns (total 64 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   accommodates                      25164 non-null  int64  
 1   amenities                         25164 non-null  object 
 2   availability_365                  25164 non-null  int64  
 3   availability_90                   25164 non-null  int64  
 4   bathrooms                         25146 non-null  float64
 5   bed_type                          25164 non-null  object 
 6   bedrooms                          25131 non-null  float64
 7   beds                              24951 non-null  float64
 8   calculated_host_listings_count    25164 non-null  int64  
 9   cancellation_policy               25164 non-null  object 
 10  cleaning_fee                      17725 non-null  object 
 11  description                       24664 non-null  object 
 12

In [21]:
# Show maximum/minimum value for each numerical column
#num_features = list(data.columns[data.dtypes!=object])
#data[num_features].max()
#data[num_features].min()

Several rows with unusually high values can be identified and may in some cases be dropped at a certain threshold during data handling. Some particular features include:

| **FEATURE** | **MAX_VALUE** |
| :----- | :----- |
| **calculated_host_listings_count** | 55 |
| **accommodates** | 16 |
| **bedrooms** | 12 |
| **beds** | 24 |
| **minimum_nights** | 1.124 |
| **maximum_nights** | 10.000 |
| **number_of_reviews_ltm** | 516 (potentially misleading; actually had less reviews on Airbnb |
| **price** | 8.983 |

In [22]:
# List unique entries per column
data.nunique()

accommodates                           17
amenities                           22601
availability_365                      366
availability_90                        91
bathrooms                              17
bed_type                                5
bedrooms                               12
beds                                   19
calculated_host_listings_count         31
cancellation_policy                     6
cleaning_fee                          134
description                         24140
experiences_offered                     1
extra_people                           65
first_review                         2645
guests_included                        16
has_availability                        1
host_acceptance_rate                   98
host_has_profile_pic                    2
host_identity_verified                  2
host_is_superhost                       2
host_listings_count                    52
host_location                        1095
host_response_rate                

In [23]:
# List missing values (pre-cleaning)


def count_missing(data):
    null_cols = data.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)


count_missing(data)

square_feet                    24764
monthly_price                  22991
weekly_price                   22138
notes                          17288
interaction                    12577
house_rules                    12575
host_response_rate             12092
host_response_time             12092
neighborhood_overview          10950
transit                         9696
security_deposit                9696
space                           8848
host_acceptance_rate            8508
cleaning_fee                    7439
review_scores_value             5077
review_scores_checkin           5075
review_scores_location          5074
review_scores_communication     5056
review_scores_accuracy          5054
review_scores_cleanliness       5051
review_scores_rating            5027
last_review                     4528
reviews_per_month               4528
first_review                    4528
summary                         1249
zipcode                          513
description                      500
b

## Observations

- **host_response_rate** and **host_response_time** are unfortunately not available for half of the dataset and consequently the columns have been removed
- **review_scores** are difficult to replace if they do not exist, but at 0 they will distort the modeling. Hence, missing values are set to mean of the column
- listings without **name** and the few rows without enhanced **host information** (e.g. superhost), **bedrooms** or **bathrooms** are removed and not substantial in number
- missing values for **summary** and **description** are replaced with "" and kept in order to calculate length during feature engineering
- several features with missing values will be directly converted to 1/0 for simplification (**house_rules, security_deposit, space, cleaning_fee, monthly_price, weekly_price**)


## Data Handling

**Handle missing/incorrect values**

In [24]:
# Convert columns with missing values to 1/0
#data.security_deposit.where(data.security_deposit.isnull(), 1, inplace=True)
data.security_deposit.fillna("0", inplace=True)

#data.cleaning_fee.where(data.cleaning_fee.isnull(), 1, inplace=True)
data.cleaning_fee.fillna("0", inplace=True)

#data.monthly_price.where(data.monthly_price.isnull(), 1, inplace=True)
data.monthly_price.fillna("0", inplace=True)

#data.weekly_price.where(data.weekly_price.isnull(), 1, inplace=True)
data.weekly_price.fillna("0", inplace=True)

In [25]:
# Fill missing values of "beds" with 0 and then set all with "bed_type" Real Bed to at least 1, those with value "0" to 0.5
data.beds.fillna(0, inplace=True)
data.beds = np.where((data.beds == 0) & (data.bed_type == "Real Bed"), 1,
                     data.beds)
data.beds = np.where((data.beds == 0), 0.5, data.beds)

In [26]:
# Set all with "bathrooms" 0 to at least 0.5
data.bathrooms = np.where(data.bathrooms == 0, 0.5, data.bathrooms)

In [27]:
# Set all with "bedrooms" 0 to at least 0.5
data.bedrooms = np.where(data.bedrooms == 0, 0.5, data.bedrooms)

In [28]:
# Fill review_scores with median
data.review_scores_rating.fillna(data.review_scores_rating.median(),
                                 inplace=True)
data.review_scores_value.fillna(data.review_scores_value.median(), inplace=True)
data.review_scores_checkin.fillna(data.review_scores_checkin.median(), inplace=True)
data.review_scores_location.fillna(data.review_scores_location.median(), inplace=True)
data.review_scores_communication.fillna(data.review_scores_communication.median(), inplace=True)
data.review_scores_accuracy.fillna(data.review_scores_accuracy.median(), inplace=True)
data.review_scores_cleanliness.fillna(data.review_scores_cleanliness.median(), inplace=True)

In [29]:
# Fill host_response/acceptance columns with median/"unknown"
data.host_response_time.fillna("unknown",inplace=True)
data.host_acceptance_rate.fillna(data.review_scores_rating.median(),inplace=True)
data.host_response_rate.fillna(data.review_scores_rating.median(),inplace=True)

In [30]:
# Fill missing text values with ""
data.description.fillna("", inplace=True)
data.interaction.fillna("", inplace=True)
data.house_rules.fillna("", inplace=True)
data.neighborhood_overview.fillna("", inplace=True)
data.notes.fillna("", inplace=True)
data.space.fillna("", inplace=True)
data.summary.fillna("", inplace=True)
data.transit.fillna("", inplace=True)

**Handle wrong/varying datatypes**

In [32]:
# Convert numeric objects to float
data.cleaning_fee = [
    float(i.strip("$").replace(",", "")) for i in data.cleaning_fee
]
data.extra_people = [
    float(i.strip("$").replace(",", "")) for i in data.extra_people
]
data.host_acceptance_rate = [
    float(str(i).strip("%")) for i in data.host_acceptance_rate
]
data.host_response_rate = [
    float(str(i).strip("%")) for i in data.host_response_rate
]
data.monthly_price = [
    float(i.strip("$").replace(",", "")) for i in data.monthly_price
]
data.price = [float(i.strip("$").replace(",", "")) for i in data.price]
data.security_deposit = [
    float(i.strip("$").replace(",", "")) for i in data.security_deposit
]
data.weekly_price = [
    float(i.strip("$").replace(",", "")) for i in data.weekly_price
]

In [33]:
# Convert varying zipcode datatypes to string
data.zipcode = ["zip_" + str(i)[:5] for i in data.zipcode]

In [34]:
# Convert date objects to datetime
data.first_review = data.first_review.astype('datetime64[D]')
data.last_review = data.last_review.astype('datetime64[D]')

**Add select amenities as column to data**

In [35]:
# Create temporary list with all amenities per listing
amenities_temp = [
    data.amenities[i].strip("{").strip("}").split(',') for i in data.index
]

In [36]:
# Add all amenities to single list in order to count occurrences
amenities = []
for lst in amenities_temp:
    for item in lst:
        amenities.append(item)
amenities = pd.Series(amenities)

In [37]:
# Display count of individual amenities
#amenities.value_counts()

Out of the full list of amenities, not all will have a significant impact on the price. For the purpose of this analysis, an initial selection has been made and then enhanced by some great [previous work](https://github.com/L-Lewis/Airbnb-neural-network-price-prediction/blob/master/Airbnb-price-prediction.ipynb) on selecting relevant amenities. Additionally, most amenities with a split of more than 90/10 between 1/0 have been **removed (strikethrough in the list)** - except for some that were deemed substantial (24-hour check-in, breakfast, essentials, nature and views)

| **NEW COLUMN** | **PREVIOUS AMENITY/IES** |
| :----- | :----- |
| <s>**am_check_in_24h**</s> | <s>24-hour check-in</s> |
| **<s>am_air_con</s>** | <s>Air conditioning/central air conditioning</s> |
| **am_balcony** | Balcony/patio or balcony |
| **am_nature_and_views** | Beach view/beachfront/lake access/mountain view/ski-in ski-out/waterfront (i.e. great location/views) |
| **am_breakfast** | Breakfast |
| **am_tv** | Cable TV/TV |
| **am_coffee_machine** | Coffee maker/espresso machine |
| **am_cooking_basics** | Cooking basics |
| **am_white_goods** | Dishwasher/Dryer/Washer/Washer and dryer |
| **am_elevator** | Elevator |
| <s>**am_gym**</s> | <s>Exercise equipment/gym/private gym/shared gym</s> |
| **am_essentials** | Essentials |
| **am_child_friendly** | Family/kid friendly, or anything containing 'children' |
| **am_parking** | Free parking on premises/free street parking/outdoor parking/paid parking off premises/paid parking on premises |
| <s>**am_outdoor_space**</s> | <s>Garden or backyard/outdoor seating/sun loungers/terrace</s> |
| <s>**am_wellness**</s> | <s>Hot tub/jetted tub/private hot tub/sauna/shared hot tub/pool/private pool/shared pool</s> |
| <s>**am_internet**</s> | <s>Internet/pocket wifi/wifi</s> |
| **am_pets_allowed** | Pets allowed/cat(s)/dog(s)/pets live on this property/other pet(s) |
| **am_private_entrance** | Private entrance |
| <s>**am_secure**</s> | <s>Safe/security system</s> |
| <s>**am_self_check_in**</s> | <s>Self check-in</s> |
| **am_smoking_allowed** | Smoking allowed |

In [38]:
# Add select amenities as distinct columns to data

#data.loc[data.amenities.str.contains('24-hour check-in'), 'am_check_in_24h'] = 1
#data.am_check_in_24h.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Air conditioning|Central air conditioning'), 'am_air_con'] = 1
#data.am_air_con.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Balcony|Patio'), 'am_balcony'] = 1
data.am_balcony.fillna(0, inplace=True)
#print(data.am_balcony.value_counts())

data.loc[data.amenities.str.contains(
    'Beach view|Beachfront|Lake access|Mountain view|Ski-in/Ski-out|Waterfront'
), 'am_nature_and_views'] = 1
data.am_nature_and_views.fillna(0, inplace=True)
#print(data.am_nature_and_views.value_counts())

data.loc[data.amenities.str.contains('Breakfast'), 'am_breakfast'] = 1
data.am_breakfast.fillna(0, inplace=True)
#print(data.am_breakfast.value_counts())

data.loc[data.amenities.str.contains('TV'), 'am_tv'] = 1
data.am_tv.fillna(0, inplace=True)
#print(data.am_tv.value_counts())

data.loc[data.amenities.str.contains('Coffee maker|Espresso machine'
                                     ), 'am_coffee_machine'] = 1
data.am_coffee_machine.fillna(0, inplace=True)
#print(data.am_coffee_machine.value_counts())

data.loc[data.amenities.str.contains('Cooking basics'
                                     ), 'am_cooking_basics'] = 1
data.am_cooking_basics.fillna(0, inplace=True)
#print(data.am_cooking_basics.value_counts())

data.loc[data.amenities.str.contains('Dishwasher|Dryer|Washer'
                                     ), 'am_white_goods'] = 1
data.am_white_goods.fillna(0, inplace=True)
#print(data.am_white_goods.value_counts())

data.loc[data.amenities.str.contains('Elevator'), 'am_elevator'] = 1
data.am_elevator.fillna(0, inplace=True)
#print(data.am_elevator.value_counts())

data.loc[data.amenities.str.contains('Essentials'), 'am_essentials'] = 1
data.am_essentials.fillna(0, inplace=True)
#print(data.am_essentials.value_counts())

#data.loc[data.amenities.str.contains('Exercise equipment|Gym|gym'), 'am_gym'] = 1
#data.am_gym.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Family/kid friendly|Children|children'
                                     ), 'am_child_friendly'] = 1
data.am_child_friendly.fillna(0, inplace=True)
#print(data.am_child_friendly.value_counts())

data.loc[data.amenities.str.contains('parking'), 'am_parking'] = 1
data.am_parking.fillna(0, inplace=True)
#print(data.am_parking.value_counts())

#data.loc[data.amenities.str.contains('Garden|Outdoor|Sun loungers|Terrace'), 'am_outdoor_space'] = 1
#data.am_outdoor_space.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Hot tub|Jetted tub|hot tub|Sauna|Pool|pool'), 'am_wellness'] = 1
#data.am_wellness.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Internet|Pocket wifi|Wifi'), 'am_internet'] = 1
#data.am_internet.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Pets|pet|Cat(s)|Dog(s)'
                                     ), 'am_pets_allowed'] = 1
data.am_pets_allowed.fillna(0, inplace=True)
#print(data.am_pets_allowed.value_counts())

data.loc[data.amenities.str.contains('Private entrance'
                                     ), 'am_private_entrance'] = 1
data.am_private_entrance.fillna(0, inplace=True)
#print(data.am_private_entrance.value_counts())

#data.loc[data.amenities.str.contains('Safe|Security system'), 'am_secure'] = 1
#data.am_secure.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Self check-in'), 'am_self_check_in'] = 1
#data.am_self_check_in.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Smoking allowed'
                                     ), 'am_smoking_allowed'] = 1
data.am_smoking_allowed.fillna(0, inplace=True)
#print(data.am_smoking_allowed.value_counts())

  return func(self, *args, **kwargs)


**Remove low-frequency classes from categorical columns**

In [39]:
# Change neighbourhoods_cleansed that make up <0.25% of data to "other"
data = data.apply(lambda x: x.mask(
    x.map(x.value_counts()) < (0.0025 * len(data)), 'nb_other')
                  if x.name == 'neighbourhood_cleansed' else x)

In [40]:
# Change zipcodes that make up <0.25% of data to "other"
data = data.apply(lambda x: x.mask(
    x.map(x.value_counts()) < (0.0025 * len(data)), 'zip_other')
                  if x.name == 'zipcode' else x)

**Drop irrelevant rows**

In [41]:
# Drop irrelevant rows with few missing values
data.dropna(subset=[
    "name", "host_is_superhost", "bedrooms", "bathrooms",
    "neighbourhood_cleansed", "zipcode"
],
            inplace=True)

In [42]:
# Remove "poor" listings (value above/below a certain threshold)
data = data[data.price < 500]
data = data[data.price >= 10]
data = data[data.minimum_nights <= 100]

In [43]:
# Remove listings where "accommodates" is lower than "guests_included"
data = data[data.accommodates - data.guests_included >= 0]

In [44]:
# Remove listings where "accommodates" > 10 (outliers)
data = data[data.accommodates <= 10]

In [45]:
# Remove listings where "accommodates" - "beds" < 0
data = data[data.accommodates - data.beds >= 0]

In [46]:
# Remove listings where "bedrooms" - "beds" > 2
data = data[data.bedrooms - data.beds <= 2]

In [47]:
# Remove listings where "beds" - "bedrooms" > 10
data = data[data.beds - data.bedrooms <= 10]

In [48]:
# Remove listings where "monthly_price" is more than 30x "price"
data = data[data.monthly_price / data.price <= 30]

In [49]:
# Remove listings where "weekly_price" is more than 7x "price"
data = data[data.weekly_price / data.price <= 7]

In [50]:
# Remove "inactive" or "new" listings with no reviews in last twelve months
data = data[data.number_of_reviews_ltm != 0]

In [51]:
# Remove listings with no "availability_365" and no reviews in last three months
data = data[(data.availability_365 != 0) |
            (data.last_review > dataset_date_min3mth)]

In [52]:
data.shape

(10635, 78)

## Final Check, Cleaning and Export

In [53]:
# Drop further columns
data.drop(
    [
        "bed_type",
        "experiences_offered",
        "has_availability",
#        "host_acceptance_rate",
        "host_location",
#        "host_response_rate",
#        "host_response_time",  
#        "number_of_reviews", 
#        "number_of_reviews_ltm",
        "requires_license",
        "is_business_travel_ready",
        "host_has_profile_pic",
        "host_listings_count",
        "require_guest_profile_picture",
        "require_guest_phone_verification",
        "reviews_per_month",
        "square_feet"
    ],
    inplace=True,
    axis=1)

In [54]:
# Sort columns in dataset
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

| **FEATURE(S)** | **NOTES** |
| :----- | :----- | 
| **bed_type** | over 97% of values were "Real Bed", hence little added value |
| **experiences_offered** | all values are "none" |
| **has_availability** | all values are "t" |
| **requires_license, host_has_profile_pic** | almost all values are "t" |
| **is_business_travel_ready** | all values are "f" |
| **require_guest_xxx** | almost all values are "f" |
| **host_listings_count** | calculated_host_listings_count appears to be a sanitized version (ranges from 1 to 55) of host_listings_count (has values 0 and highest is 1397) |
 **other host_xyz** | too many missing values |
| **reviews_per_month, number_of_reviews(_ltm)** | reviews in last 2 yrs are calculated in feature engineering |
| **square_feet** | too many missing values |
| <s>**property_type**</s> | <s>90% of values are "apartment", too many unique values to sensibly classify</s> kept instead |

In [55]:
# List datatypes (data.info()) (post-cleaning)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10635 entries, 3176 to 42885615
Data columns (total 66 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   accommodates                    10635 non-null  int64         
 1   am_balcony                      10635 non-null  float64       
 2   am_breakfast                    10635 non-null  float64       
 3   am_child_friendly               10635 non-null  float64       
 4   am_coffee_machine               10635 non-null  float64       
 5   am_cooking_basics               10635 non-null  float64       
 6   am_elevator                     10635 non-null  float64       
 7   am_essentials                   10635 non-null  float64       
 8   am_nature_and_views             10635 non-null  float64       
 9   am_parking                      10635 non-null  float64       
 10  am_pets_allowed                 10635 non-null  float64       
 

In [56]:
# List missing values (post-cleaning)

def count_missing(data):
    null_cols = data.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)

count_missing(data)
#data.isnull().sum()

Series([], dtype: float64)


As we can see, we got rid of all the missing values

In [57]:
# Display cleaned dataset
print(data.shape)
data.head(3)

(10635, 66)


Unnamed: 0_level_0,accommodates,am_balcony,am_breakfast,am_child_friendly,am_coffee_machine,am_cooking_basics,am_elevator,am_essentials,am_nature_and_views,am_parking,am_pets_allowed,am_private_entrance,am_smoking_allowed,am_tv,am_white_goods,amenities,availability_365,availability_90,bathrooms,bedrooms,beds,calculated_host_listings_count,cancellation_policy,cleaning_fee,description,extra_people,first_review,guests_included,host_acceptance_rate,host_identity_verified,host_is_superhost,host_response_rate,host_response_time,house_rules,instant_bookable,interaction,is_location_exact,last_review,latitude,listing_url,longitude,maximum_nights,minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,room_type,security_deposit,space,summary,transit,weekly_price,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1
3176,4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",140,0,1.0,1.0,2.0,1,strict_14_with_grace_period,100.0,This beautiful first floor apartment is situa...,20.0,2009-06-20,2,80.0,t,f,100.0,within a few hours,"It’s a non smoking flat, which likes to be tre...",f,Feel free to ask any questions prior to bookin...,t,2019-06-27,52.535,https://www.airbnb.com/rooms/3176,13.41758,1125,62,1900.0,Fabulous Flat in great Location,The neighbourhood is famous for its variety of...,Prenzlauer Berg Südwest,We welcome FAMILIES and cater especially for y...,145,1,90.0,Apartment,9.0,9.0,9.0,9.0,10.0,93.0,9.0,Entire home/apt,300.0,1st floor (68m2) apartment on Kollwitzplatz/ P...,This beautiful first floor apartment is situa...,"We are 5 min walk away from the tram M2, whic...",520.0,zip_10405
3309,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,"{Internet,Wifi,""Pets live on this property"",Ca...",320,45,1.0,1.0,1.0,1,strict_14_with_grace_period,30.0,First of all: I prefer short-notice bookings. ...,18.0,2013-08-12,1,100.0,f,f,100.0,within a day,House-Rules and Information ..............(deu...,f,I'm working as a freelancing photographer. My ...,t,2019-05-31,52.49885,https://www.airbnb.com/rooms/3309,13.34906,35,7,599.0,BerlinSpot Schöneberg near KaDeWe,"My flat is in the middle of West-Berlin, direc...",Schöneberg-Nord,The flat is a strictly non-smoking facility! A...,27,1,28.0,Apartment,9.0,9.0,9.0,10.0,9.0,89.0,9.0,Private room,250.0,"Your room is really big and has 26 sqm, is ver...",First of all: I prefer short-notice bookings. ...,The public transportation is excellent: Severa...,175.0,zip_10777
6883,2,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",0,0,1.0,1.0,1.0,1,moderate,39.0,Stay in a stylish loft on the second floor and...,0.0,2010-02-15,1,100.0,t,f,100.0,within an hour,No Pets. No loud Parties. Smoking only on th...,f,I rent out my space when I am travelling so I ...,t,2020-02-16,52.51171,https://www.airbnb.com/rooms/6883,13.45477,90,3,1399.0,Stylish East Side Loft in Center with AC & 2 b...,The emerging and upcoming East of the new hip ...,Frankfurter Allee Süd FK,Information on Berlin Citytax: English (Websit...,133,9,125.0,Loft,10.0,10.0,10.0,10.0,10.0,99.0,10.0,Entire home/apt,0.0,Stay in a stylish loft on the second floor and...,,Location: - Very close to Alexanderplatz just ...,599.0,zip_10243


**Export data_clean**

In [58]:
# Create path to export dataset (if not existing)
if not os.path.exists(f"saves/{dataset_loc}_{dataset_date}/"):
    os.mkdir(f"saves/{dataset_loc}_{dataset_date}/")

In [59]:
# Export dataset for further use in 2_Airbnb_EDA and 3_Airbnb_Feature_Engineering
data.to_pickle(f"saves/{dataset_loc}_{dataset_date}/data_clean.pkl")

In [60]:
# Alternative: Export with to_csv and save dtypes separately
#data.to_csv(r'saves/data_clean.csv', index = True)
#data.dtypes.to_frame('types').to_csv('saves/types_clean.csv')

**BACKUP**

In [61]:
# Import Airbnb listing data for the time period 04/2018-03/2020 (2 years)
#all_files = glob.glob(os.path.join("data", "*.csv.gz"))
#all_df = []
#for f in all_files:
#    df = pd.read_csv(f, sep=',')
#    df['file'] = f.split('/')[-1]
#    all_df.append(df)
#data_raw = pd.concat(all_df, ignore_index=True, sort=True)

In [68]:
data.groupby("zipcode")["neighbourhood_cleansed"].value_counts().count()

324

In [69]:
data.neighbourhood_cleansed.value_counts().count()

60