# Business Understanding and Set-up

## Background and Key Question

**Airbnb**

Brief description

**Key Question**

1. Taking data from a specific listing time stamp (in this case Jan 2020), **can we predict its occupancy rate** in the following month, based on reviews from March 2020? (regression)
2. Including changes in price and other features from previous months, **can we predict changes in occupancy levels** based on feature values and changes? (classification)

**Assumptions**

As the data is accessible information only and does not include data such as actual occupancy, several assumptions were necessary to perform the analysis. Some of the key ones are described below

| **TOPIC** | **ASSUMPTION** |
| :----- | :----- |
| **Occupancy calculation** | Please refer to 3 - Feature Engineering for a detailed explanation of how to calculate occupancy |
| **Avg. length of stay (Berlin)** | 3 days, while adjusting for max. length <= 5 and min. length > 3 |
| **Days in advance booking** | Two weeks on average, while longer stays are typically booked further in advance (needs further validation) |
| **Data date selection** | Main dataset "data" is taken from Jan 10th 2020 (in order to stay clear of COVID effects). Airbnb allows 0-14 days after completion of trip (avg. 3 days). Hence, reviews from Mar 1-31 2020 are used as a proxy to calculate occupation in the month affected by price on Jan 10th 2020 (approx. Jan 24th-Feb 23th) |
| **** |  |
| **** |  |



## Feature Glossary

[LINK](https://github.com/L-Lewis/Airbnb-neural-network-price-prediction/blob/master/Airbnb-price-prediction.ipynb)

| **FEATURE** | **DESCRIPTION** |
| :----- | :----- |
| **name** | header of Airbnb listing |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |



## Dataset Glossary

| **DATASET** | **DESCRIPTION** |
| :----- | :----- |
| **data_raw** | Originally imported dataset listings.csv.gz (February 2020) |
| **data** | Naming for main working dataset throughout all notebooks |
| **data_clean** | Export from Notebook 1-Clean, import for Notebooks 2-EDA and 3-Feature Engineering |
| **data_engineered** | Export from Notebook 3-Feature Engineering, import for Notebook 4-Predictive Modeling |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |
| **** |  |



## Target Feature(s) and Metric(s)

**Target 1**:
- Feature: Occupancy class
- Metric: F1-Score

## Libraries and Dashboard

In [1378]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import loadtxt
import os, glob
import geopandas as gpd
%matplotlib inline

In [1379]:
#Dashboard
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 100)
pd.options.display.max_seq_items = 300
#pd.options.display.max_rows = 4000
sns.set(style="white")

# Data Mining

## Data Checks

The monthly data for Berlin is composed of various files that are briefly visualized here (based on Dec 2019):

- listings.csv.gz
- listings.csv
- reviews.csv.gz
- reviews.csv
- calendar.csv.gz
- neighbourhoods.csv
- neighbourhoods.geojson

**listings.csv.gz**

In [1380]:
# Display contents of listings.csv.gz as well as its shape
data_2020_01_10_listings_gz = pd.read_csv("data/2020-01_listings.csv.gz")
print(data_2020_01_10_listings_gz.shape)
data_2020_01_10_listings_gz.head(3)

(25349, 106)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2015,https://www.airbnb.com/rooms/2015,20200110222913,2020-01-11,Berlin-Mitte Value! Quiet courtyard/very central,Great location! 30 of 75 sq meters. This wood...,A+++ location! This „Einliegerwohnung“ is an e...,Great location! 30 of 75 sq meters. This wood...,none,It is located in the former East Berlin area o...,"This is my home, not a hotel. I rent out occas...","Close to U-Bahn U8 and U2 (metro), Trams M12, ...","Simple kitchen/cooking, refrigerator, microwav...",Always available,"No parties No events No pets No smoking, not e...",,,https://a0.muscache.com/im/pictures/260fd609-7...,,2217,https://www.airbnb.com/users/show/2217,Ion,2008-08-18,"Key Biscayne, Florida, United States",Believe in sharing economy.,within an hour,100%,,f,https://a0.muscache.com/im/pictures/user/21428...,https://a0.muscache.com/im/pictures/user/21428...,Mitte,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Berlin, Berlin, Germany",Mitte,Brunnenstr. Süd,Mitte,Berlin,Berlin,10119,Berlin,"Berlin, Germany",DE,Germany,52.53454,13.40256,f,Guesthouse,Entire home/apt,3,1.0,1.0,0.0,Real Bed,"{TV,""Cable TV"",Wifi,Kitchen,Gym,""Free street p...",,$60.00,,,$250.00,$30.00,1,$28.00,4,1125,4,59,1125,1125,41.0,1125.0,a week ago,t,8,8,15,194,2020-01-11,130,10,2016-04-11,2019-10-12,93.0,10.0,9.0,10.0,10.0,10.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,6,6,0,0,2.84
1,3176,https://www.airbnb.com/rooms/3176,20200110222913,2020-01-11,Fabulous Flat in great Location,This beautiful first floor apartment is situa...,1st floor (68m2) apartment on Kollwitzplatz/ P...,This beautiful first floor apartment is situa...,none,The neighbourhood is famous for its variety of...,We welcome FAMILIES and cater especially for y...,"We are 5 min walk away from the tram M2, whic...",The apartment will be entirely yours. We are c...,Feel free to ask any questions prior to bookin...,"It’s a non smoking flat, which likes to be tre...",,,https://a0.muscache.com/im/pictures/243355/84a...,,3718,https://www.airbnb.com/users/show/3718,Britta,2008-10-19,"Coledale, New South Wales, Australia",We love to travel ourselves a lot and prefer t...,within a day,67%,,f,https://a0.muscache.com/im/users/3718/profile_...,https://a0.muscache.com/im/users/3718/profile_...,Prenzlauer Berg,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"Berlin, Berlin, Germany",Prenzlauer Berg,Prenzlauer Berg Südwest,Pankow,Berlin,Berlin,10405,Berlin,"Berlin, Germany",DE,Germany,52.535,13.41758,t,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",720.0,$90.00,$520.00,"$1,900.00",$300.00,$100.00,2,$20.00,62,1125,62,62,1125,1125,62.0,1125.0,4 months ago,t,0,0,14,289,2020-01-11,145,1,2009-06-20,2019-06-27,93.0,9.0,9.0,9.0,9.0,10.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,1,1,0,0,1.13
2,3309,https://www.airbnb.com/rooms/3309,20200110222913,2020-01-11,BerlinSpot Schöneberg near KaDeWe,First of all: I prefer short-notice bookings. ...,"Your room is really big and has 26 sqm, is ver...",First of all: I prefer short-notice bookings. ...,none,"My flat is in the middle of West-Berlin, direc...",The flat is a strictly non-smoking facility! A...,The public transportation is excellent: Severa...,I do have a strictly non-smoker-flat. Keep th...,I'm working as a freelancing photographer. My ...,House-Rules and Information ..............(deu...,,,https://a0.muscache.com/im/pictures/29054294/b...,,4108,https://www.airbnb.com/users/show/4108,Jana,2008-11-07,"Berlin, Berlin, Germany",ENJOY EVERY DAY AS IF IT'S YOUR LAST!!! \r\n\r...,within a day,100%,,f,https://a0.muscache.com/im/pictures/user/d8049...,https://a0.muscache.com/im/pictures/user/d8049...,Schöneberg,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,f,"Berlin, Berlin, Germany",Schöneberg,Schöneberg-Nord,Tempelhof - Schöneberg,Berlin,Berlin,10777,Berlin,"Berlin, Germany",DE,Germany,52.49885,13.34906,t,Apartment,Private room,1,1.0,1.0,1.0,Pull-out Sofa,"{Internet,Wifi,""Pets live on this property"",Ca...",0.0,$28.00,$175.00,$599.00,$250.00,$30.00,1,$18.00,7,35,7,7,35,35,7.0,35.0,6 weeks ago,t,0,10,40,315,2020-01-11,27,1,2013-08-12,2019-05-31,89.0,9.0,9.0,9.0,10.0,9.0,9.0,t,,,f,f,strict_14_with_grace_period,f,f,1,0,1,0,0.35


**listings.csv**

In [1381]:
# Display contents of listings.csv as well as its shape
data_2020_02_18_listings = pd.read_csv("data/2020-02-18/listings.csv")
print(data_2020_02_18_listings.shape)
data_2020_02_18_listings.head(2)

(25197, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,3176,Fabulous Flat in great Location,3718,Britta,Pankow,Prenzlauer Berg Südwest,52.535,13.41758,Entire home/apt,90,62,145,2019-06-27,1.12,1,221
1,3309,BerlinSpot Schöneberg near KaDeWe,4108,Jana,Tempelhof - Schöneberg,Schöneberg-Nord,52.49885,13.34906,Private room,28,7,27,2019-05-31,0.34,1,293


**reviews.csv.gz**

In [1382]:
# Display contents of reviews.csv.gz as well as its shape
data_2020_02_18_reviews_gz = pd.read_csv("data/2020-02-18/reviews.csv.gz")
print(data_2020_02_18_reviews_gz.shape)
data_2020_02_18_reviews_gz.head(2)

(543302, 6)


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,3176,4283,2009-06-20,21475,Milan,"excellent stay, i would highly recommend it. a..."
1,3176,134722,2010-11-07,263467,George,Britta's apartment in Berlin is in a great are...


**reviews.csv**

In [1383]:
# Display contents of reviews.csv as well as its shape
data_2020_02_18_reviews = pd.read_csv("data/2020-02-18/reviews.csv")
print(data_2020_02_18_reviews.shape)
data_2020_02_18_reviews.head(2)

(543302, 2)


Unnamed: 0,listing_id,date
0,3176,2009-06-20
1,3176,2010-11-07


**calendar.csv.gz**

In [1384]:
# Display contents of calendar.csv.gz as well as its shape
data_2020_02_18_cal = pd.read_csv("data/2020-02-18/calendar.csv.gz")
print(data_2020_02_18_cal.shape)
data_2020_02_18_cal.head(2)

KeyboardInterrupt: 

**neighbourhoods.csv**

In [None]:
# Display contents of neighbourhoods.csv as well as its shape
data_2020_02_18_neighb = pd.read_csv("data/2020-02-18/neighbourhoods.csv")
print(data_2020_02_18_neighb.shape)
data_2020_02_18_neighb.head(2)

**neighbourhoods.geojson**

In [None]:
# Display contents of neighbourhoods.geojson as well as its shape
data_2020_02_18_neighb_geojson = gpd.read_file('data/2020-02-18/neighbourhoods.geojson')
print(data_2020_02_18_neighb_geojson.shape)
data_2020_02_18_neighb_geojson.head(2)

## Data Import

**Create main dataset (listings on January 10th, i.e. pre-COVID-19)**

In [None]:
# Import dataset as DataFrame (as csv-file)
data_raw = pd.read_csv("data/2020-01_listings.csv.gz")

In [None]:
# Assign data_raw to data (in order to always keep a freshly imported data_raw) and set id as index
data = data_raw.copy()
data.set_index('id', inplace=True)

# Data Cleaning

## Pre-cleaning

In [None]:
# Display shape of "data"
data.shape

In [None]:
# Display head(1) of "data"
data.head(1)

In [None]:
# Display columns of "data"
#data.columns

In [None]:
# Define columns for pre-cleaning drop
drop_columns = ['access', 'availability_30',
       'availability_60',# 'availability_90',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'calendar_last_scraped',
       'calendar_updated', 'city', 
       'country', 'country_code', 
#       'first_review', 
       'host_about', 'host_id',
       'host_name', 'host_neighbourhood', 'host_picture_url',
       'host_thumbnail_url', 'host_total_listings_count', 'host_since', 'host_url',
       'host_verifications', 'jurisdiction_names', 
#       'last_review', 
       'last_scraped',
       'license', 'listing_url', 'market',
       'maximum_maximum_nights', 'maximum_minimum_nights',
       'maximum_nights_avg_ntm', 'medium_url', 'minimum_maximum_nights',
       'minimum_minimum_nights', 'minimum_nights_avg_ntm',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed',
       'picture_url', 'review_scores_accuracy',
       'review_scores_checkin', 'review_scores_cleanliness',
       'review_scores_communication', 'review_scores_location', 
       'review_scores_value', 'scrape_id',
       'smart_location', 'state', 'street', 
       'thumbnail_url', 'xl_picture_url']

In [None]:
# Drop innecessary columns
data.drop(labels=drop_columns, inplace=True, axis=1)

## Inspection

In [None]:
# Display shape of "data"
data.shape

In [None]:
# Display head(5) of remaining "data"
data.head(5)

In [None]:
# Describe data (summary)
data.describe().round(2).T

In [None]:
# List datatypes (data.info()) (pre-cleaning)
data.info()

In [None]:
# Show maximum/minimum value for each numerical column
#num_features = list(data.columns[data.dtypes!=object])
#data[num_features].max()
#data[num_features].min()

Several rows with unusually high values can be identified and may in some cases be dropped at a certain threshold during data handling. Some particular features include:

| **FEATURE** | **MAX_VALUE** |
| :----- | :----- |
| **calculated_host_listings_count** | 55 |
| **accommodates** | 16 |
| **bedrooms** | 12 |
| **beds** | 24 |
| **minimum_nights** | 1.124 |
| **maximum_nights** | 10.000 |
| **number_of_reviews_ltm** | 516 (potentially misleading; actually had less reviews on Airbnb |
| **price** | 8.983 |

In [None]:
# List unique entries per column
data.nunique()

In [None]:
# List missing values (pre-cleaning)

def count_missing(data):
    null_cols = data.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)
    
count_missing(data)

## Observations

- **host_response_rate** and **host_response_time** are unfortunately not available for half of the dataset and consequently the columns have been removed
- **review_scores** are difficult to replace if they do not exist, but at 0 they will distort the modeling. Hence, missing values are set to mean of the column
- listings without **name** and the few rows without enhanced **host information** (e.g. superhost), **bedrooms** or **bathrooms** are removed and not substantial in number
- missing values for **summary** and **description** are replaced with "" and kept in order to calculate length during feature engineering
- several features with missing values will be directly converted to 1/0 for simplification (**house_rules, security_deposit, space, cleaning_fee, monthly_price, weekly_price**)


## Data Handling

**Handle missing/incorrect values**

In [None]:
# Convert columns with missing values to 1/0
#data.security_deposit.where(data.security_deposit.isnull(), 1, inplace=True)
data.security_deposit.fillna("0", inplace=True)

#data.cleaning_fee.where(data.cleaning_fee.isnull(), 1, inplace=True)
data.cleaning_fee.fillna("0", inplace=True)

#data.monthly_price.where(data.monthly_price.isnull(), 1, inplace=True)
data.monthly_price.fillna("0", inplace=True)

#data.weekly_price.where(data.weekly_price.isnull(), 1, inplace=True)
data.weekly_price.fillna("0", inplace=True)

In [None]:
# Fill missing values of "beds" with 0 and then set all with "bed_type" Real Bed to at least 1, those with value "0" to 0.5
data.beds.fillna(0, inplace=True)
data.beds = np.where((data.beds==0) & (data.bed_type=="Real Bed"), 1, data.beds)
data.beds = np.where((data.beds==0), 0.5, data.beds)

In [None]:
# Set all with "bathrooms" 0 to at least 0.5
data.bathrooms = np.where(data.bathrooms==0, 0.5, data.bathrooms)

In [None]:
# Set all with "bedrooms" 0 to at least 0.5
data.bedrooms = np.where(data.bedrooms==0, 0.5, data.bedrooms)

In [None]:
# Fill review_scores with mean
data.review_scores_rating.fillna(data.review_scores_rating.mean(), inplace=True)
#data.review_scores_value.fillna(data.review_scores_value.mean(), inplace=True)
#data.review_scores_checkin.fillna(data.review_scores_checkin.mean(), inplace=True)
#data.review_scores_location.fillna(data.review_scores_location.mean(), inplace=True)
#data.review_scores_communication.fillna(data.review_scores_communication.mean(), inplace=True)
#data.review_scores_accuracy.fillna(data.review_scores_accuracy.mean(), inplace=True)
#data.review_scores_cleanliness.fillna(data.review_scores_cleanliness.mean(), inplace=True)

In [None]:
# Fill missing text values with ""
data.description.fillna("", inplace=True)
data.interaction.fillna("", inplace=True)
data.house_rules.fillna("", inplace=True)
data.neighborhood_overview.fillna("", inplace=True)
data.notes.fillna("", inplace=True)
data.space.fillna("", inplace=True)
data.summary.fillna("", inplace=True)
data.transit.fillna("", inplace=True)

**Handle wrong/varying datatypes**

In [None]:
# Convert numeric objects to float
data.cleaning_fee = [float(i.strip("$").replace(",","")) for i in data.cleaning_fee]
data.extra_people = [float(i.strip("$").replace(",","")) for i in data.extra_people]
data.monthly_price = [float(i.strip("$").replace(",","")) for i in data.monthly_price]
data.price = [float(i.strip("$").replace(",","")) for i in data.price]
data.security_deposit = [float(i.strip("$").replace(",","")) for i in data.security_deposit]
data.weekly_price = [float(i.strip("$").replace(",","")) for i in data.weekly_price]
data.zipcode = ["zip_"+str(i)[:5] for i in data.zipcode]

In [None]:
# Convert date objects to datetime
data.first_review = data.first_review.astype('datetime64[D]')
data.last_review = data.last_review.astype('datetime64[D]')


**Add select amenities as column to data**

In [None]:
# Create temporary list with all amenities per listing
amenities_temp = [data.amenities[i].strip("{").strip("}").split(',') for i in data.index]

In [None]:
# Add all amenities to single list in order to count occurrences
amenities = []
for lst in amenities_temp:
    for item in lst:
        amenities.append(item)
amenities = pd.Series(amenities)

In [None]:
# Display count of individual amenities
#amenities.value_counts()

Out of the full list of amenities, not all will have a significant impact on the price. For the purpose of this analysis, an initial selection has been made and then enhanced by some great [previous work](https://github.com/L-Lewis/Airbnb-neural-network-price-prediction/blob/master/Airbnb-price-prediction.ipynb) on selecting relevant amenities. Additionally, most amenities with a split of more than 90/10 between 1/0 have been **removed (strikethrough in the list)** - except for some that were deemed substantial (24-hour check-in, breakfast, essentials, nature and views)

| **NEW COLUMN** | **PREVIOUS AMENITY/IES** |
| :----- | :----- |
| <s>**am_check_in_24h**</s> | <s>24-hour check-in</s> |
| **<s>am_air_con</s>** | <s>Air conditioning/central air conditioning</s> |
| **am_balcony** | Balcony/patio or balcony |
| **am_nature_and_views** | Beach view/beachfront/lake access/mountain view/ski-in ski-out/waterfront (i.e. great location/views) |
| **am_breakfast** | Breakfast |
| **am_tv** | Cable TV/TV |
| **am_coffee_machine** | Coffee maker/espresso machine |
| **am_cooking_basics** | Cooking basics |
| **am_white_goods** | Dishwasher/Dryer/Washer/Washer and dryer |
| **am_elevator** | Elevator |
| <s>**am_gym**</s> | <s>Exercise equipment/gym/private gym/shared gym</s> |
| **am_essentials** | Essentials |
| **am_child_friendly** | Family/kid friendly, or anything containing 'children' |
| **am_parking** | Free parking on premises/free street parking/outdoor parking/paid parking off premises/paid parking on premises |
| <s>**am_outdoor_space**</s> | <s>Garden or backyard/outdoor seating/sun loungers/terrace</s> |
| <s>**am_wellness**</s> | <s>Hot tub/jetted tub/private hot tub/sauna/shared hot tub/pool/private pool/shared pool</s> |
| <s>**am_internet**</s> | <s>Internet/pocket wifi/wifi</s> |
| **am_pets_allowed** | Pets allowed/cat(s)/dog(s)/pets live on this property/other pet(s) |
| **am_private_entrance** | Private entrance |
| <s>**am_secure**</s> | <s>Safe/security system</s> |
| <s>**am_self_check_in**</s> | <s>Self check-in</s> |
| **am_smoking_allowed** | Smoking allowed |

In [None]:
# Add select amenities as distinct columns to data

#data.loc[data.amenities.str.contains('24-hour check-in'), 'am_check_in_24h'] = 1
#data.am_check_in_24h.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Air conditioning|Central air conditioning'), 'am_air_con'] = 1
#data.am_air_con.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Balcony|Patio'), 'am_balcony'] = 1
data.am_balcony.fillna(0, inplace=True)
#print(data.am_balcony.value_counts())

data.loc[data.amenities.str.contains('Beach view|Beachfront|Lake access|Mountain view|Ski-in/Ski-out|Waterfront'), 'am_nature_and_views'] = 1
data.am_nature_and_views.fillna(0, inplace=True)
#print(data.am_nature_and_views.value_counts())

data.loc[data.amenities.str.contains('Breakfast'), 'am_breakfast'] = 1
data.am_breakfast.fillna(0, inplace=True)
#print(data.am_breakfast.value_counts())

data.loc[data.amenities.str.contains('TV'), 'am_tv'] = 1
data.am_tv.fillna(0, inplace=True)
#print(data.am_tv.value_counts())

data.loc[data.amenities.str.contains('Coffee maker|Espresso machine'), 'am_coffee_machine'] = 1
data.am_coffee_machine.fillna(0, inplace=True)
#print(data.am_coffee_machine.value_counts())

data.loc[data.amenities.str.contains('Cooking basics'), 'am_cooking_basics'] = 1
data.am_cooking_basics.fillna(0, inplace=True)
#print(data.am_cooking_basics.value_counts())

data.loc[data.amenities.str.contains('Dishwasher|Dryer|Washer'), 'am_white_goods'] = 1
data.am_white_goods.fillna(0, inplace=True)
#print(data.am_white_goods.value_counts())

data.loc[data.amenities.str.contains('Elevator'), 'am_elevator'] = 1
data.am_elevator.fillna(0, inplace=True)
#print(data.am_elevator.value_counts())

data.loc[data.amenities.str.contains('Essentials'), 'am_essentials'] = 1
data.am_essentials.fillna(0, inplace=True)
#print(data.am_essentials.value_counts())

#data.loc[data.amenities.str.contains('Exercise equipment|Gym|gym'), 'am_gym'] = 1
#data.am_gym.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Family/kid friendly|Children|children'), 'am_child_friendly'] = 1
data.am_child_friendly.fillna(0, inplace=True)
#print(data.am_child_friendly.value_counts())

data.loc[data.amenities.str.contains('parking'), 'am_parking'] = 1
data.am_parking.fillna(0, inplace=True)
#print(data.am_parking.value_counts())

#data.loc[data.amenities.str.contains('Garden|Outdoor|Sun loungers|Terrace'), 'am_outdoor_space'] = 1
#data.am_outdoor_space.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Hot tub|Jetted tub|hot tub|Sauna|Pool|pool'), 'am_wellness'] = 1
#data.am_wellness.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Internet|Pocket wifi|Wifi'), 'am_internet'] = 1
#data.am_internet.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Pets|pet|Cat(s)|Dog(s)'), 'am_pets_allowed'] = 1
data.am_pets_allowed.fillna(0, inplace=True)
#print(data.am_pets_allowed.value_counts())

data.loc[data.amenities.str.contains('Private entrance'), 'am_private_entrance'] = 1
data.am_private_entrance.fillna(0, inplace=True)
#print(data.am_private_entrance.value_counts())

#data.loc[data.amenities.str.contains('Safe|Security system'), 'am_secure'] = 1
#data.am_secure.fillna(0, inplace=True)

#data.loc[data.amenities.str.contains('Self check-in'), 'am_self_check_in'] = 1
#data.am_self_check_in.fillna(0, inplace=True)

data.loc[data.amenities.str.contains('Smoking allowed'), 'am_smoking_allowed'] = 1
data.am_smoking_allowed.fillna(0, inplace=True)
#print(data.am_smoking_allowed.value_counts())


**Remove low-frequency classes from categorical columns**

In [None]:
# Change neighbourhoods that make up <0.1% of data to "other"
data = data.apply(lambda x: x.mask(x.map(x.value_counts())<(0.001*len(data)), 'nb_other') if x.name=='neighbourhood' else x)

In [None]:
# Change zipcodes that make up <0.1% of data to "other"
data = data.apply(lambda x: x.mask(x.map(x.value_counts())<(0.001*len(data)), 'zip_other') if x.name=='zipcode' else x)

**Drop irrelevant rows**

In [None]:
# Drop irrelevant rows with few missing values
data.dropna(subset=["name", "host_is_superhost", "bedrooms", "bathrooms", "neighbourhood", "zipcode"], inplace=True)

In [None]:
# Remove non-residential property_types (e.g. hotels, hostels, ...) --> INSTEAD ENGINEERED IN 3 | FEATURE ENGINEERING
#data = data[data.property_type.isin(["Apartment", "Condominium", "Loft", "House", "Townhouse", 
#                                     "Guest suite", "Bed and breakfast", "Bungalow", "Villa"])]

In [None]:
# Remove "poor" listings (value above/below a certain threshold)
data = data[data.price < 500]
data = data[data.price >= 10]
data = data[data.minimum_nights <= 100]

In [None]:
# Remove listings where "accommodates" is lower than "guests_included"
data = data[data.accommodates-data.guests_included >= 0]

In [None]:
# Remove listings where "bedrooms" - "beds" > 2
data = data[data.bedrooms-data.beds <= 2]

In [None]:
# Remove listings where "beds" - "bedrooms" > 10
data = data[data.beds-data.bedrooms <= 10]

In [None]:
# Remove listings where "accommodates" - "beds" < 0
data = data[data.accommodates-data.beds >= 0]

In [None]:
# Remove listings where "monthly_price" is more than 30x "price"
data = data[data.monthly_price/data.price <= 30]

In [None]:
# Remove listings where "weekly_price" is more than 7x "price"
data = data[data.weekly_price/data.price <= 7]

## Final Check, Cleaning and Export

In [None]:
data[(data.availability_365==0)&(data.last_review>"2020-01-28")]

In [1342]:
# Drop further columns
data.drop(["bed_type", "experiences_offered", "has_availability", "host_acceptance_rate", "host_location", 
           "host_response_rate", "host_response_time", "number_of_reviews", "number_of_reviews_ltm",
           "requires_license", "is_business_travel_ready", "host_has_profile_pic", "host_listings_count",
           "require_guest_profile_picture", "require_guest_phone_verification",
           "reviews_per_month", "square_feet"], inplace=True, axis=1)

KeyError: "['bed_type' 'experiences_offered' 'has_availability'\n 'host_acceptance_rate' 'host_location' 'host_response_rate'\n 'host_response_time' 'requires_license' 'is_business_travel_ready'\n 'host_has_profile_pic' 'host_listings_count'\n 'require_guest_profile_picture' 'require_guest_phone_verification'\n 'reviews_per_month' 'square_feet'] not found in axis"

| **FEATURE(S)** | **NOTES** |
| :----- | :----- | 
| **bed_type** | over 97% of values were "Real Bed", hence little added value |
| **experiences_offered** | all values are "none" |
| **has_availability** | all values are "t" |
| **requires_license, host_has_profile_pic** | almost all values are "t" |
| **is_business_travel_ready** | all values are "f" |
| **require_guest_xxx** | almost all values are "f" |
| **host_listings_count** | calculated_host_listings_count appears to be a sanitized version (ranges from 1 to 55) of host_listings_count (has values 0 and highest is 1397) |
 **other host_xyz** | too many missing values |
| **reviews_per_month, number_of_reviews(_ltm)** | reviews in last 2 yrs are calculated in feature engineering |
| **square_feet** | too many missing values |
| <s>**property_type**</s> | <s>90% of values are "apartment", too many unique values to sensibly classify</s> kept instead |

In [1333]:
# List datatypes (data.info()) (post-cleaning)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23766 entries, 2015 to 41347401
Data columns (total 51 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   name                            23766 non-null  object 
 1   summary                         23766 non-null  object 
 2   space                           23766 non-null  object 
 3   description                     23766 non-null  object 
 4   neighborhood_overview           23766 non-null  object 
 5   notes                           23766 non-null  object 
 6   transit                         23766 non-null  object 
 7   interaction                     23766 non-null  object 
 8   house_rules                     23766 non-null  object 
 9   host_is_superhost               23766 non-null  object 
 10  host_identity_verified          23766 non-null  object 
 11  neighbourhood                   23766 non-null  object 
 12  zipcode                   

In [1334]:
# List missing values (post-cleaning)

#def count_missing(data):
#    null_cols = data.columns[data.isnull().any(axis=0)]
#    X_null = data[null_cols].isnull().sum()
#    X_null = X_null.sort_values(ascending=False)
#    print(X_null)
    
#count_missing(data)
data.isnull().sum()

name                              0
summary                           0
space                             0
description                       0
neighborhood_overview             0
notes                             0
transit                           0
interaction                       0
house_rules                       0
host_is_superhost                 0
host_identity_verified            0
neighbourhood                     0
zipcode                           0
latitude                          0
longitude                         0
is_location_exact                 0
property_type                     0
room_type                         0
accommodates                      0
bathrooms                         0
bedrooms                          0
beds                              0
amenities                         0
price                             0
weekly_price                      0
monthly_price                     0
security_deposit                  0
cleaning_fee                

As we can see, we got rid of all the missing values

In [1335]:
# Sort columns in dataset
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

In [1336]:
# Display cleaned dataset
print(data.shape)
data.head(3)

(23766, 51)


Unnamed: 0_level_0,accommodates,am_balcony,am_breakfast,am_child_friendly,am_coffee_machine,am_cooking_basics,am_elevator,am_essentials,am_nature_and_views,am_parking,am_pets_allowed,am_private_entrance,am_smoking_allowed,am_tv,am_white_goods,amenities,availability_365,bathrooms,bedrooms,beds,calculated_host_listings_count,cancellation_policy,cleaning_fee,description,extra_people,guests_included,host_identity_verified,host_is_superhost,house_rules,instant_bookable,interaction,is_location_exact,latitude,longitude,maximum_nights,minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood,notes,price,property_type,review_scores_rating,room_type,security_deposit,space,summary,transit,weekly_price,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
2015,3,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,"{TV,""Cable TV"",Wifi,Kitchen,Gym,""Free street p...",194,1.0,1.0,1.0,6,strict_14_with_grace_period,30.0,Great location! 30 of 75 sq meters. This wood...,28.0,1,t,f,"No parties No events No pets No smoking, not e...",f,Always available,f,52.53454,13.40256,1125,4,0.0,Berlin-Mitte Value! Quiet courtyard/very central,It is located in the former East Berlin area o...,Mitte,"This is my home, not a hotel. I rent out occas...",60.0,Guesthouse,93.0,Entire home/apt,250.0,A+++ location! This „Einliegerwohnung“ is an e...,Great location! 30 of 75 sq meters. This wood...,"Close to U-Bahn U8 and U2 (metro), Trams M12, ...",0.0,zip_10119
3176,4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"{Internet,Wifi,Kitchen,""Buzzer/wireless interc...",289,1.0,1.0,2.0,1,strict_14_with_grace_period,100.0,This beautiful first floor apartment is situa...,20.0,2,t,f,"It’s a non smoking flat, which likes to be tre...",f,Feel free to ask any questions prior to bookin...,t,52.535,13.41758,1125,62,1900.0,Fabulous Flat in great Location,The neighbourhood is famous for its variety of...,Prenzlauer Berg,We welcome FAMILIES and cater especially for y...,90.0,Apartment,93.0,Entire home/apt,300.0,1st floor (68m2) apartment on Kollwitzplatz/ P...,This beautiful first floor apartment is situa...,"We are 5 min walk away from the tram M2, whic...",520.0,zip_10405
3309,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,"{Internet,Wifi,""Pets live on this property"",Ca...",315,1.0,1.0,1.0,1,strict_14_with_grace_period,30.0,First of all: I prefer short-notice bookings. ...,18.0,1,f,f,House-Rules and Information ..............(deu...,f,I'm working as a freelancing photographer. My ...,t,52.49885,13.34906,35,7,599.0,BerlinSpot Schöneberg near KaDeWe,"My flat is in the middle of West-Berlin, direc...",Schöneberg,The flat is a strictly non-smoking facility! A...,28.0,Apartment,89.0,Private room,250.0,"Your room is really big and has 26 sqm, is ver...",First of all: I prefer short-notice bookings. ...,The public transportation is excellent: Severa...,175.0,zip_10777


**Export data_clean**

In [1337]:
# Export dataset for further use in 2_Airbnb_EDA and 3_Airbnb_Feature_Engineering
data.to_pickle("saves/data_clean.pkl")

In [1338]:
# Alternative: Export with to_csv and save dtypes separately
#data.to_csv(r'saves/data_clean.csv', index = True)
#data.dtypes.to_frame('types').to_csv('saves/types_clean.csv')

**BACKUP**

In [1339]:
# Import Airbnb listing data for the time period 04/2018-03/2020 (2 years)
#all_files = glob.glob(os.path.join("data", "*.csv.gz"))
#all_df = []
#for f in all_files:
#    df = pd.read_csv(f, sep=',')
#    df['file'] = f.split('/')[-1]
#    all_df.append(df)
#data_raw = pd.concat(all_df, ignore_index=True, sort=True)