written by: Jana Vihs, vihsjana@student.hu-berlin.de, 604930
# Dear Jupyter Notebook Reader

fancy, seeing you here.

# Airbnb Price Predictor 

### Table of Contents
- Introduction
    - Meta Information
    - Tools 
        - Docker
        - DVC
- Explorative Data Analysis
    - Numeric Features about the Airbnb 
    - Numeric Features about the Host
    - Text Data 
        - Reviews
    - Images 
- Feature Engineering 
    - Distance to City Center
    - Host since in years
    - Text Length
    - Sentiment Analysis
    - Images 
        - Colors and Brightness
- Feature Selection
    - Feature Importance 
    - Grid Search
- Benchmark Models
    - Multivariate Linear Regression
    - Neural Networks  
- Model Evaluation
- Final Method
    - Hyperparameter Tuning
- Conclusion and Outlook
- References 

# Introduction


In [6]:
# import all necessary packages 
# Standards 
import pandas as pd 
import numpy as np
import os 
import math
import sys

# Visulaizations
import seaborn as sns
import folium
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

import datetime
import warnings
warnings.filterwarnings('ignore')



In [9]:
# change python path too include modules that i wrote myself
sys.path.append(os.path.dirname('../src'))
from src.features.preprocess.Processor import Processor
from src.features.preprocess.Textprocessor import Textprocessor

In [11]:
# load data set 
train = pd.read_csv('../data/raw/train.csv', index_col='listing_id')
test = pd.read_csv('../data/raw/test.csv', index_col='listing_id')
reviews = pd.read_csv('../data/raw/reviews.csv', index_col='listing_id')

# Meta Information 

The designated data set is available on Kaggle https://www.kaggle.com/c/adams2021/data and contains the following features

* *name*:
* *summary*:
* *space*:
* *description*:
* *experiences_offered*:
* *neigbhourhood_overview*:
* *transit*:
* *house_rules*:                   
* *picture_url*:                      
* *host_id*:                           
* *host_since*:                        
* *host_response_time*:              
* *host_response_rate*:            
* *host_is_superhost*:                  
* *host_total_listings_count*:         
* *host_has_profile_pic*:               
* *host_identity_verified*:             
* *neighbourhood*:                     
* *neighbourhood_cleansed*:             
* *zipcode*:                          
* *latitude*:                           
* *longitude*:                          
* *property_type*:                      
* *room_type*:                          
* *accommodates*:                       
* *bathrooms*:                         
* *bedrooms*:                         
* *beds*:                             
* *bed_type*:                           
* *amenities*:                           
* *guests_included*:                    
* *review_scores_rating*:            
* *review_scores_accuracy*:          
* *review_scores_cleanliness*:       
* *review_scores_checkin*:           
* *review_scores_communication*:     
* *review_scores_location*:          
* *review_scores_value*:             
* *cancellation_policy*:                
* *reviews_per_month*:  
* *price*: **Target variable**

The review data set contains the following :

* *reviwer_id*:
* *comments*:
* *review_id*:



In [27]:
print('Our train data consists of {}'.format(train.shape[0]) + ' rows and {}'.format(train.shape[1]) + ' columns, while our test data contains {}'.format(test.shape[0]) + ' rows and {}'.format(test.shape[1]) + ' columns.')
print('The additional data set reviews consist of {}'.format(reviews.shape[0]) + ' rows and {}'.format(reviews.shape[1]) + ' columns')

Our train data consists of 55284 rows and 41 columns, while our test data contains 29769 rows and 40 columns.
The additional data set reviews consist of 1540778 rows and 3 columns


In [12]:
#  change data types because of memory reasons
train = Processor().change_data_types(train)
test = Processor().change_data_types(test)
reviews = Processor().change_data_types(reviews)

## Missing values

In [29]:
# Missing values in test data
test.isnull().sum(
)

name                              10
summary                         1577
space                           9057
description                      943
experiences_offered                0
neighborhood_overview          10551
transit                        10672
house_rules                    12580
picture_url                        0
host_id                            0
host_since                        65
host_response_time              9572
host_response_rate              9572
host_is_superhost                  0
host_total_listings_count         65
host_has_profile_pic               0
host_identity_verified             0
neighbourhood                     86
neighbourhood_cleansed             0
zipcode                          635
latitude                           0
longitude                          0
property_type                      0
room_type                          0
accommodates                       0
bathrooms                         50
bedrooms                          29
b

In [4]:
# Missing values  in train data 
train.isnull().sum()

name                              14
summary                         2954
space                          16881
description                     1726
experiences_offered                0
neighborhood_overview          19506
transit                        19807
house_rules                    23378
picture_url                        0
host_id                            0
host_since                       111
host_response_time             17802
host_response_rate             17802
host_is_superhost                  0
host_total_listings_count        111
host_has_profile_pic               0
host_identity_verified             0
neighbourhood                    147
neighbourhood_cleansed             0
zipcode                         1272
latitude                           0
longitude                          0
property_type                      0
room_type                          0
accommodates                       0
bathrooms                         70
bedrooms                          62
b

In [28]:
# Missing values reviews
reviews.isnull().sum()

reviewer_id      0
comments       691
review_id        0
dtype: int64

In [30]:
# Merge reviews on train using listing_id
trainReview = train.merge(reviews, on='listing_id')

In order to get a better grip of our data, we split our traing data into different categories, to make analysation simpler.
Dataframe host which contain infos about the host of the airbnb, the information in airbnb itself and the information about the reviews.

In [46]:
# lets split our data set in 3 different categories to make analyse simpler
host, airbnb, review_scores = Processor().split_df(train)

As we will consider some features later during our analysis will narrow our dataframes down to specfic columns.

In [47]:
airbnb = airbnb.drop(['picture_url', 'longitude', 'latitude', 'zipcode', 'neighbourhood'], axis=1)

# Explorative Data Analysis

In [49]:
airbnb.describe()

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,guests_included
count,55284.0,55214.0,55222.0,55022.0,55284.0,55284.0
mean,3.131756,1.28385,1.373873,1.710661,104.308754,1.574832
std,1.930209,0.566556,0.859448,1.224301,83.74041,1.263427
min,1.0,0.0,0.0,0.0,10.0,1.0
25%,2.0,1.0,1.0,1.0,45.0,1.0
50%,2.0,1.0,1.0,1.0,80.0,1.0
75%,4.0,1.5,2.0,2.0,130.0,2.0
max,16.0,11.0,19.0,21.0,500.0,46.0


In [36]:
# Price distribution 

In [None]:
# Correlation plot

In [None]:
# Create map
lonlat = list(zip(train.longitude, train.latitude))
mapit = folium.Map( location=[52.667989, -1.464582], zoom_start=6 )
for coord in lonlat:
    folium.Marker( location=[ coord[0], coord[1] ], fill_color='#43d9de', radius=8 ).add_to( mapit )

mapit.save( 'map.html')

# Feature Engineering 



## Host History 
## Pictures
## Reviews

# Sentiment Analysis


str