## Introduction

Airbnb is an online platform that helps individuals connect who wants to rent their homes
to the one searching for accommodation. It presently encompasses more than 81,000 cities
worldwide and 220 countries. The total number of listings worldwide is about 6 million
and actively supports an average of 2 million overnight stays (Zhu, et al., 2020).

To captivate more guests , the host has to provide a detailed description of his property.
One of the strategies to attract more tenants includes a reasonable price. The star rating
system is usually used for determining the cost of the property; currently, there are no clear
price recommendations available (Li, et al., 2016; Zhu et al., 2020).


## Research Question. 

This study aims to analyze the Airbnb listing data and find the most relevant features that
can be used to suggest an ideal price to the host while listing a new property.

## Rationale.

In this study, we analyze relation within various features like property type, room type, location to develop a price recommendation model helping hosts decide fair pricing for the property.

## Data Description.

Publicly available Airbnb data for New York city (Anon., n.d.) has been used for analysis.
Variables that are irrelevant to the study has been excluded from the data set. The data set
has 59 columns and 36,922 observations.


Downloading Dataset from Google Basket. 

Moving file to HDFS


Importing python library.

In [6]:
%livy2.pyspark3

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField,StringType,IntegerType,StructType, DateType, DoubleType
from datetime import datetime as dt
from pyspark.sql.functions import isnan, when, count, col, lit
import pyspark.sql.functions as f
import pyspark.sql.types as t
import calendar
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator


spark = SparkSession.builder.appName("AirBnb").getOrCreate()


## Data Cleaning and preprocessing
Importing Dataset

In [8]:
%livy2.pyspark3

df1 = spark.read.format("csv").load("/tmp/listing.csv", header = True)
df1.show(2)

In [9]:
%livy2.pyspark3
for col in df1.columns:
    print(col)

Only selecting parameters that are relevant to the study. i.e features that host will enter while listing the property and can be related with price.

In [11]:
%livy2.pyspark3

df = df1.select(['host_since','price','room_type','bedrooms','host_verifications', 'property_type','latitude', 'longitude','beds','accommodates'])
df.show(2)

Only selecting parameters that are relevant to the study. i.e features that host will enter while listing the property and can be related with price.v

In [13]:
%livy2.pyspark3

df.printSchema()

In [14]:
%livy2.pyspark3
from pyspark.sql.functions import isnan, when, count, col

df = df.na.drop(subset = ['host_since'])
df.select([count(when(isnan(c) | col(c).isNull(), c)). alias(c)for c in df.columns]).show()


Changing Data type of variables and replacing numm values in bedrooms and bed with mean.

In [16]:
%livy2.pyspark3

df = df.withColumn("bedrooms", df['bedrooms'].cast(IntegerType()))
df = df.withColumn("beds", df['beds'].cast(IntegerType()))
df = df.withColumn("accommodates", df['accommodates'].cast(IntegerType()))

df = df.na.fill(int(df.select(f.mean(df['bedrooms'])).collect()[0][0]),subset = ["bedrooms"])
df = df.na.fill(int(df.select(f.mean(df['beds'])).collect()[0][0]),subset = ["beds"])

df = df.withColumn("latitude", df['latitude'].cast(DoubleType())).withColumn("longitude", df['longitude'].cast(DoubleType()))

df.printSchema()

Removing $ symbol from the data


In [18]:
%livy2.pyspark3
price_new = df.select('price').rdd.map(lambda x:float(x[0].replace('$','').replace(',',''))).map(lambda x: (x,)).toDF().withColumnRenamed('_1','price_new')
price_new = price_new.withColumn("id", f.monotonically_increasing_id())

 

Parcing **Host_since** columns as dates, and counting number of varification each host has. 


In [20]:
%livy2.pyspark3
new_dates = df.select('host_since').rdd.map(lambda x:dt.strptime(x[0], "%m/%d/%Y")).map(lambda x: (x, )).toDF().withColumnRenamed('_1', 'new_date')

new_host_verifications = df.select('host_verifications').rdd.map(lambda x:x[0].count("'")/2).map(lambda x: (x, )).toDF().withColumnRenamed('_1', 'new_host_verifications')

new_dates = new_dates.withColumn("id", f.monotonically_increasing_id())
df = df.withColumn("id", f.monotonically_increasing_id())
new_host_verifications = new_host_verifications.withColumn("id", f.monotonically_increasing_id())

df2 = df.join(new_dates, "id", "outer").join(new_host_verifications, "id", "outer").join(price_new, "id", "outer").orderBy('id').drop('id').drop('host_since').drop('price').drop('host_verifications')

df2.show(2)

 

Checking if all the datatypes are as expected or not


In [22]:
%livy2.pyspark3
df2.createOrReplaceTempView("air")
df2.printSchema()

## Visualization


In the chunk below we group the data by the day of registration 
and visualize the results.



Plotting relations between numbers of registrations on each  month 
and mean price of property.


In [26]:
%livy2.sql

Select day(new_date) as day,count(*) as no_of_properties_registered,mean(price_new) as mean_price from air group by day(new_date) order by day(new_date) 





In [27]:
%livy2.sql

Select month(new_date) as month,count(*) as no_of_properties_registered ,mean(price_new) as mean_price from air group by month(new_date) order by month(new_date) 

The plot above for counts (Numbers of registrations) is slightly skewed towards right i.e moreproperties have been registered in the second half of the years. However there is no evident effect of months on the price of the property


No relation can be seen between date and price of property.


Plotting relations between year of registration and number of registrations along with price.


In [31]:
%livy2.sql

Select year(new_date) as day,count(*) as no_of_properties_registered ,mean(price_new) as mean_price from air group by year(new_date) order by year(new_date) 

Although, there are more registrations from 2012 to 2017 no significant difference can be seen in the price of the property.
Thus, after analyzing the date of registration it shows no significant effect on the price of property.



In [33]:
%livy2.pyspark3

df2.groupBy('room_type').agg(f.mean('price_new'),f.count('room_type')).createOrReplaceTempView("room_type_count")

To find out if there exist any relation between room type and price of 
the property we plot pie chart between room type and mean price for the room type.

Understanding the types of properties that are majorly 
listed over the years.

In [36]:
%livy2.sql
select * from room_type_count


In [37]:
%livy2.sql
select * from room_type_count


Analyzing Relation between number of beds and the count of properties along with cost.


In [39]:
%livy2.sql

Select bedrooms, count(*) as no_of_properties_registered,mean(price_new) as mean_price from air group by bedrooms 

In [40]:
%livy2.sql
Select bedrooms, count(*) as no_of_properties_registered,mean(price_new) as mean_price from air group by bedrooms 

In [41]:
%livy2.pyspark3
df2.columns

Understanding relation between price and number of people property can accommodates.

 let us find the imapct of host verification on the listing of the property.

In [44]:
%livy2.sql

select accommodates, count(*)/10, mean(price_new) from air group by accommodates


In [45]:
%livy2.sql

select new_host_verifications,sum(price_new) from air group by new_host_verifications


Understanding impact of property type on price.

In [47]:
%livy2.sql
select property_type, mean(price_new) from air group by property_type order by mean(price_new)

In [48]:
%livy2.pyspark3
from pyspark.sql.functions import isnan, when, count, col

df2.drop('new_date').select([count(when(isnan(c), c)).alias(c) for c in df2.drop('new_date').columns]).show()


## Data analysis

By the above analysis we understand that room and property type plays important and are correlated to the price of the property.
Inorder to understand the relation better we calculate the pearson correlation between data. However, some of these features are in string format; to perform further analysis we perform String indexing on these features.
The code below is used to String index the above columns.



In [51]:
%livy2.pyspark3

from pyspark.ml.feature import StringIndexer


indexer = StringIndexer(inputCol="room_type", outputCol="Index_room")
indexed = indexer.fit(df2).transform(df2)

indexer = StringIndexer(inputCol="property_type", outputCol="Index_property")
indexed = indexer.fit(indexed).transform(indexed)

indexed= indexed.select([ 'bedrooms',  'latitude', 'longitude', 'beds', 'accommodates', 'new_host_verifications', 'price_new', 'Index_property','Index_room'])

In [52]:
%livy2.pyspark3


indexed.show(2)

In [53]:
%livy2.pyspark3

indexed.printSchema()

Now the data is finally ready to calculate the Pearson Correlation. The code below is used to find the correlation between the data.
Once the correlation have been calculated we filter the columns with correlation less than 0.05. The columns are filtered due to following reasons.

1.Reduce the dimensions of the data.
2.To Pervent over fitting or under fitting of the model.


In [55]:
%livy2.pyspark3

corr_list = [(c,indexed.select(f.corr(c,'price_new')).collect()[0][0]) for c in indexed.drop('price_new').columns]

deptSchema = StructType([       
    StructField('Feature', StringType(), True),
    StructField('correlation_with_price', DoubleType(), True)
])

corr_df = spark.createDataFrame(data=corr_list, schema = deptSchema)


corr_df.createOrReplaceTempView('corr_df')

Visualization of correlation results

In [57]:
%livy2.sql

select *  from corr_df


Only keeping the columns with higher corelation 

In [59]:
%livy2.pyspark3

 indexed = indexed.select(['bedrooms', 'longitude', 'beds', 'accommodates', 'price_new', 'Index_property', 'Index_room'])

In [60]:
%livy2.pyspark3
indexed.columns

Performing one hot encoding so that we can feed the data to the machine learning model

In [62]:
%livy2.pyspark3

from pyspark.ml.feature import OneHotEncoder

OHE_encoded = OneHotEncoder(inputCol = 'Index_room', outputCol = 'Index_room_').transform(indexed)
OHE_encoded = OneHotEncoder(inputCol = 'Index_property', outputCol = 'Index_property_').transform(OHE_encoded)

OHE_encoded.select(['bedrooms', 'longitude', 'beds', 'accommodates', 'price_new','Index_room_','Index_property_'])
OHE_encoded.show(2)

 
Creating dense vector 


In [64]:
%livy2.pyspark3
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols =['bedrooms', 'longitude', 'beds', 'accommodates', 'Index_room_', 'Index_property_'], outputCol = "features" )

output =assembler.transform(OHE_encoded)

output = output.select(['features', 'price_new'])

In [65]:
%livy2.pyspark3

train_data,test_data = output.randomSplit([0.8,0.2])
train_data.printSchema()

The Independent features has a large portion of binary variables, thus Random Forest Regression model is suitable for such applications.
The code below splits the data into train and test sets. After splitting, we pass the data to the Random Forest Regression engine with 100 Trees.


In [67]:
%livy2.pyspark3

rf = RandomForestRegressor(featuresCol="features", labelCol='price_new', numTrees = 100)
model = rf.fit(train_data)

Predicting the result

In [69]:
%livy2.pyspark3
from pyspark.sql.functions import monotonically_increasing_id 

predictions = model.transform(test_data)
predictions.createOrReplaceTempView('pred')

The chunk of code given below is used to visualize our predictions and give us a better understanding on the performance of our engine.

In [71]:
%livy2.sql

Select row_number() over (order by "price_new") as num, * from pred limit 500

Calculating RSME of the model.

In [73]:
%livy2.pyspark3

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="price_new", predictionCol="prediction", metricName="rmse")
    
rmse = evaluator.evaluate(predictions.limit(5))
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)


The RSME of the model is 54. However, just RSME is not sufficient to evaluate the model. Thus, the code below is used to calculate the mean of dependent variable.

In [75]:
%livy2.sql

select mean(price_new) from air


## Conclusion. 

Thus, we have successfully extracted features that can be used for recommending prices for a new listing. The date of registration does not play a significant role in determining the cost of the property. However, features such as room type, property type, bedrooms,longitude, number of beds, and the number of people a property can accommodate can be used to recommend a favorable fare.

Random Forest Regression engine is used to predict an ideal price rate for the property. The model has an RSME value of 54 and is reasonably suitable for the application. The performance of the recommendation engine can be improved by using complex modeling techniques and tuning hyperparameters.

## Improvisation

Using automated script to update date in every 15 days. Script is included in the Report.
## References

Anon., n.d. Inside Airbnb (Get the data). [Online] 
  Available at: http://insideairbnb.com/get-the-data.html
  [Accessed 1 2021].

Li, Y., Pan, Q., Yang, T. & Guo, L., 2016. Reasonable Price Recommendation on Airbnb Using Multi-Scale          Clustering. Chengdu, China, TCCT.

Zhu, A., Li, R. & Xie, Z., 2020. Machine Learning Prediction of New York Airbnb. Irvine, CA, USA, USA, IEEE.


