# Estimating AirBnB Prices

<img src="images/airbnb.jpg"/>

## Background Information on the Dataset

AirBnB is an online marketplace that allows members to offer or arrange lodging (primarily homestays) or tourism experiences. There are millions of listings across in cities across the world, such as London, Paris, and New York. In this problem, we would like to understand the factors that influence the price of a listing.

To derive insights and answer these questions, we take a look at listing data released by AirBnB (downloaded in September 2018 from http://insideairbnb.com/get-the-data.html). We specifically focus on apartments listed in six representative neighborhoods of Boston, MA. Our data has a total of 12 columns and 1693 observations, split across a training set (1187 observations) and a test set (506 observations). Each observation corresponds to a different listing.

    Training data: airbnb-train.csv 

    Test data: airbnb-test.csv

### Here is a detailed description of the variables:

    id: A number that uniquely identifies the listing.

    host_is_superhost: Whether a host is a “superhost,” meaning they satisfy AirBnB’s criteria for high-quality listings, high response rate, and reliability.

    host_identity_verified: Whether the host has verified their identity with AirBnB, which is intended to promote trust between hosts and guests. neighborhood: The neighborhood that the listing is located in (Allston, Back Bay, Beacon Hill, Brighton Downtown, or South End.

    room_type: The type of room provided in the listing (Entire home/apt, Private room, or Shared room).

    accommodates: The number of people that the listing can accommodate.
    
    bathrooms: The number of bathrooms in the listing.

    bedrooms: The number of bedrooms in the listing.

    beds: The number of beds in the listing. price: The price to stay in the listing for one night.

    logprice: The natural logarithm of the price variable.

    logacc: The natural logarithm of the accommodates variable.

### Exploratory Data Analysis
Load *airbnb-train.csv* into a data frame called train.

In [1]:
# Read in the  training dataset

train = read.csv("data/airbnb-train.csv")

head(train)

Unnamed: 0_level_0,id,host_is_superhost,host_identity_verified,neighborhood,room_type,accommodates,bathrooms,bedrooms,beds,price,logprice,logacc
Unnamed: 0_level_1,<int>,<int>,<int>,<fct>,<fct>,<int>,<dbl>,<int>,<int>,<int>,<dbl>,<dbl>
1,8792,0,0,Downtown,Entirehome/apt,2,1,1,1,154,5.036953,0.6931472
2,10810,0,0,Allston,Entirehome/apt,5,1,2,4,250,5.521461,1.6094379
3,10811,0,0,BackBay,Entirehome/apt,3,1,0,2,189,5.241747,1.0986123
4,22212,0,1,BackBay,Entirehome/apt,4,1,2,2,285,5.652489,1.3862944
5,28150,0,1,BackBay,Entirehome/apt,2,1,1,1,184,5.214936,0.6931472
6,47722,0,1,BackBay,Privateroom,2,1,1,1,479,6.171701,0.6931472


In [2]:
str(train)

'data.frame':	1187 obs. of  12 variables:
 $ id                    : int  8792 10810 10811 22212 28150 47722 60356 170715 307571 311240 ...
 $ host_is_superhost     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ host_identity_verified: int  0 0 0 1 1 1 0 0 0 1 ...
 $ neighborhood          : Factor w/ 6 levels "Allston","BackBay",..: 5 1 2 2 2 2 5 2 3 2 ...
 $ room_type             : Factor w/ 3 levels "Entirehome/apt",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ accommodates          : int  2 5 3 4 2 2 2 4 2 2 ...
 $ bathrooms             : num  1 1 1 1 1 1 1 1 1 1 ...
 $ bedrooms              : int  1 2 0 2 1 1 1 1 1 0 ...
 $ beds                  : int  1 4 2 2 1 1 1 2 1 1 ...
 $ price                 : int  154 250 189 285 184 479 175 200 150 185 ...
 $ logprice              : num  5.04 5.52 5.24 5.65 5.21 ...
 $ logacc                : num  0.693 1.609 1.099 1.386 0.693 ...


**How many rows are in the training dataset?**

In [3]:
# Calculate the number of rows in the training dataset
nrow(train)

**What is the mean price in the training dataset?**

In [5]:
# Find the mean price in the training set
mtp = mean(train$price)
round(mtp,2)

**What is the maximum price in the training dataset?**

In [6]:
# Find the max price in the training set
maxp = max(train$price)
round(maxp,2)

**What is the neighborhood with the highest number of listings in the training dataset?**

In [7]:
# Tabulate the number of listings for each neighborhood
table(train$neighborhood)


   Allston    BackBay BeaconHill   Brighton   Downtown   SouthEnd 
       176        279        155        135        208        234 

In [8]:
max(table(train$neighborhood))

Answer: BackBay.

**What is the neighborhood with the highest average price in the training dataset?**

In [14]:
# Tabulate the neighborhood with the highest average price in the training dataset
tapply(train$price, train$neighborhood, mean)

Answer: Downtown.

### Simple Linear Regression
For the rest of this problem, we will be working with log(price) and log(accommodates), which helps us manage the outliers with excessively large prices and accommodations. The values of log(price) and log(accommodates) are found in the columns logprice and logacc, respectively.

Load *airbnb-test.csv* into a data frame called test.