# Project Planning
### Marcus Lim Group 15


In [None]:
# Importing libraries

# library(broom)
library(readr)
library(tidyverse)
library(GGally)

## 1) Data Description
Our dataset contains Airbnb prices for Budapest, London and Rome, from the [Airbnb Prices in European Cities](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities/data) dataset. The data was collected from Airbnb listings.

There are 20 variables in the original dataset, however, one is the `id` column which is the same as the row number, and four others (`attr_index`, `attr_index_norm`, `rest_index`, `rest_index_norm`) are not explained in detail. This leaves us with 15 variables, detailed below:

| Column name   | Description   | Data Type     |
| ------------- | ------------- | ------------- |
| `realSum` | The total price of the Airbnb listing (Euros). | Numeric |
| `room_type` | The type of room being offered (e.g. private, shared, etc.). | Categorical |
| `room_shared` | Whether the room is shared or not. | Boolean |
| `room_private` | Whether the room is private or not. | Boolean |
| `person_capacity` | The maximum number of people that can stay in the room. | Numeric |
| `host_is_superhost` | Whether the host is a superhost or not. | Boolean |
| `multi` | Whether the listing is for multiple rooms or not. | Boolean |
| `biz` | Whether the listing is for business purposes or not. | Boolean |
| `cleanliness_rating` | The cleanliness rating of the listing. | Numeric |
| `guest_satisfaction_overall` | The overall guest satisfaction rating of the listing. | Numeric |
| `bedrooms` | The number of bedrooms in the listing. | Numeric |
| `dist` | The distance from the city centre. | Numeric |
| `metro_dist` | The distance from the nearest metro station. | Numeric |
| `lng` | The longitude of the listing. | Numeric |
| `lat` | The latitude of the listing. | Numeric |


In order to get the size of our dataset, we will first read the data and merge the csv files into one table. In addition, we will create two more variables for a total of `17` variables. The new variables are as follows:

| Column name   | Description   | Data Type     |
| ------------- | ------------- | ------------- |
| `city` | The city location of the Airbnb listing. | Categorical |
| `isWeekend` | Whether the listing is for weekends or not. | Boolean |

In [None]:
# Function to format data: select vars from dataset and add city name + if it's weekend data
tidy_data <- function(data, cityName, isWeekend) {
    data <- data %>%
        select(-...1, -attr_index, -attr_index_norm, -rest_index, -rest_index_norm)
    data$city <- as.factor(cityName)
    data$isWeekend <- as.logical(isWeekend)
    return(data)
}

# Import and format data with tidy_data()
london_weekdays <- (tidy_data(read_csv("data/london_weekdays.csv"), "London", 0))
london_weekends <- (tidy_data(read_csv("data/london_weekends.csv"), "London", 1))
rome_weekdays <- (tidy_data(read_csv("data/rome_weekdays.csv"), "Rome", 0))
rome_weekends <- (tidy_data(read_csv("data/rome_weekends.csv"), "Rome", 1))
budapest_weekdays <- (tidy_data(read_csv("data/budapest_weekdays.csv"), "Budapest", 0))
budapest_weekends <- (tidy_data(read_csv("data/budapest_weekends.csv"), "Budapest", 1))

# Merge all the data together
data <- rbind(london_weekdays, london_weekends, rome_weekdays, rome_weekends, budapest_weekdays, budapest_weekends)

In [None]:
nrow(data)

From the above output, our combined dataset contains `23042` total observations.

## 2) Question
My question is:

> Is there an association between Airbnb prices to the distance to the closest metro and if the listing is for weekends?

Justification: 

For tourists looking to explore a new country, access to the local transit network can be essential for gettng around. In addition, it would be valuable to know if listings on weekends are associated with Airbnb prices. Knowing if these variables are associated with Airbnb prices can help strike a good balance between costs and travel enjoyment.

The data is suitable for this question, as it contains the columns `realSum`, which is the total price of an Airbnb listing, as well as `metro_dist` and `isWeekend`, which are the distance to the closest metro and if a listing is for a weekend or not. 

These columns will help determine if distance to the metro, or if a listing is for weekends, are statistically associated with Airbnb price. This is an inferential question involving hypothesis testing and associations, rather than predicting values.

## 3) Exploratory Data Analysis

Earlier, to get the number of observations overall, we demonstrated that the dataset can be loaded into R. 

Below we go through an EDA checklist:
- Packaging (& Wrangling as needed)
- Head and tail of data (presented in a tidy format)
- Check for NA values
- Check n's and summary statistics
- Plot visualizations for variables of interest

In [None]:
# Packaging
str(data)

It looks like `room_type` is a character, when it should be categorical. There are also inconsistencies with how boolean variables are represented (logical vs numerical as 0 or 1), but we will ignore this for now.

In [None]:
data <- data %>% mutate(room_type = as.factor(room_type))

# Fixed Packaging
str(data)

In [None]:
# Top and Bottom of data
head(data)
tail(data)

In [None]:
# Check for missing/NA values
sum(is.na(data))

In [None]:
# Compute summary statistics & check numbers
summary(data)

From the above summary statistics, we note the presence of outliers in `realSum` with a max of `15499.89`, which is much higher than the mean of `268.47`. 

I propose histograms of `realSum` in order to visualize how the response is distributed to remove outliers, and to make it easier to understand possible relationships between `realSum` and the predictors.

In [None]:
# Plot a histogram of our response
options(repr.plot.width = 8, repr.plot.height = 5)

realSum_hist <- data %>%
    ggplot(aes(x=realSum)) +
    geom_histogram() +
    labs(title="Histogram of Airbnb prices", x="Total Airbnb Price (Euros)", y="Count") +
    theme(axis.text = element_text(size = 16), 
          axis.title = element_text(size = 16),
         title=element_text(size = 20))

realSum_hist

Above, it looks like our `realSum` values have outliers. I will filter the data to remove values above 1000. In addition, I would like to know the mean, and where 90% of the data falls within (5th and 95th percentile).

In [None]:
data_filt <- data %>% 
    filter(realSum < 1000)

realSum_hist_filt <- data_filt %>%
    ggplot(aes(x=realSum)) +
    geom_histogram() +
    geom_vline(xintercept=mean(data$realSum), color = "red") +
    geom_vline(xintercept=quantile(data$realSum, 0.95), color = "blue") +
    scale_x_continuous(breaks=seq(0, 1000, 100)) +
    ggtitle("Histogram of Airbnb prices < 1000\nwith 95% threshold (blue) and mean price (red)") +
    labs(x="Total Airbnb Price", y="Count") +
    theme(axis.text = element_text(size = 15), 
          axis.title = element_text(size = 15),
         title=element_text(size = 16))
    

realSum_hist_filt

It appears that our response data is very right skewed. 95% of Airbnb price listngs fall below `640`, with the average at around `280`. The threshold of `1000` still covers over 95% of listings, and because my question is interested in saving costs, leaving the threshold at `1000` euros is suitable for this EDA.

Next, I propose the plots between `realSum` to `metro_dist` and `realSum` to `isWeekend`, as these plots would help to visualize the relationship between the response and predictors. Specifically, a scatterplot for `metro_dist` (continuous) and a boxplot for `isWeekend` (binary).

(I also note that possible confounders, or problems such as Simpson's paradox, may need to be corrected in this EDA. However, that is beyond the scope of this planning assignment.)

In [None]:
realSum_metro <- data_filt %>%
    ggplot(aes(x=metro_dist,y=realSum)) +
    geom_point() +
    ggtitle("Airbnb Price vs. Distance to Metro") +
    ylab("Airbnb Price (Euros)") +
    xlab("Distance to closest Metro (km)") +
    theme(axis.text = element_text(size = 15), 
          axis.title = element_text(size = 15),
         title=element_text(size = 20))

realSum_metro

The plot above suggests a weak negative trend between `metro_dist` and `realSum`. It is weak due to the high variability of data points located at smaller distances (0-3 km).

In [None]:
realSum_weekend <- data_filt %>%
    ggplot(aes(x=isWeekend,y=realSum)) +
    geom_boxplot() +
    ggtitle("Airbnb Price for Weekends vs Weekdays") +
    ylab("Airbnb Price (Euros)") +
    xlab("On a weekend or not") +
    theme(axis.text = element_text(size = 15), 
          axis.title = element_text(size = 15),
         title=element_text(size = 16))

realSum_weekend

The plot above does not suggest a significant average difference in Airbnb price between listings that are on a weekend, and listings that are not.

## 4) Methods and Plan

To answer my question, “*Is there an association between Airbnb prices to the distance to the closest metro and if the listing is for weekends?*“, I will use an additive MLR model using `lm()`, with the formula: `realSum ~ metro_dist + isWeekend`. This method is appropriate, because `lm()` allows us to quantify and test the significance of associations, with interpretable coefficients. Through hypothesis testing, we will either have evidence suggesting statistically significant associations (a p-value less than a chosen significance level for that coefficient), or not enough evidence.

However, we need to make the following assumptions:
- Linearity: There is a linear relationship between `realSum` and `metro_dist`.
- Independence: The errors (observations) are independent of each other.
- Normality: Under the CLT, since our data contains 23042 observations, the data and errors in the data are assumed to be normally distributed.
- Homoscedasticity: The errors have constant variance.
- There is no interaction between `metro_dist` and `isWeekend` on Airbnb prices. That is, the estimated change in Airbnb price for every change in distance to the metro does not differ depending on if the listing is on a weekend or not.
- The covariates are not highly correlated (multicollinearity)

There are also the following limitations:
- Incorrect Assumptions: If some of our assumptions turn out to be false, then our results may be invalid or require corrections. For instance, if the relationship is non-linear, we may need to transform some covariates.
- Confounding: Since the model only includes two variables, there may be other variables that influence both price and metro distance (ie. city). This could result in biased estimates, if not included in the model.
- Outliers: Based on the EDA, the data is skewed with outliers, which may result in a model that does not capture representative patterns.
