# Individual Report | Alizah Irfan - Group 15

STAT 301 103

In [None]:
library(tidyverse)
library(broom)
library(dplyr)
library(GGally)

## (1) Data Description

Data is collected from Airbnb Prices in European Cities [https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities/data]. The data is collected from Airbnb listings. Each city and weekend status has its own dataset with the following variables:

| Column name   | Description   | Data Type     |
| ------------- | ------------- | ------------- |
| `realSum` | The total price of the Airbnb listing. | Numeric |
| `room_type` | The type of room being offered (e.g. private, shared, etc.). | Categorical |
| `room_shared` | Whether the room is shared or not. | Boolean |
| `room_private` | Whether the room is private or not. | Boolean |
| `person_capacity` | The maximum number of people that can stay in the room. | Numeric |
| `host_is_superhost` | Whether the host is a superhost or not. | Boolean |
| `multi` | Whether the listing is for multiple rooms or not. | Boolean |
| `biz` | Whether the listing is for business purposes or not. | Boolean |
| `cleanliness_rating` | The cleanliness rating of the listing. | Numeric |
| `guest_satisfaction_overall` | The overall guest satisfaction rating of the listing. | Numeric |
| `bedrooms` | The number of bedrooms in the listing. | Numeric |
| `dist` | The distance from the city centre. | Numeric |
| `metro_dist` | The distance from the nearest metro station. | Numeric |
| `lng` | The longitude of the listing. | Numeric |
| `lat` | The latitude of the listing. | Numeric |

We also found the variables `attr_index`, `attr_index_norm`, `rest_index`, `rest_index_norm` in the datasets. However, they were not described.

The data used for this study will specifically look at the cities London, Rome, and Budapest for both weekends and weekdays - totalling 6 datasets. We combine the 6 datasets into one, adding the `city` and `isWeekend`. We will also select the variables we wish to use for the study (`realSum`, `dist`, `city`).

| Added Column name   | Description   | Data Type     |
| ------------- | ------------- | ------------- |
| `city` | City of Airbnb listing | Categorical |
| `isWeekend` | Whether or not a listing is for the weekend | Boolean |

In [None]:
# Function to format data: select vars from dataset and add city name + if it's weekend data
tidy_data <- function(data, cityName, isWeekend) {
    data <- data %>%
            select(realSum, dist) # add in any other predictors
    data$city <- cityName
    data$isWeekend <- isWeekend
    return(data)
}

# Import and format data with tidy_data()
london_weekdays <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/london_weekdays.csv"), "London", 0))
london_weekends <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/london_weekends.csv"), "London", 1))
rome_weekdays <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/rome_weekdays.csv"), "Rome", 0))
rome_weekends <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/rome_weekends.csv"), "Rome", 1))
budapest_weekdays <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/budapest_weekdays.csv"), "Budapest", 0))
budapest_weekends <- (tidy_data(read_csv("https://raw.githubusercontent.com/alizahirfan/stat301-project/refs/heads/main/data/budapest_weekends.csv"), "Budapest", 1))

# Merge all the data together
data <- rbind(london_weekdays, london_weekends, rome_weekdays, rome_weekends, budapest_weekdays, budapest_weekends)

data <- data %>%
        select(-isWeekend)

# Viewing the top and bottom 5 rows
# head(data)
# tail(data)

Now we have a single dataset containing all the observations for the relevant cities with only our key variables selected. However, this dataset may contain outliers and we have to ensure that the column for `city` is a factor. This will be done by the code below. The top and bottom 5 rows of our data will be printed along with the dimensions.

In [None]:
tidy_data <-
    data %>%
    mutate(
            across(c(city),as.factor)
    ) %>%
    group_by(city) %>%
    filter(
        realSum >= quantile(realSum, 0.05, na.rm = TRUE),
        realSum <= quantile(realSum, 0.95, na.rm = TRUE)) %>%
    ungroup()
dim(tidy_data)
head(tidy_data)
tail(tidy_data)

From the output, we can see that `realSum` and `dist` are considered `<dbl>` and `city` is `<fct>`, which is correct for analysis. Additonally, we have a total of 20742 observations, which is a large sample size.

## (2) Question

The goal for this study is to discover results that would be helpful for people who are opening Airbnb's in one of the 3 mentioned cities (London, Rome, Budapest) and want insight into setting the price for their listing. The following question can guide our model:

**How is the average price of an Airbnb listing affected by the city it's located in and distance from the city centre?**

This is an inference question - by analyzing this sample data in regard to the question, we can infer the behaviour of the population data. Our population in this case would be all Airbnb's located in London, Rome, and Budapest. The response variable will be the price of the Airbnb listing (`realSum`) and the covariates will be the city of the listing (`city`) and distance to city centre (`dist`). We will also assume there is a possible interaction between the two covariates since certain cities may have better transit options to the city centre than others despite longer distances which could impact the associations between `dist` and `realSum`. The model for this problem will follow this formula: `realSum ~ city * dist`. We can use exploratory visualizations to ensure that some of our assumptions for this formula are reasonable.

## (3) Exploratory Data Analysis and Visualization

We have imported and tidied the data in section (1). Now, we can take a closer look at the distribution of variables and there relation to one another through visualizations. 

Since `realSum` is our response variable, we can take a look at a distribution of all the values in that column. Additionally, we can add lines to mark the 90% percentile and the mean value.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6) 

q5 <- quantile(tidy_data$realSum, 0.05, na.rm = TRUE)
q95 <- quantile(tidy_data$realSum, 0.95, na.rm = TRUE)

tidy_data %>% 
    ggplot(aes(x = realSum)) +
    geom_histogram(bins = 30, fill = "#69b3a2", color = "white", alpha = 0.9) +
    geom_vline(aes(xintercept = mean(realSum, na.rm = TRUE)),
             color = "red", linetype = "dashed", size = 1) +
    geom_vline(xintercept = q5, color = "red", linetype = "solid", size = 0.7) +
    geom_vline(xintercept = q95, color = "red", linetype = "solid", size = 0.7) +
    xlab("realSum") +
    ylab("Count") +
    ggtitle("Distribution of realSum values") 

We can see that the distribution of values of the response variable in this sample has a right skew. Thus, the mean value also lies closer to the 5% quantile line. This is simply the distribution of our sample, and we'd have to apply other techniques (ie bootstrapping) to gain more insight into the population distribution.

We can also visualize a scatter plot of all our data points. We can colour code are points by city and set a lower transparency so we can see areas of high density of points.

In [None]:
tidy_data %>%
    ggplot(aes(x = dist, y = realSum, color = city)) +
    geom_point(alpha = 0.2)

We can see form the plot above that there are certain areas of the graph that have a high density of points for a particular study, thus a potential pattern emerges between `realSum` and `dist` that requires we factor in `city`. A line graph may show clearer relations. We 'bin' the values of `dist` together by grouping together values that are equal when rounded to the nearest ones. We can visualize the average value of our response for each 'bin' and colour code by city.

In [None]:
data_binned <- tidy_data
data_binned$dist <- round(data_binned$dist, 0)

data_binned <- data_binned %>%
                group_by(city, dist) %>%
                summarize(avg_realSum = mean(realSum))

data_binned %>%
    ggplot(aes(x = dist, y = avg_realSum, color = city)) +
    geom_point() +
    geom_line() +
    xlab("Distance") +
    ylab("Average realSum") +
    ggtitle("Average Listing Price vs Distance from City Centre Based on Location") 

We can see a clearer pattern between our inputs and response variables through this plot. It seems that the lines for distance may have a different slope that is dependant on the city, which further validates using a model with an interaction term between the two inputs. A potential problem of using a linear model is that the lines we see have a lot of kinks and using a straight-line model can possibly cause an over-generalization of the data. What about disincluding city from this model? The plot below shows the relation between distance and listing price through the same 'binning' process without seperating by the city.

In [None]:
data_binned <- tidy_data
data_binned$dist <- round(data_binned$dist, 0)

data_binned <- data_binned %>%
                group_by(dist) %>%
                summarize(avg_realSum = mean(realSum))

data_binned %>%
    ggplot(aes(x = dist, y = avg_realSum)) +
    geom_point() +
    geom_line() +
    xlab("Distance") +
    ylab("Average realSum") +
    ggtitle("Average Listing Price vs Distance from City Centre") 

We can see that including city provides a nuance that we are not able to see in the distance vs listing price plot and thus should be considered in our model so that we can get more detailed results. 

## (4) Methods and Plan

We are planning on using multiple linear regression with interaction to estimate Airbnb prices based on city and distance from city centre. An MLR model is the appropriate choice for answering an inference question with 2 inputs and continuous numerical response. We will consider a model with interaction due to the results of the EDA: it seems the slope between `dist` and `realSum` are dependent on city. The formula for this regression will be: `realSum ~ city * dist`.<br>
As with any model, there are certain assumption made about the data. The assumptions about MLR with interactionin relation to our data is as follows: <br>

- **Linearity:** Based on our EDA, we are assuming that our response is approximately a linear function of our inputs and their interaction term  
- **Independence of Errors:** We assume that are observations are independent of each other. This data is from an outside source so we can never know this for sure.  
- **Homoscedasticity:** We can test if the variance of errors is constant through a residuals vs fitted values plot.  
- **Normality of Residuals:** We can see if the residuals have a normal distribution through a Q-Q plot.  
- **Minimal Multicollinearity:** We can check if our inputs are highly correlated through their VIF score.  
- **Correct Specification:** We are assuming that we have correctly specified the model with an interaction term from the result of our EDA. 
- **Correct Variable Encoding:** We ensured that, when we loaded the data, each relevant variable had the correct variable type for the model.  

Following these assumptions ensures that the results from our model are appropriate for inference since we can attain reliable estimates, confidence intervals, and errors.
There are still possible difficulties and weaknesses of our choice in model. Violating some of the assumptions may mean we cannot answer our question since it relies on making inferences. Adding in an interaction term could possible lead to overfitting the model, which may lead to poor predictions. Having too few input variables may mean that we result in a model with high errors if we are missing the potentially large impact of other variables. There's always the possiblity of confounding variables as well. Since this is an external observational study, we cannot easily discover or isolate them from the data. 

To summarize, we can use a MLR model with interaction to infer the association between the listing price of all Airbnbs in Rome, Budapest, and London depending on their city and distance from city centre. Our EDA justifies the appropriateness of this model. However, there are assumptions for this type of model we'd have to further test for and other possible weaknesses that may prevent us from answering our inference question.