# **Source**:

https://www.kaggle.com/datasets/manjitbaishya001/airbnb-new-york-jan-2024?select=detailed_reviews.csv, sourced from Airbnb Description: Airbnb
data from New York focussing on listings, locations, and user reviews of
locations

# **What is the Dataset about?**:

This dataset contains listings of Airbnbs in New York, listing features such as what type of housing it is (Rental Unit, Loft, etc.) and the price of the Airbnb. Note that there are several rows with missing values in the price column, which are dealt with during Data Preprocessing. Other features include the name of the host, neighbourhood group, neighbourhood, latitude, longitude, minimum nights to stay, number of reviews about the establishment, the date of the last review, and reviews per month. These features can be used to estimate what an accurate valuation of the Airbnb might be.

There is also a dataset with detailed reviews for each Airbnb, which can be used to evaluate the quality of the listing based on user experiences.


# **Data Preprocessing**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# importing dataset and organizing
listings_df = pd.read_csv("../dataset/listings.csv")

# dropping all listings with a NA price as the percentage of NAs is more than 5%
# and imputation on a feature as complex as a listing price seems unnecessary
# given the quanitity of listings that do have an associated price
listings_df = listings_df.dropna(subset=['price'])

# all values are in the correct format
print("Data Types: ", list(listings_df.dtypes))


print(3*"\n", "Dataframe:")

listings_df.head()

Data Types:  [dtype('int64'), dtype('O'), dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('float64'), dtype('float64'), dtype('O'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('O'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('O')]



 Dataframe:


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1,743430766348459803,Rental unit in Brooklyn · 1 bedroom · 1 bed · ...,83422715,D,Brooklyn,Flatbush,40.65375,-73.95502,Entire home/apt,289.0,30,0,,,1,365,0,
6,11943,Home in Brooklyn · 1 bedroom · 2 beds · 1 bath,45445,Harriet,Brooklyn,Flatbush,40.63702,-73.96327,Private room,150.0,30,0,,,1,0,0,
8,1312228,Rental unit in Brooklyn · ★5.0 · 1 bedroom,7130382,Walter,Brooklyn,Clinton Hill,40.68371,-73.96461,Private room,55.0,30,3,2015-12-20,0.03,1,0,0,
24,13440481,Rental unit in New York · 1 bedroom · 1 bed · ...,17385374,Dennis J.,Manhattan,Upper East Side,40.7655,-73.9708,Private room,301.0,30,0,,,1,0,0,
28,45277537,Rental unit in New York · ★4.67 · 2 bedrooms ·...,51501835,Jeniffer,Manhattan,Hell's Kitchen,40.76661,-73.9881,Entire home/apt,144.0,30,9,2023-05-01,0.24,139,364,2,


# **Data exploration and summary statistics**

## **Statistical Method #1**

**Null Hypothesis ($H_{0}$) :** The neighborhood group of the Airbnb does not have a statistically significant impact on the availability for the Airbnb.

**Alternative Hypothesis ($H_{a}$) :** The neighborhood group of the Airbnb has a statistically significant impact on the availability for the Airbnb.

**Alpha-Value ($a$) :** 0.05

**Confidence level:** 95%

In [None]:
import numpy as np
from scipy.stats import kruskal

# Group by neighborhood group and calculate mean availability
df_groups = listings_df.groupby('neighbourhood_group')['availability_365'].mean().sort_values(ascending=False)

# Convert mean availability to a list
mean_availability = df_groups.tolist()

# Perform Kruskal-Wallis test to compare mean availability across neighborhood groups
statistic, p = kruskal(*[listings_df[listings_df['neighbourhood_group'] == group]['availability_365'] for group in df_groups.index])

print("Kruskal-Wallis Test Statistic:", statistic)
print("P-Value:", p)
print(df_groups)

# Plotting the graph
df_groups.plot.bar()
plt.ylabel('Average Availability (days)')
plt.xlabel('Neighbourhood Groups')
plt.title('Average Availabilities Across Neighbourhood Groups')
plt.show()

Since the p-value is lower than the confidence interval, we can reject the null hypothesis. This means that the location has a statistically significant impact on the availability for the Airbnb. The most available Airbnbs are located in Staten Island, and the least available Airbnbs are in Brooklyn.

This is important for the machine learning model because the model should take into account that there are disproportionate availabilities based on the location of the Airbnb. The pricing model will weight this as well.

## **Statistical Method #2**

**Null Hypothesis ($H_{0}$) :** The neighbourhoods in Manhattan do not have a statistically significant impact on the mean prices for Entire home/apt units.

**Alternative Hypothesis ($H_{a}$) :** The neighbourhoods in Manhattan have a statistically significant impact on the mean prices for Entire home/apt units.

**Alpha-Value ($a$) :** 0.05

**Confidence level:** 95%

In [None]:
from scipy.stats import f_oneway

# Filter the dataset to include only listings in Manhattan with 'Entire home/apt' room type
df_manhattan = listings_df[(listings_df['neighbourhood_group'] == 'Manhattan') & (listings_df['room_type'] == 'Entire home/apt')]

# Group by neighborhood and calculate mean prices for 'Entire home/apt' units
df_groups = df_manhattan.groupby('neighbourhood')['price'].mean().sort_values(ascending=False)
mean_prices = list(df_groups)

# Perform one-way ANOVA test to test for statistically significant differences between the mean prices of neighborhoods
statistic, p = f_oneway(*[df_manhattan[df_manhattan['neighbourhood'] == neighborhood]['price'] for neighborhood in df_groups.index])

print("P-Value:", p)

# Plot the scatter plot
plt.scatter(df_groups.index, mean_prices)
plt.xticks(rotation=90)
plt.ylabel('Average Price ($)')
plt.xlabel('Manhattan Neighbourhoods')
plt.title('Average Prices of Entire home/apt units in Manhattan Neighbourhoods')
plt.show()

As the p-value (7.9997e-22) is less than the alpha value of 0.05, we can reject the null hypothesis and conclude that there is a significant difference in the prices of Entire home/apt units between Manhattan neighbourhoods. Tribeca is the most expensive location in Manhattan based on the data provided/present for entire home/apts, with an average price of approximately $595. These findings allude to the fact that New York's neighbourhood groups have a varying price range of entire home/apt units throught their neighbourhoods.

## **Statistical Method #3**

**Null Hypothesis ($H_{0}$) :** The number of reviews of an Airbnb does not have a statistically significant impact on the price of the Airbnb.

**Alternative Hypothesis ($H_{a}$) :** The number of reviews of an Airbnb has a statistically significant impact on the price of the Airbnb.

**Alpha-Value ($a$) :** 0.05

**Confidence level:** 95%

In [None]:
from scipy import stats

# using a threshold that listings with a number of reviews of lesser than 50 is
# regarded as 'Low' number of reviews and those with higher than 50 is regarded
# as 'High' number of reviews.

# taking out outliers (only 7 listings have prices higher than 19000 and may
# ruin integrity of our dataset)
new_df = listings_df[listings_df["price"] < 19000.0]

low_df = new_df[new_df['number_of_reviews'] < 50]
high_df = new_df[new_df['number_of_reviews'] >= 50]

low_prices = low_df['price']
high_prices = high_df['price']

t_statistic, p_value = stats.ttest_ind(low_prices, high_prices)

print("T-statistic:", t_statistic)
print("P-value:", p_value)

# plotting the graph
sns.scatterplot(data=new_df, x="price", y="number_of_reviews", linewidth = 0.5, palette="mako_r", alpha=1)
plt.xscale('log')


The p-value of 1.444e-17 is extremely small, indicating strong evidence against the null hypothesis.

Given the very small p-value, much smaller than the typical significance level of 0.05, we reject the null hypothesis. Therefore, we conclude that there is a statistically significant difference in mean prices between the low and high review groups.

If there's a significant difference in mean prices between listings with low and high review counts, this finding could have implications for pricing strategy. For example, it might suggest that listings with higher review counts can command higher prices, potentially reflecting greater perceived value among customers.

# **Initial Conclusions through Exploratory Analysis**

Through our exploratory data analysis and basic data cleaning (the data was pretty clean to begin with) we hope to build a machine learning model to predict the price of Airbnb listings based on features such as location, property type, number of bedrooms, amenities, and historical booking data. This could help hosts optimize their pricing strategy and maximize their revenue.

Through our aforementioned Hypothesis Testing, we have found correlatory evidence in features provided in the dataset and being able to predict the value of a listing. This provides a positive outlook onto being able to create a Machine Learning pricing model in the future.