# **"How does various factors of Airbnb listings affect the price?"**

# Project 1 

## Introduction


The main aim of this project is to understand how various factors, such as property type, room type, and the number of bathrooms/bedrooms, impact the price at which listings are offered on Airbnb. Airbnb is a home-sharing plaform that allows hosts to rent out their homes. The hosts get to set the prices for their listings, under certain regulations and guidance by the platform. There are many attributes of the listings (e.g., neighbourhood, amenities, reviews) that would possibly affect the pricing, but the type of property, being the most visible and distinguishable characteristic, certainly have a significant impact on the pricing. 


The dataset, sourced from Kaggle's Airbnb Boston listing dataset, serves as the foundation for this investigation, providing a rich canvas to analyze and derive insights from the evolving landscape of urban short-term rentals. Through data analysis and visualization, we seek to identify patterns and insights that could be valuable for hosts looking to price their listings competitively. Initial findings suggest that certain features significantly affect listing prices, offering a foundation for further investigation and analysis as the project progresses.

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
df = pd.read_csv('listings.csv')

In [None]:
df.head() 

X variable: property_type, room_type, bathrooms, bedrooms

Y variable: Price


The chosen X variables—property_type, room_type, bathrooms, and bedrooms—are critical in predicting the Y variable, price, because they fundamentally influence a property's appeal and functionality to potential guests. Property and room types provide insight into the listing's nature and privacy level, factors that significantly impact guest preferences and willingness to pay. The level of privacy varies significantly depending on what type of room/property it is. For example, 'entire home' for the property type would provide more privacy, while 'shared room' for room type would provide less privacy. The number of bathrooms and bedrooms directly correlates to the accommodation's capacity and comfort, affecting its market value. These variables align with the research question by investigating how tangible property features translate into economic value within the Airbnb market. Their importance lies in offering a quantifiable measure of what guests value, thus enabling hosts to strategically position their listings in the competitive landscape of short-term rental pricing.

## Data Cleaning

In [None]:
#Data cleaning

#Dropping NA values
df.isnull().sum()  # To check for missing values
columns_to_drop_na = ['price', 'property_type', 'room_type', 'bathrooms', 'bedrooms']
data_cleaned = df.dropna(subset=columns_to_drop_na)

#Dropping duplicates
df.drop_duplicates(inplace=False)


In [None]:
#Checking if all necessary values were removed 

if df.isna().any().any():
    print("There are still NA values in the DataFrame.")
else:
    print("All NA values have been removed.")

if df.duplicated().any():
    print("There are still duplicate rows in the DataFrame.")
else:
    print("All duplicates have been removed.")


In [None]:
#data cleaning for y variable 
#some abnormalities observed while looking at the raw data
#need to convert it to a numeric format and drop rows with non-numeric values to get summary statistics 

# remove the '$' sign and ',' from the 'price' column
df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(str)

# convert 'price' to numeric, setting errors='coerce' will replace non-numeric values with NaN
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# drop rows where 'price' is NaN 
df = df.dropna(subset=['price'])

# check if everything is accurately executed
df.head()


In [None]:
# Dropping na

#drop na
unique_property_types = df['property_type'].unique()
property_types_clean = [x for x in unique_property_types if x == x]
property_types_clean

#drop na 
num_bedroom = df['bedrooms'].unique()
num_bedroom_clean = num_bedroom[~np.isnan(num_bedroom)]
num_bedroom_clean

#drop na
num_bathrooms = df['bathrooms'].unique()
num_bathrooms_clean = num_bathrooms[~np.isnan(num_bathrooms)]
num_bathrooms_clean

## Summary Statistics

In [None]:
x = df[['property_type', 'room_type', 'bathrooms', 'bedrooms']]
y = df['price']

In [None]:
# Convert 'property_type' and 'room_type' into dummies
df['is_apartment'] = (df['property_type'] == 'Apartment').astype(int)
df['is_entire_home_apt'] = (df['room_type'] == 'Entire home/apt').astype(int)

# Calculate summary statistics for x 
x = df[['is_apartment', 'is_entire_home_apt', 'bathrooms', 'bedrooms']]
summary_x = x.describe(include=[object, 'category', float, int])
print("Summary Statistics for x:\n", summary_x)

# Calculate summary statistics for y
y = df['price']
summary_y = y.describe()
print("\nSummary Statistics for y:\n", summary_y)


The analysis of summary statistics for variables pertaining to the distribution and diversity of property and room types within the dataset reveals a predominance of apartments and entire homes/apartments as the most frequently listed categories. This observation suggests a significant trend among hosts to offer their entire apartments as shared-home spaces rather than allocating merely a portion of their premises for guest accommodation. Furthermore, the average figures for bathrooms and bedrooms—ranging between 1 and 2 for each—reflect the standard size of accommodations made available on the platform. These findings are indicative of Airbnb's primary market segment, which consists of travelers or short-term renters, often in small groups or as individuals.

In relation to the price variable, the dataset exhibits a considerable range, extending from $10 to $4000, with a mean price approximately situated at $174. This variation in price points underscores the extensive diversity present within Airbnb listings, encompassing options from budget-friendly to luxury accommodations. Such diversity is pivotal to the research's objective of exploring how the features of properties influence their pricing strategies on the platform.

The confidence interval for price, with a lower bound of $85 and an upper bound of $220, hints at a broader implication. Despite the myriad uses for Airbnb spaces, a substantial majority is utilized primarily for lodging purposes. This aspect of the analysis not only enriches our understanding of the platform's market dynamics but also contributes to the nuanced examination of factors affecting pricing within the sharing economy's accommodation sector.


## Visualization 

In [None]:
#import necessary packages for visualization
! pip install -q qeds fiona geopandas xgboost gensim folium pyLDAvis descartes seaborn

import matplotlib
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm #for linear regression: sm.ols
import seaborn as sns


from pandas_datareader import DataReader

%matplotlib inline
# activate plot theme
import qeds

In [None]:
# Bar chart to show the relationship between room type and the price 

# Calculate mean prices for room type and property type
mean_price_by_room_type = df.groupby('room_type')['price'].mean().reset_index()
mean_price_by_property_type = df.groupby('property_type')['price'].mean().reset_index()

# Create a figure with two subplots
fig, ax = plt.subplots(1, 1, figsize=(4, 4))


# Bar chart for mean price by room type
sns.barplot(x='room_type', y='price', data=mean_price_by_room_type, color='blue')
ax.set_title('Mean Price by Room Type')
ax.set_xlabel('Room Type')
ax.set_ylabel('Mean Price')

plt.tight_layout()
plt.show()


The selected variables for these visualizations are crucial in understanding pricing on Airbnb. The "Mean Price by Room Type" plot shows that entire homes/apartments command higher prices than private or shared rooms, likely due to greater privacy and space. Notably, the pricing difference between private and shared rooms is minimal, suggesting that the valuation of privacy, in this context, transcends mere spatial considerations. Instead, it points towards a nuanced understanding of privacy, emphasizing the exclusivity of certain amenities, such as bathrooms. In cases of both private and shared rooms, the commonality of shared bathrooms underlines that the premium on privacy may not be placed on sleeping quarters but rather on the accommodation at large. This observation underscores a refined differentiation in the perception of privacy, which is pivotal in shaping pricing strategies on the Airbnb platform.


In [None]:

# Calculate mean prices for property type
mean_price_by_property_type = df.groupby('property_type')['price'].mean().reset_index()

# Calculate frequency of each property type
property_type_freq = df['property_type'].value_counts().reset_index()
property_type_freq.columns = ['property_type', 'frequency']

# Merge mean price with frequency for annotations
merged_data = pd.merge(mean_price_by_property_type, property_type_freq, on='property_type')

# Create a figure with two subplots
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Bar chart for mean price by property type on the first subplot
sns.barplot(x='price', y='property_type', data=merged_data, ax=ax[0], color='blue')
ax[0].set_title('Mean Price by Property Type')
ax[0].set_xlabel('Mean Price')
ax[0].set_ylabel('Property Type')

# Annotating with mean price to the right of the bar
for index, row in merged_data.iterrows():
    ax[0].text(row['price'] + 5, index, f'${row["price"]:.2f}', va='center', ha='left', color='black')

# Histogram for the frequency of each property type on the second subplot
ax[1].barh(merged_data['property_type'], merged_data['frequency'], color='green')
ax[1].set_title('Frequency of Each Property Type')
ax[1].set_xlabel('Frequency')
ax[1].set_ylabel('Property Type')

# Annotating with frequency to the right of the bar
for index, row in merged_data.iterrows():
    ax[1].text(row['frequency'] + 1, index, f'{row["frequency"]} listings', va='center', ha='left', color='black')

# Invert y-axis to have 'Apartment' on top
ax[1].invert_yaxis()

plt.tight_layout()
plt.show()



The analysis of "Mean Price by Property Type" uncovers significant variations in pricing across different types of accommodations, with properties such as villas and houses often commanding higher prices. This suggests a guest preference for certain property characteristics, for which they are willing to pay a premium.

A limited number of listings for property types like villas, boats, and guesthouses may signal niche markets with restricted supply or demand. These types command higher average prices, indicative of their position within luxury or specialized segments, appealing to wealthier guests in search of distinctive experiences.

Despite the prevalent availability of apartment listings, their maintained high mean price points to a robust market presence. This scenario suggests that the supply, while ample, meets a strong demand that supports higher pricing levels. This condition may be symptomatic of a competitive housing market, likely situated in an urban area characterized by a high cost of living.

In [None]:
# histogram to show the relationship between price and the number of bedrooms/bathrooms

# Create a figure with two subplots for histograms
fig, ax = plt.subplots(1, 2, figsize=(8, 3))

# Histogram for the frequency of number of bedrooms
ax[0].hist(data_cleaned['bedrooms'], bins=range(int(data_cleaned['bedrooms'].max()) + 1), color='blue', alpha=0.7)
ax[0].set_title('Frequency of Number of Bedrooms')
ax[0].set_xlabel('Number of Bedrooms')
ax[0].set_ylabel('Frequency')
ax[0].set_xticks(range(int(data_cleaned['bedrooms'].max()) + 1))

# Histogram for the frequency of number of bathrooms
ax[1].hist(data_cleaned['bathrooms'], bins=range(int(data_cleaned['bathrooms'].max()) + 1), color='red', alpha=0.7)
ax[1].set_title('Frequency of Number of Bathrooms')
ax[1].set_xlabel('Number of Bathrooms')
ax[1].set_ylabel('Frequency')
ax[1].set_xticks(range(int(data_cleaned['bathrooms'].max()) + 1))

plt.tight_layout()
plt.show()

The histograms for the number of bedrooms and bathrooms show that most listings have 1 bedroom and 1 bathroom, which is reflective of the accommodation types that are most commonly available and hence, most commonly priced.

In [None]:
# bar chart using grouping method 

# adding one more x variable to better explain the covariate and such
x = df[['property_type', 'room_type', 'bathrooms', 'bedrooms', 'neighbourhood']]

grouped_data = df.groupby(['neighbourhood', 'room_type'])['price'].mean().reset_index()

# Pivot the data so that each room type is a column
pivot_data = grouped_data.pivot(index='neighbourhood', columns='room_type', values='price')

# Calculate mean price for entire home/apt or average across all types
pivot_data['mean_entire_home_apt'] = pivot_data['Entire home/apt']
pivot_data['average_price'] = pivot_data.mean(axis=1)

# Sort by mean price of entire home/apt in descending order
sorted_data = pivot_data.sort_values(by='mean_entire_home_apt', ascending=False).reset_index()

# Melt the sorted data to have room types back in long form
melted_sorted_data = sorted_data.melt(id_vars='neighbourhood', value_vars=['Entire home/apt', 'Private room', 'Shared room'])

# Create the bar plot
plt.figure(figsize=(15, 10))
sns.barplot(x='neighbourhood', y='value', hue='room_type', data=melted_sorted_data)

# Adding plot title and labels
plt.title('Mean Price by Room Type in Different Neighborhoods Sorted by Entire Home/Apt Price')
plt.xlabel('Neighborhood')
plt.ylabel('Mean Price')
plt.xticks(rotation=45)  # Rotates the x labels to make them readable

# Display the plot
plt.tight_layout()
plt.show()


The visualization above illustrates the mean price for different types of rooms across various neighborhoods in Boston. This graph is directly related to the research question that investigates the correlation between room types and pricing while considering the location variable. The clear pattern that emerges is that entire homes/apartments typically fetch higher prices across all neighborhoods, followed by private rooms, and then shared rooms. This suggests that guests place a premium on privacy and space, which is consistently valued across different areas of the city. The disparities in pricing between neighborhoods also highlight the influence of location on the perceived value of Airbnb listings.

These visualizations all together directly tie to the research question by illustrating the relationship between property characteristics and listing prices. The observed patterns suggest a clear correlation: more private and spacious properties tend to be listed at higher prices. This reinforces the hypothesis that property type and the number of bedrooms and bathrooms can significantly influence the price. Understanding these patterns is essential for hosts to price their listings competitively and for guests to make informed choices based on their preferences and budget.

# Project 2

## The message

## Mapping

In [None]:
# installing necessary packages
!pip install q qeds fiona geopandas gensim folium pyLDAvis descartes contextily

import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import contextily as ctx
import folium

from shapely.geometry import Point

%matplotlib inline
# activate plot theme
#import qeds
#qeds.themes.mpl_style();

In [None]:
zipcode = df['zipcode']
latitude = df['latitude']
longitude = df['longitude']
neighbourhood = df['neighbourhood']
df_map = pd.DataFrame({
    'zipcode': df['zipcode'],
    'latitude': df['latitude'],
    'longitude': df['longitude'],
    'neighbourhood': df['neighbourhood']
})
df_map.head()

In [None]:
df_map["Coordinates"] = list(zip(df.longitude, df.latitude))
df_map.head()

In [None]:
df_map["Coordinates"] = df_map["Coordinates"].apply(Point)
df_map.head()

In [None]:
gdf_points = gpd.GeoDataFrame(df_map, geometry="Coordinates")
gdf_points.head()

In [None]:
print('\nThe geometry column is:', gdf_points.geometry.name)

In [None]:
# Map Boston city by neighbourhoods

gdf = gpd.read_file('Census2020_BG_Neighborhoods.shp')
gdf.crs = "EPSG:3857"

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, color='white', edgecolor='black')

# Add a basemap
# The crs parameter should match the one used in your geospatial data for Boston neighborhoods.
ctx.add_basemap(ax, crs=gdf.crs)
ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor="k")

for idx, row in gdf.iterrows():
    # Get the centroid of the polygon to place the text
    centroid = row.geometry.centroid
    ax.text(centroid.x, centroid.y, row['BlockGr202'], fontsize=8, ha='center', va='center')

# Set the bounds for the plot
ax.set_xlim(-71.2, -70.9)
ax.set_ylim(42.22, 42.4)


plt.show()

In [None]:
# Map Boston city by neighbourhoods

# Load the Boston neighborhoods shapefile
gdf = gpd.read_file('Census2020_BG_Neighborhoods.shp')

# Assuming df_map is your DataFrame with points to plot, having 'lat' and 'lon' columns
# Convert df_map to a GeoDataFrame with the correct CRS to match the basemap
gdf_points = gpd.GeoDataFrame(df_map, geometry=gpd.points_from_xy(df_map['longitude'], df_map['latitude']))
gdf_points.crs = "EPSG:4326"  # Set the CRS to WGS84 if your longitude and latitude are in degrees

# Now plot the neighborhoods
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, color='white', edgecolor='black')

# Add the points from df_map on top of the neighborhood plot
gdf_points.to_crs(gdf.crs).plot(ax=ax, color='red', markersize=5)  # Convert CRS and plot

# Add a basemap 
ctx.add_basemap(ax, crs=gdf.crs)

plt.show()

### Merging a new dataset

In [39]:
# Ensure the 'price' column is treated as a string
df['price'] = df['price'].astype(str)

# Remove '$' and ',' from the 'price' column, then convert it to numeric
df['price'] = pd.to_numeric(df['price'].str.replace('$', '').str.replace(',', ''), errors='coerce')

# Now proceed with your calculation
mean_price_by_neighborhood = df.groupby('neighbourhood')['price'].mean().reset_index()


In [40]:
mean_price_by_neighborhood

Unnamed: 0,neighbourhood,price
0,Allston-Brighton,114.162088
1,Back Bay,245.457045
2,Beacon Hill,212.08046
3,Brookline,130.375
4,Cambridge,203.0
5,Charlestown,210.050633
6,Chestnut Hill,70.75
7,Chinatown,235.410256
8,Dorchester,97.451282
9,Downtown,196.5


In [41]:
gdf

Unnamed: 0,OBJECTID,neighbourhood,Shape_Leng,Shape_Area,geometry,CombinedNeighborhood
0,1,Allston,35808.619278,41547600.0,"POLYGON ((758525.831 2959265.091, 758671.805 2...",Allston-Brighton
1,2,Back Bay,18815.103609,15387240.0,"POLYGON ((771539.219 2954877.239, 771575.861 2...",Back Bay
2,3,Beacon Hill,11668.951169,7891524.0,"POLYGON ((774297.440 2956963.715, 774312.270 2...",Beacon Hill
3,4,Brighton,47051.804654,76581560.0,"POLYGON ((754177.850 2955969.986, 754151.917 2...",Allston-Brighton
4,5,Charlestown,33910.754786,51270210.0,"POLYGON ((773132.501 2968902.714, 773021.919 2...",Charlestown
5,6,Chinatown,10843.828683,3436019.0,"POLYGON ((775639.044 2953734.864, 775595.372 2...",Chinatown
6,7,Dorchester,80692.139164,219303800.0,"POLYGON ((775867.212 2944875.352, 775903.995 2...",Dorchester
7,8,Downtown,32767.370822,21590100.0,"MULTIPOLYGON (((773867.690 2953737.779, 773824...",Downtown
8,9,East Boston,79266.383121,194861800.0,"POLYGON ((790588.304 2971526.017, 790707.690 2...",East Boston
9,10,Fenway,101396.628071,42813570.0,"POLYGON ((756955.208 2961112.383, 757042.651 2...",Fenway


In [42]:
# need to change the column name to merge datasets
gdf.rename(columns={'BlockGr202': 'neighbourhood'}, inplace=True)

In [None]:
# need to change names of neighbourhoods 

In [43]:
# Merge the GeoDataFrame with the mean price DataFrame
gdf_combined = gdf_combined.merge(mean_price_by_neighborhood, on='neighbourhood', how='left')

# Check the merge result
print(gdf_combined.head())


  CombinedNeighborhood                                           geometry  \
0     Allston-Brighton  POLYGON ((754796.098 2951330.551, 754762.975 2...   
1             Back Bay  POLYGON ((771539.219 2954877.239, 771575.861 2...   
2          Beacon Hill  POLYGON ((774297.440 2956963.715, 774312.270 2...   
3          Charlestown  POLYGON ((773132.501 2968902.714, 773021.919 2...   
4            Chinatown  POLYGON ((775639.044 2953734.864, 775595.372 2...   

   OBJECTID neighbourhood    Shape_Leng    Shape_Area     price_x     price_y  
0         1       Allston  35808.619278  4.154760e+07         NaN         NaN  
1         2      Back Bay  18815.103609  1.538724e+07  245.457045  245.457045  
2         3   Beacon Hill  11668.951169  7.891524e+06  212.080460  212.080460  
3         5   Charlestown  33910.754786  5.127021e+07  210.050633  210.050633  
4         6     Chinatown  10843.828683  3.436019e+06  235.410256  235.410256  


In [45]:
gdf_combined

Unnamed: 0,CombinedNeighborhood,geometry,OBJECTID,neighbourhood,Shape_Leng,Shape_Area,price_x,price_y
0,Allston-Brighton,"POLYGON ((754796.098 2951330.551, 754762.975 2...",1,Allston,35808.619278,41547600.0,,
1,Back Bay,"POLYGON ((771539.219 2954877.239, 771575.861 2...",2,Back Bay,18815.103609,15387240.0,245.457045,245.457045
2,Beacon Hill,"POLYGON ((774297.440 2956963.715, 774312.270 2...",3,Beacon Hill,11668.951169,7891524.0,212.08046,212.08046
3,Charlestown,"POLYGON ((773132.501 2968902.714, 773021.919 2...",5,Charlestown,33910.754786,51270210.0,210.050633,210.050633
4,Chinatown,"POLYGON ((775639.044 2953734.864, 775595.372 2...",6,Chinatown,10843.828683,3436019.0,235.410256,235.410256
5,Dorchester,"POLYGON ((775867.212 2944875.352, 775903.995 2...",7,Dorchester,80692.139164,219303800.0,97.451282,97.451282
6,Downtown,"MULTIPOLYGON (((773824.886 2953491.571, 773962...",8,Downtown,32767.370822,21590100.0,196.5,196.5
7,East Boston,"POLYGON ((790588.304 2971526.017, 790707.690 2...",9,East Boston,79266.383121,194861800.0,124.059829,124.059829
8,Fenway,"POLYGON ((756955.208 2961112.383, 757042.651 2...",10,Fenway,101396.628071,42813570.0,,
9,Harbor Islands,"MULTIPOLYGON (((802826.470 2939443.967, 803277...",11,Harbor Islands,113958.523276,93237860.0,,


In [44]:
gdf_combined.crs = "EPSG:3857"
print(gdf_combined.crs)

EPSG:3857


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
# Plot the neighborhoods with prices using OrRd colormap
gdf_combined.dropna(subset=['price']).plot(column='price', ax=ax, legend=True,
                                  legend_kwds={'label': "Mean Airbnb Price by Neighborhood",
                                               'shrink': 0.5},
                                  cmap='OrRd', edgecolor="k")

# Plot the neighborhoods without prices in grey and keep the borders
gdf_combined[gdf_combined['price'].isna()].plot(ax=ax, color='grey', edgecolor="k")

# Add a basemap for context
ctx.add_basemap(ax, crs=gdf_combined.crs)

# Remove axis
ax.set_axis_off()

plt.show()

## Merging with a new dataset

In [None]:
# Load the dataset
df1 = pd.read_csv('redistricting_data_tract20_nbhd_hhpopsize_ab-1.csv')

In [None]:
print(df.columns)
print(df1.columns)

In [None]:
# need to change the column name to merge datasets
df1.rename(columns={'tract20_nbhd': 'neighbourhood'}, inplace=True)

In [None]:
merged_df = pd.merge(df, df1, on='neighbourhood', how='right')

In [None]:
print(merged_df.head())

## Conclusion


The findings from the project underscore a significant correlation between property attributes and their pricing on Airbnb in Boston. The data indicates that entire homes or apartments generally command higher prices, reinforcing the notion that space and privacy are highly valued in the short-term rental market. The variability in pricing across neighborhoods suggests that location is a crucial determinant of pricing, potentially influenced by factors such as proximity to city attractions, neighborhood safety, and local amenities.


These observations align with economic theories on goods differentiation and consumer preference, highlighting that consumers are willing to pay premium prices for goods that better satisfy their preferences—in this case, accommodation that offers more space and privacy. The study's implication for hosts on Airbnb is clear: understanding these preferences can lead to more strategic pricing and better market positioning. Future research could expand on this by exploring the impact of additional factors such as seasonal trends, special events, and the effect of reviews on pricing