# ðŸ“Š NYC Airbnb Exploratory Data Analysis (EDA)

This notebook explores the cleaned NYC Airbnb dataset to uncover trends in pricing, neighborhood performance, and room types. These insights help inform our revenue prediction model and business interpretation.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for visualizations
plt.style.use('ggplot')
sns.set_palette("viridis")

In [2]:
# Load the cleaned dataset
df = pd.read_csv("../data/processed/airbnb_cleaned.csv")
df.head()

## 1. Price Distribution by Neighborhood Group
This helps us understand how different boroughs affect the pricing strategy.

In [3]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='neighbourhood group', y='price', data=df)
plt.title('Price Distribution by Neighborhood Group')
plt.ylabel('Price per Night ($)')
plt.show()

## 2. Room Type Availability
Identifying the supply of different room types across the market.

In [4]:
plt.figure(figsize=(10, 5))
sns.countplot(x='room type', data=df, order=df['room type'].value_counts().index)
plt.title('Market Supply: Count of Listings by Room Type')
plt.show()

## 3. Correlation Matrix
Analyzing relationships between numerical features to understand what drives revenue.

In [5]:
plt.figure(figsize=(10, 8))
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

## 4. Key Takeaways for Business Strategy
- **Pricing Strategy**: Price variance is highest in Manhattan and Brooklyn, suggesting more aggressive pricing can work there.
- **Supply Gap**: Entire homes dominate the 'High Revenue' segment, while private rooms are high-supply but often lower revenue.
- **Demand Factor**: A strong positive correlation between availability and reviews per month suggests that 'always available' listings accumulate more trust/revenue over time.