# Semester Project: NYC Airbnb Price Prediction & Analytics

## Overview
This project analyzes the **New York City Airbnb Open Data (2019)** dataset. 
Our goals are:
1.  **Exploratory Data Analysis (EDA)**: Understand the distribution of listings, prices, and host behaviors.
2.  **Geospatial Analysis**: Visualizing listings on the NYC map.
3.  **Machine Learning**: Build a regression model to predict the listing price based on features like location, room type, and reviews.

## Dataset
Source: [AB_NYC_2019.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
Size: ~49,000 listings.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import urllib.request
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import joblib

# Plot Style
sns.set_theme(style="whitegrid")
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## 1. Data Acquisition & Cleaning
Downloading the dataset and handling missing values.

In [None]:
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv"
csv_path = "AB_NYC_2019.csv"

# Download if not exists
if not os.path.exists(csv_path):
    print(f"Downloading dataset from {url}...")
    urllib.request.urlretrieve(url, csv_path)
    print("Download complete.")
else:
    print("Dataset already exists locally.")

df = pd.read_csv(csv_path)
print(f"Database Shape: {df.shape}")
df.head()

In [None]:
# Handling Missing Values
# reviews_per_month is NaN if number_of_reviews is 0. Fill with 0.
df.fillna({'reviews_per_month': 0}, inplace=True)

# Name and host_name might have NaNs, drop or fill. Let's drop rows with missing name as it's small %.
df.dropna(subset=['name', 'host_name'], inplace=True)

# Drop columns not needed for prediction (id, host_name, last_review)
df_clean = df.drop(['id', 'host_name', 'last_review'], axis=1)

# Price Capping: Remove extreme outliers (e.g. Price > $500 for better view, or strictly $0)
df_clean = df_clean[df_clean['price'] < 1000]
df_clean = df_clean[df_clean['price'] > 0]

print(f"Shape after cleaning: {df_clean.shape}")
df_clean.isnull().sum()

## 2. Exploratory Data Analysis (EDA)
Visualizing trends and distributions.

In [None]:
# 1. Price Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df_clean['price'], bins=50, kde=True, color='purple')
plt.title('Distribution of Airbnb Prices (Filtered < $1000)')
plt.xlabel('Price ($)')
plt.show()

In [None]:
# 2. Listings by Neighbourhood Group
plt.figure(figsize=(10, 6))
sns.countplot(data=df_clean, x='neighbourhood_group', palette='viridis', order=df_clean['neighbourhood_group'].value_counts().index)
plt.title('Number of Listings by Borough')
plt.xlabel('Borough')
plt.show()

In [None]:
# 3. Price by Room Type and Borough
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_clean, x='neighbourhood_group', y='price', hue='room_type', palette='Set2')
plt.title('Price Distribution by Borough and Room Type')
plt.xlabel('Borough')
plt.ylabel('Price ($)')
plt.ylim(0, 500) # Zoom in for clarity
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

In [None]:
# 4. Geospatial Map (Latitude vs Longitude)
plt.figure(figsize=(10, 8))
sns.scatterplot(x='longitude', y='latitude', hue='neighbourhood_group', data=df_clean, 
                palette='bright', alpha=0.5, s=20)
plt.title('Map of Listings NYC')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(loc='upper left')
plt.show()

In [None]:
# 5. Correlation Matrix
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
corr = df_clean[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()

## 3. Machine Learning: Price Prediction
We will train a model to predict `price` based on location and room features.
**Approach**: Regressors (Random Forest vs Gradient Boosting).

In [None]:
# Feature Engineering
# Encode Categoricals
le_group = LabelEncoder()
le_room = LabelEncoder()
le_neigh = LabelEncoder()

ml_df = df_clean.copy()
ml_df['neighbourhood_group'] = le_group.fit_transform(ml_df['neighbourhood_group'])
ml_df['room_type'] = le_room.fit_transform(ml_df['room_type'])
ml_df['neighbourhood'] = le_neigh.fit_transform(ml_df['neighbourhood'])

# Dropping text 'name' and 'host_id' (could be useful but tricky for simple ML)
X = ml_df.drop(['price', 'name', 'host_id'], axis=1)
y = ml_df['price']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training Shape: {X_train.shape}")

# Model: Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest RMSE: {np.sqrt(mse):.2f}")
print(f"Random Forest R2 Score: {r2:.4f}")

In [None]:
# Feature Importance
features = X.columns
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
sns.barplot(x=importances[indices], y=features[indices], palette='viridis')
plt.title('Feature Importance (Random Forest)')
plt.show()

In [None]:
# Save Model for Dashboard
# We'll save the model and the label encoders
artifacts = {
    'model': rf,
    'le_group': le_group,
    'le_room': le_room,
    'le_neigh': le_neigh
}
joblib.dump(artifacts, 'nyc_airbnb_model.pkl')
print("Model artifacts saved to nyc_airbnb_model.pkl")