# Practical Exam Project – Applied Machine Learning & Statistics

#### The dataset used in this analysis was obtained from the New York City Airbnb Open Data collection on Kaggle, originally compiled by Dmitry Gomonov (2019). It contains detailed listing information for Airbnb properties in New York City. The dataset was accessed via Kaggle at:

#### Datset Link:- https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data

## STEP 1: Imports + Data Loading (Done by Saniya Shaikh)

In [None]:


import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import NearestNeighbors

from scipy import stats
from scipy.stats import shapiro, ttest_ind, f_oneway
from scipy.stats import pearsonr

In [None]:
df = pd.read_csv("Data/AB_NYC_2019.csv")

print(df.shape)
df.head()


In [None]:
df_denoised = df.copy()

## STEP 2: Basic Data Inspection (Saniya Shaikh)

In [None]:

df.info()
df.isnull().sum()


## STEP 3: HANDLE MISSING VALUES (Done by Saniya Shaikh)

In [None]:
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)


In [None]:
df.drop(columns=['last_review'], inplace=True)


In [None]:
df['name'] = df['name'].fillna("Unknown")
df['host_name'] = df['host_name'].fillna("Unknown")


In [None]:

print("Shape of dataset:", df.shape)

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nData types:")
print(df.dtypes)

df.head()


In [None]:
df.to_csv("AB_NYC_2019_basic_cleaned.csv", index=False)

## STEP 4: OUTLIER REMOVAL (RULE-BASED) (Done by Saniya Shaikh)

In [None]:


# Remove price <= 0 or price > 1000
df = df[(df["price"] > 0) & (df["price"] <= 1000)]

# Remove unrealistic minimum_nights > 365
df = df[df["minimum_nights"] <= 365]

df.shape


## STEP 5: FEATURE ENGINEERING (Done by Saniya Shaikh)

In [None]:
# Distance from Manhattan center (rough)
MAN_LAT = 40.7589
MAN_LON = -73.9851
df["dist_manhattan"] = np.sqrt(
    (df["latitude"] - MAN_LAT) ** 2 + (df["longitude"] - MAN_LON) ** 2
)


## STEP 6: Encoding Categorical Features (Done by Saniya)

In [None]:
# One-hot encode neighbourhood_group and room_type
df = pd.get_dummies(
    df,
    columns=["neighbourhood_group", "room_type"],
    drop_first=True
)

# Label encode neighbourhood (many categories)
le_neigh = LabelEncoder()
df["neighbourhood_encoded"] = le_neigh.fit_transform(df["neighbourhood"])


## STEP 7: NOISE INJECTION (Done by Saniya)

In [None]:


df_noisy = df.copy()

# 7.1 Add Gaussian noise to price (mean=0, std=0.15 * price_std)
price_std = df_noisy["price"].std()
noise = np.random.normal(loc=0, scale=0.15 * price_std, size=len(df_noisy))
df_noisy["price_noisy"] = df_noisy["price"] + noise
df_noisy["price_noisy"] = df_noisy["price_noisy"].clip(lower=1)  # avoid <=0

# 7.2 Add outliers to latitude/longitude for 5% of rows
n_outliers = int(0.05 * len(df_noisy))
outlier_idx = np.random.choice(df_noisy.index, size=n_outliers, replace=False)

df_noisy.loc[outlier_idx, "latitude"] += 0.5
df_noisy.loc[outlier_idx, "longitude"] += 0.5


## STEP 8: NOISE CLEANING & OUTLIER HANDLING (Done by Saniya)

In [None]:


# 8.1 Remove price outliers in noisy price using IQR
Q1 = df_noisy["price_noisy"].quantile(0.25)
Q3 = df_noisy["price_noisy"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_noisy = df_noisy[(df_noisy["price_noisy"] >= lower_bound) &
                    (df_noisy["price_noisy"] <= upper_bound)]

# 8.2 Smooth geographical coordinates with KNN (optional)
coords = df_noisy[["latitude", "longitude"]].values
nbrs = NearestNeighbors(n_neighbors=5).fit(coords)
distances, indices = nbrs.kneighbors(coords)

lat_smoothed = []
lon_smoothed = []

for idx_list in indices:
    lat_smoothed.append(coords[idx_list, 0].mean())
    lon_smoothed.append(coords[idx_list, 1].mean())

df_noisy["latitude_smooth"] = lat_smoothed
df_noisy["longitude_smooth"] = lon_smoothed

# 8.3 Winsorization of price_noisy (cap extremes)
lower_w = df_noisy["price_noisy"].quantile(0.01)
upper_w = df_noisy["price_noisy"].quantile(0.99)
df_noisy["price_clean"] = df_noisy["price_noisy"].clip(lower=lower_w, upper=upper_w)


In [None]:
df.to_csv("AB_NYC_2019_denoised_cleaned.csv", index=False)

## STEP 9: Comparison Plots (Done by Saniya)

In [None]:


fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original price
sns.histplot(df["price"], kde=True, ax=axes[0])
axes[0].set_title("Original Price Distribution")

# Noisy price
sns.histplot(df_noisy["price_noisy"], kde=True, ax=axes[1])
axes[1].set_title("Noisy Price Distribution")

# Cleaned price
sns.histplot(df_noisy["price_clean"], kde=True, ax=axes[2])
axes[2].set_title("Cleaned Price Distribution")

plt.tight_layout()
plt.show()


# Scatter: lat/long showing outlier detection (before smoothing)
plt.figure(figsize=(8, 6))
plt.scatter(df["longitude"], df["latitude"], s=5, alpha=0.3, label="Original")
plt.scatter(df_noisy["longitude_smooth"], df_noisy["latitude_smooth"],
            s=5, alpha=0.3, label="Smoothed")
plt.legend()
plt.title("Geographical Points: Original vs Smoothed")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()


## STEP 10: FEATURE SCALING (Done by Saniya)

In [None]:

scaler = StandardScaler()

numeric_cols = [
    "price_clean",
    "latitude_smooth",
    "longitude_smooth",
    "minimum_nights",
    "number_of_reviews",
    "reviews_per_month",
    "calculated_host_listings_count",
    "availability_365",
    "dist_manhattan",
]



df_scaled = df_noisy.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_scaled[numeric_cols])

df_scaled.head()


In [None]:
df.to_csv("AB_NYC_2019_Featured.csv", index=False)

# Descriptive Statistics (Done by Sakshi Manjrekar)

### 1. Numerical Features (Done by Sakshi Manjrekar)

In [None]:
df_stats = pd.read_csv("AB_NYC_2019_denoised_cleaned.csv")

In [None]:
df.head()

In [None]:
num_features = ["price", "number_of_reviews", "reviews_per_month",
                "availability_365", "minimum_nights", 
                "calculated_host_listings_count", "dist_manhattan"]

desc_stats = pd.DataFrame({
    "Mean": df_stats[num_features].mean(),
    "Median": df_stats[num_features].median(),
    "Std Dev": df_stats[num_features].std(),
    "Variance": df_stats[num_features].var(),
    "Skewness": df_stats[num_features].skew()
})

desc_stats

### Statistical Analysis Explanations (Done by Sakshi Manjrekar)

**Price**
- Highly right-skewed (2.94) → many low-cost listings with few expensive outliers driving up the mean ($141) above median ($105).

**Number of Reviews**
- Extremely right-skewed (3.68) → most listings have few reviews (median = 5), while popular properties accumulate hundreds.

**Reviews per Month**
- Very high skewness (3.30) → most listings receive minimal monthly reviews (median = 0.38), indicating low booking frequency.

**Availability (365)**
- Moderate positive skewness (0.77) → wide variance due to diverse host behavior, with median (44 days) suggesting selective availability.

**Minimum Nights**
- Extreme skewness (11.63) → most require short stays (median = 3 nights), but long-term rental outliers enforce 30+ night minimums.

**Calculated Host Listings Count**
- Extremely skewed (7.92) → most hosts manage single properties (median = 1), while commercial operators run dozens.

**Distance from Manhattan**
- Moderately right-skewed (1.30) → listings cluster near Manhattan center with low variance, showing geographic concentration.


### 2. Categorical Features (Done by Sakshi Manjrekar)

In [None]:
df_stats = pd.read_csv("AB_NYC_2019_basic_cleaned.csv")

In [None]:
print(df.columns)


In [None]:
df_stats.columns.tolist()

In [None]:
# Mode
neigh_mode = df_stats["neighbourhood_group"].mode()[0]
room_mode = df_stats["room_type"].mode()[0]

# Frequency
neigh_freq = df_stats["neighbourhood_group"].value_counts()
room_freq = df_stats["room_type"].value_counts()

# Summary table
categorical_summary = pd.DataFrame({
    "Mode": [neigh_mode, room_mode],
    "Most_Frequent_Count": [neigh_freq.max(), room_freq.max()]
}, index=["Neighbourhood_Group", "Room_Type"])

categorical_summary


In [None]:
neigh_freq, room_freq

### Explanation (Done by Sakshi Manjrekar)

Most common borough: Manhattan

Most common room type: Entire home/apt

# Correlation Analysis (Done by Sakshi Manjrekar)

In [None]:
df_stats = pd.read_csv("AB_NYC_2019_denoised_cleaned.csv")

### 1. Correlation Matrix (Done by Sakshi Manjrekar)

In [None]:
corr_features = [
    "price",
    "number_of_reviews",
    "reviews_per_month",
    "availability_365",
    "calculated_host_listings_count",
    "dist_manhattan"
]


clean_df = df[corr_features].dropna()

In [None]:
corr_matrix = clean_df.corr(method="pearson")
corr_matrix

### 2. Correlation with P-Values (Done by Sakshi Manjrekar)

In [None]:
pval_matrix = pd.DataFrame(
    np.ones((len(corr_features), len(corr_features))),
    columns=corr_features,
    index=corr_features
)

for i in corr_features:
    for j in corr_features:
        r, p = pearsonr(clean_df[i].values, clean_df[j].values)
        pval_matrix.loc[i, j] = p

pval_matrix


### Correlation Analysis Explanation (Done by Sakshi Manjrekar)

- **Price vs dist_manhattan (-0.31)** → Strong negative correlation; listings closer to Manhattan command significantly higher prices.

- **Number_of_reviews vs reviews_per_month (0.59)** → Strong positive correlation; listings with more total reviews tend to have higher monthly review rates, indicating consistent popularity.

- **Price vs reviews_per_month (-0.056)** → Weak negative correlation; expensive listings receive slightly fewer frequent reviews, possibly due to lower booking frequency.

- **Calculated_host_listings_count vs availability_365 (0.23)** → Moderate positive correlation; hosts managing multiple properties tend to keep them available longer, suggesting professional/commercial operations.

- **Calculated_host_listings_count vs price (0.13)** → Weak positive correlation; multi-property hosts charge slightly higher prices, indicating professional pricing strategies.

- **Availability_365 vs reviews_per_month (0.17)** → Weak positive correlation; listings available more days receive more frequent reviews due to higher booking opportunities.

- **Price vs number_of_reviews (-0.058)** → Weak negative correlation; expensive listings accumulate fewer total reviews, likely due to exclusivity and lower turnover.

### **Summary:**
Location (distance from Manhattan) is the strongest predictor of price, while review metrics show strong internal correlation but weak influence on pricing.


# Statistical Visualizations (Done by Sakshi Manjrekar)

In [None]:
df_stats = pd.read_csv("AB_NYC_2019_denoised_cleaned.csv")

### Plot 1. Correlation Heatmap (Done by Sakshi Manjrekar)

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    square=True
)
plt.title("Correlation Heatmap")
plt.show()

The correlation heatmap shows generally weak linear relationships among the features, with a moderate positive association between reviews per month and total reviews. Other correlations with price and location are mild, indicating that no single feature has a strong linear effect on price in this subset.

### Plot 2. Price Distribution - Histogram + KDE (Done by Sakshi Manjrekar)

In [None]:
plt.figure(figsize=(10,6))

sns.histplot(
    data=df_stats[df_stats["price"] > 0],
    x="price",
    bins=50,
    kde=True
)

plt.xscale("log")

plt.title("Price Distribution with KDE (Log Scale)")
plt.xlabel("Price (Log Scale)")
plt.ylabel("Frequency")

plt.show()


The price distribution is strongly right‑skewed, with many low‑priced listings and a long tail of high prices, even after log scaling. This visualization highlights that most Airbnb listings cluster at lower price levels, while a smaller number extend into much higher price ranges.

### Plot 3. Boxplot - Price Across Boroughs (Done by Sakshi Manjrekar)

In [None]:
df_denoised = pd.read_csv("AB_NYC_2019_denoised_cleaned.csv")
print(df_denoised.columns)

In [None]:
borough_cols = [
    "neighbourhood_group_Brooklyn",
    "neighbourhood_group_Manhattan",
    "neighbourhood_group_Queens",
    "neighbourhood_group_Staten Island"
]

# Create borough column from encoded variables
df_denoised["borough"] = df_denoised[borough_cols].idxmax(axis=1)
df_denoised["borough"] = df_denoised["borough"].str.replace("neighbourhood_group_", "")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))

sns.boxplot(
    data=df_denoised,
    x="borough",
    y="price"
)

plt.yscale("log")  # Important for skewed price distribution

plt.title("Price Distribution Across Boroughs (Denoised Data)")
plt.xlabel("Borough")
plt.ylabel("Price (Log Scale)")

plt.show()

 The boxplot shows clear differences in Airbnb prices across boroughs, with Manhattan having the widest range and highest median prices, followed by Brooklyn. This indicates a greater prevalence of expensive listings in Manhattan and higher overall price variability across boroughs.

### Plot 4. Barplot (Done by Sakshi Manjrekar)

In [None]:
plt.figure(figsize=(10,6))

sns.barplot(
    data=df_denoised,
    x="borough",
    y="price",
    estimator="mean"
)

plt.title("Average Price by Borough")
plt.xlabel("Borough")
plt.ylabel("Average Price")

plt.show()

In [None]:
plt.figure(figsize=(10,6))

sns.countplot(
    data=df_denoised,
    x="borough"
)

plt.title("Number of Listings by Borough")
plt.xlabel("Borough")
plt.ylabel("Count of Listings")

plt.show()

The bar chart shows Manhattan with the clearly highest average price among NYC boroughs, followed by Brooklyn and Queens. Meanwhile, Brooklyn has the most listings, highlighting the strong impact of location on pricing while the city has uneven listing distribution across boroughs.

### Plot 5. Scatter Plot (Done by Sakshi Manjrekar)

In [None]:
plt.figure(figsize=(10,6))

sns.scatterplot(
    data=df_denoised,
    x="number_of_reviews",
    y="price",
    alpha=0.5
)

plt.yscale("log")  # Recommended due to skewed price

plt.title("Price vs Number of Reviews")
plt.xlabel("Number of Reviews")
plt.ylabel("Price (Log Scale)")

plt.show()

The scatter plot shows a slight negative relationship between price and number of reviews, suggesting that lower-priced listings tend to receive more reviews. However, the relationship is weak, indicating that other factors also influence pricing.

## Price Insights (Done by Sakshi Manjrekar)

The plot shows weak linear relationships, indicating other factors influence price.
- 	Location strongly impacts price—Manhattan has highest prices.
- 	Room type has strong effect—entire homes cost most.
- 	Reviews per month has slight positive relationship with price.
- 	Price distribution is positively skewed.
		Feature Selection Recommendations

## Feature Importance from EDA (Done by Sakshi Manjrekar)

Based on statistical analysis for predictive modeling, the following recommendations:
- Strong predictors: location, room type.
- Moderate counts: number of reviews, availability, reviews per month.
- Consider transforming price and minimum nights due to skewness.

# Unsupervised Learning- Airbnb Listing Clustering (Done by Hadassah Mercy)

## Objective
In this section, we apply **K-Means Clustering** to segment Airbnb listings into meaningful groups based on pricing, availability, reviews, and geographic location.

### Features Used for Clustering:
- price
- number_of_reviews
- availability_365
- minimum_nights
- latitude
- longitude

All features are scaled before clustering.


### Import required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

sns.set(style="whitegrid")


## Step 1: Select Features for Clustering
We extract numerical features relevant to pricing behavior and geographic distribution.


## Feature Selection (Select clustering features)

In [None]:
features = ['price', 'number_of_reviews', 'availability_365',
            'minimum_nights', 'latitude', 'longitude']

X = df[features]

X.head()

## Step 2: Feature Scaling

Since K-Means is distance-based, we standardize the features using StandardScaler.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for readability
X_scaled_df = pd.DataFrame(X_scaled, columns=features)
X_scaled_df.head()


All features were standardized using StandardScaler to ensure equal contribution during distance-based clustering.


## Step 3: Determine Optimal Number of Clusters (Elbow Method)

We test k values from 2 to 10 and analyze inertia to determine the optimal number of clusters.


In [None]:
inertia = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plot Elbow Curve
plt.figure(figsize=(8,5))
plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()


### Optimal k Selection

Although the elbow curve shows gradual improvement up to k = 6,
we select k = 4 for better business interpretability and meaningful
market segmentation (Luxury, Budget, Mid-range, Private rooms).

This choice balances model complexity and practical insights.



In [None]:
# Apply KMeans with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X_scaled)

df[['price', 'cluster']].head()


## Step 4: Cluster Profile Analysis

We calculate the mean values of features per cluster to understand their characteristics.


In [None]:
cluster_profile = df.groupby('cluster')[features].mean()
cluster_profile

### Cluster Interpretation

Cluster 0 represents higher-priced listings with long minimum stay requirements and high availability, suggesting premium or long-term rentals.

Cluster 1 contains moderately priced listings with very low availability, indicating high demand and frequent bookings.

Cluster 2 includes highly reviewed listings with lower minimum night requirements, suggesting popular short-term rental properties.

Cluster 3 represents lower-priced listings, likely budget accommodations or backpacker friendly budget rooms.


Based on average price and geographic concentration:

- Cluster 0 → Luxury Manhattan listings (High price)
- Cluster 1 → Budget Brooklyn/Bronx listings
- Cluster 2 → Mid-range entire homes
- Cluster 3 → Smart Budget Retreats

## Step 5: Geographic Visualization of Clusters
We visualize how clusters are distributed across NYC.


### Geographic Scatter Plot

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(df['longitude'], df['latitude'],
            c=df['cluster'], cmap='viridis', alpha=0.5)

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Airbnb Listings Clusters (Geographic Distribution)')
plt.colorbar(label='Cluster')
plt.show()


## Step 6: Price Distribution by Cluster


### Boxplot

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='cluster', y='price', data=df)
plt.title('Price Distribution by Cluster')
plt.show()


## Price Distribution by Cluster

The boxplot shows clear price differences across the four clusters.

- **Cluster 0** has the highest median price and the greatest variability, indicating higher-priced listings.
- **Clusters 1 and 2** fall in the mid-price range, with moderate spread and some high-value outliers.
- **Cluster 3** has the lowest median price, representing more affordable listings.

The presence of outliers in all clusters suggests that while most listings follow typical price patterns, some premium properties exist within each group.

Overall, the clustering effectively separates listings into distinct pricing tiers.


## Step 7: Reviews vs Price by Cluster

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='number_of_reviews',
                y='price',
                hue='cluster',
                data=df,
                palette='viridis',
                alpha=0.6)

plt.title('Reviews vs Price by Cluster')
plt.show()

## Step 8: Business Insights

### Cluster 0 — Luxury Manhattan
High price listings ($300+). Likely entire homes in prime locations.

### Cluster 1 — Budget Listings
Low-priced properties (<$80), high review volume, outer boroughs.

### Cluster 2 — Mid-Range
Moderate price ($150–$250), balanced availability.

### Cluster 3 — Smart Budget Retreats
Lower-mid price ($50–$120), high availability.


In [None]:
# Convert cluster to dummy variables
df = pd.get_dummies(df, columns=['cluster'], drop_first=True)

df.head()


## Step 9: Export K-Means Clustered Dataset

We save the dataset including the cluster labels into a CSV file for further analysis, reporting, or visualization in external tools.


In [None]:
output_file = "airbnb_nyc_kmeans_clustered.csv"
df.to_csv(output_file, index=False)

print(f"Dataset successfully saved as {output_file}")


#  Discussion & Creativity Section

---

## Methodological Justification

### Choice of Random Forest

Random Forest was selected for price prediction due to the presence of strong **non-linear relationships** between listing features and price. Factors such as location, room type, availability, and review count interact in complex ways that are not well captured by linear models.

Key reasons for selection:
- Captures non-linear feature interactions
- Handles mixed data types effectively
- Robust to outliers
- Reduces overfitting through ensemble averaging
- Provides feature importance for interpretability

The model achieved strong predictive performance, confirming that ensemble-based tree methods are well-suited for Airbnb pricing analysis.

---

### Choice of k = 4 for K-Means

The elbow method indicated that clustering improvement gradually decreases as k increases. While additional clusters slightly reduce inertia, selecting **k = 4** provides:

- Clear economic segmentation
- Practical interpretability
- Distinct pricing tiers

The four clusters represent meaningful market segments such as higher-priced listings, mid-range properties, frequently booked units, and budget accommodations. This balance between statistical evidence and business intuition makes k=4 an appropriate choice.

---

### Use of StandardScaler

Clustering algorithms such as K-Means rely on distance calculations. Since the dataset contains features with different numerical scales (e.g., price vs latitude), scaling ensures:

- Equal contribution of all features
- Prevention of dominance by high-magnitude variables
- Improved clustering accuracy

StandardScaler standardizes features to zero mean and unit variance, making distance-based algorithms more reliable.

---

## Interpretation of Unsupervised Learning

### K-Means Clustering Insights

The K-Means clustering algorithm successfully segmented Airbnb listings into four distinct groups based on price, availability, reviews, and geographic location.

The resulting clusters revealed:

- A higher-priced segment with longer minimum stays and high availability.
- A mid-range segment with moderate pricing and balanced availability.
- A highly reviewed cluster representing popular listings with strong demand.
- A lower-priced cluster representing budget accommodations.

This segmentation demonstrates that pricing patterns in NYC are influenced not only by location but also by booking activity and listing characteristics.

---

### Geographic Patterns

The geographic visualization of clusters shows that listings naturally group into spatial zones, reflecting neighborhood-level pricing differences.

Premium listings tend to concentrate in central and high-demand areas, while lower-priced listings are more dispersed across outer boroughs. This confirms that location plays a critical role in Airbnb pricing strategy.

---

### Business Implications

The unsupervised learning approach provides actionable insights:

- Identifies pricing tiers for strategic positioning
- Highlights high-demand listing types
- Reveals spatial pricing concentration
- Supports targeted marketing strategies

Overall, clustering complements supervised learning by uncovering hidden structure in the data, enhancing both interpretability and strategic value.
