# Notebook 6: Exploratory Data Analysis (EDA) & Feature Engineering

**Objective:** To understand the drivers of house prices in Bandung, identify premium neighborhoods, and prepare the dataset for Machine Learning.

**Input Data:** `df_platform_a_bandung_cleaned.csv` (Output from `05_outlier_removal`).

---

## Section 1: Feature Engineering (Price Metrics)

We cannot rely on `price` alone to compare neighborhoods. A 5 Billion IDR house might be "cheap" if it is huge, or "expensive" if it is tiny. 

We will create **Price per Square Meter (Land)** (`price_per_m2`). This is the standard metric for real estate value.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

In [None]:
# --- 1. Load the Cleaned Data ---
# Pointing to the 'processed' directory
data_dir = r"..\data\processed"
filename = "df_platform_a_bandung_cleaned.csv"
full_path = os.path.join(data_dir, filename)

try:
    print(f"Loading data from: {full_path}...")
    df_platform_a = pd.read_csv(full_path)
    print(f"✅ Data Loaded Successfully. Total Listings: {len(df_platform_a)}")
except FileNotFoundError:
    print(f"❌ Error: Could not find the file. Please check the path or run Notebook 05.")

# --- 2. Feature Engineering: Create 'Price per m2' ---
# We use Land Size because in Indonesia, land value is the primary driver.
df_platform_a['price_per_m2'] = df_platform_a['price'] / df_platform_a['land_size_sqm']

# Helper: Create a "Juta per m2" column for easier reading/plotting
df_platform_a['price_per_m2_juta'] = df_platform_a['price_per_m2'] / 1_000_000

# --- 3. Inspect the New Feature ---
print("\n--- Price per Square Meter Summary (in Juta IDR) ---")
print(df_platform_a['price_per_m2_juta'].describe())

# --- 4. Visualize the Distribution ---
plt.figure(figsize=(12, 6))
sns.histplot(df_platform_a['price_per_m2_juta'], bins=100, kde=True, color='teal')
plt.title('Distribution of Land Price (Juta per m²)', fontsize=16)
plt.xlabel('Price per Meter (Millions IDR)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, 50) 
plt.grid(axis='y', alpha=0.3)
plt.show()

**Mean vs. Median**  
- Mean = 15.5  
- Median = 13.9  
The mean is very close to the median, showing the data is not heavily skewed.  
It approximates a normal distribution (bell curve), which is ideal for machine learning models.  

---

**The Standard Market**  
- 25% Quartile = 9.8 Juta/m²  
- 75% Quartile = 19.4 Juta/m²  
This range defines what a "normal" price looks like.  

50% of all houses in Bandung are priced between **9.8 Juta** and **19.4 Juta per m²**.
---

**The Maximum**  
- Max = 152 Juta/m²  
This is high, but realistic for a prime location in Dago or a commercial area.  
It is not an error like the 950 Billion outlier was.  


## Important Notes

Price / Land ignores the value of the building. A luxury mansion on 100m² costs more than a tear-down shack on 100m². Our current metric assumes the price is driven entirely by land, which is an oversimplification.

# --- Section 2: Basic Descriptive Statistics ---

In [None]:
# 1. Correlation Matrix (The "Drivers" of Price)
print("--- Calculating Correlations ---")
# Select only numeric columns of interest
numeric_cols = ['price', 'land_size_sqm', 'building_size_sqm', 'bedrooms', 'bathrooms']
correlation = df_platform_a[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation, 
    annot=True,         # Show the numbers
    cmap='coolwarm',    # Red = High Correlation, Blue = Low
    fmt=".2f",          # 2 decimal places
    linewidths=0.5
)
plt.title('Correlation Matrix: What drives the Price?', fontsize=16)
plt.show()

**1. The Big Winner: Land Size (0.81)**  
- Correlation between price and `land_size_sqm` = 0.81  
- **Interpretation:** Very strong positive correlation. Land is the undisputed primary driver of house value in this market.  
- **Actionable Insight:** Any machine learning model must heavily weight land size.  

---

**2. The Strong Second: Building Size (0.61)**  
- Correlation between price and `building_size_sqm` = 0.61  
- **Interpretation:** Strong, but significantly lower than land size. Buyers pay for the plot potential first, structure second.  
- **Warning (Multicollinearity):** Correlation between `land_size_sqm` and `building_size_sqm` = 0.71.  
  - This is high, meaning the variables are linked (big houses on big land).  
  - In linear regression, this can cause multicollinearity issues.  
  - However, they are distinct enough (0.71 is not 0.90) that we can likely keep both.  

---

**3. The "Weak" Links: Rooms (0.34 & 0.32)**  
- **Bedrooms (0.34):** Surprisingly low correlation. Adding a 5th bedroom increases price far less than adding 50 m² of land.  
- **Bathrooms (0.32):** Weak correlation. Unlike some markets, bathroom count is not a strong predictor of value in Bandung compared to physical space.  

---

**Summary for Analysis**  
The market is driven primarily by physical space (land and building size), not by room count.  
- A small house with 10 bedrooms (a *Kost*) is likely worth less than a spacious house with 3 bedrooms (a luxury home).  
- The correlation (0.71) between land and building size shows they are linked, but Land Size (0.81) is clearly the dominant factor.  

**Conclusion:** Land size is the most important feature.

## Section 3: Neighborhood Ranking (The "Beverly Hills" of Bandung ?)

Now we answer the key question: **Where is the land most valuable?**

We rank districts (`ADM4_EN`) by their **Median Price per m²**.
* **Metric:** Median Price (Juta/m²) is robust against outliers.
* **Filter:** We only include districts with at least **10 listings** to ensure statistical significance.

In [None]:
# --- Section 3: Neighborhood Ranking ---

# 1. Group by District (Kecamatan) and calculate stats
#    We calculate Count (n) and Median Price
district_stats = df_platform_a.groupby('ADM4_EN')['price_per_m2_juta'].agg(
    ['count', 'median', 'mean']
).sort_values(by='median', ascending=False)

# 2. Filter for significant sample size (N >= 10)
significant_districts = district_stats[district_stats['count'] >= 10]

print(f"Ranking {len(significant_districts)} districts with significant data (N>=10)...")
print("-" * 50)

# 3. Display the Tables
print("🏆 Top 10 Most Expensive Districts (Median Juta/m²):")
print(significant_districts[['count', 'median']].head(10))

print("\n💰 Top 10 Most Affordable Districts (Median Juta/m²):")
print(significant_districts[['count', 'median']].tail(10))

# 4. Visualization: The Top 20 Bar Chart
plt.figure(figsize=(12, 8))
top_20 = significant_districts.head(20)

sns.barplot(
    x=top_20['median'], 
    y=top_20.index, 
    palette='viridis'
)

plt.title('Top 20 Most Expensive Districts in Bandung (Median Land Price)', fontsize=16)
plt.xlabel('Median Price (Juta per m²)', fontsize=12)
plt.ylabel('District (Kecamatan)', fontsize=12)
plt.grid(axis='x', alpha=0.3)
plt.show()

Analysis: The "Beverly Hills" of Bandung?

- **To locals, this data might look nonsensical.** You might ask: *"Why is District X ranked lower than District Y? Everyone knows X is more elite!"*

This discrepancy highlights a critical limitation in geospatial analysis known as the **Modifiable Areal Unit Problem (MAUP)**. Here is why the data contradicts "Street Knowledge":

**1. The "District Trap" (Hidden Heterogeneity)**
Administrative borders (Kecamatan) are arbitrary and do not respect economic zones.
* **Example:** The district of **Coblong** contains the ultra-elite **Dago** area. However, it *also* contains dense, working-class neighborhoods (kampung) near the city center.
* **The Result:** When we calculate the **Median**, the hundreds of affordable homes pull the number down, "hiding" the luxury mansions. The statistic reflects the *typical* house in the district, not the *most famous* ones.

**2. The "Median" Effect**
We used the **Median** (the middle value) to avoid skew from outliers.
* While this is statistically safer, it suppresses the "Prestige Factor." A district defined by a few dozen mega-mansions (like Cidadap/Setiabudi) might rank lower than a district with consistently high-priced but smaller homes (like Bandung Wetan) because the *bulk* of its inventory is mid-range.



**3. Conclusion: Geography vs. Statistics**
This chart does not show where the *most expensive houses* are (absolute price); it shows where the *standard land value* is highest on average.
* **To see the true "Elite Zones,"** we must look beyond district names and use the **Heatmap (Section 4)**, which ignores borders and spots the actual clusters of wealth.

## Section 4: Spatial Analysis (Land Value Heatmap)

Visualizing the **Price per Square Meter** on a map allows us to see the "Economic Geography" of Bandung. We expect to see a "Hot Core" (North/Central) and cooler prices in the East/South.

In [None]:
# --- Section 4: Spatial Analysis ---
import geopandas as gpd

print("--- Plotting Final Verification Map (Price Heatmap) ---")

# 1. Load the Shapefile
# We use the raw shapefile to get the official district boundaries
shapefile_path = r"..\data\raw\idn_admbnda_adm4_ID3_bps_20200401.shp"


try:
    gdf_adm = gpd.read_file(shapefile_path)
    
    # Filter for Kota Bandung
    bandung_gdf = gdf_adm[gdf_adm['ADM2_EN'] == 'Kota Bandung'].copy()
    bandung_gdf = bandung_gdf.to_crs("EPSG:4326") # Ensure Standard Lat/Lon
    print("Bandung map boundaries loaded.")

    # 2. Convert our dataframe to a GeoDataFrame
    # We use the cleaned 'df_platform_a' from Section 1
    gdf_listings = gpd.GeoDataFrame(
        df_platform_a, 
        geometry=gpd.points_from_xy(df_platform_a.longitude, df_platform_a.latitude), 
        crs="EPSG:4326"
    )
    print(f"✅ Loaded {len(gdf_listings)} listings for plotting.")

    # 3. Plot the Map
    fig, ax = plt.subplots(figsize=(12, 12))
    ax.set_aspect('equal')
    
    # A. Plot Base Map (Grey Background)
    bandung_gdf.plot(
        ax=ax, 
        edgecolor='black', 
        facecolor='#dddddd', 
        label='Kota Bandung Boundary',
        zorder=1
    )
    
    # B. Plot Listings (Colored by Price)
    # FIX: Replaced 'gdf_listings_filtered' with the correct variable 'gdf_listings'
    gdf_listings.plot( # <--- CORRECTED LINE
        ax=ax, 
        column='price',         # Use price for color
        cmap='viridis_r',       # Reverse Viridis (Purple=High, Yellow=Low usually, or vice versa depending on version)
        markersize=15,          # Size of dots
        alpha=0.7,              # Transparency
        legend=True,
        legend_kwds={
            'label': "Price (IDR)", 
            'shrink': 0.6,
            'format': "%.0e"    # Scientific notation for cleaner legend
        },
        vmax=15_000_000_000,    # CAP visual scale at 15 Billion so normal houses show variation
        zorder=2
    )
    
    # C. Set Zoom Limits (Focus on Bandung)
    ax.set_xlim(107.55, 107.74)
    ax.set_ylim(-6.98, -6.83)
    
    # D. Formatting
    # FIX: Replaced 'gdf_listings_filtered' with the correct variable 'gdf_listings' in the title
    ax.set_title(f'Verification: {len(gdf_listings)} Clean Listings (Colored by Price)', fontsize=16) 
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()

except Exception as e:
    print(f"❌ An error occurred: {e}")

**Map Analysis: Price Clusters in Bandung**  

The color-coding (yellow = low prices, dark purple = high prices) highlights clear clusters of high-value properties.  
Listings above 10 Miliar IDR are concentrated in the northern and central-northern districts of Bandung.  

**Premium Residential Areas Identified:**  
- **Dago (Coblong/Cidadap):** Widely recognized as Bandung’s most expensive area, especially near the city core.  
- **Ciumbuleuit (Cidadap):** Known for luxury homes, apartments, and proximity to universities.  
- **Setiabudi (Setiabudi District):** A premium corridor, particularly along the main road.  
- **Northern Suburbs (Cimenyan, Ciburial Village, Lembang):** High land values driven by tourism and luxury developments.  

**Conclusion:**  
The visualization successfully validates the known geography of wealth in Bandung.  
The spatial data (latitude and longitude) is reliable and consistent with external market reports.  


## Limitations and Possibilities

## 1. Limitations of the Current Analysis

While this dataset provides a valuable "signal" of the Bandung property market, it is important to acknowledge three critical limitations in its current state:

### A. The "Static Snapshot" Limitation (Temporal Bias)
Currently, the data represents a single window of time (September–October 2025).
* **No Velocity:** A snapshot has no movement. We cannot tell if prices in Bandung are currently rising (appreciating) or falling (crashing).
* **No Seasonality:** Real estate markets often fluctuate based on the time of year (e.g., higher activity during school holidays). This 2-month sample cannot capture those annual cycles.
* **Unknown "Time-on-Market":** We cannot distinguish between fresh listings and "stale" listings that have been sitting unsold for months because they are overpriced.

### B. The "Asking Price" Bias
The dataset consists entirely of **Listing Prices** (what sellers *want*), not **Transaction Prices** (what buyers actually *pay*).
* **Sentiment vs. Reality:** Asking prices represent seller aspiration. Final sales prices are typically 5–15% lower after negotiation.
* **The "Upward Skew":** Consequently, our calculated averages likely overestimate the true cost of housing in Bandung.

### C. The "District Grouping" Issue (Spatial Granularity)
In our analysis, we grouped homes by `Kecamatan` (District).
* **Hidden Heterogeneity:** A district like *Coblong* contains both ultra-luxury villas (Dago Atas) and dense student housing. Averaging them into a single number hides the massive variety within the district. Real estate value is often defined by specific streets, not broad administrative borders.

---

## 2. Roadmap: Future Possibilities for Improvement

To transform this project from a descriptive analysis into a predictive engine, we propose the following advancements:

### A. Longitudinal Data Collection (Time-Series)
**Goal:** Collect data consistently over a 12-month period.
* **Why it matters:** This would allow us to build a **Market Trend Index**. We could answer questions like *"Is Gedebage heating up?"* or *"Are prices in Setiabudi stagnating?"*
* **Gentrification Detection:** By tracking price changes over time, we could identify neighborhoods that are rapidly becoming expensive, spotting investment opportunities before the general market does.

### B. Econometric Modeling: The Hedonic Pricing Model
**Goal:** Move beyond simple averages to a multivariate regression model that explains *why* a house costs what it costs.

**1. What is a Hedonic Model?**
In economics, the "Hedonic" theory treats a house not as a single object, but as a **"Bundle of Attributes"** (like a shopping cart full of different items).
* **The Theory:** A buyer isn't just buying "a house"; they are buying specific amounts of utility: 200m² of land, 3 bedrooms, 1 garage, and 10 minutes of saved commute time.
* **The Math:** The model uses regression analysis to assign a specific "Price Tag" (Coefficient) to each of these attributes.
    * *Formula:* $Price = (Size \times A) + (Bedrooms \times B) + (Location \times C) + Error$

**2. The Output (The "Why")**
Instead of just predicting a price, this model gives us explanatory power. It can quantify the marginal value of specific features in the Bandung market:
* *"Each additional bedroom adds **50 Million IDR** to the property value."*
* *"Each kilometer further from Gedung Sate reduces the value by **5%**."*
* *"Properties with 'Main Road Access' command a **20% premium** over those in alleys."*

**3. Implementation Requirements (What We Need)**
To build this model, we need to upgrade our data collection to capture more than just the basics:
* **Granular Location:** Calculate precise distances to key landmarks (e.g., Toll Gates, CBD, Universities, Malls).
* **Quality Attributes:** Scrape specific text details: *Does it have a swimming pool? Is it furnished? How wide is the access road (masuk mobil)?*
* **Structural Details:** Number of floors, electricity capacity (watts), and certification type (SHM vs. HGB)."*

### C. Advanced Spatial Analytics (Heatmaps)
**Goal:** Abandon administrative borders in favor of **Kernel Density Estimation (KDE)**.
* **The Concept:** Instead of coloring a whole district one color, we plot a smooth "weather map" of prices.
* **The Output:** This would reveal precise "Hot Zones" (e.g., a cluster of expensive homes around a specific international school) that might otherwise be invisible on a standard district map.