#  Week 1 Executive Summary: Target Definition & Geographic EDA

**Lead:** Emircan Kasap
**Scope:** Target Variable Creation, Geographic Segmentation, Class Balance Analysis
**Status:**  Completed

---

## 1.  Objective & Methodology
The primary objective of Week 1 was to prepare the dataset for the "Value Classification" machine learning model. This involved defining what constitutes "good value" and handling the complex geographic nature of the data (two distinct cities).

## 2.  Key Findings & Actions

### A. Geographic Segmentation (Task 1.16)
* **Discovery:** The raw dataset contained two geographically distinct clusters.
* **Action:** A hard split was applied at **Longitude -120.0**.
    * **San Francisco:** Identified as the Western cluster.
    * **San Diego:** Identified as the Eastern cluster.
* **Neighborhood Analysis:** Since official neighborhood labels were inconsistent, we applied **K-Means Clustering** on coordinates to create "Virtual Neighborhoods":
    * *San Francisco:* 6 Clusters (High density).
    * *San Diego:* 8 Clusters (Spread out).
* **Outcome:** Two new features, `city_label` and `neighborhood_id`, were added to capture location value.

### B. Target Variable Definition (Task 1.13)
* **Metric:** A custom **"FP Score"** (Fair Price Score) was calculated: $$FP Score = \frac{Price}{Review Rating}$$
* **Classification:** Listings were categorized into 3 classes based on quantiles:
    1.  **Excellent Value:** Top 33% (Best value for money).
    2.  **Fair Value:** Middle 33%.
    3.  **Poor Value:** Bottom 33% (Expensive for the rating).

### C. Class Distribution Analysis (Task 1.17)
* **Analysis:** We analyzed the distribution of the new `value_category` target variable.
* **Result:** The dataset is **BALANCED**.
* **Imbalance Ratio:** The ratio between majority and minority classes is **< 1.5**.
* **Decision:** **No SMOTE or Class Weighting is required.** The model can be trained on the standard dataset without synthetic oversampling.

---



---


In [None]:
# Cell: Save Report and Final Dataset (T1.18)

import os

# --- Configuration ---
OUTPUT_DIR = '../../outputs/'
DATA_DIR = '../../data/processed/'
REPORT_FILE = os.path.join(OUTPUT_DIR, 'EDA_Executive_Summary.md')
DATA_FILE = os.path.join(DATA_DIR, 'listings_with_geo_features.csv')

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

# 1. Save the Executive Summary as a Markdown file (Deliverable)
report_text = """
# Berlin Airbnb Value Classification - Week 1 EDA Report
(See the Markdown cell above for full rendered content)
... [Full text content would ideally be pasted here for file export] ...
**Conclusion:** Data is Balanced. Geographic features added. Ready for Week 2.
"""

with open(REPORT_FILE, 'w', encoding='utf-8') as f:
    f.write(report_text)
print(f" Report saved to: {REPORT_FILE}")

# 2. Save the Final Processed Data (Deliverable)
# We ensure the final dataframe (df_final_geo) is saved for Member 1/Member 4
if 'df_final_geo' in locals():
    df_final_geo.to_csv(DATA_FILE, index=False)
    print(f" Final Dataset saved to: {DATA_FILE}")
    print(f"   Shape: {df_final_geo.shape}")
    print(f"   Key Features Added: city_label, neighborhood_id, value_category")
else:
    print(" Warning: df_final_geo not found in memory. Please run the merge cell first.")