# 4. Question Formulation

## 4.1 Research Question 1: Unit Price Efficiency Analysis
**The Question**
How does the relationship between property size (`square_feet`) and unit price (`price_per_sqft`) vary across different BHK configurations (1, 2, 3, 4+) and area types (Super Area vs. Carpet Area)?

Specifically: Does a specific **"area threshold"** exist where increasing the square footage no longer results in a competitive unit price (diminishing returns), and does this threshold differ between BHK groups?

**Motivation & Benefits**
* **Why this is worth investigating:** Real estate pricing is rarely linear. This analysis explores the "Core Pricing Logic," verifying whether larger properties offer economies of scale or if there is a point of diminishing returns.
* **Insights provided:** It identifies the "sweet spot" size for each apartment configuration (e.g., the optimal size for a 2BHK before it becomes overpriced relative to the market).
* **Stakeholders:**
    * **Home Buyers:** To identify "value-for-money" properties and avoid paying a premium for inefficiently large spaces.
    * **Investors:** To maximize rental yield potential by purchasing efficiently sized units.
    * **Real Estate Agents:** To justify pricing strategies to sellers based on market data.
* **Real-world decision:** This informs the decision of choosing a property that offers the best utility-to-price ratio.

# 5. Data analysis

## 5.1 Analysis for Question 1

#### A. Preprocessing
**Written Explanation:**
To ensure a fair comparison between size and price, we need to handle statistical noise and group sparse data.
1.  **Outlier Removal:** The `price_per_sqft` column likely contains extreme outliers (e.g., data entry errors or ultra-luxury properties that skew the average). We will use the **Interquartile Range (IQR)** method to filter these out, keeping only the middle 50% distribution extended by 1.5x IQR.
2.  **BHK Grouping:** High BHK counts (5, 6, etc.) usually have very few data points. We will group all properties with 4 or more rooms into a single category: `4+ BHK` to ensure statistical significance.
3.  **Area Type Filtering:** We will focus our analysis on the two most common metrics: `Super Area` and `Carpet Area`, excluding less common types like `Plot Area` which are priced differently.

In [None]:
# 1. Handling Outliers for price_per_sqft using IQR method
Q1 = df['price_per_sqft'].quantile(0.25)
Q3 = df['price_per_sqft'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_clean = df[(df['price_per_sqft'] >= lower_bound) & (df['price_per_sqft'] <= upper_bound)].copy()

# 2. Grouping BHK
def categorize_bhk(x):
    if pd.isna(x): return "Unknown"
    if x >= 4: return "4+ BHK"
    return f"{int(x)} BHK"

df_clean['bhk_group'] = df_clean['bhk'].apply(categorize_bhk)

# 3. Focus on main Area Types
target_area_types = ['Super Area', 'Carpet Area']
df_analysis = df_clean[df_clean['areaWithType'].isin(target_area_types)]

#### B. Analysis
Our analytical approach focuses on visualizing the non-linear relationship between size and unit price.
* **Methodology:**
    1.  **Binning (Discretization):** We will slice the continuous `square_feet` variable into fixed intervals (bins of 200–500 sqft). This allows us to calculate the average unit price for specific size ranges rather than looking at raw noisy data.
    2.  **Visualization - Scatter & Trend:** We will use a Scatter Plot with a regression line (Lowess smoothing) to visualize the general correlation.
    3.  **Visualization - Line Plot by Segments:** We will plot the average `price_per_sqft` against the `square_feet` bins, faceted by `bhk_group`.
* **Expected Outputs:**
    * A visual curve showing how unit price drops as size increases.
    * Identification of the "flattening point" (threshold) where the price curve stabilizes.

In [None]:
df_analysis['sqft_bin'] = pd.cut(df_analysis['square_feet'], 
                                 bins=range(0, 5000, 500), 
                                 labels=[f"{i}-{i+500}" for i in range(0, 4500, 500)])

pivot_price = df_analysis.groupby(['bhk_group', 'sqft_bin'], observed=True)['price_per_sqft'].mean().reset_index()

print("Average Price per Sqft by Size Buckets:")
display(pivot_price.head())

In [None]:
plt.figure(figsize=(12, 6))

sns.scatterplot(
    data=df_analysis, 
    x='square_feet', 
    y='price_per_sqft', 
    hue='bhk_group',
    alpha=0.5,
    palette='viridis'
)

sns.regplot(
    data=df_analysis, 
    x='square_feet', 
    y='price_per_sqft', 
    scatter=False, 
    color='black', 
    line_kws={"linestyle": "--", "label": "Overall Trend"}
)

plt.title('Relationship between Property Size and Unit Price', fontsize=15, fontweight='bold')
plt.xlabel('Property Size (sqft)', fontsize=12)
plt.ylabel('Price per Sqft (₹)', fontsize=12)
plt.legend(title='BHK Configuration')
plt.xlim(0, 8000)

plt.tight_layout()
plt.show()

# ==============================================================================
# VISUALIZATION 2: BINNED AVERAGE PRICE (QUAN TRỌNG ĐỂ TÌM NGƯỠNG)
# Mục đích: Tìm điểm "gãy" (threshold) cụ thể cho từng nhóm BHK
# ==============================================================================

# 1. Tạo các khoảng diện tích (Binning) - Ví dụ: mỗi 250 sqft là 1 khoảng
bins = range(0, 5000, 250)
labels = [f"{i}" for i in bins[:-1]] # Label là cận dưới của khoảng
df_analysis['sqft_bin_val'] = pd.cut(df_analysis['square_feet'], bins=bins, labels=labels)

# Chuyển đổi label về dạng số để plot đúng thứ tự trục X
df_analysis['sqft_bin_val'] = pd.to_numeric(df_analysis['sqft_bin_val'])

plt.figure(figsize=(14, 7))

# Vẽ biểu đồ đường thể hiện giá trung bình tại mỗi khoảng diện tích
sns.lineplot(
    data=df_analysis,
    x='sqft_bin_val',
    y='price_per_sqft',
    hue='bhk_group',
    style='bhk_group',
    markers=True,
    dashes=False,
    palette='viridis',
    linewidth=2.5,
    errorbar=None # Tắt thanh sai số để biểu đồ đỡ rối
)

plt.title('Unit Price Efficiency Curves by BHK Group', fontsize=15, fontweight='bold')
plt.xlabel('Property Size (sqft) - Binned', fontsize=12)
plt.ylabel('Average Price per Sqft (₹)', fontsize=12)
plt.grid(True, which='both', linestyle='--', alpha=0.7)

# Thêm chú thích hoặc đường tham chiếu nếu cần (ví dụ ngưỡng 2000 sqft)
# plt.axvline(x=2000, color='red', linestyle=':', label='Potential Threshold')

plt.legend(title='BHK Configuration', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()