# Project Summary

### Key Findings
Based on our analysis of 4,500+ real estate listings in Surat, we derived the following core insights:

1.  **The "Floor Premium" Paradox (Market Split):** We discovered a distinct behavioral split between the *Primary* and *Secondary* markets. In **New Properties**, there is a strong positive correlation between floor height and price (the "Developer Premium" for views/status). However, in the **Resale Market**, this premium vanishes, with buyers showing a preference for mid-to-lower floors, likely prioritizing convenience and accessibility over vertical status.
2.  **Diminishing Returns are Area-Dependent:** The "bigger is cheaper per sqft" rule is not universal. While **Carpet Area** listings for 2-3 BHKs show clear diminishing returns (price/sqft drops as size increases), **Super Area** listings often maintain or increase their unit price as size grows. This suggests that large "Super Area" values act as a proxy for luxury projects with high common-area loading.
3.  **Locality quantifies the "Status Tax":** Using Hedonic Regression, we isolated the pure price effect of location. Areas like **Vesu** and **New Citylight** command a statistically significant **30-60% premium** over the market baseline, independent of the property's size or furnishing status. Conversely, areas like **Olpad** trade at a ~40% discount.
4.  **The Volatility of Luxury Resale:** Our dispersion analysis (IQR) revealed that **4+ BHK Resale properties** are the most unstable segment in Surat. While new luxury projects have standardized pricing, large resale homes show massive price variance, indicating a highly subjective market where valuation is difficult and negotiation is key.

**Most Interesting Discovery:**
The most surprising finding was the **Cluster Analysis (Q5)** results, which redefined the market structure not by price, but by verticality. We found a distinct "Mid-rise Economy" segment (buildings stuck at 13-14 floors) that is completely separate from the "Large-format Luxury" segment. This highlights how Surat’s zoning or development regulations likely create specific "building classes" that dictate price more than just the number of bedrooms.

---

### Limitations
*   **Dataset Limitations:**
    *   **Asking vs. Sold Price:** The data represents *listing prices* (expectations), not final *transaction prices* (reality). The "Resale" volatility we observed might partly be sellers having unrealistic expectations.
    *   **Static Snapshot:** The data is from a single time period (~2023). We cannot account for inflation, interest rate changes, or seasonality.
    *   **Description Data:** We had to drop the `description` column (30% missing) due to complexity, potentially losing valuable details about amenities (e.g., "Gym", "Pool") that drive price.

*   **Analysis Limitations:**
    *   **Locality Extraction:** We used Regex to extract localities from text strings. This may have missed micro-markets or grouped distinct neighborhoods (e.g., "Vesu" vs. "Vesu Main Road") too broadly.
    *   **Confounding Variables:** Our models cannot see "building age" or "builder reputation," which are massive drivers of price in India.

*   **Scope Limitations:**
    *   The analysis is strictly limited to **Surat**. The trends (like floor premiums) might be totally different in a city like Mumbai (where space is scarcer) or Hanoi.

--- 

### Future Directions (If We Had More Time)
*   **Alternative Methods (NLP):** Instead of dropping the `description` column, we would use **TF-IDF or BERT** to extract keyword features (e.g., "Garden facing", "Italian marble"). We hypothesize this would significantly improve the $R^2$ of our price prediction model.
*   **Geospatial Analysis:** We would seek an API (like Google Maps) to convert locality names into **Latitude/Longitude**. This would allow us to calculate "Distance to City Center" or "Distance to Airport" features, providing a continuous location metric rather than categorical labels.
*   **External Data:** Integrating **Government Guidance Values (Jantri Rates)** would allow us to compare market rates vs. official valuations to identify overheated zones.

---

# Individual Reflections


### Focus: Data Cleaning & Q1/Q4
*   **Challenges & Difficulties Encountered:**
The biggest obstacle was the **"Uncleaned" nature of the columns**. Features like `floor` contained mixed data ("Ground out of 5", "Basement"), and `price` mixed units ("Cr" vs "Lac"). Writing a robust Regex function to handle every edge case without crashing was technically challenging. Conceptually, deciding how to handle the "Call for Price" entries (whether to impute or drop) required careful thought to avoid biasing the distribution.

*   **Learning & Growth:**
I learned that **80% of Data Science is preprocessing**. I improved my skills in **Regular Expressions** significantly. I was surprised by how much the "Floor Effect" differed between New and Resale properties—it taught me that data often reveals human behaviors (like valuing views vs. convenience) that we wouldn't assume initially.

### Focus: Statistical Analysis Q2/Q3
*   **Challenges & Difficulties Encountered:**
My main challenge was **analytical depth**. For Question 2 (Locality Premium), simply calculating the average price per locality was misleading because some areas just happened to have bigger houses. Overcoming this required learning about **Hedonic Regression** to control for variables like size and BHK. Implementing **Levene’s Test** for Q3 was also new to me, as I had to ensure the data met the assumptions of the test.

*   **Learning & Growth:**
I learned that **summary statistics (mean/median) can lie**. Controlling for confounding variables is crucial for real-world insights. This project shaped my understanding of data science by showing me that coding is the easy part; *formulating the right question* and choosing the right statistical test is where the real value lies.

### Focus: Modeling & Clustering Q5/Q6
*   **Challenges & Difficulties Encountered:**
The specific obstacle I faced was **Data Leakage**. In the price prediction model (Q6), I initially included `price_per_sqft` as a feature, which gave a perfect (but fake) score. I had to restructure the pipeline to ensure we only predicted based on structural attributes. For Clustering (Q5), interpreting the "Elbow Plot" was ambiguous, and assigning meaningful names to the clusters (e.g., "Mid-rise Economy") required deep domain interpretation of the boxplots.

*   **Learning & Growth:**
I gained technical skills in **XGBoost and Hyperparameter tuning**. I was most surprised by how well the **K-Means clustering** separated the buildings by height without us explicitly telling it to. This project taught me that Machine Learning models are not black boxes; they require deep EDA to understand *why* the model is making certain predictions.