# Real Estate Valuation Model Report

## 1. Data Preprocessing, Exploratory Data Analysis (EDA), and Visualization

### 1.1 Data Overview
The dataset used for real estate valuation consists of 414 instances with six predictive features and one target variable:

- **X1**: Transaction date (continuous)
- **X2**: House age (years, continuous)
- **X3**: Distance to nearest MRT station (meters, continuous)
- **X4**: Number of convenience stores (integer)
- **X5**: Latitude (degrees, continuous)
- **X6**: Longitude (degrees, continuous)
- **Y**: House price per unit area (continuous, target variable)

### 1.2 Missing Values
The dataset contains **no missing values**, ensuring data completeness.

### 1.3 Summary Statistics
Key insights from summary statistics:
- The house price per unit area ranges from **7.6 to 117.5** with a mean of **37.98**.
- High standard deviation (**13.61**) indicates price variability.
- Strong negative correlation (**-0.75**) between **distance to MRT station** and **house price**.
- Positive correlation (**0.60**) between **number of convenience stores** and **house price**.

### 1.4 Outlier Detection and Treatment
- **Outliers detected in**: Distance to MRT station (37), Latitude (8), Longitude (35), House price (3).
- **Handling Method**: Capped at **Interquartile Range (IQR) boundaries**.

### 1.5 Skewness and Transformations
- **Distance to MRT station** showed high skewness (**1.216**), so a **log transformation** was applied.
- Skewness was successfully reduced post-transformation.

### 1.6 Feature Scaling
- **Standardization** applied to all numerical features.
- Prevents features with large numerical values from dominating the model.

## 2. Feature Selection and Extraction

### 2.1 Correlation-Based Feature Selection
- Features with **high correlation** to the target variable were retained.
- **House age and MRT distance negatively impact house price**, so they are retained.
- **Convenience stores, latitude, and longitude positively impact house price**, making them relevant predictors.

### 2.2 Feature Engineering
- **Polynomial features** were explored to capture non-linearity in the model.
- **Polynomial Regression (degree 3) improved performance**, suggesting a non-linear relationship.

## 3. Model Development and Hyperparameter Tuning

### 3.1 Models Tested
#### 3.1.1 Multiple Linear Regression (Baseline)
- **Train R²**: 0.69
- **Test R²**: 0.73
- **Test RMSE**: 6.72

#### 3.1.2 Ridge Regression
- Best **alpha = 1** (determined via cross-validation)
- **Test R²**: 0.731
- **Test RMSE**: 6.71

#### 3.1.3 Lasso Regression
- Best **alpha = 0.0001**
- **Test R²**: 0.731
- **Test RMSE**: 6.72

#### 3.1.4 Polynomial Regression (Degree 3)
- **Train R²**: 0.82
- **Test R²**: 0.79
- **Test RMSE**: 5.87

### 3.2 Best Model Selection
- **Polynomial Regression (Degree 3) provided the highest R²** and lowest RMSE.
- **Regularization techniques (Ridge, Lasso) provided marginal improvements over MLR**, suggesting minimal overfitting.
- The **third-degree polynomial captured non-linear relationships effectively**, making it the best model.

## 4. Model Evaluation and Interpretation

### 4.1 Performance Metrics
- **Cross-Validation Mean R²**: 0.669
- **Test MSE**: 34.47
- **Test RMSE**: 5.87
- **Test MAE**: 4.26

### 4.2 Insights from Model Coefficients
- **Latitude and number of convenience stores** positively impact house prices.
- **House age and distance to MRT negatively impact prices**.
- **Transaction date has a minor positive impact**, indicating slight price appreciation over time.

## 5. Data-Driven Insights

### 5.1 Key Findings
- **Location Matters**: Proximity to MRT stations and commercial infrastructure strongly influences prices.
- **Age of the House**: Older houses tend to have lower prices.
- **Feature Scaling and Transformation**: Improved model performance and reduced bias.
- **Polynomial Regression** effectively captures market trends, outperforming linear models.

### 5.2 Business Recommendations
- **Real estate investors should prioritize properties near transit hubs** for higher valuations.
- **Newer buildings hold better resale value**.
- **Commercial establishments (convenience stores, shopping centers) enhance property desirability**.

## 6. Conclusion
- **Polynomial Regression (Degree 3) is the best model for real estate valuation**.
- **Feature selection and preprocessing played a crucial role** in improving model accuracy.
- **Data-driven insights provide valuable guidance for real estate investments**.
- **Further improvements**: Exploring ensemble methods (Random Forest, XGBoost) could enhance predictive performance.
