# Project Title
## 1. Introduction
### 1.1 Problem Statement

**Goal:** Predict house prices using regression models.  
**Dataset:** 10,000 observations from Kaggle.  
**Result:** Random Forest achieved RMSE of 12,500.


**Key insight:** price is strongly correlated with `SquareFeet`.
*This assumption* is tested using correlation analysis.


---

### Feature Scaling
We standardize numerical features to improve model convergence.


We chose Random Forest because:
- Handles non-linearity well
- Robust to outliers
- Minimal feature scaling needed


| Model          | RMSE  | R¬≤   |
|----------------|-------|------|
| Linear Reg.    | 18,200| 0.71 |
| Random Forest  | 12,500| 0.86 |

**Figure 1:** Distribution of house prices shows right skewness,
justifying a log transformation.


> **Conclusion:** Feature engineering improved model performance
> more than hyperparameter tuning.


## Final Conclusion
The model successfully predicts prices with high accuracy.

### Next Steps
- Add external economic data
- Test XGBoost
- Deploy model using Streamlit


<div class="alert alert-block alert-info"><b>Note:</b> This is an info box.</div>
<div class="alert alert-block alert-warning"><b>Warning:</b> Memory intensive!</div>

<div class="alert alert-block alert-info">
    <b>üí° MLE Insight:</b> We are using <b>XGBoost</b> here because the dataset is tabular and has high cardinality.
</div>

<div class="alert alert-block alert-warning">
    <b>‚ö†Ô∏è Memory Warning:</b> The next cell requires at least 16GB of RAM to process the feature matrix.
</div>

<div class="alert alert-block alert-danger">
    <b>‚ùå Critical:</b> Ensure the <code>.env</code> file is configured before running the database connector.
</div>

### üìä Data Dictionary

| Feature | Type | Description | Example |
| :--- | :--- | :--- | :--- |
| `user_id` | `UUID` | Unique identifier for the customer | `550e8400-e29b...` |
| `event_ts` | `Timestamp` | Event time in UTC | `2025-10-16 14:00:00` |
| `target` | `Boolean` | If the user churned in 30 days | `1` |

### üß† Loss Function Definition

Our model optimizes the **Mean Squared Error (MSE)** with an $L_2$ regularization term (Ridge):

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2$$

Where:
* $\lambda$: Regularization strength.
* $m$: Number of training examples.

### üß† Loss Function Definition

Our model optimizes the **Mean Squared Error (MSE)** with an $L_2$ regularization term (Ridge):

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2$$

Where:
* $\lambda$: Regularization strength.
* $m$: Number of training examples.

-----

### 5\. Task Lists para Roadmap do Projeto

Perfeito para o in√≠cio do notebook, mostrando o que j√° foi feito.

```markdown
### üó∫Ô∏è Project Roadmap
- [x] **Data Ingestion:** Connected to AWS S3.
- [x] **Exploratory Data Analysis:** Identified outliers in `price` column.
- [ ] **Feature Engineering:** Implementing Target Encoding.
- [ ] **Model Training:** Hyperparameter tuning with Optuna.

### üìç Quick Navigation
1. [Introduction](#introduction)
2. [Data Cleaning](#data-cleaning)
3. [Model Evaluation](#evaluation)

...

## <a id="evaluation"></a> 3. Model Evaluation üìà
(O link acima levar√° o usu√°rio diretamente para c√°)

![Python](https://img.shields.io/badge/python-3.10+-blue.svg)
![Status](https://img.shields.io/badge/status-in--progress-orange.svg)
![Framework](https://img.shields.io/badge/framework-PyTorch-red.svg)