# House pricing

This project has 2 main phases:


1. Initial
2. Rerun

Each of the phases contained such steps:

1. Data preprocessing and analysis 
    - Collecting the initial (refered as "raw") data from several datasets and two APIs
    - Preprocess the data
    - perform the initial EDA
2. Do the modeling (refered as the "first cycle")
    - Run a few different models
    - Pick the best ones overall 

The second phase repeats the steps above with a slight modification. Instead of just collecting the data new features are also added/engineered from the existing data. Moreover, the cluster by the urbanization level was performed and the obtained labels were also added as a possible feature. 

## General setup

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

from src.config import (
    AGE_BUCKET_FILE,
    DENSITY_CLEAN_FILE,
    HOUSE_CLEAN_FILE,
    INCOME_CLEAN_TOTAL,
    MASTER_DF_FILE,
    SERVICES_COLUMNS,
    SERVICES_FILE,
    WEATHER_QUARTER_FILE,
    URBAN_CLUSTER_FILE
)

warnings.filterwarnings("ignore")

## Phase 1

### Collecting and processing data

The directory `notebooks\eda` contains more detailed notebooks on the data cleaning, preparation and EDA. 

---

Here's the result of that work: the master dataset, that contains all of the relevant data from 2 open APIs (Weather: [OpenMeteoAPI](https://open-meteo.com/en/docs/historical-weather-api) and amenities from [OpenStreetMapAPI](https://www.openstreetmap.org/#map=7/39.606/-7.839)) and 4 datasets (all obtained from the Pordata portal):
- Average house pricing
- Average income among people with total education level
- Population density
- Age distributions

As a result of the data manipulations the master dataset was created. It contains 39 columns, where:
- 2 columns indicate the average quarter house price (regular and log of the price)
- 3 temporal columns (`quarter_num`, `quarter_ord` and `year`), one of which (`year`) is mostly a helper/meta column
- 16 amenities columns
- 6 age bucket columns
- 1 income column
- 1 population density column
- 4 comfort level columns (weather-derived)
- 5 weather columns
- 1 municipality name column (meta column)

The data in the dataset describes 230 municipalities, all on the NUTS4 level, for 4 whole years + 1 quarter, from the last quarter of 2019 till the end of 2023. 

In [3]:
df = pd.read_csv(MASTER_DF_FILE, index_col=0)
df.head()

Unnamed: 0,municipality,house_price,total_sunshine_h,mean_sunshine_h,windspeed_mean_kmh,total_precipitation_mm,mean_precipitation_mm,year,quarter_num,quarter_ord,...,library,mall,museum,pharmacy,police,post_office,school,station,theatre,university
0,Arcos de Valdevez,813.0,487.521214,5.299144,9.177174,1405.1,15.272826,2023,4,17,...,4.0,2.0,5.0,10.0,2.0,4.0,15.0,1.0,1.0,0.0
4,Paredes de Coura,723.0,472.135439,5.131907,10.283696,1237.1,13.446739,2023,4,17,...,1.0,0.0,1.0,4.0,1.0,1.0,7.0,1.0,0.0,0.0
5,Ponte da Barca,759.0,499.030875,5.424249,8.804348,1300.5,14.13587,2023,4,17,...,3.0,2.0,7.0,10.0,2.0,3.0,16.0,1.0,1.0,0.0
6,Ponte de Lima,1128.0,513.344253,5.579829,11.211957,1132.9,12.31413,2023,4,17,...,2.0,2.0,3.0,13.0,6.0,2.0,39.0,2.0,1.0,2.0
7,Valença,945.0,549.046297,5.967895,13.494565,386.7,4.203261,2023,4,17,...,3.0,1.0,3.0,7.0,6.0,4.0,23.0,5.0,1.0,1.0


The initial assumptions from the beginning and after the initial EDA (`01_eda.ipynb`) include that the key drivers for the house pricing were income and amenities. We're also aware of the geographical specifics of Portugal, so to align with it we also considered the assumption that wind could be one the key factors, since windspeed is higher closer to the coastal line.

### Modeling

## Phase 2

### Adding more data

### Clustering

### Modeling