# House pricing

This project has 2 main phases:


1. Initial
2. Rerun

Each of the phases contained such steps:

1. Data preprocessing and analysis 
    - Collecting the initial (refered as "raw") data from several datasets and two APIs
    - Preprocess the data
    - perform the initial EDA
2. Do the modeling (refered as the "first cycle")
    - Run a few different models
    - Pick the best ones overall 

The second phase repeats the steps above with a slight modification. Instead of just collecting the data new features are also added/engineered from the existing data. Moreover, the cluster by the urbanization level was performed and the obtained labels were also added as a possible feature. 

## General setup

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import warnings

from src.config import (
    AGE_BUCKET_FILE,
    DENSITY_CLEAN_FILE,
    HOUSE_CLEAN_FILE,
    INCOME_CLEAN_TOTAL,
    MASTER_DF_FILE,
    SERVICES_COLUMNS,
    SERVICES_FILE,
    WEATHER_QUARTER_FILE,
    URBAN_CLUSTER_FILE
)

warnings.filterwarnings("ignore")

## Phase 1

### Collecting and processing data

%load_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import warnings

from src.config import (
    AGE_BUCKET_FILE,
    DENSITY_CLEAN_FILE,
    HOUSE_CLEAN_FILE,
    INCOME_CLEAN_TOTAL,
    MASTER_DF_FILE,
    SERVICES_COLUMNS,
    SERVICES_FILE,
    WEATHER_QUARTER_FILE,
)

warnings.filterwarnings("ignore")

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('all_raw_features.csv', index_col=0)

target = 'house_price'
numeric_df = df.select_dtypes(include='number')
correlations = numeric_df.corr()[target].drop(['log_price_sqm', target], errors='ignore').sort_values(ascending=False)

print("Top 10 Correlations with House Price:")
print(correlations.head(10))

top_features = correlations.head(6).index.tolist()

plt.figure(figsize=(12, 10))
sns.heatmap(df[top_features + [target]].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix: Top Features vs House Price')
plt.show()

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

sns.scatterplot(data=df, x='avg_income', y=target, ax=axes[0], alpha=0.6)
axes[0].set_title('Average Income vs House Price')

sns.scatterplot(data=df, x='people/km2', y=target, ax=axes[1], alpha=0.6)
axes[1].set_title('Population Density vs House Price')
axes[1].set_xscale('log')

sns.scatterplot(data=df, x='total_sunshine_h', y=target, ax=axes[2], alpha=0.6)
axes[2].set_title('Total Sunshine vs House Price')

plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='year', y=target)
plt.title('House Price Distribution by Year')
plt.show()

### Modeling

## Phase 2

### Adding more data

### Clustering

### Modeling