# Chicago Housing Prediction

The goal of this task is to use web scraping and apply decision tree models to predict housing prices based on scraped data.


## Data Scraping

- Select a city of your choice for which you will scrape housing data. Examples include Chicago, New York, San Francisco, etc.
- Use web scraping tools to collect housing data from platforms like Zillow or Redfin.
- Ensure you gather relevant features such as the number of bedrooms, bathrooms, square footage, location (address + pincode) , and price.

The scraping is already done using `zillow_chicago_scraper.py`

In [None]:
import polars as pl
import plotly.express as px

In [None]:
df = pl.read_csv('../data/raw/chicago_properties.csv', null_values=['N/A', 'null'])
df.head()

In [None]:
# Check if bathrooms and bedrooms are greater than 0
df.filter((pl.col('bathrooms') < 1) | (pl.col('bedrooms') < 1))

## Data Preparation
- Clean and preprocess the data to handle any missing or inconsistent entries.
- Encode categorical variables if necessary.

In [None]:
# Find non-numeric price values
df.filter(pl.col('price').str.replace('$', '', literal=True).str.replace_all(',', '').cast(pl.Float32, strict=False).is_null())

In [None]:
# Clean the price column
df = df.with_columns(
    pl.col('price')
    .replace({'$279,000+': '279000', 'Est. $138.8K': '138800', 'Est. $290K': '290000'})
    .str.replace('$', '', literal=True)
    .str.replace_all(',', '').cast(pl.Float32)
)
df.head()

In [None]:
# Location of the property also matters, let's extract the zip code from the address
df = df.with_columns(
    pl.col('address')
    .str.extract(r'IL (\d{5})$')
    .alias('zip_code')
)

df['zip_code'].value_counts()

In [None]:
# Check if there are any missing zip codes
df.filter(pl.col('zip_code').is_null())

In [None]:
# Fill in the missing zip code (found using Google Search)
df = df.with_columns(
    zip_code=pl.when(pl.col('address')=="Madison FP Plan, Madison").then(pl.lit('60601')).otherwise(pl.col('zip_code'))
)

In [None]:
# Filter non-numeric unique square footage values
df.filter(pl.col('square_footage').cast(pl.Float32, strict=False).is_null())['square_footage'].unique()

In [None]:
# Clean the square footage column, convert unknown values to null
df = df.with_columns(
    pl.col('square_footage').cast(pl.Float32, strict=False)
)
df.head()

In [None]:
px.box(df, x='bedrooms', y='square_footage')

In [None]:
# Find outliers
df.filter(
    (((pl.col('square_footage') > 9000) & (pl.col('bedrooms') == 3)) | ((pl.col('square_footage') > 20000) & (pl.col('bedrooms') == 5)))
)

In [None]:
# Let's take out these 2 properties which are clearly outliers
df = df.filter(
    ~pl.col('address').is_in(['1355 N Astor St, Chicago, IL 60610', '415 E North Water St #3205, Chicago, IL 60611'])
)

px.box(df, x='bedrooms', y='square_footage')

In [None]:
px.scatter(df, x='square_footage', y='price')

It can be observed from the scatter plot that a sqaure footage value can have multiple price points, and given other data like bathrooms, bedrooms (categorical data) and zip codes (less data points per state), they do not seem sufficient to explain the price. We need other information like carpet area, house type, etc.

In [None]:
px.histogram(df, x='square_footage', nbins=50)

We can keep these extreme values in the `square_footage` because price is clearly high for them. Because there are few points, it will effect the cross validation score.

In [None]:
px.density_heatmap(df, x='bathrooms', y='bedrooms', z='square_footage', histfunc='avg', title="Average square footage by number of bathrooms and bedrooms")

In [None]:
px.scatter(df.with_columns((pl.count('zip_code').over(['bathrooms', 'bedrooms']) / pl.count('zip_code').over(['bathrooms'])).alias('percentage').round(2)), x='bathrooms', y='bedrooms', size='percentage', title="Percentage of bedrooms for bedroom category")

In [None]:
# Missing values
df.select(pl.col('*').is_null().sum())

I tried imputing bathrooms and square_footage but it did not improve the R2 score. Also, in regression tasks, decision trees are highly sensitive to bias imputation. For reference, here was the code used for imputation:

We can first find bathrooms and bedrooms using each other's most common value. Then, we can impute median of square footage based on zipcode, bathroom and bedrooms.

```
def impute_bedrooms(num_bathrooms):
    if num_bathrooms <=2:
        return 2
    elif num_bathrooms <= 5:
        return num_bathrooms
    elif num_bathrooms <= 9:
        return num_bathrooms - 1
    else:
        return 10

# Impute bedrooms
df = df.with_columns(
    pl.when(
        pl.col('bedrooms').is_null()
    ).then(
        pl.col('bathrooms').map_elements(impute_bedrooms, return_dtype=pl.Float32)
    ).otherwise(
        pl.col('bedrooms')
    )
)

# Impute square_footage
df = df.with_columns(pl.col('square_footage').fill_null(pl.col('square_footage').mean().over(['bedrooms', 'bathrooms', 'zip_code']))).with_columns(pl.col('square_footage').fill_null(pl.col('square_footage').median()))
```

In [None]:
df.describe()

## Build a Decision Tree Model

- Use the scraped data to train a decision tree model.
- Experiment with different features to see which ones are most predictive of housing prices.

In [None]:
import numpy as np
from loguru import logger
from sklearn.model_selection import KFold
from sklearn.metrics import get_scorer

def build_model(model_config=None):
    """Build a model with the specified configuration

    Args:
        model_config (dict[str, Any]): Model configuration.

    Returns:
        object: Model object.
    """
    from sklearn.tree import DecisionTreeRegressor
    return DecisionTreeRegressor(
        min_samples_leaf=10,
        random_state=42,
    )


def train(
    X,
    y,
    model_params=None,
    cv=5,  # can also use train-test
    metrics=[],
    random_state=42,
):
    """Train a model and compute evaluation metrics.

    This function is supposed to do four things in order:
    1. Perform training along with the validation setup.
    2. Evaluate the model on specified metrics.
    3. Retrain the model on the full dataset.
    4. Save the model if a path is provided.

    Args:
        X (DataFrame): Features.
        y (DataFrame): Labels.
        model_params: Model parameters.
        cv (int, optional): Number of cross-validation folds. Defaults to 5.
        eval_metrics (list, optional): Evaluation metrics to compute. Defaults to [].

    Returns:
        object: Trained model.
        np.ndarray: Out-of-fold predictions.
        dict: Evaluation metrics.
    """
    # Perform CV
    kf = KFold(cv, random_state=random_state, shuffle=True)
    scorers = {metric: get_scorer(metric)._score_func for metric in metrics}

    train_scores = {metric: [] for metric in metrics}
    valid_scores = {metric: [] for metric in metrics}
    oof_preds = np.zeros(len(y), dtype=int)
    models = []
    for fold, (tridx, validx) in enumerate(kf.split(X, y)):
        model_ = build_model(model_params)

        X_train, y_train = X[list(tridx)], y[list(tridx)]
        X_valid, y_valid = X[list(validx)], y[list(validx)]

        # Train model
        model_.fit(X_train.to_numpy(), y_train.to_numpy().ravel())
        models.append(model_)

        # Predict on test dataset
        y_pred = model_.predict(X_valid.to_numpy())
        oof_preds[validx] = y_pred

        for metric in metrics:
            valid_score = scorers[metric](y_valid.to_numpy().ravel(), y_pred)
            valid_scores[metric].append(valid_score)

            train_score = scorers[metric](
                y_train.to_numpy().ravel(),
                model_.predict(X_train.to_numpy()),
            )
            train_scores[metric].append(train_score)

            logger.info(
                f"Fold: {fold+1}/{cv}, Train {metric}: {train_score}, Valid {metric}: {valid_score}"
            )

    # Compute CV mean and standard deviation of train and valid scores
    cv_mean_train_scores = {
        metric: np.mean(train_scores[metric]) for metric in metrics
    }
    cv_std_train_scores = {
        metric: np.std(train_scores[metric]) for metric in metrics
    }
    cv_mean_valid_scores = {
        metric: np.mean(valid_scores[metric]) for metric in metrics
    }
    cv_std_valid_scores = {
        metric: np.std(valid_scores[metric]) for metric in metrics
    }

    # Compute OOF scores
    oof_score = {metric: scorers[metric](y, oof_preds) for metric in metrics}

    evaluation_metrics = {
        "CV Mean Train score": cv_mean_train_scores,
        "CV Std Train score": cv_std_train_scores,
        "CV Mean Valid score": cv_mean_valid_scores,
        "CV Std Valid score": cv_std_valid_scores,
        "OOF Score": oof_score,
    }

    # Retrain model on full dataset
    model = build_model(model_params)
    model.fit(X.to_numpy(), y.to_numpy().ravel())

    return model, oof_preds, evaluation_metrics

In [None]:
# One hot encode zip code
X = df.to_dummies('zip_code')

# Feature engineering
X = X.with_columns(
    (pl.col('bedrooms') + pl.col('bathrooms')).alias('B_plus_B'),
    (pl.col('bedrooms') * pl.col('bathrooms')).alias('B_prod_B'),
    (pl.col('square_footage') / pl.col('bedrooms')).alias('sq_div_bed'),
    (pl.col('square_footage') / pl.col('bathrooms')).alias('sq_div_bath'),
)

# Train the model
feature_cols = ['bedrooms', 'bathrooms', 'square_footage'] + [col for col in X.columns if col.startswith('zip_code')] + ['B_plus_B', 'B_prod_B', 'sq_div_bed', 'sq_div_bath']
target_col = 'price'

model, oof_preds, evaluation_results = train(
    X=X[feature_cols],
    y=df[target_col],
    cv=5,
    metrics=['neg_mean_squared_error', 'neg_root_mean_squared_error', 'r2'],
)

In [None]:
from rich import print
print(evaluation_results)

## Analysis and Reporting

- Analyze the results of your decision tree model.
- Discuss the features that were most influential in predicting housing prices.

In [None]:
import plotly.graph_objects as go

fig = px.scatter(x=df[target_col], y=oof_preds, labels={'x': 'Ground Truth - Price', 'y': 'Predicted - Price'})
fig.add_trace(
    go.Scatter(x=df[target_col], y=df[target_col], name="linear", line_shape='linear')
)
fig.show()

In [None]:
feature_importances = sorted(list(zip(X[feature_cols].columns, model.feature_importances_)), key=lambda x: x[1], reverse=True)
feature_importances_X = [x[0] for x in feature_importances if x[1] > 0]
feature_importances_y = [x[1] for x in feature_importances if x[1] > 0]

px.bar(x=feature_importances_X, y=feature_importances_y, labels={'x': 'Feature', 'y': 'Importance'})

`sq_div_bed` i.e. = (Square footage / Number of bedrooms) and number of `bathrooms` are the most important determining factors.