# Chicago Housing Prediction

The goal of this task is to use web scraping and apply decision tree models to predict housing prices based on scraped data.

Submission:

- Submit your Python notebook containing the code for data scraping, preprocessing, model training, and analysis.
- Include a brief report summarizing your findings and any insights gained from the task.


## Data Scraping

- Select a city of your choice for which you will scrape housing data. Examples include Chicago, New York, San Francisco, etc.
- Use web scraping tools to collect housing data from platforms like Zillow or Redfin.
- Ensure you gather relevant features such as the number of bedrooms, bathrooms, square footage, location (address + pincode) , and price.

The scraping is already done using `zillow_chicago_scraper.py`

In [None]:
import polars as pl
import plotly.express as px

In [None]:
df = pl.read_csv('../data/raw/chicago_properties.csv', null_values=['N/A', 'null'])
df.head()

In [None]:
df.describe()

## Data Preparation
- Clean and preprocess the data to handle any missing or inconsistent entries.
- Encode categorical variables if necessary.

In [None]:
df.filter(pl.col('price').str.replace('$', '', literal=True).str.replace_all(',', '').cast(pl.Float32, strict=False).is_null())

In [None]:
df = df.with_columns(
    pl.col('price')
    .replace({'$279,000+': '279000', 'Est. $138.8K': '138800', 'Est. $290K': '290000'})
    .str.replace('$', '', literal=True)
    .str.replace_all(',', '').cast(pl.Float32)
)
df.head()

In [None]:
# Extract zip codes from address by taking number in front of IL (in the end of the address)
df = df.with_columns(
    pl.col('address')
    .str.extract(r'IL (\d{5})$')
    .alias('zip_code')
)

df['zip_code'].value_counts()

In [None]:
df.filter(pl.col('zip_code').is_null())

In [None]:
df = df.with_columns(
    zip_code=pl.when(pl.col('address')=="Madison FP Plan, Madison").then(pl.lit('60601')).otherwise(pl.col('zip_code'))
)

In [None]:
df.to_dummies('zip_code')

In [None]:
df.filter(pl.col('square_footage').cast(pl.Float32, strict=False).is_null())['square_footage'].unique()

In [None]:
df = df.with_columns(
    pl.col('square_footage').cast(pl.Float32, strict=False)
)
df.head()

In [None]:
px.box(df, x='bathrooms', y='square_footage')

In [None]:
# Clear outliers
df.filter((pl.col('square_footage') > 8000) & (pl.col('bathrooms') == 4))

In [None]:
px.scatter(df, x='square_footage', y='price')

In [None]:
px.density_heatmap(df, x='bathrooms', y='bedrooms', z='square_footage', histfunc='avg')

In [None]:
px.scatter(df.with_columns((pl.count('zip_code').over(['bathrooms', 'bedrooms']) / pl.count('zip_code').over(['bathrooms'])).alias('percentage').round(2)), x='bathrooms', y='bedrooms', size='percentage')

In [None]:
px.scatter(df.with_columns((pl.count('zip_code').over(['bathrooms', 'bedrooms']) / pl.count('zip_code').over(['bedrooms'])).alias('percentage').round(2)), x='bedrooms', y='bathrooms', size='percentage')

All three - bedrooms, bathrooms and square_footage have some missing values.

We can first find bathrooms and bedrooms using each other's most common value.
then, we can impute median of square footage based on bathroom and bedrooms.

In [None]:
def impute_bedrooms(num_bathrooms):
    if num_bathrooms <= 2:
        return 2
    elif num_bathrooms <= 5:
        return num_bathrooms
    elif num_bathrooms <= 11:
        return num_bathrooms - 1
    else:
        return 12

def impute_bathrooms(num_bedrooms):
    if num_bathrooms <= 2:
        return 2
    elif num_bathrooms <= 5:
        return num_bathrooms
    elif num_bathrooms <= 11:
        return num_bathrooms - 1
    else:
        return 12

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state=42, sample_posterior=True)
imp.fit(df.select(['bedrooms', 'bathrooms', 'square_footage']))

In [None]:
df[['bedrooms', 'bathrooms', 'square_footage']] = imp.transform(df.select(['bedrooms', 'bathrooms', 'square_footage'])).round(0)

In [None]:
df.select(pl.col('*').is_null().sum())

In [None]:
df

## Build a Decision Tree Model

- Use the scraped data to train a decision tree model.
- Experiment with different features to see which ones are most predictive of housing prices.

## Analysis and Reporting

- Analyze the results of your decision tree model.
- Discuss the features that were most influential in predicting housing prices.