# ***Houston Housing Price Prediction using Random Forest Regressor***

## Objective:
Predict the unformatted price of a property in Houston based on features like zestimate, area, beds, etc., using a Random Forest Regressor.

## Workflow Overview:
- **Import Dependencies** – Load required Python libraries
- **Load & Explore Data** – Load JSON housing data and inspect structure
- **Data Cleaning & Preprocessing** – Drop irrelevant columns, impute missing values
- **Feature Engineering** – Encode categorical variables using one-hot encoding
- **Train-Test Split** – Divide data into training and testing sets
- **Model Training** – Train a Random Forest Regressor
- **Model Evaluation** – Evaluate model using MAE (Mean Absolute Error)
- **Feature Importance** – Identify top predictors
- **Baseline Metrics** – Compare MAE to average house price

## Problem Type:
- Regression Problem
- Target Variable: unformattedPrice (continuous value representing house price in USD)

## Dataset:
- Source: Zillow Houston Housing Market Dataset (2024)
- Format: JSON
- Rows: ~18,000+ properties
- Features: Street address, beds, baths, Zestimate, area, price, images, and many binary attributes


### Step 1: Import Dependencies
Load libraries for data processing, modeling, and evaluation.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

### Step 2: Load & Explore the Dataset
Read the JSON file and understand its structure and contents.

In [6]:
# Unzip the dataset
!unzip -q houston_housing_market_2024_light.json.zip -d houston_house_data

In [7]:
# List files to verify
!ls houston_house_data

'houston_housing market 2024_light.json'


In [8]:
# Load the lighter JSON file
house_data = pd.read_json("/content/houston_house_data/houston_housing market 2024_light.json")

Initial Structure and stats of data

In [9]:
house_data.head()

Unnamed: 0,zpid,id,rawHomeStatusCd,marketingStatusSimplifiedCd,imgSrc,hasImage,detailUrl,statusType,statusText,countryCurrency,...,builderName,streetViewURL,streetViewMetadataURL,isPropertyResultCDP,flexFieldText,flexFieldType,info3String,info6String,info2String,availabilityDate
0,305340899,305340899,ForSale,For Sale by Agent,https://photos.zillowstatic.com/fp/c6c062dff72...,1.0,https://www.zillow.com/homedetails/13234-Valle...,FOR_SALE,House for sale,$,...,,,,,,,,,,
1,160934721,160934721,ForSale,For Sale by Agent,https://photos.zillowstatic.com/fp/dca8e79a0d4...,1.0,https://www.zillow.com/homedetails/12235-Green...,FOR_SALE,House for sale,$,...,,,,,,,,,,
2,121106557,121106557,ForSale,For Sale by Agent,https://photos.zillowstatic.com/fp/5d7fc370754...,1.0,https://www.zillow.com/homedetails/13807-Bend-...,FOR_SALE,House for sale,$,...,,,,,,,,,,
3,28549516,28549516,ForSale,For Sale by Agent,https://photos.zillowstatic.com/fp/f7c3f5735ac...,1.0,https://www.zillow.com/homedetails/14606-Kings...,FOR_SALE,House for sale,$,...,,,,,,,,,,
4,123238565,123238565,ForSale,For Sale by Agent,https://photos.zillowstatic.com/fp/274024ea250...,1.0,https://www.zillow.com/homedetails/14434-Mount...,FOR_SALE,House for sale,$,...,,,,,,,,,,


In [10]:
house_data.shape

(25948, 58)

In [11]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25948 entries, 0 to 25947
Data columns (total 58 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   zpid                         25948 non-null  int64  
 1   id                           25948 non-null  int64  
 2   rawHomeStatusCd              25948 non-null  object 
 3   marketingStatusSimplifiedCd  25948 non-null  object 
 4   imgSrc                       25948 non-null  object 
 5   hasImage                     25839 non-null  float64
 6   detailUrl                    25948 non-null  object 
 7   statusType                   25948 non-null  object 
 8   statusText                   25948 non-null  object 
 9   countryCurrency              25948 non-null  object 
 10  price                        25948 non-null  object 
 11  unformattedPrice             25948 non-null  int64  
 12  address                      25948 non-null  object 
 13  addressStreet   

In [12]:
house_data.isnull().sum()

Unnamed: 0,0
zpid,0
id,0
rawHomeStatusCd,0
marketingStatusSimplifiedCd,0
imgSrc,0
hasImage,109
detailUrl,0
statusType,0
statusText,0
countryCurrency,0


### Step 3: Data Cleaning & Preprocessing
Drop irrelevant columns and handle missing values based on data skewness.


In [13]:
# dropping columns with high missing values
columns_to_drop  = ["variableData", "hasOpenHouse" ,"openHouseStartDate" ,
                    "openHouseEndDate" ,"openHouseDescription" ,"lotAreaString"
                    ,"providerListingId" ,"builderName" ,"streetViewURL"
                    ,"streetViewMetadataURL" ,"isPropertyResultCDP" ,"flexFieldText"
                    ,"flexFieldType" ,"info3String" ,"info6String"
                    ,"info2String" , "availabilityDate"]


house_data.drop(columns = columns_to_drop, inplace = True)

In [14]:
# Correlation matrix
correlation = house_data.corr(numeric_only=True)

# Show correlation with target
correlation['unformattedPrice'].sort_values(ascending=False)

Unnamed: 0,unformattedPrice
unformattedPrice,1.0
zestimate,0.996109
area,0.623328
baths,0.269873
beds,0.267893
hasVideo,0.024171
isShowcaseListing,0.016465
zpid,0.003945
id,0.003945
hasAdditionalAttributions,-0.000671


In [15]:
# Check percentage of missing values
missing_percent = house_data.isnull().mean().sort_values(ascending=False) * 100
print(missing_percent)

zestimate                      17.061045
brokerName                     16.016649
area                           12.162787
beds                           12.151226
hasImage                        0.420071
carouselPhotos                  0.342994
baths                           0.142593
zpid                            0.000000
id                              0.000000
statusText                      0.000000
statusType                      0.000000
detailUrl                       0.000000
imgSrc                          0.000000
rawHomeStatusCd                 0.000000
marketingStatusSimplifiedCd     0.000000
addressStreet                   0.000000
countryCurrency                 0.000000
addressZipcode                  0.000000
addressState                    0.000000
addressCity                     0.000000
isUndisclosedAddress            0.000000
latLong                         0.000000
price                           0.000000
unformattedPrice                0.000000
address         

In [16]:
# Finding Skewness to determine the statistical measure to imputate
print(f"Bed Skewness: {house_data['beds'].skew()}")
print(f"area Skewness: {house_data['area'].skew()}")
print(f"zestimate Skewness: {house_data['zestimate'].skew()}")
print(f"baths Skewness: {house_data['baths'].skew()}")

Bed Skewness: 0.7797275640195557
area Skewness: 14.093616987829206
zestimate Skewness: 15.694127856538481
baths Skewness: 0.8159236495361571



**Interpretation:**
- . 0 → perfectly symmetrical
- . > 0.5 → moderate right-skewed
- . > 1 → highly right-skewed
- . < -0.5 → moderate left-skewed
- . < -1 → highly left-skewed

**Imputate Missing Values**

In [17]:
# Numerical Columns (use median)
house_data['zestimate'].fillna(house_data['zestimate'].median(), inplace=True)
house_data['area'].fillna(house_data['area'].median(), inplace=True)
house_data['beds'].fillna(house_data['beds'].median(), inplace=True)
house_data['baths'].fillna(house_data['baths'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  house_data['zestimate'].fillna(house_data['zestimate'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  house_data['area'].fillna(house_data['area'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because 

In [18]:
# Categorical/Binary Columns (use mode)
house_data['brokerName'].fillna(house_data['brokerName'].mode()[0], inplace=True)
house_data['hasImage'].fillna(house_data['hasImage'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  house_data['brokerName'].fillna(house_data['brokerName'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  house_data['hasImage'].fillna(house_data['hasImage'].mode()[0], inplace=True)


Code	Meaning
- **.mode()**:- 	Returns a Series of most frequent values (could be 1 or more).
- **.mode()[0]**:- 	Picks the first mode, suitable for filling in one value.

In [19]:
# drop irrelevant columns
house_data.drop(columns=['carouselPhotos'], inplace=True)

In [20]:
print(house_data.isnull().sum())

zpid                           0
id                             0
rawHomeStatusCd                0
marketingStatusSimplifiedCd    0
imgSrc                         0
hasImage                       0
detailUrl                      0
statusType                     0
statusText                     0
countryCurrency                0
price                          0
unformattedPrice               0
address                        0
addressStreet                  0
addressCity                    0
addressState                   0
addressZipcode                 0
isUndisclosedAddress           0
beds                           0
baths                          0
area                           0
latLong                        0
isZillowOwned                  0
hdpData                        0
isSaved                        0
isUserClaimingOwner            0
isUserConfirmedClaim           0
pgapt                          0
sgapt                          0
zestimate                      0
shouldShow

In [21]:
house_data.shape

(25948, 40)

### Step 4: Feature Engineering
Remove unsupported (list/dict) columns and encode remaining categorical features.

In [23]:
# One-hot encoding for categorical columns
house_model = house_data.copy()

# Drop columns that contain lists or dicts
for col in house_model.columns:
    if house_model[col].apply(lambda x: isinstance(x, (dict, list))).any():
        print(f"Dropping column with dict/list: {col}")
        house_model.drop(columns=[col], inplace=True)

Dropping column with dict/list: latLong
Dropping column with dict/list: hdpData


In [24]:
# Select object-type columns
categorical_cols = house_model.select_dtypes(include='object').columns

# Perform one-hot encoding
house_model_encoded = pd.get_dummies(house_model, columns=categorical_cols, drop_first=True)

### Step 5: Split Features and Target
Separate input features (X) from the target price (Y).

In [30]:
# Downsample to 2,000 rows to avoid memory issues
house_model_sampled = house_model_encoded.sample(n=2000, random_state=42)

In [31]:
# Seprate features and target
X = house_model_sampled.drop(columns = 'unformattedPrice' , axis = 1)
Y = house_model_sampled['unformattedPrice']

### Step 6: Train-Test Split
Split data into 80% training and 20% test set.

In [32]:
X_train , X_test , Y_train , Y_test = train_test_split(X, Y , test_size=0.2 , random_state= 0)

### Step 7: Train Random Forest Model
Train a Random Forest Regressor on the training data.

In [35]:
# Train Random Forest
rf = RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)
rf.fit(X_train, Y_train)

In [36]:
# Predict
y_pred = rf.predict(X_test)

**Why Random Forest?**
- Handles non-linear relationships well
- Robust to outliers and missing values
- Works well with mixed feature types
- Provides feature importance
- Requires minimal preprocessing
- A strong baseline model for tabular regression

### Step 8: Model Evaluation
Evaluate model performance using Mean Absolute Error (MAE).

In [37]:
mae = mean_absolute_error(Y_test, y_pred)
print(f"MAE on test set: ${mae:,.2f}")

MAE on test set: $111,063.29


### Step 9: Feature Importance
Identify which features had the most impact on prediction.

In [44]:
# Feature importances from previous RF model
importances = rf.feature_importances_
features = X_train.columns
importance_series = pd.Series(importances, index=features).sort_values(ascending=False)
print(importance_series.head(20))

zestimate                                                                                           0.516123
address_0 Fm 762 Rd, Richmond, TX 77469                                                             0.079872
addressStreet_0 Fm 762 Rd                                                                           0.069252
price_$15,500,000                                                                                   0.056752
detailUrl_https://www.zillow.com/homedetails/0-Fm-762-Rd-Richmond-TX-77469/2066674469_zpid/         0.054338
imgSrc_https://photos.zillowstatic.com/fp/61e50b4c65fe1a8cb2e94a995fbfc1d6-p_e.jpg                  0.022578
addressStreet_2819 Newhoff St                                                                       0.019676
detailUrl_https://www.zillow.com/homedetails/2819-Newhoff-St-Houston-TX-77026/27800263_zpid/        0.016456
imgSrc_https://photos.zillowstatic.com/fp/34982b4245454703e0a1510a029d57fb-p_e.jpg                  0.013489
price_$7,200,000   

### Step 10: Baseline Comparison
Compare the model's MAE to the average price in the dataset.

In [45]:
print(f"Average home price: ${Y.mean():,.2f}")

Average home price: $521,595.48


 **Final Notes:**
- Model MAE: ~$111,000

- Average Home Price: ~$521,000
- This means the model is off by ~21%, which is reasonable for real estate price prediction, especially with minimal feature tuning.
