# Phase 1: Business Understanding

## 1. Business Objective
The goal of this project is to **predict house sale prices** across California districts using demographic, geographic, and socioeconomic features.  

This project simulates the role of a data scientist working for a **real estate analytics firm** aiming to:
- Identify the factors that most influence housing prices.
- Build a predictive model that can estimate house prices for new areas.
- Provide actionable insights to real estate investors, city planners, and mortgage institutions.

**Key Business Question:**
> "Which factors most strongly affect housing prices in California, and can we build a reliable model to predict prices for new districts?"

---

## 2. Business Success Criteria
The project will be considered successful if:
- The predictive model achieves a **Root Mean Squared Error (RMSE)** below \$50,000 on unseen data.
- The analysis identifies **interpretable and actionable insights** about housing price drivers (e.g., median income, proximity to the ocean, population density).
- The final model is **deployable** and can be used for predicting housing values on new data (from the test set).

---

## 3. Data Mining Goals
From a technical standpoint, the objectives are to:
1. Understand relationships between variables and the target (`median_house_value`).
2. Clean, preprocess, and transform data for modeling.
3. Develop and evaluate predictive regression models.
4. Interpret and visualize results for clear communication.

---

## 4. Project Constraints
- Dataset is **static** (no live updates or streaming data).
- Analysis limited to the provided features — **no external datasets** will be integrated.
- The model must balance **accuracy** and **interpretability**.
- The entire pipeline must be **reproducible in Google Colab**.

---

## 5. Dataset Overview
The dataset represents **California housing block groups** (small geographic areas).  
Each row corresponds to a single block group with aggregated attributes.

**Features include:**
- `longitude`, `latitude` — location coordinates  
- `housing_median_age` — median age of houses  
- `total_rooms`, `total_bedrooms`, `population`, `households` — housing and population metrics  
- `median_income` — median income of residents  
- `ocean_proximity` — categorical variable indicating distance to the ocean  
- **Target variable:** `median_house_value`

The data is provided in two files:
- `train.csv` — for model training and validation  
- `test.csv` — for model testing and deployment

---

## 6. Risks and Assumptions
| Risk | Description | Mitigation |
|------|--------------|-------------|
| Missing values | Some numeric features (e.g., bedrooms) may have NaNs | Use median imputation |
| Skewed distributions | Features like `population` and `total_rooms` may be highly skewed | Apply log or power transformations |
| Multicollinearity | Correlated variables can distort model interpretation | Analyze correlations and drop redundant features |
| Outliers | Extreme income or house values may bias results | Identify and cap or remove outliers |
| Overfitting | Model may fit training data too closely | Use cross-validation and regularization |

---

## 7. Deliverables
1. A **predictive regression model** for California housing prices.  
2. A **complete CRISP-DM workflow** (Business Understanding → Deployment).  
3. A **Medium article** summarizing project methodology, findings, and insights.  
4. A **Colab notebook** and **GitHub repository** with reproducible code, visualizations, and results.

---

## 8. Tools and Techniques
| Category | Tools/Methods |
|-----------|----------------|
| Data Manipulation | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Modeling | scikit-learn (Linear Regression, Random Forest, Gradient Boosting) |
| Evaluation | RMSE, MAE, R², cross-validation |
| Deployment | joblib, Colab runtime, GitHub documentation |
---


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv('./housing_prices_dataset/train.csv')
print(train.head())

   Id            Address  Sold Price  \
0   0        540 Pine Ln   3825000.0   
1   1     1727 W 67th St    505000.0   
2   2     28093 Pine Ave    140000.0   
3   3  10750 Braddock Dr   1775000.0   
4   4  7415 O Donovan Rd   1175000.0   

                                             Summary          Type  \
0  540 Pine Ln, Los Altos, CA 94022 is a single f...  SingleFamily   
1  HURRY, HURRY.......Great house 3 bed and 2 bat...  SingleFamily   
2  'THE PERFECT CABIN TO FLIP!  Strawberry deligh...  SingleFamily   
3  Rare 2-story Gated 5 bedroom Modern Mediterran...  SingleFamily   
4  Beautiful 200 acre ranch land with several pas...    VacantLand   

   Year built                                       Heating  \
0      1969.0  Heating - 2+ Zones, Central Forced Air - Gas   
1      1926.0                                   Combination   
2      1958.0                                    Forced air   
3      1947.0                                       Central   
4         NaN          

## Phase 2: Data Understanding
Performing
 EDA to explore data distributions, missing values, and correlations.

In [5]:
train.describe()

Unnamed: 0,Id,Sold Price,Year built,Lot,Bathrooms,Full bathrooms,Total interior livable area,Total spaces,Garage spaces,Elementary School Score,Elementary School Distance,Middle School Score,Middle School Distance,High School Score,High School Distance,Tax assessed value,Annual tax amount,Listed Price,Last Sold Price,Zip
count,47439.0,47439.0,46394.0,33258.0,43974.0,39574.0,44913.0,46523.0,46522.0,42543.0,42697.0,30734.0,30735.0,42220.0,42438.0,43787.0,43129.0,47439.0,29673.0,47439.0
mean,23719.0,1296050.0,1956.634888,235338.3,2.355642,2.094961,5774.587,1.567117,1.491746,5.720824,1.152411,5.317206,1.691593,6.134344,2.410366,786311.8,9956.843817,1315890.0,807853.7,93279.178587
std,13694.604047,1694452.0,145.802456,11925070.0,1.188805,0.96332,832436.3,9.011608,8.964319,2.10335,2.332367,2.002768,2.462879,1.984711,3.59612,1157796.0,13884.254976,2628695.0,1177903.0,2263.459104
min,0.0,100500.0,0.0,0.0,0.0,1.0,1.0,-15.0,-15.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,85611.0
25%,11859.5,565000.0,1946.0,4991.0,2.0,2.0,1187.0,0.0,0.0,4.0,0.3,4.0,0.6,5.0,0.8,254961.5,3467.0,574500.0,335000.0,90220.0
50%,23719.0,960000.0,1967.0,6502.0,2.0,2.0,1566.0,1.0,1.0,6.0,0.5,5.0,1.0,6.0,1.3,547524.0,7129.0,949000.0,598000.0,94114.0
75%,35578.5,1525000.0,1989.0,10454.0,3.0,2.0,2142.0,2.0,2.0,7.0,1.0,7.0,1.8,8.0,2.4,937162.5,12010.0,1498844.0,950000.0,95073.0
max,47438.0,90000000.0,9999.0,1897474000.0,24.0,17.0,176416400.0,1000.0,1000.0,10.0,57.2,9.0,57.2,10.0,73.9,45900000.0,552485.0,402532000.0,90000000.0,96155.0


In [4]:
train.isnull().sum()

Unnamed: 0,0
Id,0
Address,0
Sold Price,0
Summary,354
Type,0
Year built,1045
Heating,6852
Cooling,20694
Parking,1374
Lot,14181


## Phase 3: Data Preparation
Impute missing values, encode categoricals, and scale numeric data.**bold text**

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X = train.drop('Sold Price', axis=1)
y = train['Sold Price']

num_features = X.select_dtypes(include=['int64','float64']).columns
cat_features = X.select_dtypes(exclude=['int64','float64']).columns

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, num_features),
    ('cat', categorical_transformer, cat_features)])

## Phase 4: Modeling

Training a regression model using the preprocessed data.


Reduced the size of the training data by sampling a percentage/number of the rows.

In [16]:
# Reduce the size of the training data by sampling
train_sampled = train.sample(n=5000, random_state=42)

X_sampled = train_sampled.drop('Sold Price', axis=1)
y_sampled = train_sampled['Sold Price']

# Split the sampled data
X_train_sampled, X_test_sampled, y_train_sampled, y_test_sampled = train_test_split(
    X_sampled, y_sampled, test_size=0.2, random_state=42
)

# Replace the original training data with the sampled data
X_train = X_train_sampled
y_train = y_train_sampled

print(f"Original training data size: {len(X)}")
print(f"Sampled training data size: {len(X_train)}")

Original training data size: 47439
Sampled training data size: 4000


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

model = RandomForestRegressor(n_estimators=100, random_state=42)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', model)])

pipeline.fit(X_train, y_train)

## Phase 5: Evaluation
Evaluating the model performance and feature importances.

In [21]:
# Access feature importances from the regressor within the pipeline
importances = pipeline.named_steps['regressor'].feature_importances_
print(importances)

[9.36882205e-04 1.09699125e-03 1.38614795e-03 ... 8.86547164e-11
 3.86284016e-08 4.62133654e-08]


### Subtask: Evaluate Model Performance with Log RMSE

Evaluate the trained model using the Root-Mean-Squared-Error on the logarithm of the predicted and observed sales prices.

In [22]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Make predictions on the test data
y_pred = pipeline.predict(X_test_sampled)

# Calculate the log of the predicted and actual values
y_test_sampled_log = np.log1p(y_test_sampled)
y_pred_log = np.log1p(y_pred)

# Calculate the log RMSE
log_rmse = np.sqrt(mean_squared_error(y_test_sampled_log, y_pred_log))

print(f"Log Root Mean Squared Error (Log RMSE): {log_rmse}")

Log Root Mean Squared Error (Log RMSE): 0.1727210572795126


## Phase 6: Deployment
Saving the model for reuse and demonstrate prediction.

In [20]:
import joblib
joblib.dump(model, 'california_housing_model.pkl')
print('Model saved!')

Model saved!
