
# Task 02 — House Price Prediction (Regression)

**Files this notebook will produce (replace `TeamCode` and `TeamName` with your actual team code/name before submission):**

- `TeamCode_TeamName_Task2_HousePrice.ipynb` — this notebook (you already have it).
- `TeamCode_TeamName_Task2_Metrics.png` — screenshot image with the evaluation metrics (RMSE, MAE, R²).
- `TeamCode_TeamName_Task2_Predictions.csv` — final submission CSV (**must** contain `house_id` and `predicted_price` columns).
- `TeamCode_TeamName_Task2_FinalModel.joblib` — trained final model (optional to submit).

**How to run (Google Colab recommended):**
1. Open this notebook in Colab (`File > Upload notebook`) or upload `train.csv` and `test.csv` into the Colab session.  
2. Run cells from top to bottom. The notebook contains safe default modeling, preprocessing, evaluation, and instructions.  
3. After running, the predictions CSV and metrics PNG will be saved in the runtime working directory. Download and add them to your GitHub repo with the required file names above.

> **Important:** Replace `TeamCode` and `TeamName` in filenames before uploading to the competition (or rename the produced files accordingly).


In [None]:

# Imports
import os
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import joblib

print('Environment ready. Current working dir:', os.getcwd())


In [None]:

# ---- Load train.csv and test.csv ----
# This block works in regular Jupyter and in Google Colab.
def _read_with_date(path):
    cols = pd.read_csv(path, nrows=0).columns.tolist()
    if 'sale_date' in cols:
        return pd.read_csv(path, parse_dates=['sale_date'])
    else:
        return pd.read_csv(path)

train_path = 'train.csv'
test_path = 'test.csv'

if os.path.exists(train_path) and os.path.exists(test_path):
    train = _read_with_date(train_path)
    test = _read_with_date(test_path)
else:
    # Colab upload flow
    try:
        from google.colab import files
        print('Please upload train.csv and test.csv in the file chooser. You can upload both at once.')
        uploaded = files.upload()
        # uploaded is a dict of filename -> bytes
        for fn in uploaded.keys():
            print('Uploaded:', fn)
        train = _read_with_date('train.csv')
        test = _read_with_date('test.csv')
    except Exception as e:
        raise FileNotFoundError('train.csv and test.csv not found. Upload them to the working directory or use Colab files.upload().') from e

print('Train shape:', train.shape)
print('Test shape:', test.shape)
train.head(3)


In [None]:

# Quick overview
display(train.describe(include='all').T)
print('\nMissing values in train:')
print(train.isna().sum().sort_values(ascending=False).head(20))
print('\nColumns:', train.columns.tolist())


In [None]:
# ✅ Safe and robust feature engineering function
def feature_engineer(df, zip_count_map=None, is_train=True):
    df = df.copy()

    # Ensure numeric columns are numeric
    for col in ['built_year', 'renovated_year', 'living_area', 'lot_area', 
                'above_area', 'basement_area', 'num_bedrooms', 'num_bathrooms', 
                'num_floors', 'view_rating', 'condition_index', 'construction_grade',
                'neighbor_living_area', 'neighbor_lot_area']:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')

    # Parse sale_date safely
    if 'sale_date' in df.columns:
        df['sale_year'] = pd.to_datetime(df['sale_date'], errors='coerce').dt.year
    else:
        df['sale_year'] = pd.Timestamp.now().year  # fallback

    # Fill missing years with median to avoid NaN issues
    if 'built_year' in df.columns:
        df['built_year'] = df['built_year'].fillna(df['built_year'].median())

    # Age feature
    df['age'] = df['sale_year'] - df['built_year']

    # Renovation features
    if 'renovated_year' in df.columns:
        df['renovated_year'] = df['renovated_year'].fillna(0)
        df.loc[df['renovated_year'] < 0, 'renovated_year'] = 0
        df['years_since_renov'] = df['sale_year'] - df['renovated_year']
        df.loc[df['renovated_year'] <= 0, 'years_since_renov'] = 0
        df['was_renovated'] = (df['renovated_year'] > 0).astype(int)
    else:
        df['years_since_renov'] = 0
        df['was_renovated'] = 0

    # Area-related features
    df['above_area'] = df['above_area'].fillna(0)
    df['basement_area'] = df['basement_area'].fillna(0)
    df['total_area'] = df['above_area'] + df['basement_area']
    df['has_basement'] = (df['basement_area'] > 0).astype(int)

    # Ratio feature
    df['lot_living_ratio'] = df['lot_area'] / (df['living_area'].replace(0, np.nan))
    df['lot_living_ratio'] = df['lot_living_ratio'].replace([np.inf, -np.inf], np.nan).fillna(0)

    # Zip frequency encoding
    if 'zip_area' in df.columns:
        if is_train:
            zip_count_map = df['zip_area'].value_counts().to_dict()
        df['zip_freq'] = df['zip_area'].map(zip_count_map).fillna(0)
    else:
        df['zip_freq'] = 0

    return df, zip_count_map


# ✅ Re-run these lines
train_fe, zip_map = feature_engineer(train, is_train=True)
test_fe, _ = feature_engineer(test, zip_map, is_train=False)

print('✅ Feature engineering complete. Example:')
display(train_fe[['sale_date','sale_year','age','years_since_renov','was_renovated','total_area']].head())


In [None]:

# Prepare feature list
target_col = 'target_price'
id_col = 'house_id'

# Drop columns that won't be used as features (but keep id_col for test)
drop_cols = ['sale_date'] if 'sale_date' in train_fe.columns else []

# Select candidate numerical features
numeric_feats = [
    'num_bedrooms','num_bathrooms','living_area','lot_area','num_floors',
    'is_waterfront','view_rating','condition_index','construction_grade',
    'above_area','basement_area','built_year','renovated_year',
    'neighbor_living_area','neighbor_lot_area',
    # engineered
    'sale_year','age','years_since_renov','total_area','has_basement','lot_living_ratio','zip_freq'
]

# Keep only columns that exist
numeric_feats = [c for c in numeric_feats if c in train_fe.columns]

# For categorical, we'll treat 'zip_area' cautiously; if it has small uniques, one-hot it.
cat_feats = []
if 'zip_area' in train_fe.columns:
    if train_fe['zip_area'].nunique() <= 20:
        cat_feats.append('zip_area')  # will one-hot encode
    else:
        # we already encoded zip_freq; skip one-hot to avoid huge dim
        pass

print('Numeric features used:', numeric_feats)
print('Categorical features used:', cat_feats)

# Train/validation split
X = train_fe[numeric_feats + cat_feats].copy()
y = train_fe[target_col].copy()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=42)

print('X_train shape:', X_train.shape, 'X_val shape:', X_val.shape)


In [None]:

# Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

if len(cat_feats) > 0:
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_feats),
        ('cat', categorical_transformer, cat_feats)
    ])
else:
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_feats)
    ])

# Example: a pipeline with RandomForest
rf_pipeline = Pipeline(steps=[
    ('preproc', preprocessor),
    ('model', RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1))
])

# Quick baseline: Linear Regression pipeline
lr_pipeline = Pipeline(steps=[
    ('preproc', preprocessor),
    ('model', LinearRegression())
])


In [None]:
def rmse(y_true, y_pred): 
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Train linear regression baseline
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_val)
print('Linear Regression  - RMSE: {:.2f}, MAE: {:.2f}, R2: {:.4f}'.format(
    rmse(y_val, y_pred_lr), 
    mean_absolute_error(y_val, y_pred_lr), 
    r2_score(y_val, y_pred_lr)
))

# Train random forest baseline
rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_val)
print('Random Forest      - RMSE: {:.2f}, MAE: {:.2f}, R2: {:.4f}'.format(
    rmse(y_val, y_pred_rf), 
    mean_absolute_error(y_val, y_pred_rf), 
    r2_score(y_val, y_pred_rf)
))

# Train a Gradient Boosting baseline (sklearn)
gbr = Pipeline(steps=[('preproc', preprocessor),
                      ('model', GradientBoostingRegressor(n_estimators=300, random_state=42))])
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_val)
print('GradientBoosting   - RMSE: {:.2f}, MAE: {:.2f}, R2: {:.4f}'.format(
    rmse(y_val, y_pred_gbr), 
    mean_absolute_error(y_val, y_pred_gbr), 
    r2_score(y_val, y_pred_gbr)
))

In [None]:
# Much smaller and faster search
from scipy.stats import randint
param_dist = {
    'model__n_estimators': [50, 100],  # Reduced options
    'model__max_depth': [10, 15],      # Reduced options  
    'model__min_samples_split': [5, 10] # Reduced options
}

rs = RandomizedSearchCV(
    rf_pipeline, 
    param_dist, 
    n_iter=4,  # Only 4 iterations
    scoring='neg_root_mean_squared_error', 
    cv=2,      # Only 2-fold CV
    random_state=42, 
    n_jobs=1,  # Use only 1 job to avoid overhead
    verbose=1
)
rs.fit(X_train, y_train)
print('Best RF params:', rs.best_params_)
best_rf = rs.best_estimator_
y_pred_best_rf = best_rf.predict(X_val)
print('Tuned Random Forest - RMSE: {:.2f}, MAE: {:.2f}, R2: {:.4f}'.format(
    rmse(y_val, y_pred_best_rf), 
    mean_absolute_error(y_val, y_pred_best_rf), 
    r2_score(y_val, y_pred_best_rf)
))

In [None]:

# Choose final model: pick the best among the trained models automatically
candidates = {
    'LinearRegression': (lr_pipeline, rmse(y_val, y_pred_lr)),
    'RandomForest': (rf_pipeline, rmse(y_val, y_pred_rf)),
    'GradientBoosting': (gbr, rmse(y_val, y_pred_gbr)),
    'TunedRandomForest': (best_rf if 'best_rf' in globals() else rf_pipeline, rmse(y_val, y_pred_best_rf) if 'y_pred_best_rf' in globals() else rmse(y_val, y_pred_rf))
}

best_name = min(candidates.items(), key=lambda kv: kv[1][1])[0]
best_pipeline = candidates[best_name][0]
print('Selected final model:', best_name)

# Refit final model on full training data (train_fe)
X_full = train_fe[numeric_feats + cat_feats].copy()
y_full = train_fe[target_col].copy()
best_pipeline.fit(X_full, y_full)

# Prepare test features (ensure same feature columns/order)
X_test = test_fe[numeric_feats + cat_feats].copy()
preds = best_pipeline.predict(X_test)

# Prepare predictions DataFrame
preds_df = pd.DataFrame({
    id_col: test_fe[id_col],
    'predicted_price': np.round(preds).astype(int)
})

# Save the predictions CSV with required filename (update TeamCode and TeamName below)
teamcode = 'TeamCode'
teamname = 'TeamName'
pred_filename = f'{teamcode}_{teamname}_Task2_Predictions.csv'
preds_df.to_csv(pred_filename, index=False)
print('Predictions saved to', pred_filename)

# Save the final trained model
model_filename = f'{teamcode}_{teamname}_Task2_FinalModel.joblib'
joblib.dump(best_pipeline, model_filename)
print('Trained model saved to', model_filename)


In [None]:
# Compute evaluation metrics on validation for the selected model (if available)
if 'y_val' in globals():
    if best_name == 'LinearRegression':
        y_val_pred = y_pred_lr
    elif best_name == 'RandomForest':
        y_val_pred = y_pred_rf
    elif best_name == 'GradientBoosting':
        y_val_pred = y_pred_gbr
    else:
        y_val_pred = y_pred_best_rf if 'y_pred_best_rf' in globals() else y_pred_rf

    # Use our custom rmse function instead of mean_squared_error with squared=False
    final_rmse = rmse(y_val, y_val_pred)
    final_mae = mean_absolute_error(y_val, y_val_pred)
    final_r2 = r2_score(y_val, y_val_pred)
else:
    final_rmse = final_mae = final_r2 = None

# Save a PNG screenshot with the metrics for submission
metrics_filename = f'{teamcode}_{teamname}_Task2_Metrics.png'
plt.figure(figsize=(6,3))
plt.axis('off')
txt = f"""Validation metrics ({best_name})
RMSE: {final_rmse:.2f}
MAE: {final_mae:.2f}
R²: {final_r2:.4f}
""" if final_rmse is not None else 'Metrics not available'
plt.text(0.01, 0.5, txt, fontsize=12, va='center')
plt.savefig(metrics_filename, bbox_inches='tight', dpi=150)
plt.close()
print('Metrics screenshot saved to', metrics_filename)


---

## Final notes & checklist before submission

1. **Rename files**: Replace `TeamCode` and `TeamName` in the produced filenames with your official team code and team name exactly as required by the competition. Example final names:
   - `T01_CoolTeam_Task2_HousePrice.ipynb`
   - `T01_CoolTeam_Task2_Metrics.png`
   - `T01_CoolTeam_Task2_Predictions.csv`

2. **Verify CSV format**: Open the saved CSV and confirm it has exactly two columns: `house_id`, `predicted_price`. No index column should be included.

3. **Screenshot requirement**: The PNG `*_Metrics.png` is required as proof of training metrics; make sure it is visible in your repo.

4. **If you want to improve performance**:
   - Try feature selection, target transformation (log), stacking, stronger hyperparameter tuning, or use LightGBM/XGBoost (available in Colab).
   - Use K-Fold CV and out-of-fold predictions for more robust model selection.

5. **Troubleshooting**:
   - If any column names differ from the assumed ones above, edit the feature lists in the notebook accordingly.
   - If zip codes are many, the notebook uses `zip_freq` (frequency encoding) to avoid huge one-hot expansions.
---

Good luck in the competition — run this in Colab, upload the resulting CSV + PNG + notebook to your GitHub repository, and you should be ready for submission!
