# 🏠 Mohammed Real Estate Demand Prediction (Professional Edition)

**Author:** Mohammed Nasrallah — Data Scientist / ML Engineer  
**Email:** [mohammednasrallah82@gmail.com](mailto:mohammednasrallah82@gmail.com)

---

## 💼 Project Summary
End-to-end pipeline for **real estate demand forecasting** based on monthly transaction data across city sectors.  
Includes **data merging**, **feature engineering**, and **LightGBM regression modeling** to predict  
future housing transaction amounts and extract key feature importance insights.

---

## 🧠 Tech Stack
`Python`, `Pandas`, `NumPy`, `Scikit-Learn`, `LightGBM`, `Matplotlib`, `Seaborn`

---

## ⚙️ Repro Steps
1️⃣ Place all raw CSVs (`land_transactions`, `new_house_transactions`, `pre_owned_house_transactions`, etc.) under `/content`  
2️⃣ Run all notebook cells sequentially (top to bottom)  
3️⃣ Outputs: model metrics (MAE, RMSE, MAPE) and final prediction file → **submission_mohammed_final.csv**

---

## 📊 Performance Summary
| Metric | Score |
|:-------|------:|
| **MAE**  | 1623.041 |
| **RMSE** | 4647.608 |
| **MAPE** | 0.214 |

---

## 📝 Notes
- Code is **fully commented** and designed for clarity & reproducibility.  
- Built for **interview-readiness** and **portfolio showcasing**.  
- Can be easily adapted to other cities or real-estate datasets.


In [19]:
# 🏗️ Stage 1: Setup & Data Loading
# ------------------------------------------------------------
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

pd.set_option("display.max_columns", 50)
pd.set_option("display.float_format", "{:.2f}".format)

BASE_PATH = "/content"

# Load all CSV files
files = {
    "city_indexes": pd.read_csv(f"{BASE_PATH}/city_indexes.csv"),
    "city_search_index": pd.read_csv(f"{BASE_PATH}/city_search_index.csv"),
    "land_transactions": pd.read_csv(f"{BASE_PATH}/land_transactions.csv"),
    "land_transactions_nearby": pd.read_csv(f"{BASE_PATH}/land_transactions_nearby_sectors.csv"),
    "new_house": pd.read_csv(f"{BASE_PATH}/new_house_transactions.csv"),
    "new_house_nearby": pd.read_csv(f"{BASE_PATH}/new_house_transactions_nearby_sectors.csv"),
    "pre_owned": pd.read_csv(f"{BASE_PATH}/pre_owned_house_transactions.csv"),
    "pre_owned_nearby": pd.read_csv(f"{BASE_PATH}/pre_owned_house_transactions_nearby_sectors.csv"),
    "sector_poi": pd.read_csv(f"{BASE_PATH}/sector_POI.csv"),
    "test": pd.read_csv(f"{BASE_PATH}/test.csv"),
    "sample_submission": pd.read_csv(f"{BASE_PATH}/sample_submission.csv")
}

print("✅ Data loaded successfully.")

# ------------------------------------------------------------
# 🧹 Stage 2: Cleaning & Standardization
# ------------------------------------------------------------
for name, df in files.items():
    df.columns = df.columns.str.strip().str.lower()
    files[name] = df

for key in [
    "land_transactions", "land_transactions_nearby",
    "new_house", "new_house_nearby",
    "pre_owned", "pre_owned_nearby",
    "city_search_index"
]:
    if "month" in files[key].columns:
        files[key]["month"] = pd.to_datetime(
            files[key]["month"].str.replace("_", "-").str.replace(" ", "-"),
            errors="coerce", format="%Y-%b"
        )

print("✅ Columns standardized and dates parsed.")

# ------------------------------------------------------------
# 🧩 Stage 3: Merging All Datasets
# ------------------------------------------------------------
merged_df = files["new_house"].copy()
merge_list = [
    "land_transactions",
    "land_transactions_nearby",
    "pre_owned",
    "pre_owned_nearby",
    "new_house_nearby",
    "sector_poi"
]

for name in merge_list:
    df = files[name].copy()
    common_cols = [c for c in ["month", "sector"] if c in df.columns]
    merged_df = pd.merge(merged_df, df, on=common_cols, how="left")
    print(f"🔗 Merged with {name:<30} → shape: {merged_df.shape}")

print("✅ All files merged successfully.")

# ------------------------------------------------------------
# 🧠 Stage 4: Feature Engineering
# ------------------------------------------------------------
df = merged_df.copy()
df = df[~df["amount_new_house_transactions"].isna()].copy()

# Handle missing values
for col in df.columns:
    if df[col].dtype in ["float64", "int64"]:
        df[col].fillna(df[col].median(), inplace=True)
    else:
        df[col].fillna(df[col].mode()[0], inplace=True)

# Time features
df["month"] = pd.to_datetime(df["month"], errors="coerce")
df["year"] = df["month"].dt.year
df["month_num"] = df["month"].dt.month
df["ym_idx"] = (df["year"] - df["year"].min()) * 12 + df["month_num"]

# Label encode sector
le = LabelEncoder()
df["sector"] = le.fit_transform(df["sector"].astype(str))

print("✅ Feature engineering complete.")

# ------------------------------------------------------------
# ⚙️ Stage 5: Model Training (LightGBM)
# ------------------------------------------------------------
cat_cols = ["sector"]
num_cols = [c for c in df.columns if c not in ["month", "amount_new_house_transactions", "sector"]
            and pd.api.types.is_numeric_dtype(df[c])]

X_all = df[cat_cols + num_cols].fillna(0)
y_all = df["amount_new_house_transactions"].astype(float)

# Split (time-based)
cut_ym = np.quantile(df["ym_idx"], 0.80)
train_idx = df["ym_idx"] <= cut_ym
val_idx = df["ym_idx"] > cut_ym

X_train, y_train = X_all.loc[train_idx], y_all.loc[train_idx]
X_val, y_val = X_all.loc[val_idx], y_all.loc[val_idx]

print(f"Train: {X_train.shape}, Validation: {X_val.shape}")

# Train LightGBM model
model = lgb.LGBMRegressor(
    objective="regression",
    boosting_type="gbdt",
    learning_rate=0.05,
    num_leaves=31,
    max_depth=-1,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=5,
    random_state=42,
    n_estimators=1000
)

model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          eval_metric="mae",
          callbacks=[lgb.early_stopping(100), lgb.log_evaluation(100)])

# ------------------------------------------------------------
# 📊 Stage 6: Evaluation
# ------------------------------------------------------------
def mape(y_true, y_pred, eps=1e-9):
    y_true = np.array(y_true, dtype=float)
    y_pred = np.array(y_pred, dtype=float)
    return np.mean(np.abs((y_true - y_pred) / np.clip(np.abs(y_true), eps, None)))

y_pred = model.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
mape_val = mape(y_val, y_pred)

print("\n📈 Model Evaluation Results:")
print(f"MAE   : {mae:.3f}")
print(f"RMSE  : {rmse:.3f}")
print(f"MAPE  : {mape_val:.3f}")

# ------------------------------------------------------------
# 🏁 Stage 7: Submission File Generation
# ------------------------------------------------------------
test_df = pd.read_csv(f"{BASE_PATH}/test.csv")
test_df["month"] = test_df["id"].apply(lambda x: "-".join(x.split("_")[:2]))
test_df["sector"] = test_df["id"].apply(lambda x: x.split("_")[-1])
test_df["month"] = pd.to_datetime(test_df["month"], errors="coerce")

test_merged = test_df.copy()
merge_files = [
    "land_transactions", "land_transactions_nearby",
    "new_house", "new_house_nearby",
    "pre_owned", "pre_owned_nearby", "sector_poi"
]

for name in merge_files:
    temp = files[name]
    if "month" in temp.columns and "sector" in temp.columns:
        test_merged = pd.merge(test_merged, temp, on=["month", "sector"], how="left")
    elif "sector" in temp.columns:
        test_merged = pd.merge(test_merged, temp, on="sector", how="left")

test_merged = test_merged.fillna(0)
test_merged["sector"] = le.transform([s if s in le.classes_ else le.classes_[0] for s in test_merged["sector"].astype(str)])

common_cols = [c for c in (cat_cols + num_cols) if c in test_merged.columns]
missing_cols = [c for c in (cat_cols + num_cols) if c not in test_merged.columns]
for c in missing_cols:
    test_merged[c] = 0

X_test = test_merged[cat_cols + num_cols]
preds = model.predict(X_test)

submission = pd.read_csv(f"{BASE_PATH}/sample_submission.csv")
submission["new_house_transaction_amount"] = preds
submission.to_csv("/content/submission_mohammed_final.csv", index=False)

print("\n✅ Submission file created successfully! → submission_mohammed_final.csv")


✅ Data loaded successfully.
✅ Columns standardized and dates parsed.
🔗 Merged with land_transactions              → shape: (5433, 15)
🔗 Merged with land_transactions_nearby       → shape: (5433, 19)
🔗 Merged with pre_owned                      → shape: (5433, 23)
🔗 Merged with pre_owned_nearby               → shape: (5433, 27)
🔗 Merged with new_house_nearby               → shape: (5433, 36)
🔗 Merged with sector_poi                     → shape: (5433, 177)
✅ All files merged successfully.
✅ Feature engineering complete.
Train: (4355, 178), Validation: (1078, 178)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


Training until validation scores don't improve for 100 rounds
[100]	valid_0's l1: 1661.14	valid_0's l2: 2.06569e+07
Early stopping, best iteration is:
[93]	valid_0's l1: 1623.04	valid_0's l2: 2.16003e+07

📈 Model Evaluation Results:
MAE   : 1623.041
RMSE  : 4647.608
MAPE  : 0.214

✅ Submission file created successfully! → submission_mohammed_final.csv


  test_df["month"] = pd.to_datetime(test_df["month"], errors="coerce")
  test_merged = test_merged.fillna(0)
