# **OPEN-ARC**
---

### Project 13: India Housing Price Prediction Model:
**Challenge:** Create an AI model, capable of accurately predicting housing prices in India.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | MAE | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | CatBoostRegressor  | Kaggle   | ✔  | India House Rent Prediction | 3.86    | [Notebook](https://github.com/Infinitode/OPEN-ARC/blob/main/Project-13/notebook.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | Unknown | Score    | [Notebook](https://github.com) |

---

In [6]:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load data
df = pd.read_csv("/kaggle/input/india-house-rent-prediction/data.csv")

# Clean data
df = df[df["area"] > 50]  # remove impossible areas
df = df[df["rent"] <= df["rent"].quantile(0.995)]  # remove extreme rent outliers
df = df.drop(columns=["area_rate"])  # leaking feature

# Rent per sqft proxy (ONLY used for bucketing)
df["rent_psf"] = df["rent"] / df["area"]
df["rent_psf_bucket"] = pd.qcut(df["rent_psf"], 20, labels=False)

# Area buckets
df["area_bucket"] = pd.cut(
    df["area"],
    bins=[0, 300, 600, 900, 1200, 2000, 5000, 100000],
    labels=False
)

# Drop temporary continuous leakage column
df = df.drop(columns=["rent_psf"])

X = df.drop(columns=["rent"])
y = df["rent"]

# Log-transform target (more stable)
y_log = np.log1p(y)

# Categorical columns
categorical_cols = [
    "house_type", "locality", "city", "furnishing",
    "rent_psf_bucket", "area_bucket"
]

cat_idx = [X.columns.get_loc(c) for c in categorical_cols]

X_train, X_test, y_train, y_test = train_test_split(
    X, y_log, test_size=0.15, random_state=42
)

# Model
model = CatBoostRegressor(
    iterations=2500,
    depth=10,
    learning_rate=0.045,
    loss_function="MAE",         # stable for prices
    eval_metric="MAE",
    random_seed=42,
    verbose=200
)

model.fit(
    X_train, y_train,
    cat_features=cat_idx,
    eval_set=(X_test, y_test),
    use_best_model=True
)

# Metrics
preds_log = model.predict(X_test)
preds = np.expm1(preds_log)
actual = np.expm1(y_test)

MAE = mean_absolute_error(actual, preds)
RMSE = np.sqrt(mean_squared_error(actual, preds))
R2 = r2_score(actual, preds)

print("\n========== MODEL METRICS ==========")
print(f"MAE : {MAE:,.2f}")
print(f"RMSE: {RMSE:,.2f}")
print(f"R²  : {R2:.4f}")
print("===================================\n")

# Show preview
preview = X_test.copy()
preview["actual_rent"] = actual
preview["predicted_rent"] = preds
print(preview.head(10))

# Interactive prediction

def predict_user_rent(model, raw_df):
    print("\n\n========== RENT PREDICTION ASSISTANT ==========\n")
    print("Choose values for each feature below. For categorical vars, pick a number.\n")

    sample = {}

    # Menu
    def choose_cat(col_name):
        unique_vals = sorted(raw_df[col_name].unique())
        print(f"\n--- {col_name} ---")
        for idx, val in enumerate(unique_vals):
            print(f"{idx + 1}. {val}")
        sel = int(input("Enter your choice number: ")) - 1
        return unique_vals[sel]

    # Categorical
    sample["house_type"] = choose_cat("house_type")
    sample["locality"] = choose_cat("locality")
    sample["city"] = choose_cat("city")
    sample["furnishing"] = choose_cat("furnishing")

    # Numeric values
    def choose_num(col_name):
        return float(input(f"\nEnter value for {col_name}: "))

    sample["area"] = choose_num("area")
    sample["beds"] = choose_num("beds")
    sample["bathrooms"] = choose_num("bathrooms")
    sample["balconies"] = choose_num("balconies")

    # area bucket
    area_val = sample["area"]
    area_bins = [0, 300, 600, 900, 1200, 2000, 5000, 100000]
    area_bucket = np.digitize([area_val], area_bins)[0] - 1
    sample["area_bucket"] = area_bucket

    # placeholder for rent_psf bucket (we don't know rent yet)
    # so we use area only as a proxy for typical price density
    sample["rent_psf_bucket"] = min(int(area_bucket), 19)

    df_input = pd.DataFrame([sample])

    # Must match training encodings
    for col in ["house_type", "locality", "city", "furnishing"]:
        df_input[col] = df_input[col].astype(raw_df[col].dtype)

    # Prediction
    pred_log = model.predict(df_input)[0]
    pred_rent = np.expm1(pred_log)

    print("\n===================================")
    print(f"Estimated Rent: ₹ {pred_rent:,.2f}")
    print("===================================\n")

    return pred_rent

# Uncomment to use interactively:
# predict_user_rent(model, df)

0:	learn: 0.7387084	test: 0.7170429	best: 0.7170429 (0)	total: 21ms	remaining: 52.4s
200:	learn: 0.0631034	test: 0.0626027	best: 0.0626027 (200)	total: 6.03s	remaining: 1m 8s
400:	learn: 0.0425443	test: 0.0551059	best: 0.0551059 (400)	total: 11.9s	remaining: 1m 2s
600:	learn: 0.0339558	test: 0.0538099	best: 0.0537260 (595)	total: 18s	remaining: 56.9s
800:	learn: 0.0292132	test: 0.0537060	best: 0.0536343 (778)	total: 23.9s	remaining: 50.6s
1000:	learn: 0.0256610	test: 0.0536322	best: 0.0536081 (854)	total: 29.8s	remaining: 44.7s
1200:	learn: 0.0233197	test: 0.0535926	best: 0.0535666 (1031)	total: 35.8s	remaining: 38.7s
1400:	learn: 0.0214445	test: 0.0536164	best: 0.0535531 (1294)	total: 41.7s	remaining: 32.7s
1600:	learn: 0.0196665	test: 0.0537456	best: 0.0535531 (1294)	total: 47.6s	remaining: 26.8s
1800:	learn: 0.0182131	test: 0.0540199	best: 0.0535531 (1294)	total: 53.6s	remaining: 20.8s
2000:	learn: 0.0172847	test: 0.0540033	best: 0.0535531 (1294)	total: 59.5s	remaining: 14.8s
2200:	

This model achieved an `MAE` of `3.86`, which is really impressive especially on rent-related prediction tasks where `MAE` of `~7` is really common.

In [7]:
# Save the model
model.save_model("model.cbm")
model.save_model("model.json", format="json")

### The End:

This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: https://github.com/Infinitode/OPEN-ARC. If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode