# PropMatch Egypt — Cairo Real Estate Pricing

**Purpose:** Clean, expand (via web scraping), model, and deploy a pricing tool for 2–3 bedroom apartments in New Cairo.

This notebook is intentionally **simple** (no heavy custom classes) and demonstrates: data cleaning, feature engineering, training 4 models, selecting the best monthly model, and MLflow integration for MLOps.

**Files:**
- Dataset (uploaded): `/mnt/data/cairo_real_estate_dataset.csv`



## 1) Imports & Setup

In [18]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib
import os
from datetime import datetime, timedelta

# Modeling imports
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# XGBoost / LightGBM (optional)
try:
    import xgboost as xgb
except Exception:
    xgb = None
try:
    import lightgbm as lgb
except Exception:
    lgb = None

# MLflow for experiment tracking (optional)
try:
    import mlflow
    import mlflow.sklearn
except Exception:
    mlflow = None

# Display settings
pd.options.display.max_columns = 200
plt.rcParams['figure.figsize'] = (8,5)


## 2) Load dataset (uploaded file)

In [19]:
# Adjust path if needed. The uploaded file path provided by the assistant environment:
data_path = Path("cairo_real_estate_dataset.csv")
if not data_path.exists():
    print(f"File not found at {data_path}. Please upload the dataset to this path.")
else:
    df = pd.read_csv(data_path)
    print('Loaded dataset with shape:', df.shape)
    display(df.head())

Loaded dataset with shape: (2000, 22)


Unnamed: 0,listing_id,price_egp,area_sqm,bedrooms,bathrooms,floor_number,building_age_years,district,compound_name,distance_to_auc_km,distance_to_mall_km,distance_to_metro_km,finishing_type,has_balcony,has_parking,has_security,has_amenities,view_type,listing_date,days_on_market,seller_type,is_negotiable
0,NCR-2024-00001,3650000,145,3,2,12,18,Madinaty,,15.9,2.1,13.1,Lux,Yes,Yes,Yes,No,Garden,2025-08-23,76,Broker,Yes
1,NCR-2024-00002,3900000,155,3,3,15,17,Fifth Settlement,,23.8,3.0,6.4,Lux,Yes,Yes,Yes,Yes,Compound,2025-08-12,87,Broker,Yes
2,NCR-2024-00003,2650000,109,2,3,5,14,Rehab City,,9.8,7.7,9.9,Lux,Yes,Yes,Yes,Yes,Garden,2025-09-20,48,Broker,Yes
3,NCR-2024-00004,5450000,219,3,4,10,1,Katameya,Lake View,4.5,4.4,4.5,Lux,Yes,Yes,Yes,Yes,Street,2025-09-10,58,Broker,Yes
4,NCR-2024-00005,2450000,96,2,3,4,13,Rehab City,Rehab 3,13.6,3.5,11.3,Lux,Yes,No,Yes,No,Street,2025-09-11,57,Owner,Yes


## 3) Basic EDA: inspect columns, types, missing values, duplicates, and summary statistics

In [20]:
# Basic checks (run after loading df)
def eda_overview(df):
    print('Columns and dtypes:\n', df.dtypes)
    print('\nMissing values (count):\n', df.isnull().sum().sort_values(ascending=False).head(20))
    print('\nDuplicates:', df.duplicated().sum())
    display(df.describe(include='all').T.head(30))

# If dataframe exists in notebook:
try:
    eda_overview(df)
except NameError:
    print('Load the dataset first (run the previous cell).')

Columns and dtypes:
 listing_id               object
price_egp                 int64
area_sqm                  int64
bedrooms                  int64
bathrooms                 int64
floor_number              int64
building_age_years        int64
district                 object
compound_name            object
distance_to_auc_km      float64
distance_to_mall_km     float64
distance_to_metro_km    float64
finishing_type           object
has_balcony              object
has_parking              object
has_security             object
has_amenities            object
view_type                object
listing_date             object
days_on_market            int64
seller_type              object
is_negotiable            object
dtype: object

Missing values (count):
 compound_name           471
listing_id                0
area_sqm                  0
bedrooms                  0
bathrooms                 0
price_egp                 0
floor_number              0
building_age_years        0
district   

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
listing_id,2000.0,2000.0,NCR-2024-00001,1.0,,,,,,,
price_egp,2000.0,,,,3845500.0,1117343.491608,1250000.0,3000000.0,3750000.0,4600000.0,8200000.0
area_sqm,2000.0,,,,150.721,43.785703,60.0,116.0,145.0,184.0,299.0
bedrooms,2000.0,,,,2.493,0.65587,1.0,2.0,2.0,3.0,4.0
bathrooms,2000.0,,,,2.7195,0.936621,1.0,2.0,3.0,3.0,5.0
floor_number,2000.0,,,,8.175,4.261743,1.0,5.0,8.0,12.0,15.0
building_age_years,2000.0,,,,10.3175,6.034227,0.0,5.0,11.0,15.0,20.0
district,2000.0,5.0,Fifth Settlement,700.0,,,,,,,
compound_name,1529.0,18.0,Mountain View,146.0,,,,,,,
distance_to_auc_km,2000.0,,,,13.4911,6.688537,2.0,7.7,13.2,19.4,25.0


## 4) Cleaning: handle missing values & duplicates

Strategy suggestions:
- Drop rows with critical missing target (`price_egp`) or `area_sqm`.
- Impute numeric features with group median (e.g., median area or price per district) when appropriate.
- For categorical missing values, fill with `'Unknown'` or a new category.
- Remove exact duplicates (same listing_id) and near-duplicates if needed.

The code below implements a simple, reproducible cleaning pipeline.

In [21]:
# Simple cleaning pipeline
def clean_real_estate(df):
    df = df.copy()
    # Standardize column names (lowercase, underscores)
    df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
    # Ensure target exists
    if 'price_egp' not in df.columns and 'price' in df.columns:
        df = df.rename(columns={'price': 'price_egp'})
    # Drop rows missing target or area
    df = df.dropna(subset=['price_egp', 'area_sqm'])
    # Remove exact duplicates by listing_id if present, otherwise by all columns
    if 'listing_id' in df.columns:
        df = df.drop_duplicates(subset=['listing_id'])
    else:
        df = df.drop_duplicates()
    # Fill categorical NAs
    cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    for c in cat_cols:
        df[c] = df[c].fillna('Unknown')
    # Impute numeric NAs with median by district if available, else column median
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    for c in num_cols:
        if df[c].isnull().sum() > 0:
            if 'district' in df.columns:
                df[c] = df.groupby('district')[c].transform(lambda x: x.fillna(x.median()))
            df[c] = df[c].fillna(df[c].median())
    return df

# Run cleaning
try:
    df_clean = clean_real_estate(df)
    print('After cleaning shape:', df_clean.shape)
    display(df_clean.head())
except NameError:
    print('Load the dataset first.')

After cleaning shape: (2000, 22)


Unnamed: 0,listing_id,price_egp,area_sqm,bedrooms,bathrooms,floor_number,building_age_years,district,compound_name,distance_to_auc_km,distance_to_mall_km,distance_to_metro_km,finishing_type,has_balcony,has_parking,has_security,has_amenities,view_type,listing_date,days_on_market,seller_type,is_negotiable
0,NCR-2024-00001,3650000,145,3,2,12,18,Madinaty,Unknown,15.9,2.1,13.1,Lux,Yes,Yes,Yes,No,Garden,2025-08-23,76,Broker,Yes
1,NCR-2024-00002,3900000,155,3,3,15,17,Fifth Settlement,Unknown,23.8,3.0,6.4,Lux,Yes,Yes,Yes,Yes,Compound,2025-08-12,87,Broker,Yes
2,NCR-2024-00003,2650000,109,2,3,5,14,Rehab City,Unknown,9.8,7.7,9.9,Lux,Yes,Yes,Yes,Yes,Garden,2025-09-20,48,Broker,Yes
3,NCR-2024-00004,5450000,219,3,4,10,1,Katameya,Lake View,4.5,4.4,4.5,Lux,Yes,Yes,Yes,Yes,Street,2025-09-10,58,Broker,Yes
4,NCR-2024-00005,2450000,96,2,3,4,13,Rehab City,Rehab 3,13.6,3.5,11.3,Lux,Yes,No,Yes,No,Street,2025-09-11,57,Owner,Yes


## 5) Feature engineering

Create business-feasible features:
- price_per_sqm
- finishing_score (map finishing types)
- amenities_count or binary flag
- proximity_score (simple weighted sum of distances)
- listing_age_days (from listing_date)


In [22]:
# Feature engineering example
def feature_engineering(df):
    df = df.copy()
    # price per sqm
    df['price_per_sqm'] = df['price_egp'] / df['area_sqm']
    # finishing score mapping
    if 'finishing_type' in df.columns:
        mapping = {'Super Lux':4, 'Lux':3, 'Semi-finished':2, 'Unfinished':1,
                   'SuperLux':4, 'super lux':4, 'lux':3, 'semi-finished':2, 'unfinished':1}
        df['finishing_score'] = df['finishing_type'].map(mapping).fillna(2)
    # binary amenities composite
    amenities_cols = [c for c in ['has_balcony','has_parking','has_security','has_amenities'] if c in df.columns]
    if len(amenities_cols)>0:
        df['amenities_score'] = df[amenities_cols].apply(lambda row: sum([1 if str(x).lower() in ['yes','true', '1'] else 0 for x in row]), axis=1)
    # proximity score (lower distance -> higher score)
    dist_cols = [c for c in ['distance_to_auc_km','distance_to_mall_km','distance_to_metro_km'] if c in df.columns]
    if len(dist_cols)>0:
        # normalize distances and invert so smaller distance = larger score
        for c in dist_cols:
            df[c] = pd.to_numeric(df[c], errors='coerce')
        df['proximity_score'] = df[dist_cols].apply(lambda row: np.nanmean([1/(1+v) if pd.notnull(v) else 0 for v in row]), axis=1)
    # parse listing date to compute days_on_market if missing
    if 'listing_date' in df.columns:
        df['listing_date'] = pd.to_datetime(df['listing_date'], errors='coerce')
        today = pd.Timestamp.today()
        df['listing_age_days'] = (today - df['listing_date']).dt.days.fillna(df.get('days_on_market', np.nan))
    return df

try:
    df_feat = feature_engineering(df_clean)
    display(df_feat.head())
except NameError:
    print('Run cleaning first.')

Unnamed: 0,listing_id,price_egp,area_sqm,bedrooms,bathrooms,floor_number,building_age_years,district,compound_name,distance_to_auc_km,distance_to_mall_km,distance_to_metro_km,finishing_type,has_balcony,has_parking,has_security,has_amenities,view_type,listing_date,days_on_market,seller_type,is_negotiable,price_per_sqm,finishing_score,amenities_score,proximity_score,listing_age_days
0,NCR-2024-00001,3650000,145,3,2,12,18,Madinaty,Unknown,15.9,2.1,13.1,Lux,Yes,Yes,Yes,No,Garden,2025-08-23,76,Broker,Yes,25172.413793,3,3,0.150891,80
1,NCR-2024-00002,3900000,155,3,3,15,17,Fifth Settlement,Unknown,23.8,3.0,6.4,Lux,Yes,Yes,Yes,Yes,Compound,2025-08-12,87,Broker,Yes,25161.290323,3,4,0.141819,91
2,NCR-2024-00003,2650000,109,2,3,5,14,Rehab City,Unknown,9.8,7.7,9.9,Lux,Yes,Yes,Yes,Yes,Garden,2025-09-20,48,Broker,Yes,24311.926606,3,4,0.099759,52
3,NCR-2024-00004,5450000,219,3,4,10,1,Katameya,Lake View,4.5,4.4,4.5,Lux,Yes,Yes,Yes,Yes,Street,2025-09-10,58,Broker,Yes,24885.844749,3,4,0.182941,62
4,NCR-2024-00005,2450000,96,2,3,4,13,Rehab City,Rehab 3,13.6,3.5,11.3,Lux,Yes,No,Yes,No,Street,2025-09-11,57,Owner,Yes,25520.833333,3,2,0.124005,61


## 6) Modeling: train 4 models and compare

We train: Linear Regression, Random Forest, XGBoost (if installed), LightGBM (if installed). We will use MAE as primary metric and keep model artifacts with `joblib`. Also shows MLflow logging if mlflow is available.

In [23]:
# Prepare dataset for modeling (filter 2-3 bedrooms if available)
def prepare_model_data(df):
    df = df.copy()
    # Focus on 2-3 bedroom apartments if column exists
    if 'bedrooms' in df.columns:
        df = df[df['bedrooms'].isin([2,3])]
    # Choose features (simple numeric + engineered)
    ignore = ['listing_id','price_egp','listing_date']
    X = df[[c for c in df.columns if c not in ignore and (df[c].dtype in [np.float64, np.int64] or c in ['finishing_score','amenities_score','proximity_score','price_per_sqm'])]]
    # One-hot encode categorical small-cardinality features
    cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
    cat_cols = [c for c in cat_cols if c in ['district','compound_name','view_type','finishing_type','seller_type','is_negotiable']]
    X = pd.get_dummies(pd.concat([X, df[cat_cols]], axis=1), drop_first=True)
    y = df['price_egp']
    return X.fillna(0), y

try:
    X, y = prepare_model_data(df_feat)
    print('Modeling matrix shape:', X.shape)
except NameError:
    print('Prepare previous steps first.')

Modeling matrix shape: (1820, 44)


In [24]:
# Train/test split (time-aware option if listing_date exists)
def split_data(X, y, df, test_size=0.2, time_based=False):
    if time_based and 'listing_date' in df.columns:
        # sort by listing_date and take last test_size fraction as test
        order = df.sort_values('listing_date').index
        split_idx = int(len(order)*(1-test_size))
        train_idx = order[:split_idx]
        test_idx = order[split_idx:]
        return X.loc[train_idx], X.loc[test_idx], y.loc[train_idx], y.loc[test_idx]
    else:
        return train_test_split(X, y, test_size=test_size, random_state=42)

try:
    X_train, X_test, y_train, y_test = split_data(X, y, df_feat, test_size=0.2, time_based=False)
    print('Train/test sizes:', X_train.shape, X_test.shape)
except NameError:
    print('Prepare data first.')

Train/test sizes: (1456, 44) (364, 44)


In [25]:
# Define a train-evaluate helper
def evaluate_model(model, X_train, y_train, X_test, y_test, verbose=True):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)
    if verbose:
        print(f'MAE: {mae:.0f}, RMSE: {rmse:.0f}, R2: {r2:.3f}')
    return {'model': model, 'mae': mae, 'rmse': rmse, 'r2': r2, 'preds': preds}

models = {}
# Linear Regression
lr = LinearRegression()
models['LinearRegression'] = evaluate_model(lr, X_train, y_train, X_test, y_test)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
models['RandomForest'] = evaluate_model(rf, X_train, y_train, X_test, y_test)

# XGBoost (if available)
if xgb is not None:
    xgbr = xgb.XGBRegressor(n_estimators=100, random_state=42, verbosity=0)
    models['XGBoost'] = evaluate_model(xgbr, X_train, y_train, X_test, y_test)
else:
    print('XGBoost not installed; skipping.')

# LightGBM (if available)
if lgb is not None:
    lgbr = lgb.LGBMRegressor(n_estimators=100, random_state=42)
    models['LightGBM'] = evaluate_model(lgbr, X_train, y_train, X_test, y_test)
else:
    print('LightGBM not installed; skipping.')

# Show summary table
try:
    summary = pd.DataFrame([{ 'model':k, 'mae':v['mae'], 'rmse':v['rmse'], 'r2':v['r2']} for k,v in models.items()])
    display(summary.sort_values('mae'))
except Exception as e:
    print('Couldn\'t summarize:', e)

MAE: 101542, RMSE: 151463, R2: 0.974
MAE: 40176, RMSE: 62756, R2: 0.996
MAE: 47693, RMSE: 64613, R2: 0.995
LightGBM not installed; skipping.


Unnamed: 0,model,mae,rmse,r2
1,RandomForest,40175.824176,62755.959395,0.995606
2,XGBoost,47693.441406,64612.808142,0.995343
0,LinearRegression,101542.213428,151463.171105,0.974407


In [38]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

# Numeric features
numeric_features = [
    "area_sqm", "bedrooms", "bathrooms", "floor_number", "building_age_years",
    "distance_to_auc_km", "distance_to_mall_km", "distance_to_metro_km"
]

# Categorical features
categorical_features = ["district", "finishing_type", "view_type"]

# Preprocessor
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
])

# Load local RandomForest model
rf_model = joblib.load(r"C:\Users\shahd\Depi Data Science\Case Study\chat\models\RandomForest_model.joblib")

# Combine into pipeline
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", rf_model)
])

# Save full pipeline
joblib.dump(pipeline, r"C:\Users\shahd\Depi Data Science\Case Study\chat\models\best_model_pipeline.pkl")


['C:\\Users\\shahd\\Depi Data Science\\Case Study\\chat\\models\\best_model_pipeline.pkl']

In [42]:
df.describe

<bound method NDFrame.describe of           listing_id  price_egp  area_sqm  bedrooms  bathrooms  floor_number  \
0     NCR-2024-00001    3650000       145         3          2            12   
1     NCR-2024-00002    3900000       155         3          3            15   
2     NCR-2024-00003    2650000       109         2          3             5   
3     NCR-2024-00004    5450000       219         3          4            10   
4     NCR-2024-00005    2450000        96         2          3             4   
...              ...        ...       ...       ...        ...           ...   
1995  NCR-2024-01996    3600000       121         2          3             9   
1996  NCR-2024-01997    2300000        94         2          3             6   
1997  NCR-2024-01998    4100000       156         3          2             5   
1998  NCR-2024-01999    2050000        97         2          3             8   
1999  NCR-2024-02000    5400000       196         3          4             8   

     

In [43]:
# Simulate user input (replace these with real values)
district = "Fifth Settlement"
finishing = "Super Lux"
view = "Garden"
area = 150
bedrooms = 3
bathrooms = 2
floor = 3
building_age = 5
distance_auc = 5.0
distance_mall = 3.0
distance_metro = 10.0

# Now you can create input_df
import pandas as pd
input_df = pd.DataFrame([{
    "district": district,
    "finishing_type": finishing,
    "view_type": view,
    "area_sqm": area,
    "bedrooms": bedrooms,
    "bathrooms": bathrooms,
    "floor_number": floor,
    "building_age_years": building_age,
    "distance_to_auc_km": distance_auc,
    "distance_to_mall_km": distance_mall,
    "distance_to_metro_km": distance_metro
}])


## 7) Select best model (monthly strategy) & save artifact

Strategy: each month, retrain models on new data, compare MAE on a validation slice, pick the best model and save it (joblib) and log with MLflow.

In [26]:
# Pick best model by MAE and save
def select_and_save(models_dict, X_test, y_test, model_dir='models'):
    os.makedirs(model_dir, exist_ok=True)
    best = None
    for name, info in models_dict.items():
        if best is None or info['mae'] < best['mae']:
            best = {'name': name, **info}
    # Save model artifact
    filename = f'{model_dir}/{best["name"]}_model.joblib'
    joblib.dump(best['model'], filename)
    print('Saved best model:', best['name'], '->', filename)
    return best, filename

best, fname = select_and_save(models, X_test, y_test)
best

Saved best model: RandomForest -> models/RandomForest_model.joblib


{'name': 'RandomForest',
 'model': RandomForestRegressor(n_jobs=-1, random_state=42),
 'mae': 40175.82417582418,
 'rmse': np.float64(62755.95939478927),
 'r2': 0.9956064673736594,
 'preds': array([3520500., 5711500., 3346000., 2809000., 5515000., 3893000.,
        3596500., 2577000., 5439500., 3138000., 3312500., 4042000.,
        5170500., 2976000., 2976500., 3008000., 4404500., 3074000.,
        4529000., 3264500., 3044500., 4628000., 4041000., 4220000.,
        3548500., 3370000., 3914500., 2677500., 2924500., 4339000.,
        2285000., 4361000., 3843500., 4317000., 2775500., 4033500.,
        2747000., 4520000., 2336000., 2811500., 4373000., 2898000.,
        3780000., 5999000., 4963000., 2234000., 2206000., 3640000.,
        4450500., 4086000., 2416000., 4921000., 3299500., 4409500.,
        3795000., 3057500., 2749500., 4396500., 4995500., 3824500.,
        3357500., 4080000., 4828500., 3035000., 3303000., 3764500.,
        3043500., 3274000., 3828000., 3216000., 2726000., 61920

## 8) MLflow: log runs and artifacts (optional)

If you have MLflow installed and running (`mlflow ui`), you can log parameters, metrics, and models. The code below checks for mlflow availability.

In [27]:
if mlflow is None:
    print('MLflow not installed in this environment. Install mlflow to use tracking features.')
else:
    # Simple example logging the best model
    with mlflow.start_run(run_name='best_model_save'):
        mlflow.log_param('selected_model', best['name'])
        mlflow.log_metric('mae', float(best['mae']))
        mlflow.sklearn.log_model(best['model'], artifact_path='model')        
        print('Logged best model to MLflow run:', mlflow.active_run().info.run_id)



Logged best model to MLflow run: a73fdb2eb2874f14a028a92a9ad7fb5e


## 9) Web scraping to expand dataset (template)

**Important:** Always check each site's `robots.txt` and terms of service. Use polite scraping: `time.sleep`, caching, and consider official APIs if available.

Below are templates for scraping listing pages from Property Finder / OLX / Aqar. These are *templates* — adapt selectors to current site HTML, and obey robots.txt.

In [28]:
# Web scraping template with requests + BeautifulSoup
# NOTE: This environment cannot perform live web requests when saving the notebook, but this code works in your local environment.
import requests, time
from bs4 import BeautifulSoup

def fetch_propertyfinder_listings(search_url, max_pages=3, delay=2.0):
    listings = []
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/88.0'}
    for page in range(1, max_pages+1):
        url = f"{search_url}&page={page}"
        print('Fetching', url)
        r = requests.get(url, headers=headers, timeout=20)
        if r.status_code != 200:
            print('Bad status', r.status_code)
            break
        soup = BeautifulSoup(r.text, 'html.parser')
        # TODO: adapt the selectors below to current site HTML
        cards = soup.select('.card') or soup.select('.listing-card')
        for c in cards:
            try:
                title = c.select_one('.title').get_text(strip=True)
                price_text = c.select_one('.price').get_text(strip=True)
                # parse price_text to integer EGP here
                # other fields...
                listings.append({'title': title, 'price_raw': price_text})
            except Exception as e:
                continue
        time.sleep(delay)
    return pd.DataFrame(listings)

# Example usage (replace with actual search URL and test locally):
# df_pf = fetch_propertyfinder_listings('https://www.propertyfinder.eg/en/search?regions=New+Cairo', max_pages=2)
# display(df_pf.head())

## 10) Monthly retrain & model selection (automation)

Outline for MLOps:
- Use a scheduler (cron on Linux, Task Scheduler on Windows, or Airflow) to run a monthly job.
- Job steps: pull newest scraped + internal listings, run cleaning+feature engineering, retrain 4 models, evaluate on hold-out validation set, pick best model, store artifact, update prediction API model, and notify sales team.

Example command (run via cron):
```
python retrain_and_deploy.py --data-folder /data/new_listings --output-model-folder /models/current
```


In [29]:
# Example: simple monthly retrain function (sketch)
def monthly_retrain_pipeline(data_folder, model_folder='models'):
    # 1. load all data CSVs in folder
    files = list(Path(data_folder).glob('*.csv'))
    if not files:
        print('No new data found in', data_folder)
        return
    df_all = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
    df_all = clean_real_estate(df_all)
    df_all = feature_engineering(df_all)
    X, y = prepare_model_data(df_all)
    X_train, X_test, y_train, y_test = split_data(X, y, df_all, test_size=0.2, time_based=False)
    # train same 4 models and pick best (reuse code above)
    # Save best model to model_folder
    print('Monthly retrain pipeline executed (sketch).')

# ==============================
# 🔹 MLflow UI Setup
# ==============================

In [30]:
# In case you want to log everything to a specific folder
import mlflow
mlflow.set_tracking_uri(r"file:C:\Users\shahd\Depi Data Science\Case Study\chat\mlruns")
mlflow.set_experiment("Cairo_RealEstate_Pricing")

print("✅ MLflow tracking set. Run `mlflow ui --port 5001` in terminal to open the dashboard.")

✅ MLflow tracking set. Run `mlflow ui --port 5001` in terminal to open the dashboard.


In [44]:
df.head()

Unnamed: 0,listing_id,price_egp,area_sqm,bedrooms,bathrooms,floor_number,building_age_years,district,compound_name,distance_to_auc_km,distance_to_mall_km,distance_to_metro_km,finishing_type,has_balcony,has_parking,has_security,has_amenities,view_type,listing_date,days_on_market,seller_type,is_negotiable
0,NCR-2024-00001,3650000,145,3,2,12,18,Madinaty,,15.9,2.1,13.1,Lux,Yes,Yes,Yes,No,Garden,2025-08-23,76,Broker,Yes
1,NCR-2024-00002,3900000,155,3,3,15,17,Fifth Settlement,,23.8,3.0,6.4,Lux,Yes,Yes,Yes,Yes,Compound,2025-08-12,87,Broker,Yes
2,NCR-2024-00003,2650000,109,2,3,5,14,Rehab City,,9.8,7.7,9.9,Lux,Yes,Yes,Yes,Yes,Garden,2025-09-20,48,Broker,Yes
3,NCR-2024-00004,5450000,219,3,4,10,1,Katameya,Lake View,4.5,4.4,4.5,Lux,Yes,Yes,Yes,Yes,Street,2025-09-10,58,Broker,Yes
4,NCR-2024-00005,2450000,96,2,3,4,13,Rehab City,Rehab 3,13.6,3.5,11.3,Lux,Yes,No,Yes,No,Street,2025-09-11,57,Owner,Yes


## Final Notes

- This notebook is a practical, reproducible starting point. Replace scraping selectors with actual HTML classes/IDs from the target sites and obey their terms.
- For production: create a small REST API (FastAPI/Flask) that loads the `models/current_model.joblib` and serves price predictions to agents.
- Add authentication and logging before exposing model to users.

---

**Download notebook**: the notebook file will be saved to `/mnt/data/propmatch_cairo_pricing.ipynb`.
