# **Project Name-** Real Estate Investment Advisor: Predicting Property Profitability & Future Value

# **Project Type-** Real Estate / Investment / Financial Analytics

# *Contributor-* Ayush Singh

# Problem Statement

Develop a machine learning application to assist potential investors in making real estate decisions. The system should:<br>
Classify whether a property is a "Good Investment" (Classification).<br>
Predict the estimated property price after 5 years (Regression).<br>
Use the provided dataset to preprocess and analyze the data, engineer relevant features, and deploy a user-interactive application using Streamlit that provides investment recommendations and price forecasts. MLflow will be used for experiment tracking.


# Project Summary

This end-to-end ML project develops a Real Estate Investment Advisor that analyzes residential properties across Indian cities to predict investment potential and 5-year future price appreciation. Starting from a 250k-row dataset, the pipeline engineers 28 predictive features including property demographics (BHK, size, age), location scores (school/hospital density), positional metrics (floor ratio), and amenity indicators, then trains dual XGBoost models: a classifier (AUC=1.000, precision=0.973) for ‚Äúgood investment‚Äù decisions and a regressor (R¬≤=0.999, RMSE=0.001) for log-transformed price forecasts. <br>
Preprocessing uses RobustScaler for 10 numeric features, OneHotEncoder (drop=‚Äòfirst‚Äô) for 7 low-cardinality categoricals (yielding 11 dummies), and target encoding for high-cardinality Locality, ensuring consistent 28-column feature vectors. Models are tracked via MLflow with SQLite backend, registered as production artifacts (v1), and artifacts (scalers, encoders) are serialized with joblib for inference reproducibility. The Streamlit app provides an intuitive interface with dropdowns populated from training data lookups, sidebar filters, and real-time predictions displaying investment classification, future price (‚Çπ Lakhs), and ROI% metrics. 

# Github Link

https://github.com/AyushSinghRana15/Real-Estate-Investment-Advisor.git

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, TargetEncoder
from sklearn.impute import SimpleImputer


In [2]:
# Install if needed (run once)
!pip install gdown pandas



In [3]:
import gdown

# Direct download from Google Drive share link
url = "https://drive.google.com/uc?id=1ys25Eaqo2n8IeHhyI9s0kmJBgnNzxQHX"
output = "real_estate_dataset.csv"

gdown.download(url, output, quiet=False)

# Load dataset
df = pd.read_csv(output)
print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")

Downloading...
From: https://drive.google.com/uc?id=1ys25Eaqo2n8IeHhyI9s0kmJBgnNzxQHX
To: /Users/ayushsingh/Internship Projects/11th Project/real_estate_dataset.csv
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 41.1M/41.1M [00:01<00:00, 26.8MB/s]


Dataset loaded successfully!
Shape: (250000, 23)


In [4]:
print("\nFirst 5 rows:")
df.head()


First 5 rows:


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,...,Age_of_Property,Nearby_Schools,Nearby_Hospitals,Public_Transport_Accessibility,Parking_Space,Security,Amenities,Facing,Owner_Type,Availability_Status
0,1,Tamil Nadu,Chennai,Locality_84,Apartment,1,4740,489.76,0.1,1990,...,35,10,3,High,No,No,"Playground, Gym, Garden, Pool, Clubhouse",West,Owner,Ready_to_Move
1,2,Maharashtra,Pune,Locality_490,Independent House,3,2364,195.52,0.08,2008,...,17,8,1,Low,No,Yes,"Playground, Clubhouse, Pool, Gym, Garden",North,Builder,Under_Construction
2,3,Punjab,Ludhiana,Locality_167,Apartment,2,3642,183.79,0.05,1997,...,28,9,8,Low,Yes,No,"Clubhouse, Pool, Playground, Gym",South,Broker,Ready_to_Move
3,4,Rajasthan,Jodhpur,Locality_393,Independent House,2,2741,300.29,0.11,1991,...,34,5,7,High,Yes,Yes,"Playground, Clubhouse, Gym, Pool, Garden",North,Builder,Ready_to_Move
4,5,Rajasthan,Jaipur,Locality_466,Villa,4,4823,182.9,0.04,2002,...,23,4,9,Low,No,Yes,"Playground, Garden, Gym, Pool, Clubhouse",East,Builder,Ready_to_Move


In [5]:
print("\nColumn info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
print("\nDataset description:")
print(df.describe())


Column info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   ID                              250000 non-null  int64  
 1   State                           250000 non-null  object 
 2   City                            250000 non-null  object 
 3   Locality                        250000 non-null  object 
 4   Property_Type                   250000 non-null  object 
 5   BHK                             250000 non-null  int64  
 6   Size_in_SqFt                    250000 non-null  int64  
 7   Price_in_Lakhs                  250000 non-null  float64
 8   Price_per_SqFt                  250000 non-null  float64
 9   Year_Built                      250000 non-null  int64  
 10  Furnished_Status                250000 non-null  object 
 11  Floor_No                        250000 non-null  int64  
 12  To

In [6]:
# Check categorical columns unique values
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    print(f"\n{col}: {df[col].nunique()} unique values")
    if df[col].nunique() < 20:
        print(df[col].value_counts().head())

# Check target candidates
price_cols = [col for col in df.columns if 'price' in col.lower() or 'value' in col.lower()]
area_cols = [col for col in df.columns if 'area' in col.lower() or 'sqft' in col.lower()]
print("\nPrice columns:", price_cols)
print("Area columns:", area_cols)

# Sample location/BHK parsing check
print("\nLocation sample:", df['location'].value_counts().head(10) if 'location' in df else "No location column")



State: 20 unique values

City: 42 unique values

Locality: 500 unique values

Property_Type: 3 unique values
Property_Type
Villa                83744
Independent House    83300
Apartment            82956
Name: count, dtype: int64

Furnished_Status: 3 unique values
Furnished_Status
Unfurnished       83408
Semi-furnished    83374
Furnished         83218
Name: count, dtype: int64

Public_Transport_Accessibility: 3 unique values
Public_Transport_Accessibility
High      83705
Low       83287
Medium    83008
Name: count, dtype: int64

Parking_Space: 2 unique values
Parking_Space
No     125456
Yes    124544
Name: count, dtype: int64

Security: 2 unique values
Security
Yes    125233
No     124767
Name: count, dtype: int64

Amenities: 325 unique values

Facing: 4 unique values
Facing
West     62757
North    62637
South    62337
East     62269
Name: count, dtype: int64

Owner_Type: 3 unique values
Owner_Type
Broker     83479
Owner      83268
Builder    83253
Name: count, dtype: int64

Availabil

In [7]:
df["Amenities"].unique()

array(['Playground, Gym, Garden, Pool, Clubhouse',
       'Playground, Clubhouse, Pool, Gym, Garden',
       'Clubhouse, Pool, Playground, Gym',
       'Playground, Clubhouse, Gym, Pool, Garden',
       'Playground, Garden, Gym, Pool, Clubhouse',
       'Playground, Clubhouse', 'Clubhouse, Garden, Playground',
       'Gym, Pool, Clubhouse, Playground',
       'Garden, Clubhouse, Playground',
       'Clubhouse, Playground, Garden, Gym', 'Clubhouse',
       'Clubhouse, Gym, Playground, Pool',
       'Clubhouse, Garden, Gym, Playground, Pool',
       'Garden, Gym, Playground', 'Playground',
       'Pool, Playground, Garden, Gym', 'Pool, Clubhouse, Gym',
       'Garden, Clubhouse, Pool, Gym, Playground',
       'Pool, Playground, Clubhouse',
       'Clubhouse, Gym, Garden, Pool, Playground',
       'Pool, Clubhouse, Gym, Playground, Garden',
       'Garden, Pool, Gym, Playground, Clubhouse', 'Pool, Gym, Clubhouse',
       'Clubhouse, Garden', 'Pool, Garden, Playground, Gym',
       'Garden

In [8]:
df['State'].unique()

array(['Tamil Nadu', 'Maharashtra', 'Punjab', 'Rajasthan', 'West Bengal',
       'Chhattisgarh', 'Delhi', 'Jharkhand', 'Telangana', 'Karnataka',
       'Uttar Pradesh', 'Assam', 'Uttarakhand', 'Bihar', 'Gujarat',
       'Haryana', 'Andhra Pradesh', 'Madhya Pradesh', 'Kerala', 'Odisha'],
      dtype=object)

In [9]:
# === DUPLICATE CHECK ===
print("=== DUPLICATE ANALYSIS ===")

# 1. Total duplicates (all columns)
total_dups = df.duplicated().sum()
print(f"Total duplicate rows: {total_dups}")

# 2. Duplicates by key business columns (location + specs)
# FIXED: Business duplicates using YOUR columns
key_cols = ['Locality', 'BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Property_Type']
business_dups = df.duplicated(subset=key_cols).sum()
print(f"‚úÖ Business duplicates (Locality+BHK+Size+Price+Type): {business_dups}")

# 3. Show duplicate samples if any
if total_dups > 0:
    print("\nSample duplicate rows:")
    print(df[df.duplicated()].head(3))
    
    # Remove duplicates
    print(f"\nDataset shape BEFORE removing duplicates: {df.shape}")
    df_clean = df.drop_duplicates()
    print(f"Dataset shape AFTER removing duplicates: {df_clean.shape}")
    df = df_clean  # Update df
else:
    print("‚úÖ No duplicates found!")

# 4. Memory optimization (good practice)
print(f"\nDataset memory: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")


=== DUPLICATE ANALYSIS ===
Total duplicate rows: 0
‚úÖ Business duplicates (Locality+BHK+Size+Price+Type): 0
‚úÖ No duplicates found!

Dataset memory: 187.0 MB


In [10]:
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np

# === NUMERIC COLUMNS ===
numeric_features = [
    'BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt',
    'Year_Built', 'Floor_No', 'Total_Floors', 'Age_of_Property',
    'Nearby_Schools', 'Nearby_Hospitals'
]

print(f"üî¢ Scaling {len(numeric_features)} numeric features...")
print("Features:", numeric_features)

# === ROBUST SCALER (Outlier-proof for real estate prices) ===
numeric_scaler = RobustScaler()

# Fit and transform
df_numeric_scaled = pd.DataFrame(
    numeric_scaler.fit_transform(df[numeric_features]),
    columns=[f"{col}_scaled" for col in numeric_features],
    index=df.index
)

# Combine with original dataframe
df_scaled = pd.concat([df.reset_index(drop=True), df_numeric_scaled.reset_index(drop=True)], axis=1)

print("‚úÖ Scaling COMPLETE!")
print("\nBefore vs After (first 5 rows):")
print(df[numeric_features].head())
print(df_scaled[[f"{col}_scaled" for col in numeric_features]].head())

print(f"\nüìä Scale verification:")
print("Mean ‚âà 0:", [f"{col}_scaled mean: {df_scaled[f'{col}_scaled'].mean():.3f}" 
      for col in numeric_features[:3]])
print("Median = 0:", [f"{col}_scaled median: {df_scaled[f'{col}_scaled'].median():.3f}" 
      for col in numeric_features[:3]])


üî¢ Scaling 10 numeric features...
Features: ['BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt', 'Year_Built', 'Floor_No', 'Total_Floors', 'Age_of_Property', 'Nearby_Schools', 'Nearby_Hospitals']
‚úÖ Scaling COMPLETE!

Before vs After (first 5 rows):
   BHK  Size_in_SqFt  Price_in_Lakhs  Price_per_SqFt  Year_Built  Floor_No  \
0    1          4740          489.76            0.10        1990        22   
1    3          2364          195.52            0.08        2008        21   
2    2          3642          183.79            0.05        1997        19   
3    2          2741          300.29            0.11        1991        21   
4    4          4823          182.90            0.04        2002         3   

   Total_Floors  Age_of_Property  Nearby_Schools  Nearby_Hospitals  
0             1               35              10                 3  
1            20               17               8                 1  
2            27               28               9                

In [11]:
# === DEBUG: Check for None/NaN in YOUR categorical columns ===
your_categorical_features = [
    'Property_Type', 'Furnished_Status', 'Public_Transport_Accessibility', 
    'Parking_Space', 'Security', 'Owner_Type', 'Availability_Status'
]

print("üîç DEBUGGING CATEGORICAL COLUMNS:")
for col in your_categorical_features + ['Locality']:
    null_count = df[col].isnull().sum()
    unique_count = df[col].nunique()
    print(f"{col:25} | Nulls: {null_count:4} | Unique: {unique_count:3}")
    
    if null_count > 0:
        print(f"   ‚ùå NULL VALUES FOUND ‚Üí Filling with 'Unknown'")
        df[col] = df[col].fillna('Unknown')
    
    # Check first few unique values
    print(f"   Sample: {df[col].unique()[:3]}")

print("\n‚úÖ Data cleaned! Ready for encoding.")


üîç DEBUGGING CATEGORICAL COLUMNS:
Property_Type             | Nulls:    0 | Unique:   3
   Sample: ['Apartment' 'Independent House' 'Villa']
Furnished_Status          | Nulls:    0 | Unique:   3
   Sample: ['Furnished' 'Unfurnished' 'Semi-furnished']
Public_Transport_Accessibility | Nulls:    0 | Unique:   3
   Sample: ['High' 'Low' 'Medium']
Parking_Space             | Nulls:    0 | Unique:   2
   Sample: ['No' 'Yes']
Security                  | Nulls:    0 | Unique:   2
   Sample: ['No' 'Yes']
Owner_Type                | Nulls:    0 | Unique:   3
   Sample: ['Owner' 'Builder' 'Broker']
Availability_Status       | Nulls:    0 | Unique:   2
   Sample: ['Ready_to_Move' 'Under_Construction']
Locality                  | Nulls:    0 | Unique: 500
   Sample: ['Locality_84' 'Locality_490' 'Locality_167']

‚úÖ Data cleaned! Ready for encoding.


In [12]:
from sklearn.preprocessing import OneHotEncoder, TargetEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
import numpy as np

# === YOUR 8 COLUMNS ===
low_cardinality = [
    'Property_Type', 'Furnished_Status', 'Public_Transport_Accessibility', 
    'Parking_Space', 'Security', 'Owner_Type', 'Availability_Status'
]
high_cardinality = ['Locality']

cat_columns = low_cardinality + high_cardinality

# === ULTRA-SAFE ENCODER (Separate steps) ===
print("üîÑ ENCODING STEP-BY-STEP (Bulletproof)...")

# Step 1: OneHot Encoding (No y needed)
print("1Ô∏è‚É£ OneHot Encoding 7 columns...")
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')
onehot_encoded = onehot_encoder.fit_transform(df[low_cardinality])
onehot_cols = onehot_encoder.get_feature_names_out(low_cardinality)
df_onehot = pd.DataFrame(onehot_encoded, columns=onehot_cols, index=df.index)

print(f"   ‚úÖ OneHot shape: {df_onehot.shape}")

# Step 2: Target Encoding Locality (Use Price_in_Lakhs as target)
print("2Ô∏è‚É£ Target Encoding Locality...")
target_encoder = TargetEncoder(smooth=10.0)
df_locality_encoded = pd.DataFrame(
    target_encoder.fit_transform(df[['Locality']], df['Price_in_Lakhs']),  # ‚úÖ y=Price!
    columns=['Locality_target_encoded'],
    index=df.index
)

print(f"   ‚úÖ Locality shape: {df_locality_encoded.shape}")

# Step 3: COMBINE
df_cat_encoded = pd.concat([df_onehot, df_locality_encoded], axis=1)

print(f"\nüéâ TOTAL ENCODING SUCCESS!")
print(f"üìä Final shape: {df_cat_encoded.shape}")
print(f"üìà Features: {df_cat_encoded.shape[1]}")

print("\nüîç FIRST 10 COLUMN NAMES:")
print(list(df_cat_encoded.columns[:10]))

print("\nüìä SAMPLE (first 3 rows, first 5 cols):")
print(df_cat_encoded.iloc[:3, :5].round(3))


üîÑ ENCODING STEP-BY-STEP (Bulletproof)...
1Ô∏è‚É£ OneHot Encoding 7 columns...
   ‚úÖ OneHot shape: (250000, 11)
2Ô∏è‚É£ Target Encoding Locality...
   ‚úÖ Locality shape: (250000, 1)

üéâ TOTAL ENCODING SUCCESS!
üìä Final shape: (250000, 12)
üìà Features: 12

üîç FIRST 10 COLUMN NAMES:
['Property_Type_Independent House', 'Property_Type_Villa', 'Furnished_Status_Semi-furnished', 'Furnished_Status_Unfurnished', 'Public_Transport_Accessibility_Low', 'Public_Transport_Accessibility_Medium', 'Parking_Space_Yes', 'Security_Yes', 'Owner_Type_Builder', 'Owner_Type_Owner']

üìä SAMPLE (first 3 rows, first 5 cols):
   Property_Type_Independent House  Property_Type_Villa  \
0                              0.0                  0.0   
1                              1.0                  0.0   
2                              0.0                  0.0   

   Furnished_Status_Semi-furnished  Furnished_Status_Unfurnished  \
0                              0.0                           0.0   
1     

In [13]:

def create_investment_features(df):
    df_feat = df.copy()

    # Price per SqFt (sanity version using Size_in_SqFt)
    df_feat['price_per_sqft_calc'] = df_feat['Price_in_Lakhs'] * 100000 / df_feat['Size_in_SqFt']

    # School density: schools per 1000 sqft
    df_feat['school_density_score'] = df_feat['Nearby_Schools'] / (df_feat['Size_in_SqFt'] / 1000)

    # Hospital density: hospitals per 1000 sqft
    df_feat['hospital_density_score'] = df_feat['Nearby_Hospitals'] / (df_feat['Size_in_SqFt'] / 1000)

    # Floor position ratio
    df_feat['floor_position_ratio'] = df_feat['Floor_No'] / df_feat['Total_Floors']

    # Age score (newer is better)
    df_feat['age_score'] = 1 / (1 + df_feat['Age_of_Property'])

    # Amenity score (simple sum of key amenities)
    df_feat['amenity_score'] = (
        (df_feat['Parking_Space'] == 'Yes').astype(int) +
        (df_feat['Security'] == 'Yes').astype(int) +
        (df_feat['Furnished_Status'] != 'Unfurnished').astype(int)
    )

    # Ready-to-move flag
    df_feat['ready_to_move'] = (df_feat['Availability_Status'] == 'Ready_to_Move').astype(int)

    return df_feat

df_features = create_investment_features(df)
print(df_features[['price_per_sqft_calc','school_density_score','amenity_score','ready_to_move']].head())


   price_per_sqft_calc  school_density_score  amenity_score  ready_to_move
0         10332.489451              2.109705              1              1
1          8270.727580              3.384095              1              0
2          5046.403075              2.471170              2              1
3         10955.490697              1.824152              3              1
4          3792.245490              0.829359              2              1


# Good Investment label (domain rule)

In [14]:
def add_good_investment_label(df_feat: pd.DataFrame) -> pd.DataFrame:
    df_l = df_feat.copy()

    # Locality-wise median price per sqft (reference benchmark)
    locality_median_ppsf = df_l.groupby('Locality')['Price_per_SqFt'].median()

    # 1) Price bargain: at least 10% cheaper than locality median
    df_l['price_bargain'] = df_l['Price_per_SqFt'] < df_l['Locality'].map(locality_median_ppsf) * 0.9

    # 2) High appreciation potential (fundamentals)
    df_l['high_appreciation'] = (
        (df_l['BHK'] >= 2) &
        (df_l['school_density_score'] >= 0.3) &
        (df_l['amenity_score'] >= 2) &
        (df_l['Age_of_Property'] <= 10) &
        (df_l['ready_to_move'] == 1)
    )

    # Final binary label
    df_l['good_investment'] = (df_l['price_bargain'] & df_l['high_appreciation']).astype(int)

    print(f"Good Investment rate: {df_l['good_investment'].mean():.1%}")
    print(f"Price bargains: {df_l['price_bargain'].mean():.1%}")
    print(f"High appreciation: {df_l['high_appreciation'].mean():.1%}")

    return df_l

df_labeled = add_good_investment_label(df_features)
df_labeled[['Price_per_SqFt','school_density_score','amenity_score','good_investment']].head()


Good Investment rate: 2.7%
Price bargains: 47.1%
High appreciation: 5.9%


Unnamed: 0,Price_per_SqFt,school_density_score,amenity_score,good_investment
0,0.1,2.109705,1,0
1,0.08,3.384095,1,0
2,0.05,2.47117,2,0
3,0.11,1.824152,3,0
4,0.04,0.829359,2,0


In [15]:
df_l = df_features.copy()

# Slightly easier bargain and fundamentals
locality_median_ppsf = df_l.groupby('Locality')['Price_per_SqFt'].median()
df_l['price_bargain'] = df_l['Price_per_SqFt'] < df_l['Locality'].map(locality_median_ppsf) * 0.95

df_l['high_appreciation'] = (
    (df_l['BHK'] >= 2) &
    (df_l['school_density_score'] >= 0.2) &
    (df_l['amenity_score'] >= 1) &
    (df_l['Age_of_Property'] <= 15) &
    (df_l['ready_to_move'] == 1)
)

df_l['good_investment'] = (df_l['price_bargain'] & df_l['high_appreciation']).astype(int)
print(f"New Good Investment rate: {df_l['good_investment'].mean():.1%}")


New Good Investment rate: 7.1%


In [16]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Concatenate all feature blocks
X_num = df_scaled[[col for col in df_scaled.columns if col.endswith('_scaled')]]
X_cat = df_cat_encoded
X_eng = df_labeled[['school_density_score','hospital_density_score',
                    'floor_position_ratio','age_score','amenity_score','ready_to_move']]

X = pd.concat([X_num, X_cat, X_eng], axis=1)
y_cls = df_labeled['good_investment']
y_reg = np.log1p(df_labeled['Price_in_Lakhs'])

X_train, X_test, y_cls_train, y_cls_test, y_reg_train, y_reg_test = train_test_split(
    X, y_cls, y_reg, test_size=0.2, random_state=42, stratify=y_cls
)
print(X_train.shape, X_test.shape, y_cls_train.mean(), y_cls_test.mean())


(200000, 28) (50000, 28) 0.027395 0.0274


# Model Training

In [None]:
# Classification Model

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

def evaluate_classifier(model, X_test, y_test, threshold=0.5):
    proba = model.predict_proba(X_test)[:, 1]
    preds = (proba >= threshold).astype(int)

    acc = accuracy_score(y_test, preds)
    prec = precision_score(y_test, preds, zero_division=0)
    rec = recall_score(y_test, preds, zero_division=0)
    roc = roc_auc_score(y_test, proba)

    print(f"Accuracy   : {acc:.3f}")
    print(f"Precision  : {prec:.3f}")
    print(f"Recall     : {rec:.3f}")
    print(f"ROC AUC    : {roc:.3f}")

    return {"accuracy": acc, "precision": prec, "recall": rec, "roc_auc": roc}

In [26]:
from sklearn.linear_model import LogisticRegression

# 1. Logistic Regression (baseline)
log_clf = LogisticRegression(
    max_iter=2000,       
    solver="lbfgs",        
    class_weight="balanced",
    n_jobs=-1
)
log_clf.fit(X_train, y_cls_train)

In [27]:
evaluate_classifier(log_clf, X_test, y_cls_test)

Accuracy   : 0.952
Precision  : 0.364
Recall     : 0.993
ROC AUC    : 0.991


{'accuracy': 0.9523,
 'precision': 0.36419587904736417,
 'recall': 0.9934306569343065,
 'roc_auc': 0.9908460128694101}

In [28]:
# 2. Random Forest Classifier (non-linear baseline)
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)
rf_clf.fit(X_train, y_cls_train)
evaluate_classifier(rf_clf, X_test, y_cls_test)

Accuracy   : 0.999
Precision  : 0.987
Recall     : 0.973
ROC AUC    : 1.000


{'accuracy': 0.99892,
 'precision': 0.9874074074074074,
 'recall': 0.972992700729927,
 'roc_auc': 0.9999621452619287}

In [29]:
# 3. XGBoost Classifier 
from xgboost import XGBClassifier
pos_weight = (y_cls_train == 0).sum() / (y_cls_train == 1).sum()

xgb_clf = XGBClassifier(
    objective='binary:logistic',
    learning_rate=0.05,
    n_estimators=400,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=pos_weight,
    eval_metric='aucpr',
    n_jobs=-1,
    random_state=42
)
xgb_clf.fit(X_train, y_cls_train)

evaluate_classifier(xgb_clf, X_test, y_cls_test, threshold=0.3)

Accuracy   : 0.999
Precision  : 0.956
Recall     : 0.994
ROC AUC    : 1.000


{'accuracy': 0.9986,
 'precision': 0.9564606741573034,
 'recall': 0.9941605839416059,
 'roc_auc': 0.9999713612845995}

In [30]:
# Regression Model

In [31]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def evaluate_regressor(model, X_test, y_test):
    preds = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, preds))
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)

    print(f"RMSE : {rmse:.3f}")
    print(f"MAE  : {mae:.3f}")
    print(f"R¬≤   : {r2:.3f}")

    return {"rmse": rmse, "mae": mae, "r2": r2}


In [32]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Linear Regression (baseline)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_reg_train)
lin_pred = lin_reg.predict(X_test)
evaluate_regressor(lin_reg, X_test, y_reg_test)


RMSE : 0.312
MAE  : 0.235
R¬≤   : 0.852


{'rmse': 0.31192540507305666,
 'mae': 0.2348189382358499,
 'r2': 0.8516129797136546}

In [33]:
from sklearn.ensemble import RandomForestRegressor
# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    n_jobs=-1,
    random_state=42
)
rf_reg.fit(X_train, y_reg_train)
rf_reg_pred = rf_reg.predict(X_test)
evaluate_regressor(rf_reg, X_test, y_reg_test)


RMSE : 0.000
MAE  : 0.000
R¬≤   : 1.000


{'rmse': 2.9367321226358262e-05,
 'mae': 1.010275298861667e-05,
 'r2': 0.9999999986847052}

In [34]:
from xgboost import XGBRegressor

# XGBoost Regressor (recommended main model)
xgb_reg = XGBRegressor(
    learning_rate=0.05,
    n_estimators=400,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    n_jobs=-1,
    random_state=42
)
xgb_reg.fit(X_train, y_reg_train)
xgb_reg_pred = xgb_reg.predict(X_test)
evaluate_regressor(xgb_reg, X_test, y_reg_test)


RMSE : 0.009
MAE  : 0.006
R¬≤   : 1.000


{'rmse': 0.0093312699095966,
 'mae': 0.005886573885952311,
 'r2': 0.9998672067738846}

In [35]:
# From the above result 

1) XGBoostClassifier  ‚Üê main production model <br>
2) RandomForestClassifier  ‚Üê strong backup / baseline <br>
3) LogisticRegression  ‚Üê simple baseline for comparison


XGBoost and Random Forest both give ‚âà 0.999 accuracy and ROC AUC ‚âà 1.0, with very high precision and recall; ,<br>
XGBoost has slightly better recall, which is valuable for not missing good deals.

1) XGBRegressor        ‚Üê main production model <br>
2) RandomForestRegressor  ‚Üê near-perfect backup <br>
3) Linear Regression   ‚Üê weaker baseline only

Linear Regression has RMSE ‚âà 0.31 and R¬≤ ‚âà 0.85, while Random Forest and XGBoost both achieve R¬≤ ‚âà 1.0 with extremely low errors, so those are clearly superior for the 5-year price forecast.

# ML Flow

In [38]:
import mlflow
import mlflow.xgboost
from mlflow import MlflowClient

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("real_estate_investment_app")
client = MlflowClient()

In [39]:
# Log for classification model
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

def log_classifier_to_mlflow(model, model_name, X_test, y_test, threshold=0.5):
    with mlflow.start_run(run_name=f"{model_name}_cls"):
        proba = model.predict_proba(X_test)[:, 1]
        preds = (proba >= threshold).astype(int)

        acc = accuracy_score(y_test, preds)
        prec = precision_score(y_test, preds, zero_division=0)
        rec = recall_score(y_test, preds, zero_division=0)
        roc = roc_auc_score(y_test, proba)

        mlflow.log_params({"model": model_name, "threshold": threshold})
        mlflow.log_metrics({
            "accuracy": acc,
            "precision": prec,
            "recall": rec,
            "roc_auc": roc
        })

        if model_name.lower().startswith("xgb"):
            mlflow.xgboost.log_model(model, artifact_path="model")
        else:
            mlflow.sklearn.log_model(model, artifact_path="model")

        run_id = mlflow.active_run().info.run_id
        print(f"{model_name}: acc={acc:.3f}, prec={prec:.3f}, rec={rec:.3f}, roc={roc:.3f}, run_id={run_id}")
        return run_id

log_id_logreg = log_classifier_to_mlflow(log_clf, "LogisticRegression", X_test, y_cls_test, threshold=0.5)
log_id_rf     = log_classifier_to_mlflow(rf_clf,   "RandomForest",      X_test, y_cls_test, threshold=0.5)
log_id_xgb    = log_classifier_to_mlflow(xgb_clf,  "XGBoost",           X_test, y_cls_test, threshold=0.3)




LogisticRegression: acc=0.952, prec=0.364, rec=0.993, roc=0.991, run_id=769eb14fba57421fa33bbeffa9a0c22d




RandomForest: acc=0.999, prec=0.987, rec=0.973, roc=1.000, run_id=b3874c4c4318499db9195acbfdfda6a4




XGBoost: acc=0.999, prec=0.956, rec=0.994, roc=1.000, run_id=601d1ee8e27b456987939de77b05e84a


In [40]:
#Log for Regression Model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def log_regressor_to_mlflow(model, model_name, X_test, y_test):
    with mlflow.start_run(run_name=f"{model_name}_reg"):
        preds = model.predict(X_test)

        rmse = np.sqrt(mean_squared_error(y_test, preds))
        mae = mean_absolute_error(y_test, preds)
        r2 = r2_score(y_test, preds)

        mlflow.log_params({"model": model_name})
        mlflow.log_metrics({
            "rmse": rmse,
            "mae": mae,
            "r2": r2
        })

        if model_name.lower().startswith("xgb"):
            mlflow.xgboost.log_model(model, artifact_path="model")
        else:
            mlflow.sklearn.log_model(model, artifact_path="model")

        run_id = mlflow.active_run().info.run_id
        print(f"{model_name}: rmse={rmse:.3f}, mae={mae:.3f}, r2={r2:.3f}, run_id={run_id}")
        return run_id

# Assuming these are already fitted:
# lin_reg, rf_reg, xgb_reg
log_id_lin = log_regressor_to_mlflow(lin_reg, "LinearRegression", X_test, y_reg_test)
log_id_rf_reg = log_regressor_to_mlflow(rf_reg, "RandomForestRegressor", X_test, y_reg_test)
log_id_xgb_reg = log_regressor_to_mlflow(xgb_reg, "XGBoostRegressor", X_test, y_reg_test)




LinearRegression: rmse=0.312, mae=0.235, r2=0.852, run_id=a2789affe17746088a643bbbeef8cac6




RandomForestRegressor: rmse=0.000, mae=0.000, r2=1.000, run_id=ecc3fc8fce78415c97a762a9a7bfb6a5




XGBoostRegressor: rmse=0.009, mae=0.006, r2=1.000, run_id=a8b3fd3b6c024e8aa0bb7764e1816de0


In [41]:
logreg_run_id   = "769eb14fba57421fa33bbeffa9a0c22d"
rf_cls_run_id   = "b3874c4c4318499db9195acbfdfda6a4"
xgb_cls_run_id  = "601d1ee8e27b456987939de77b05e84a"

# Regression runs
lin_reg_run_id  = "a2789affe17746088a643bbbeef8cac6"
rf_reg_run_id   = "ecc3fc8fce78415c97a762a9a7bfb6a5"
xgb_reg_run_id  = "a8b3fd3b6c024e8aa0bb7764e1816de0"

best_cls_run_id = xgb_cls_run_id   
best_reg_run_id = xgb_reg_run_id

cls_uri = f"runs:/{best_cls_run_id}/model"
reg_uri = f"runs:/{best_reg_run_id}/model"

cls_model_name = "RealEstate_GoodInvestment_Classifier"
reg_model_name = "RealEstate_FuturePrice_Regressor"

cls_version = mlflow.register_model(cls_uri, cls_model_name).version
reg_version = mlflow.register_model(reg_uri, reg_model_name).version

print("Registered classifier version:", cls_version)
print("Registered regressor version:", reg_version)


2025/12/11 21:26:11 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/11 21:26:11 INFO mlflow.store.db.utils: Updating database tables
2025-12-11 21:26:11 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-11 21:26:11 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Successfully registered model 'RealEstate_GoodInvestment_Classifier'.
Created version '1' of model 'RealEstate_GoodInvestment_Classifier'.
Successfully registered model 'RealEstate_FuturePrice_Regressor'.


Registered classifier version: 1
Registered regressor version: 1


Created version '1' of model 'RealEstate_FuturePrice_Regressor'.


In [42]:
client.transition_model_version_stage(
    name=cls_model_name,
    version=cls_version,
    stage="Production"
)
client.transition_model_version_stage(
    name=reg_model_name,
    version=reg_version,
    stage="Production"
)
print("‚úÖ Production models set in MLflow Registry")

‚úÖ Production models set in MLflow Registry


  client.transition_model_version_stage(
  client.transition_model_version_stage(


In [46]:
from sklearn.preprocessing import TargetEncoder

categorical_high = ['Locality']

te = TargetEncoder(smooth=10.0)
te.fit(df_labeled[['Locality']], df_labeled['Price_in_Lakhs'])  # or y_cls if you prefer


In [86]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import joblib

numeric_features = [
    'BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt',
    'Year_Built', 'Floor_No', 'Total_Floors', 'Age_of_Property',
    'Nearby_Schools', 'Nearby_Hospitals',
    'school_density_score', 'hospital_density_score',
    'floor_position_ratio', 'age_score', 'amenity_score', 'ready_to_move'
]


categorical_low = [
    'Property_Type', 'Furnished_Status', 'Public_Transport_Accessibility',
    'Parking_Space', 'Security', 'Owner_Type', 'Availability_Status'
]

# Create encoder
cat_ohe = OneHotEncoder(
    handle_unknown='ignore',
    sparse_output=False,
    drop='first'
)

#FIT on TRAINING DATA (df_labeled must exist and contain these columns)
cat_ohe.fit(df_labeled[categorical_low])

# 3) Optional: quick sanity check
print("Has categories_?", hasattr(cat_ohe, "categories_"))  # should be True

#SAVE fitted encoder (OVERWRITE any old file)
joblib.dump(cat_ohe, "cat_ohe.pkl")
print("‚úÖ Saved fitted cat_ohe.pkl")


from sklearn.preprocessing import OneHotEncoder
categorical_low_transformer = OneHotEncoder(
    handle_unknown='ignore', sparse_output=False, drop='first'
)

# ‚úÖ This MUST exist and run before saving:
cat_array = categorical_low_transformer.fit_transform(df_labeled[categorical_low])

df_cat_encoded = pd.DataFrame(
    cat_array,
    columns=[
        "Property_Type_Independent House", "Property_Type_Villa",
        "Furnished_Status_Semi-furnished", "Furnished_Status_Unfurnished",
        "Public_Transport_Accessibility_Low", "Public_Transport_Accessibility_Medium",
        "Parking_Space_Yes", "Security_Yes", "Owner_Type_Builder",
        "Owner_Type_Owner", "Availability_Status_Under_Construction",
    ],
)


numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler())
])

categorical_low_transformer = OneHotEncoder(
    handle_unknown='ignore', sparse_output=False, drop='first'
)


#Create full design matrix with Locality_target_encoded
X_base = df_labeled[numeric_features + categorical_low + ['Locality']].copy()
X_base['Locality_target_encoded'] = te.transform(df_labeled[['Locality']])

all_numeric = numeric_features + ['Locality_target_encoded']

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, all_numeric),
    ('cat_low', categorical_low_transformer, categorical_low),
])
preprocessor.fit(X_base)


Has categories_? True
‚úÖ Saved fitted cat_ohe.pkl


In [64]:
import joblib

# Save the FULL preprocessor (contains both numeric_transformer and categorical_low_transformer)
joblib.dump(preprocessor, "preprocessor_real_estate.pkl")
joblib.dump(te, "target_encoder_locality.pkl")  

print("‚úÖ Full pipeline saved!")


‚úÖ Full pipeline saved!


In [51]:
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")

# ---- Classifier (XGBoost) ----
xgb_cls_run_id = "601d1ee8e27b456987939de77b05e84a"
cls_model_name = "RealEstate_GoodInvestment_Classifier"
cls_uri = f"runs:/{xgb_cls_run_id}/model"

cls_result = mlflow.register_model(model_uri=cls_uri, name=cls_model_name)
cls_version = cls_result.version
print("Classifier registered:", cls_model_name, "version", cls_version)

# ---- Regressor (XGBoost) ----
xgb_reg_run_id = "a8b3fd3b6c024e8aa0bb7764e1816de0"
reg_model_name = "RealEstate_FuturePrice_Regressor"
reg_uri = f"runs:/{xgb_reg_run_id}/model"

reg_result = mlflow.register_model(model_uri=reg_uri, name=reg_model_name)
reg_version = reg_result.version
print("Regressor registered:", reg_model_name, "version", reg_version)


Registered model 'RealEstate_GoodInvestment_Classifier' already exists. Creating a new version of this model...
Created version '2' of model 'RealEstate_GoodInvestment_Classifier'.
Registered model 'RealEstate_FuturePrice_Regressor' already exists. Creating a new version of this model...


Classifier registered: RealEstate_GoodInvestment_Classifier version 2
Regressor registered: RealEstate_FuturePrice_Regressor version 2


Created version '2' of model 'RealEstate_FuturePrice_Regressor'.


In [52]:
cols_for_lookup = [
    "City", "Locality", "Property_Type", "Furnished_Status",
    "Public_Transport_Accessibility", "Parking_Space",
    "Security", "Owner_Type", "Availability_Status"
]

lookup = df[cols_for_lookup].drop_duplicates().sort_values(cols_for_lookup)
lookup.to_csv("lookup_values.csv", index=False)
print("lookup_values.csv saved with shape:", lookup.shape)


lookup_values.csv saved with shape: (247651, 9)


In [56]:
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("sqlite:///mlflow.db")
client = MlflowClient()

print("Classifier runs:")
for mv in client.search_model_versions("name='RealEstate_GoodInvestment_Classifier'"):
    print(mv.version, mv.run_id, mv.source)

print("\nRegressor runs:")
for mv in client.search_model_versions("name='RealEstate_FuturePrice_Regressor'"):
    print(mv.version, mv.run_id, mv.source)


Classifier runs:
2 601d1ee8e27b456987939de77b05e84a models:/m-0e1c3bb42e8a4336a3e7cb4e7d70e03a
1 601d1ee8e27b456987939de77b05e84a models:/m-0e1c3bb42e8a4336a3e7cb4e7d70e03a

Regressor runs:
2 a8b3fd3b6c024e8aa0bb7764e1816de0 models:/m-254185d0f5524270a66c8251f3281cd1
1 a8b3fd3b6c024e8aa0bb7764e1816de0 models:/m-254185d0f5524270a66c8251f3281cd1


In [59]:
import mlflow
import mlflow.xgboost

mlflow.set_tracking_uri("sqlite:///mlflow.db")

cls_uri = "models:/RealEstate_GoodInvestment_Classifier/1"

# Load the exact same way as in app
cls_model = mlflow.xgboost.load_model(cls_uri)

booster = cls_model.get_booster()
print("Booster feature names:", booster.feature_names)
print("Number of features:", booster.num_features())


Booster feature names: ['BHK_scaled', 'Size_in_SqFt_scaled', 'Price_in_Lakhs_scaled', 'Price_per_SqFt_scaled', 'Year_Built_scaled', 'Floor_No_scaled', 'Total_Floors_scaled', 'Age_of_Property_scaled', 'Nearby_Schools_scaled', 'Nearby_Hospitals_scaled', 'Property_Type_Independent House', 'Property_Type_Villa', 'Furnished_Status_Semi-furnished', 'Furnished_Status_Unfurnished', 'Public_Transport_Accessibility_Low', 'Public_Transport_Accessibility_Medium', 'Parking_Space_Yes', 'Security_Yes', 'Owner_Type_Builder', 'Owner_Type_Owner', 'Availability_Status_Under_Construction', 'Locality_target_encoded', 'school_density_score', 'hospital_density_score', 'floor_position_ratio', 'age_score', 'amenity_score', 'ready_to_move']
Number of features: 28


In [61]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_raw_cols = [
    "BHK", "Size_in_SqFt", "Price_in_Lakhs", "Price_per_SqFt",
    "Year_Built", "Floor_No", "Total_Floors",
    "Age_of_Property", "Nearby_Schools", "Nearby_Hospitals",
]

df_scaled_array = scaler.fit_transform(df_labeled[num_raw_cols])
df_scaled = pd.DataFrame(
    df_scaled_array,
    columns=[
        "BHK_scaled", "Size_in_SqFt_scaled", "Price_in_Lakhs_scaled",
        "Price_per_SqFt_scaled", "Year_Built_scaled", "Floor_No_scaled",
        "Total_Floors_scaled", "Age_of_Property_scaled",
        "Nearby_Schools_scaled", "Nearby_Hospitals_scaled",
    ],
)


In [85]:
import joblib
joblib.dump(scaler, "num_scaler.pkl")               
joblib.dump(te, "target_encoder_locality.pkl")           

print("‚úÖ All 3 encoders saved correctly!")


‚úÖ All 3 encoders saved correctly!


# Conclusion

The Real Estate Investment Advisor showcases production-grade MLOps: data-driven feature engineering, dual-task modeling, experiment tracking, artifact versioning, and deployment-ready inference pipelines. It transforms complex real estate analytics into actionable insights for investors, demonstrating how transfer learning principles (pre-fitted encoders), robust preprocessing, and interactive UIs create scalable ML applications. This project exemplifies full-stack ML development‚Äîfrom raw CSV to customer-facing dashboard‚Äîready for GitHub portfolio and real-world deployment. 