# 05_model_inference

This notebook demonstrates how to use the trained best model for predictions.

Steps:
1. Load saved model from `app/`.
2. Load preprocessed feature data.
3. Run predictions on test data.
4. Show predicted vs actual prices.
5. Test with a custom input.


In [1]:
import pandas as pd
import numpy as np
import joblib

# Load saved best model
best_model = joblib.load("../app/xgboost_best.pkl")  # adjust filename if RF/Linear was best

print("Loaded model:", type(best_model))


Loaded model: <class 'sklearn.pipeline.Pipeline'>


configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing.

  setstate(state)


In [2]:
# Load preprocessed data
X = pd.read_csv("../data/X_preprocessed.csv")
y = pd.read_csv("../data/y_preprocessed.csv").squeeze()

print("X shape:", X.shape)
print("y shape:", y.shape)


X shape: (1458, 241)
y shape: (1458,)


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split again to match earlier training
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

y_pred = best_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE (log-scale): {rmse:.4f}")
print(f"Test R2: {r2:.4f}")


ValueError: columns are missing: {'GarageQual', 'Condition2', 'MSZoning', 'SaleType', 'Neighborhood', 'PoolQC', 'MasVnrType', 'KitchenQual', 'Electrical', 'RoofStyle', 'SaleCondition', 'LandSlope', 'Heating', 'LandContour', 'GarageFinish', 'Utilities', 'FireplaceQu', 'LotConfig', 'Functional', 'BsmtQual', 'Fence', 'LotShape', 'Street', 'GarageCond', 'Id', 'Foundation', 'BsmtCond', 'Exterior1st', 'Exterior2nd', 'CentralAir', 'BsmtFinType1', 'Condition1', 'HeatingQC', 'Alley', 'BsmtFinType2', 'ExterQual', 'MiscFeature', 'PavedDrive', 'HouseStyle', 'BsmtExposure', 'BldgType', 'GarageType', 'RoofMatl', 'ExterCond'}

In [None]:
# Remember: y was log-transformed using log1p
y_test_orig = np.expm1(y_test)
y_pred_orig = np.expm1(y_pred)

rmse_orig = np.sqrt(mean_squared_error(y_test_orig, y_pred_orig))
print(f"Test RMSE (original price scale): {rmse_orig:,.2f}")

# Show comparison for first 10 rows
pd.DataFrame({
    "ActualPrice": y_test_orig.head(10).round(0),
    "PredictedPrice": y_pred_orig[:10].round(0)
})


Test RMSE (original price scale): 20,643.52


Unnamed: 0,ActualPrice,PredictedPrice
1320,190000.0,216450.0
836,100000.0,94945.0
413,115000.0,105889.0
522,159000.0,143703.0
1035,315500.0,329155.0
614,137500.0,137837.0
218,311500.0,236983.0
1031,310000.0,346081.0
1288,281000.0,277775.0
886,135500.0,137761.0


In [None]:
# Take one sample row and predict
sample = X_test.iloc[[0]]  # first test row
pred_log = best_model.predict(sample)[0]
pred_price = np.expm1(pred_log)

print("Sample features:\n", sample.head())
print("\nPredicted log-price:", round(pred_log, 4))
print("Predicted SalePrice (original scale):", round(pred_price, 2))


Sample features:
       MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
1320          60        107.0    10186            7            5       1992   

      YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  SaleType_ConLI  \
1320          1992         0.0         674           0  ...           False   

      SaleType_ConLw  SaleType_New  SaleType_Oth  SaleType_WD  \
1320           False         False         False         True   

      SaleCondition_AdjLand  SaleCondition_Alloca  SaleCondition_Family  \
1320                  False                 False                 False   

      SaleCondition_Normal  SaleCondition_Partial  
1320                  True                  False  

[1 rows x 241 columns]

Predicted log-price: 12.2851
Predicted SalePrice (original scale): 216450.1


## Conclusions

- The saved model can be reloaded and used for predictions.  
- RMSE on the original price scale is meaningful (in $ or ₹).  
- Predictions are close to actual values for test samples.  

✅ Next step: Build a `Streamlit` app (`app/app.py`) so users can input house features and get instant price predictions.
