# House Prices - End-to-End Machine Learning Pipeline 
This notebook represents a complete machine learning workflow for solving regression problems in real-world conditions.
The project analyzes two models based on decision trees:

- Random Forest with capacity tuning through cross-validation.
- XGBoost using randomized hyperparameter search.

The result is a production pipeline that, after a single training session, can be safely used for forecasting on new data.

## Data loading

I loaded training and test datasets.
The training dataset contains the target variable "SalePrice", while the test dataset does not.

In [3]:
import pandas as pd

from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor

train_data = pd.read_csv("C:\\Users\\lb_20\\Downloads\\train.csv", index_col = "Id")
test_data = pd.read_csv("C:\\Users\\lb_20\\Downloads\\test.csv", index_col = "Id")

train_data.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

In [4]:
test_data.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

## Separate target and features

The dataset was divided into:
- "y" — target variable ("SalePrice")
- "X" — all columns with features
- "X_test" — features for the test set

In [2]:
y = train_data["SalePrice"]
X = train_data.drop(columns = ["SalePrice"])
X_test = test_data.copy()

## Determine the type of features

I divided the features into:
- categorical (dtype = "object")
- numerical (all others)

This makes it possible to use different preprocessing methods for each data type.

In [3]:
len(X.columns)

79

In [4]:
categorical_cols = []
numerical_cols = []
for col in X.columns:
    if X[col].dtype == "object":
       categorical_cols.append(col)
    else:
        numerical_cols.append(col)

In [5]:
print(f"The number of categorical columns: {len(categorical_cols)}")
print(f"The number of numrical columns: {len(numerical_cols)}")
print(len(categorical_cols) + len(numerical_cols) == 79)

The number of categorical columns: 43
The number of numrical columns: 36
True


## Preprocessing pipeline

Two stages of preprocessing:

1. **Numeric features:**
   - Missing values are replaced with the median.

2. **Categorical features:**
   - Missing data is replaced with the most frequently occurring value.
   - Categories are converted using the One-Hot encoding method.

These conversions are performed simultaneously using the "ColumnTransformer".

In [6]:
numerical_transformer = SimpleImputer(strategy = "median")

categorical_transformer = Pipeline(steps = [("imputer", SimpleImputer(strategy = "most_frequent")),
                                            ("onehot", OneHotEncoder(handle_unknown = "ignore"))])

preprocessor = ColumnTransformer(transformers = [("num", numerical_transformer, numerical_cols),
                                                 ("cat", categorical_transformer, categorical_cols)])

## Cross-validation helper

This function evaluates the entire pipeline using cross-validation and returns:
- MAE mean value
- MAE standard deviation

This allows comparing models and selecting the model with the lowest error.

In [7]:
def cross_val_mae(pipeline, X, y, cv = 5):
    scores = cross_val_score(pipeline, X, y,
                             cv = cv,
                             scoring = "neg_mean_absolute_error",
                             n_jobs = -1)
    mae = -scores
    return mae.mean(), mae.std()

## Random Forest capacity search

I analyze Random Forest models by varying the "n_estimators" parameter.

For each selected configuration I:

1. Build a complete pipeline(data preprocessing + model)

2. Test it using cross-validation

3. Fix the mean value and standard deviation of the MAE metric


In [8]:
n_estimators_options = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500]
rf_results = []
for n in n_estimators_options :
    rf = rfr(n_estimators = n, random_state = 0, n_jobs = -1)
    rf_pipeline = Pipeline([("preprocess", preprocessor), ("model", rf)])
    mean_mae, std_mae = cross_val_mae(rf_pipeline, X, y)
    rf_results.append((n, mean_mae, std_mae))
    print(f"RF n_estimators = {n}: MAE {mean_mae:.2f} +- {std_mae:.2f}")

RF n_estimators = 50: MAE 17858.94 +- 971.67
RF n_estimators = 100: MAE 17699.58 +- 890.27
RF n_estimators = 150: MAE 17591.09 +- 918.68
RF n_estimators = 200: MAE 17572.66 +- 955.86
RF n_estimators = 250: MAE 17581.94 +- 975.40
RF n_estimators = 300: MAE 17563.37 +- 986.91
RF n_estimators = 350: MAE 17546.66 +- 954.77
RF n_estimators = 400: MAE 17567.88 +- 973.17
RF n_estimators = 450: MAE 17543.89 +- 950.93
RF n_estimators = 500: MAE 17552.44 +- 945.72


## Final Random Forest Model

Based on the cross-validation results, the most effective number of trees is selected, and the final Random Forest pipeline is created.

In [9]:
rf = rfr(n_estimators = 450, random_state = 0, n_jobs = -1)
rf_pipeline = Pipeline([("preprocess", preprocessor), ("model", rf)])

## XGBoost


As an alternative to Random Forest, XGBoost was chosen—a gradient boosting method that typically performs better than tree ensembles when working with data that has a complex structure.

To investigate various parameters, I use the RandomizedSearchCV method.

The parameters investigated include:
- number of trees;
- learning rate;
- tree depth;
- resampling coefficients;
- regularization strength.

Each configuration is evaluated using cross-validation and MAE metrics.

The optimal configuration is selected automatically.

In [10]:
xgb = XGBRegressor(objective = "reg:squarederror", random_state = 0, n_jobs = -1)
xgb_pipeline = Pipeline(steps = [("preprocess", preprocessor), ("model", xgb)])

params = {"model__n_estimators": [800, 1200, 1600, 2000],
          "model__learning_rate": [0.01, 0.03, 0.05],
          "model__max_depth": [3, 4, 5, 6],
          "model__subsample": [0.7, 0.8, 0.9, 1.0],
          "model__colsample_bytree": [0.7, 0.8, 0.9, 1.0],
          "model__reg_lambda": [0.5, 1.0, 2.0],
          "model__reg_alpha": [0.0, 0.1, 0.5]}

search = RandomizedSearchCV(estimator = xgb_pipeline,
                            param_distributions = params,
                            n_iter = 20,
                            scoring = "neg_mean_absolute_error",
                            cv = 5,
                            random_state = 0,
                            n_jobs = -1,
                            verbose = 1)

## Model comparison

I am comparing two models:

1. Random Forest (with cross-validation)
2. The best XGBoost model obtained as a result of a randomized search

Both models are evaluated using the same cross-validation (CV) strategy and MAE metric, which ensures their objective comparison.

In [11]:
search.fit(X, y)

best_cv_mae = -search.best_score_
print(f"Best tuned XGB CV MAE: {best_cv_mae:.2f}")
print(f"Best params: ", search.best_params_)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best tuned XGB CV MAE: 14923.86
Best params:  {'model__subsample': 0.8, 'model__reg_lambda': 2.0, 'model__reg_alpha': 0.1, 'model__n_estimators': 2000, 'model__max_depth': 3, 'model__learning_rate': 0.03, 'model__colsample_bytree': 0.7}


In [12]:
rf_mean, rf_std = cross_val_mae(rf_pipeline, X, y)
print(f"Random Forest CV MAE: {rf_mean:.2f} ± {rf_std:.2f}")

Random Forest CV MAE: 17543.89 ± 950.93


In [13]:
best_model = search.best_estimator_
final_model = best_model

## Train final model and Generate predictions

Finally, I fit the model with the best results to the entire training sample, apply it to the test data, and make final predictions.

In [14]:
final_model.fit(X, y)
test_preds = final_model.predict(X_test)

In [15]:
final_outcome = pd.DataFrame({"Id": X_test.index,
                              "SalePrice": test_preds})
final_outcome

Unnamed: 0,Id,SalePrice
0,1461,124466.046875
1,1462,171193.203125
2,1463,185719.953125
3,1464,196443.000000
4,1465,184259.343750
...,...,...
1454,2915,82630.617188
1455,2916,76183.343750
1456,2917,175593.062500
1457,2918,118385.085938
