# House Prices

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

# Recommended tools:
- Python 3.11+
- VSCode
- Data Wrangler - to explare data in output
- nbstripout - to automatically omit jupiter notebook output

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_theme(style="whitegrid", font_scale=1.2)

# 1. Import and describe data

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
train.info()

### Count missing values and attach their data types

In [None]:
missing_values = train.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
missing_df = pd.DataFrame({'missing_count':missing_values})
missing_df['dtype'] = train[missing_df.index].dtypes
print(missing_df)

### Visualise missing values

In [None]:
plt.figure(figsize=(50,10))
sns.heatmap(train.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title("Missing values in train.csv")
plt.show()


# 2. Process missing values (mv)

I looked into data_description.txt and found values that we can substitute for the NaN values for the categorical data.
Then added manually missing_value_fill_type as the last column in the dataframe:
```
              missing_count    dtype missing_value_fill_type
PoolQC                 1453   object    NA
MiscFeature            1406   object    NA
Alley                  1369   object    NA
Fence                  1179   object    NA
MasVnrType              872   object    None
FireplaceQu             690   object    NA
LotFrontage             259  float64
GarageType               81   object    NA
GarageYrBlt              81  float64
GarageFinish             81   object    NA
GarageQual               81   object    NA
GarageCond               81   object    NA
BsmtExposure             38   object    NA
BsmtFinType2             38   object    NA
BsmtQual                 37   object    NA
BsmtCond                 37   object    NA
BsmtFinType1             37   object    NA
MasVnrArea                8  float64
Electrical                1   object    # Not defined
```

### Create df that will represent data described above

In [None]:
# Create a dictionary with default values ​​for columns with gaps.
# There can be special default values like 'median_by_neighborhood' - they will have specific processing rules in process missing values function.
# Here we only specify special cases; the rest will be handled automatically (like 0 for int/float values).
default_fill = {
    'PoolQC': 'NA',
    'MiscFeature': 'NA',
    'Alley': 'NA',
    'Fence': 'NA',
    'MasVnrType': 'None',
    'FireplaceQu': 'NA',
    'LotFrontage': 'median_by_neighborhood',
    'GarageType': 'NA',
    'GarageFinish': 'NA',
    'GarageQual': 'NA',
    'GarageCond': 'NA',
    'BsmtExposure': 'NA',
    'BsmtFinType1': 'NA',
    'BsmtFinType2': 'NA',
    'BsmtQual': 'NA',
    'BsmtCond': 'NA',
    'Electrical': 'SBrkr' # Fill with moda value
}

def get_fill_value(col, dtype):
    if col in default_fill:
        return default_fill[col]
    elif 'object' in str(dtype):
        return 'None'
    elif 'int' in str(dtype) or 'float' in str(dtype):
        return 0
    else:
        return None
    
missing_df['missing_value_fill_type'] = [
    get_fill_value(col, missing_df.loc[col, 'dtype']) for col in missing_df.index
]

print(missing_df)

### Estimate what values better to take to fill missing value cells for the numerical params: LotFrontage, GarageYrBlt, MasVnrArea.

##### LotFrontage
There are 259 mv, and we should create plots to see if we can take median value to fill mv.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.histplot(train['LotFrontage'].dropna(), kde=True, bins=40)
plt.title('Distribution of LotFrontage')
plt.xlabel('LotFrontage')
plt.ylabel('Count')
plt.show()

We will try to fill LotFrontage (Linear feet of street connected to property) mv by median of this feature grouped by Neighborhood (as it's value tend to be realistic). It's more stable for the outliers problem. There are other approaches, but we will keep going with median value for now.

In [None]:
# train['LotFrontage'] = train.groupby('Neighborhood')['LotFrontage'].transform(
#     lambda x: x.fillna(x.median())
# )

#### GarageYrBlt
We can see that Garage properties are equally missed by several features:
```
GarageType               81   object                      NA
GarageYrBlt              81  float64                       0
GarageFinish             81   object                      NA
GarageQual               81   object                      NA
GarageCond               81   object                      NA
```

We can conlude that some Houses have no garage. In future we can create binary feature HasGarage and ommit some features if they will have no prediction power.

Fill mv GarageYrBlt = 0 (Ganage not exist).

In [None]:
# train['GarageYrBlt'] = train['GarageYrBlt'].fillna(0)

#### MasVnrArea
It is logical to fill in 0 - there is no area.

In [None]:
# train['MasVnrArea'] = train['MasVnrArea'].fillna(0)

### Fill missing values using missing_df

In [None]:
def fill_missing_values(train, missing_df):
    for col, row in missing_df.iterrows():
        """
        col → index of the row in missing_df (the column name in our train/test df)
        row → the entire row as a Series with all its fields

        col = "LotFrontage"
        row = Series(
            missing_count=259,
            dtype="float64",
            missing_value_fill_type="median_by_neighborhood"
        )
        """

        fill_type = row['missing_value_fill_type']

        # train[col + '_was_missing'] = train[col].isnull().astype(int)
        
        if fill_type == "median_by_neighborhood":
            train[col] = train.groupby('Neighborhood')[col].transform(
                lambda x: x.fillna(x.median()) # By default, pandas ignores NaNs when calculating .median()
            )
            continue

        if fill_type in {0, "NA", "None", "Mix"}:
            train[col] = train[col].fillna(fill_type)
            continue

        print(f"[WARN] Unknown fill_type '{fill_type}' for column '{col}'")

    return train

In [None]:
# grouped = train.groupby('Neighborhood')['LotFrontage']
# for name, group in grouped:
#     non_missing = group[~group.isnull()]
#     print(f"Neighborhood: {name}")
#     print(non_missing)
#     print()

In [None]:
train = fill_missing_values(train.copy(), missing_df)

Check if all missing values were filled. DataFrame should be empty.

In [None]:
missing_values = train.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
missing_df = pd.DataFrame({'missing_count':missing_values})
missing_df['dtype'] = train[missing_df.index].dtypes
print(missing_df)

# 3. Exploratory Data Analysis (EDA)

### Histogramm SalePrice

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(train["SalePrice"], kde=True)
plt.title("SalePrice Distribution")
plt.show()


In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(train["LotArea"], kde=True)
plt.title("LotArea Distribution")
plt.show()

### Boxplot for exploring outliers

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x=train["GrLivArea"])
plt.title("GrLivArea Boxplot")
plt.show()

Show 10 recorts with the largest GrLivArea

In [None]:
train.nlargest(10, 'GrLivArea')[['GrLivArea', 'SalePrice']]


There are 2 points that are seems to be not logical:

GrLivArea = 5642, SalePrice = 160000

GrLivArea = 4676, SalePrice = 184750

it's big areas for a very small price. This you can also see on a further scatter plot "GrLivArea vs SalePrice" - 2 points on the right down.

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x=train["LotArea"])
plt.title("LotArea Boxplot")
plt.show()


In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x=train["TotalBsmtSF"])
plt.title("TotalBsmtSF Boxplot")
plt.show()


### Scatter-plot: GrLivArea → SalePrice

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=train["GrLivArea"], y=train["SalePrice"])
plt.title("GrLivArea vs SalePrice")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()


In [None]:
plt.figure(figsize=(14, 10))
corr = train.corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()


In [None]:
top_corr = corr["SalePrice"].abs().sort_values(ascending=False).head(10)
top_corr


We have multicolleniarity for some features. Will left them as is. Will see how it affects different models.

### Search for the important categorical features

In [None]:
plt.figure(figsize=(8, 5))
sns.barplot(x="OverallQual", y="SalePrice", data=train, estimator="mean")
plt.title("Average SalePrice by OverallQual")
plt.show()


In [None]:
sns.boxplot(x='OverallQual', y='SalePrice', data=train)
plt.title("SalePrice vs OverallQual")
plt.show()

In [None]:

f_val, p_val = stats.f_oneway(
    train[train['OverallQual']==1]['SalePrice'],
    train[train['OverallQual']==2]['SalePrice'],
    train[train['OverallQual']==3]['SalePrice'],
    train[train['OverallQual']==4]['SalePrice'],
    train[train['OverallQual']==5]['SalePrice'],
    # train[train['OverallQual']==6]['SalePrice'],
    # train[train['OverallQual']==7]['SalePrice'],
    # train[train['OverallQual']==8]['SalePrice'],
    # train[train['OverallQual']==9]['SalePrice'],
    # train[train['OverallQual']==10]['SalePrice'],

    # ...
)
print(f"F={f_val}, p={p_val}")


## Drop outliers

In [None]:
outliers = train[(train['GrLivArea'] > 4500) & (train['SalePrice'] < 300000)].index

train = train.drop(outliers)

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=train["GrLivArea"], y=train["SalePrice"])
plt.title("GrLivArea vs SalePrice")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

# 4. Feature engineering

### Categorical features

#### Find categorical features and explore their values

We need it to understand what type of encoding we should apply:
- OneHot Encoding - for nominal categorical features where there is no order.
- Ordinal Encoding - converts a categorical feature to numbers, preserving the order of categories.

In [None]:
# take all categorical features represented by strings
categorical_cols = train.select_dtypes(include=['object']).columns.tolist()
categorical_cols

In [None]:
for col in categorical_cols:
    unique_vals = train[col].unique()
    print(f"\n{col} ({len(unique_vals)} unique values):")
    print(unique_vals)

#### Find if there are categorical features that are represented by numbers

In [None]:
numeric_cols = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
for col in numeric_cols:
    if train[col].nunique() < 20: 
        print(col, '- categorical?')

```
MSSubClass - categorical ohe
OverallQual - categorical ordinal
OverallCond - categorical ordinal
BsmtFullBath - count
BsmtHalfBath - count
FullBath - count
HalfBath - count
BedroomAbvGr - count
KitchenAbvGr - count
TotRmsAbvGrd - count
Fireplaces - count
GarageCars - count
PoolArea - numerical
MoSold - numerical cyclic
YrSold - categorical ohe but we will create new feature based on this. no need for OHE
```

#### Transform numeric categorical features to string type

OHE

In [None]:
ohe_numeric_features = ['MSSubClass']
train[ohe_numeric_features] = train[ohe_numeric_features].astype(str)

Ordinal

In [None]:
ordinal_numeric_features = ['OverallQual', 'OverallCond']

Cyclic

In [None]:
cyclic_numeric_features = ['MoSold']

#### Ordinal encoding

Provided data_description.txt to ChatGPT to figure out where we should apply ordinal encoding.

In [None]:
ordinal_features = [
    'ExterQual',      
    'ExterCond',      
    'BsmtQual',       
    'BsmtCond',       
    'KitchenQual',    
    'GarageQual',     
    'GarageCond',     
    'FireplaceQu',    
    'PoolQC',         
    'Functional',     
    'GarageFinish',   
    'BsmtExposure',   
    'BsmtFinType1',   
    'BsmtFinType2',   
    'HeatingQC'       
] + ordinal_numeric_features

# ExterQual, ExterCond, BsmtQual, BsmtCond, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC
quality_map = {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
# BsmtExposure
bsmt_exposure_map = {'NA': 0, 'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}
# GarageFinish
garage_finish_map = {'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
# BsmtFinType1, BsmtFinType2
bsmt_fin_type_map = {'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}
# Functional
functional_map = {'NA': 0, 'Sal': 1, 'Sev': 2, 'Maj2': 3, 'Maj1': 4, 'Mod': 5, 'Min2': 6, 'Min1': 7, 'Typ': 8}

ordinal_maps = {
    'ExterQual': quality_map,
    'ExterCond': quality_map,
    'BsmtQual': quality_map,
    'BsmtCond': quality_map,
    'HeatingQC': quality_map,
    'KitchenQual': quality_map,
    'FireplaceQu': quality_map,
    'GarageQual': quality_map,
    'GarageCond': quality_map,
    'PoolQC': quality_map,

    'BsmtExposure': bsmt_exposure_map,
    'GarageFinish': garage_finish_map,

    'BsmtFinType1': bsmt_fin_type_map,
    'BsmtFinType2': bsmt_fin_type_map,

    'Functional': functional_map
}

for col, mapping in ordinal_maps.items():
    train[col] = train[col].map(mapping).fillna(0).astype(int)

#### One Hot encoding

Asked ChatGTP to find nominal features to apply simple OHE (without grouping). 

Note: Some of them can have numeric values. Previously we transformed such features to `str` type

In [None]:
ohe_features = [
    'MSZoning', 
    'Street', 
    'Alley', 
    'LotShape', 
    'LandContour', 
    'Utilities', 
    'LotConfig', 
    'LandSlope', 
    'BldgType', 
    'HouseStyle', 
    'RoofStyle', 
    'MasVnrType', 
    'Foundation',
    'Heating', 
    'CentralAir', 
    'GarageType',
    'PavedDrive', 
    'MiscFeature', 
    'Fence',
    'SaleType', 
    'SaleCondition',
    'Electrical'
]

Find features what need grouping because of their rareness. 

It can lead to overfitting, so we group rare values of categorical features to 'Other' group.

In [None]:
ohe_with_grouping = [
    'Exterior1st',
    'Exterior2nd',
    'Neighborhood',
    'Condition1',
    'Condition2',
    'RoofMatl'
] + ohe_numeric_features

def group_rare_categories(df, col, min_freq=20):
    freqs = df[col].value_counts()
    rare = freqs[freqs < min_freq].index
    df[col] = df[col].replace(rare, 'Other')
    return df

for col in ohe_with_grouping:
    train = group_rare_categories(train, col)

In [None]:
# Collect all feature for OHE
ohe_features += ohe_with_grouping

# Apply OHE. Drop 1st column to avoid milticolliniarity. We need it to apply linear regression
train = pd.get_dummies(
    train,
    columns=ohe_features,
    drop_first=True
)

# To apply tree model - need to apply drop_first=False
# train = pd.get_dummies(
#     train,
#     columns=ohe_features,
#     drop_first=False
# )

#### Check if all categorical features were encoded

In [None]:
train.select_dtypes(include='object').columns

In [None]:
pd.set_option("display.max_rows", None)      # show all rows
pd.set_option("display.max_columns", None)   # show all columns
pd.set_option("display.max_colwidth", None)  # show full column names
pd.set_option("display.width", None)         # don't wrap lines
train.dtypes
# pd.reset_option("display.max_rows")
# pd.reset_option("display.max_columns")

In [None]:
train.select_dtypes(include='object').shape[1]

In [None]:
bad_cols = train.columns[
    (train.dtypes == 'object') | 
    (train.dtypes == 'category')
]

bad_cols

In [None]:
train.isnull().sum().sort_values(ascending=False).head(10)

### Numeric features

#### Cyclic data transformation - sin/cos

In [None]:
train['MoSold_sin'] = np.sin(2 * np.pi * train['MoSold'] / 12)
train['MoSold_cos'] = np.cos(2 * np.pi * train['MoSold'] / 12)

cyclic_numeric_features = ['MoSold_cos', 'MoSold_sin']

#### Create new numerical features

In [None]:
# total square
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
# total bath amount
train['TotalBath'] = train['FullBath'] + 0.5 * train['HalfBath'] + train['BsmtFullBath'] + 0.5 * train['BsmtHalfBath']
# age of house on the moment it was sold
train['AgeAtSale'] = train['YrSold'] - train['YearBuilt']
# where house remodeled (repaired) or not
train['Remodeled'] = (train['YearRemodAdd'] != train['YearBuilt']).astype(int)
# garage age
train['GarageAge'] = train['YrSold'] - train['GarageYrBlt']
# if there no harage - fill with 0
train['GarageAge'] = train['GarageAge'].fillna(0)
# is the house is new or not
train['IsNew'] = (train['YrSold'] == train['YearBuilt']).astype(int)

new_features = ['TotalSF', 'TotalBath', 'AgeAtSale', 'Remodeled', 'GarageAge', 'IsNew']
print(train[new_features].head())

#### Remove redundant numeric features

In [None]:
features_to_drop = ['YrSold', 'MoSold']
train.drop(columns=features_to_drop, inplace=True)
numeric_cols = [x for x in numeric_cols if x not in features_to_drop]

### Log-transform numerical values

#### Log-transform target

In [None]:
train['SalePrice'] = np.log1p(train['SalePrice'])

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(train['SalePrice'], kde=True)
plt.title('SalePrice (Log) Distribution')
plt.show()

#### Log transform features

In [None]:
# Exclude target
numeric_cols.remove('SalePrice')

# Exclude categorical features that are represented in numeric format
all_categorical_features = ordinal_features + ohe_features

# Exclude cyclic numerical features
transformed_features = all_categorical_features + cyclic_numeric_features

numeric_cols = [
    col for col in numeric_cols
    if col not in transformed_features
]

# Exclude binary features
numeric_cols = [
    col for col in numeric_cols
    if train[col].nunique() > 2
]

#### Summary for all numerical features

In [None]:
# summary = pd.DataFrame({
#     'nunique': train[numeric_cols].nunique(),
#     'min': train[numeric_cols].min(),
#     'max': train[numeric_cols].max(),
#     'skew': train[numeric_cols].apply(lambda x: skew(x.dropna()))
# }).sort_values('skew', ascending=False)

# summary

#### Find and exclude featues that semantically not needed to be log-transformed

We need to log-transform only continuous features like area, volume, price.

In [None]:
semantic_exclude = []

for col in numeric_cols:
    unique_vals = train[col].nunique()
    max_val = train[col].max()

    # categorical / discrete
    if unique_vals < 20:
        semantic_exclude.append(col)

    # counters
    if max_val <= 10 and unique_vals <= 10:
        semantic_exclude.append(col)

loggable_numeric_cols = [
    col for col in numeric_cols
    if col not in semantic_exclude
]

#### Plots of numeric feature disctibution

In [None]:
# for col in loggable_numeric_cols:
#     plt.figure(figsize=(6,3))
#     sns.histplot(train[col], bins=50, kde=True)
#     plt.title(col)
#     plt.show()

#### Skew of disctributions

In [None]:
from scipy.stats import skew

skewed_features = train[loggable_numeric_cols].apply(lambda x : skew(x.dropna()))
skewed_features = skewed_features[abs(skewed_features) > 0.75].index.tolist()

skewed_features

In [None]:
# for col in skewed_features:
#     plt.figure(figsize=(6,3))
#     sns.histplot(train[col], bins=50, kde=True)
#     plt.title(col)
#     plt.show()

#### Show correlation with target

In [None]:
correlated_features = []
for col in loggable_numeric_cols:
    corr_original = train[col].corr(train['SalePrice'])
    corr_log = np.log1p(train[col]).corr(train['SalePrice'])
    if abs(corr_log) > abs(corr_original):
        correlated_features.append(col)

correlated_features

#### Find unique values for skewed and correlated features

In [None]:
skewed_features_unique = [
    col for col in skewed_features 
    if col not in correlated_features
]
skewed_features_unique

In [None]:
log_candidates_unique = [
    col for col in correlated_features 
    if col not in skewed_features
]
log_candidates_unique

#### Unite skewed and correlated features

They may have different feature sets

In [None]:
log_candidates_features = set(skewed_features) | set(correlated_features)

#### Compare log candidates and skewed + correlated features

In [None]:
loggable_features_but_not_candidates = set(loggable_numeric_cols) - set(log_candidates_features)
loggable_features_but_not_candidates

In [None]:
# for col in loggable_features_but_not_candidates:
#     plt.figure(figsize=(6,3))
#     sns.histplot(train[col], bins=50, kde=True)
#     plt.title(col)
#     plt.show()

#### Apply log-transform

In [None]:
# after operations with set log_candidates_features have type set[Hashable | Any]
# need to cast it back to list[str]
log_candidates_features: list[str] = list(log_candidates_features)
for col in log_candidates_features:
    train[col] = np.log1p(train[col])

# for col in log_candidates_final:
#     plt.figure(figsize=(6,3))
#     sns.histplot(train[col], bins=50, kde=True)
#     plt.title(col)
#     plt.show()

#### Find sparse features and create binary columns for them

In [None]:
sparsity = train[log_candidates_features].apply(lambda x: (x == 0).mean())
sparse_features = sparsity[sparsity > 0.5].index.tolist()  # >50% zeros

for col in sparse_features:
    train[f'Has_{col}'] = (train[col] > 0).astype(int)

# 5. Preapare for modeling

### Create target and inputs

In [None]:
y_train = train['SalePrice']
x_train = train.drop(columns=['SalePrice', 'Id'])
print(x_train.shape)
print(y_train.shape)

### Transforming pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

standard_scaler_candidates = log_candidates_features + cyclic_numeric_features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), standard_scaler_candidates)
    ],
    remainder='passthrough'
)

### Split the data

#### Train / validation split

For initial test it's a good solution. 
To check: 
- if there no leakage
- if pipeline works correct
- are the metrics adequate

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

#### KFold CV

Good solution for:
- precise estimation
- fair comparison
- hyperparams tuning

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

#### Train on a whole dataset

Fit a whole dataset. Then predict on a test. Need to be done in the end - before submission.

# 6. Baseline model

## Evaluation strategy

- Primary metric: RMSE
- Target transformation: log1p(SalePrice)
- Evaluation method: 5-fold cross-validation
- Model selection based on CV RMSE

In [None]:
from sklearn.metrics import root_mean_squared_error

def rmse(y_true, y_pred):
    return root_mean_squared_error(y_true, y_pred)

## Baseline №1 — DummyRegressor

Main purpose:
- predicts median: y_pred_dummy = median(y_train)
- adequacy control
- sanity check

Expected values:
- RMSE should be bad (~0.35-0.40)
- if RMSE is good - we have leakage

The baseline is a starting point, a simple model that demonstrates the minimum possible performance we expect from any "reasonable" model.
- It doesn't use features, so it honestly shows "how easy it is to guess without knowing anything about the data."
- All subsequent models should show improvement over the baseline.

1. Simple logic:
    - DummyRegressor simply calculates the mean/median of the target and always predicts the same value.
    - It doesn't use X → which means its real improvement comes only from the fact that the model "learns from features."
2. Strong robustness to outliers (if the median)
    - In our data, SalePrice is heavily skewed → the median provides a fair baseline, uncorrupted by rare expensive houses.
3. Leakage Control
   - If your complex model performs worse than Dummy → this is a sign that:
        - the pipeline is malfunctioning,
        - there are data leaks,
        - the features are not aligned with the target variable.

In [None]:
from sklearn.dummy import DummyRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

dummy = DummyRegressor(strategy="median")

dummy_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', dummy)
    ]
)

scores_dummy = cross_val_score(
    dummy_pipeline,
    x_train,
    y_train,
    cv=kf,
    scoring='neg_root_mean_squared_error'
)

dummy_rmse = -scores_dummy.mean()
print(dummy_rmse)

## Baseline №2 — LinearRegression

We build this model because:
- it's simple - we can choose it as a baseline to check other complecated models
    - LinearRegression → baseline with "minimal processing"
    - Ridge / Lasso → regularization check
    - RandomForest / Boosting → complex models


- check if we have a liner signal
- figure out how features are adequate
- expected RMSE ≈ 0.20–0.23

In [None]:
from sklearn.linear_model import LinearRegression

linear = LinearRegression()

linear_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', linear)
    ]
)

scores_linear = cross_val_score(
    linear_pipeline,
    x_train,
    y_train,
    cv=kf,
    scoring='neg_root_mean_squared_error'
)

linear_rmse = -scores_linear.mean()
print(linear_rmse)

We got a very good score for the 2nd baseline - LinearRegression.

The very good RMSE of LinearRegression indicates that there is a strong linear signal in the data.
However, due to the presence of highly correlated features, the coefficients of the linear model may be unstable and hard to interpret.

Multicollinearity does not necessarily worsen predictive performance, but it leads to unstable and non-interpretable coefficients in linear models. Therefore, highly correlated features are removed to obtain a stable and interpretable baseline.

The goal was not to improve the score, but to obtain a cleaner and more interpretable baseline.
This allows a fair comparison with more complex models and helps separate the effect of feature engineering from the effect of model complexity.

We remove highly correlated features not to handicap linear models, but to obtain a stable and interpretable baseline that allows us to isolate the effect of model complexity.

If we remove the correlations and look at the RMSE of the linear model:
- It's easier to see how much the complex model actually improves the results.
- It's easy to argue that "Boosting gives +0.03 RMSE because the linear model gave 0.18, and the baseline was 0.22."

Why remove highly correlated features for LinearRegression when creating a baseline:

1. Coefficient Stability
    - Highly correlated features → linear model coefficients are unstable
    - With repeated splits (KFold), the coefficients "jump"
    - Removing correlations makes the baseline predictable and stable
2. Feature Interpretation
    - Coefficients become meaningful: it's easy to understand which feature really influences the target
    - Without correlated features, it's impossible to provide a correct explanation, for example, why increasing the area of ​​a house increases the price
3. Fair Comparison with More Complex Models
    - Removing redundant information so that the baseline doesn't receive an "unnecessary bonus" from duplicate features
    - Allows you to accurately assess how much improvement a complex model (Boosting, RF, etc.) provides, and how much is simply due to good feature engineering
4. Isolating the Effect of the Model
    - Separating the Contribution of Data (Feature Engineering) and the Algorithm
    - The RMSE of the clean baseline shows how much a simple linear model can provide
    - Any improvement in a complex model now fairly reflects the advantage of the algorithm, not the features correlated data

Conclusion:
Highly correlated features are removed in the linear baseline not to worsen RMSE, but to obtain stable, interpretable coefficients and a fair comparison with more complex models. This allows us to isolate the contribution of model complexity from the effect of feature engineering.

We should find the most correlated features to exclude them from model.

There are 2 approaches how to do it:
- feature–feature correlation
- VIF (Variance Inflation Factor)

Let's find the most correlated features:

In [None]:
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = x_train.select_dtypes(include=['object']).columns
boolean_features = x_train.select_dtypes(include=['boolean']).columns

print("numerical_features:", numerical_features)
print("categorical_features:", categorical_features)
print("boolean_features:", boolean_features)


In [None]:
# Correlation of numerical features with SalePrice
correlations = x_train[numerical_features].copy()
correlations['SalePrice'] = y_train
corr_matrix = correlations.corr()

# Sort by absolute correlation with SalePrice
top_corr = corr_matrix['SalePrice'].abs().sort_values(ascending=False)
print(top_corr.head(10))


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

top_features = top_corr.index[1:11]  # exclude SalePrice

plt.figure(figsize=(15, 10))
for i, feature in enumerate(top_features):
    plt.subplot(2, 5, i+1)
    sns.scatterplot(x=x_train[feature], y=y_train)
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
plt.tight_layout()
plt.show()
