### Add the province column back to filtered_final_cleaned_data.csv

In [6]:
# Load dataset
import pandas as pd

"""We need to add a "province" column to our filtered_final_cleaned_data.csv file. 
Since we only have postal codes, we'll first need to map them to their provinces. 
We'll use a dictionary mapping postal codes to provinces."""

# Load data
df = pd.read_csv("../data/raw/filtered_final_cleaned_data.csv")

# Define postal code ranges per province
postal_to_province = {
    "Antwerp": range(2000, 3000),
    "East-Flanders": range(9000, 10000),
    "West-Flanders": range(8000, 9000),
    "Flemish-Brabant": list(range(1500, 2000)) + list(range(3000, 3500)),
    "Brussels": range(1000, 1300),
    "Limburg": range(3500, 4000),
    "Liège": range(4000, 5000),
    "Namur": range(5000, 6000),
    "Hainaut": list(range(6000, 6600)) + list(range(7000, 8000)),
    "Luxembourg": range(6600, 7000),
    "Brabant-Wallon": range(1300, 1500)
}

# Helper function to find province for each postal code
def get_province(postal_code):
    try:
        postal_code = int(postal_code)
        for province, codes in postal_to_province.items():
            if postal_code in codes:
                return province
        return "Unknown"
    except:
        return "Unknown"


# Apply the function to create a new column called "province" with the province names based on the postal codes.
df["province"] = df["postal_code"].apply(get_province)

print(df[["postal_code", "province"]].head())

# display(df.head())

# Save the updated dataset
df.to_csv("../data/raw/filtered_final_cleaned_data.csv", index=False)
print("CSV saved with 'province' column!")

# Load the new dataset with province column
df = pd.read_csv("../data/raw/filtered_final_cleaned_data.csv")






   postal_code province
0         2800  Antwerp
1         2200  Antwerp
2         2840  Antwerp
3         2440  Antwerp
4         2300  Antwerp
CSV saved with 'province' column!


### Bring together and see your fully preprocessed dataset

All preprocessing steps are applied to X_train and X_test, not the original df.
So the original df will remain:
with NaNs
not encoded
not scaled
This is expected — sklearn never modifies the original DataFrame.

Most sklearn transformers (OneHotEncoder, StandardScaler, etc.) output NumPy arrays, which:
don’t keep column names,
don’t automatically merge back into X_train,
won’t be visible in df.head().
Unless you explicitly rebuild a DataFrame with the transformed results, nothing changes.

How to see your fully preprocessed dataset:   
you must manually combine the transformed arrays back into a DataFrame:


Option A — Using ColumnTransformer + Pipeline

In [None]:
# Fit on train
X_train_processed = preprocessor.fit_transform(X_train)

# Transform test
X_test_processed = preprocessor.transform(X_test)

# Turn into DataFrame
processed_cols = preprocessor.get_feature_names_out()

X_train_final = pd.DataFrame(X_train_processed, columns=processed_cols, index=X_train.index)
X_test_final = pd.DataFrame(X_test_processed, columns=processed_cols, index=X_test.index)

X_train_final.head()


Option B — doing everything manually:

In [None]:
# 1. Impute
X_train["living_area"] = imputer.fit_transform(X_train[["living_area"]])
X_test["living_area"] = imputer.transform(X_test[["living_area"]])

# 2. Encode
ohe = OneHotEncoder(sparse_output=False)
encoded = ohe.fit_transform(X_train[["province"]])
cols = ohe.get_feature_names_out(["province"])
X_train_encoded = pd.DataFrame(encoded, columns=cols, index=X_train.index)

# 3. Scale
scaler = StandardScaler()
scaled = scaler.fit_transform(X_train[["living_area"]])
X_train_scaled = pd.DataFrame(scaled, columns=["living_area_scaled"], index=X_train.index)

# 4. Combine everything
X_train_final = pd.concat([X_train, X_train_encoded, X_train_scaled], axis=1)
