In this notebook will be developed a `process_row` function that will be used in the Processor component to process rows from the apartment dataset in Poland. The function will transform each record by applying various cleaning, normalization, and transformation operations.

## Processing Goals:

1. Converting categorical features to numerical
2. Normalizing numerical features to the range [0, 1]
3. Filling missing values
4. Creating new informative features
5. Removing unnecessary features

## Function Requirements:
- Accepts a single data row (`pandas.Series`)
- Returns a processed data row (`pandas.Series`)
- Does not modify the original dataset
- Converts string binary features ('yes'/'no') to boolean (True/False)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

# Chart display settings
plt.style.use('seaborn-v0_8')
sns.set(font_scale=1.2)

# Display all DataFrame columns
pd.set_option('display.max_columns', None)

Let's load the cleaned dataset from the `cleaned_apartments.csv` file and examine its structure.

In [None]:
# Path to the data file
file_path = '../datasets/cleaned_apartments.csv'

# Loading data
try:
    df = pd.read_csv(file_path)
    print(f"Dataset successfully loaded, size: {df.shape}")
except FileNotFoundError:
    print(f"File {file_path} not found!")
    file_path = '../datasets/apartments.csv'  # Trying to load the original dataset
    try:
        df = pd.read_csv(file_path)
        print(f"Original dataset loaded, size: {df.shape}")
    except FileNotFoundError:
        print(f"File {file_path} also not found!")

In [None]:
# Let's look at the first 5 rows of data
df.head()

In [None]:
# Check information about the data
df.info()

In [None]:
# Check statistics for numeric columns
df.describe()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values_percent = (df.isnull().sum() / len(df)) * 100

print("Number of missing values by column:")
for col, miss_count in sorted(zip(missing_values.index, missing_values), key=lambda x: x[1], reverse=True):
    if miss_count > 0:
        print(f"{col}: {miss_count} ({missing_values_percent[col]:.2f}%)")

Let's check what values the binary columns that need to be transformed contain:
- hasParkingSpace
- hasBalcony
- hasElevator
- hasSecurity
- hasStorageRoom

In [None]:
# Check unique values in binary columns
binary_columns = ['hasParkingSpace', 'hasBalcony', 'hasElevator', 'hasSecurity', 'hasStorageRoom']

for col in binary_columns:
    if col in df.columns:
        print(f"{col}: {df[col].unique()}")

Let's check what values the categorical columns that will be transformed contain:

In [None]:
# Check categorical features
categorical_columns = ['type', 'ownership', 'condition', 'city']

for col in categorical_columns:
    if col in df.columns:
        print(f"{col}: {df[col].unique()}")
        print(f"Number of unique values: {df[col].nunique()}")
        print(f"Value frequency:\n{df[col].value_counts()}\n")

Now will be developed the `process_row` function that will perform the following operations:

1. Converting string binary features to boolean values
2. Filling missing values
3. Converting categorical features to numerical
4. Normalizing numerical features
5. Creating new features
6. Removing unnecessary columns

In [None]:
def process_row(row: pd.Series) -> pd.Series:
    """
    Takes a row from a pandas DataFrame and returns an updated row
    with cleaned and transformed data.
    
    Args:
        row (pd.Series): Row from DataFrame with apartment data
        
    Returns:
        pd.Series: Processed row with transformed data
    """
    # Create a copy of the row to avoid modifying the original
    processed = row.copy()
    
    # 1. Converting string binary features to boolean values
    binary_columns = ['hasParkingSpace', 'hasBalcony', 'hasElevator', 'hasSecurity', 'hasStorageRoom']
    for col in binary_columns:
        if col in processed:
            # Fill missing values
            if pd.isna(processed[col]):
                processed[col] = 'no'  # By default, we assume the feature is absent
                
                # For hasElevator, we determine by building type and number of floors
                if col == 'hasElevator' and 'floorCount' in processed and 'type' in processed:
                    if (not pd.isna(processed['floorCount']) and processed['floorCount'] > 4) or \
                       (not pd.isna(processed['type']) and processed['type'] == 'blockOfFlats'):
                        processed[col] = 'yes'
            
            # Convert yes/no to True/False
            processed[col] = True if processed[col] == 'yes' else False
    
    # 2. Fill missing values in numeric columns
    numeric_columns = {
        'floor': 2,              # Median value
        'floorCount': 5,         # Median value
        'squareMeters': 50,      # Typical apartment size
        'rooms': 2,              # Typical number of rooms
        'centreDistance': 5.0,   # Typical distance from center
        'poiCount': 10           # Average number of POIs
    }
    
    for col, default_value in numeric_columns.items():
        if col in processed and pd.isna(processed[col]):
            processed[col] = default_value
    
    # 3. Convert categorical features to numerical
    # Building type
    if 'type' in processed:
        type_mapping = {
            'blockOfFlats': 0,
            'tenement': 1,
            'apartmentBuilding': 2
        }
        if not pd.isna(processed['type']):
            processed['type_numeric'] = type_mapping.get(processed['type'], 3)
        else:
            processed['type_numeric'] = 0  # Default value
        
        # Remove original column
        processed = processed.drop('type')
    
    # Apartment condition
    if 'condition' in processed:
        condition_mapping = {
            'very good': 4,
            'good': 3,
            'average': 2,
            'poor': 1,
            'to renovation': 0
        }
        if not pd.isna(processed['condition']):
            processed['condition_numeric'] = condition_mapping.get(processed['condition'], 2)
        else:
            processed['condition_numeric'] = 2  # Average condition by default
        
        # Remove original column
        processed = processed.drop('condition')
    
    # 4. Create new features
    # Floor ratio to total floors
    if 'floor' in processed and 'floorCount' in processed and processed['floorCount'] > 0:
        processed['floor_ratio'] = round(processed['floor'] / processed['floorCount'], 3)
    else:
        processed['floor_ratio'] = 0.5  # Default value
    
    # Price per square meter
    if 'price' in processed and 'squareMeters' in processed and processed['squareMeters'] > 0:
        processed['price_per_m2'] = round(processed['price'] / processed['squareMeters'], 2)
    
    # Combined comfort score
    comfort_features = ['hasParkingSpace', 'hasBalcony', 'hasElevator', 'hasSecurity', 'hasStorageRoom']
    comfort_score = 0
    for feature in comfort_features:
        if feature in processed and processed[feature]:
            comfort_score += 1
    processed['comfort_score'] = comfort_score
    
    # 5. Remove rarely used or uninformative columns
    columns_to_drop = [
        'buildYear', 'buildingMaterial', 'ownership', 
        'schoolDistance', 'clinicDistance', 'kindergartenDistance', 
        'restaurantDistance', 'collegeDistance', 'pharmacyDistance', 'postOfficeDistance',
        'id'  # ID is usually not needed for ML
    ]
    
    for col in columns_to_drop:
        if col in processed:
            processed = processed.drop(col)
    
    return processed

Let's create an additional function to normalize numerical features. This will not be part of the main `process_row` function, but can be applied to the data after processing.

In [None]:
def normalize_numeric_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Normalizes numerical features to the range [0, 1].
    
    Args:
        df (pd.DataFrame): DataFrame with numerical features
        
    Returns:
        pd.DataFrame: DataFrame with normalized features
    """
    # Create a copy of the DataFrame
    normalized_df = df.copy()
    
    # Define numerical columns (excluding boolean and price - the target variable)
    numeric_columns = [col for col in df.columns 
                      if df[col].dtype in ['int64', 'float64'] 
                      and col != 'price'
                      and not (df[col].isin([0, 1]).all() and df[col].nunique() <= 2)]
    
    # Create a MinMaxScaler instance
    scaler = MinMaxScaler()
    
    # Apply normalization to numerical columns
    if numeric_columns:
        normalized_df[numeric_columns] = scaler.fit_transform(df[numeric_columns])
    
    return normalized_df

Let's apply `process_row` function to the dataset to test how it works.

In [None]:
# Apply the process_row function to each row of the dataset
processed_df = df.apply(process_row, axis=1)

# Display the first few rows of the processed dataset
processed_df.head()

In [None]:
# Information about the structure of the processed dataset
processed_df.info()

In [None]:
# Check unique values for binary features after processing
for col in binary_columns:
    if col in processed_df.columns:
        print(f"{col}: {processed_df[col].unique()}")

Let's check how the new features look:

In [None]:
# Check new features
new_features = ['floor_ratio', 'price_per_m2', 'comfort_score', 'type_numeric', 'condition_numeric']
for feature in new_features:
    if feature in processed_df.columns:
        plt.figure(figsize=(10, 4))
        if feature == 'comfort_score':
            plt.title(f'Distribution of {feature}')
            sns.countplot(x=processed_df[feature])
        else:
            plt.title(f'Distribution of {feature}')
            sns.histplot(processed_df[feature], kde=True)
        plt.show()

Let's apply normalization to the processed dataset.

In [None]:
# Apply normalization to the processed dataset
normalized_df = normalize_numeric_features(processed_df)

# Check normalization results
normalized_df.head()

In [None]:
# Check the range of normalized features
numeric_columns = [col for col in normalized_df.columns 
                  if normalized_df[col].dtype in ['int64', 'float64'] 
                  and col != 'price'
                  and not (normalized_df[col].isin([0, 1]).all() and normalized_df[col].nunique() <= 2)]

print("Ranges of normalized features:")
for col in numeric_columns:
    print(f"{col}: [{normalized_df[col].min()}, {normalized_df[col].max()}]")

Let's check how features correlate with each other and with the target variable (price).

In [None]:
# Building a correlation matrix
plt.figure(figsize=(14, 12))
correlation_matrix = normalized_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Sort features by correlation with price
price_correlations = correlation_matrix['price'].sort_values(ascending=False)
print("Correlation of features with price:")
print(price_correlations)

# Visualize top-10 features by correlation with price
plt.figure(figsize=(12, 6))
price_correlations.drop('price').sort_values(ascending=False).head(10).plot(kind='bar')
plt.title('Top-10 Features by Correlation with Price')
plt.tight_layout()
plt.show()

## FINAL

Have been developed the `process_row` function that performs all the necessary data transformations for project:                                                                                         

1. Converts binary features from strings ('yes'/'no') to boolean (True/False)
2. Fills missing values
3. Converts categorical features to numerical
4. Creates new informative features
5. Removes unnecessary features

Also implemented an additional `normalize_numeric_features` function to normalize numerical features.

Below is the final version of the `process_row` function that can be used in the Processor component:

In [None]:
def process_row_final(row: pd.Series, normalize=True) -> pd.Series:
    """
    Final function for processing data rows.
    Includes all necessary transformations and optional normalization.
    
    Args:
        row (pd.Series): Row from DataFrame with apartment data
        normalize (bool): Flag indicating whether to normalize numerical features
        
    Returns:
        pd.Series: Processed row with transformed data
    """
    # Create a copy of the row
    processed = row.copy()
    
    # 1. Convert binary string features to boolean values
    binary_columns = ['hasParkingSpace', 'hasBalcony', 'hasElevator', 'hasSecurity', 'hasStorageRoom']
    for col in binary_columns:
        if col in processed:
            # Fill missing values
            if pd.isna(processed[col]):
                processed[col] = 'no'
                
                # For hasElevator, determine by building type and floor count
                if col == 'hasElevator' and 'floorCount' in processed and 'type' in processed:
                    if (not pd.isna(processed['floorCount']) and processed['floorCount'] > 4) or \
                       (not pd.isna(processed['type']) and processed['type'] == 'blockOfFlats'):
                        processed[col] = 'yes'
            
            # Convert yes/no to True/False
            processed[col] = True if processed[col] == 'yes' else False
    
    # 2. Fill missing values in numeric columns
    numeric_columns = {
        'floor': 2,              # Median value
        'floorCount': 5,         # Median value
        'squareMeters': 50,      # Typical apartment size
        'rooms': 2,              # Typical number of rooms
        'centreDistance': 5.0,   # Typical distance from center
        'poiCount': 10           # Average number of POIs
    }
    
    for col, default_value in numeric_columns.items():
        if col in processed and pd.isna(processed[col]):
            processed[col] = default_value
    
    # 3. Convert categorical features to numerical
    # Building type
    if 'type' in processed:
        type_mapping = {
            'blockOfFlats': 0,
            'tenement': 1,
            'apartmentBuilding': 2
        }
        if not pd.isna(processed['type']):
            processed['type_numeric'] = type_mapping.get(processed['type'], 3)
        else:
            processed['type_numeric'] = 0
        
        processed = processed.drop('type')
    
    # Apartment condition
    if 'condition' in processed:
        condition_mapping = {
            'very good': 4,
            'good': 3,
            'average': 2,
            'poor': 1,
            'to renovation': 0
        }
        if not pd.isna(processed['condition']):
            processed['condition_numeric'] = condition_mapping.get(processed['condition'], 2)
        else:
            processed['condition_numeric'] = 2
        
        processed = processed.drop('condition')
    
    # City (if present)
    if 'city' in processed:
        city_mapping = {
            'warszawa': 0,
            'krakow': 1,
            'wroclaw': 2,
            'gdansk': 3,
            'lodz': 4,
            'poznan': 5
        }
        if not pd.isna(processed['city']):
            processed['city_numeric'] = city_mapping.get(processed['city'].lower(), 6)
        else:
            processed['city_numeric'] = 0
        
        processed = processed.drop('city')
    
    # 4. Create new features
    # Floor ratio to total floors
    if 'floor' in processed and 'floorCount' in processed and processed['floorCount'] > 0:
        processed['floor_ratio'] = round(processed['floor'] / processed['floorCount'], 3)
    else:
        processed['floor_ratio'] = 0.5
    
    # Price per square meter
    if 'price' in processed and 'squareMeters' in processed and processed['squareMeters'] > 0:
        processed['price_per_m2'] = round(processed['price'] / processed['squareMeters'], 2)
    
    # Combined comfort score
    comfort_features = ['hasParkingSpace', 'hasBalcony', 'hasElevator', 'hasSecurity', 'hasStorageRoom']
    comfort_score = 0
    for feature in comfort_features:
        if feature in processed and processed[feature]:
            comfort_score += 1
    processed['comfort_score'] = comfort_score
    
    # 5. Remove rarely used or uninformative columns
    columns_to_drop = [
        'buildYear', 'buildingMaterial', 'ownership', 
        'schoolDistance', 'clinicDistance', 'kindergartenDistance', 
        'restaurantDistance', 'collegeDistance', 'pharmacyDistance', 'postOfficeDistance',
        'id'
    ]
    
    for col in columns_to_drop:
        if col in processed:
            processed = processed.drop(col)
    
    # 6. Normalize numerical features (if required)
    if normalize:
        # Define numerical columns (excluding boolean and price target variable)
        numeric_cols = [col for col in processed.index 
                        if isinstance(processed[col], (int, float)) 
                        and col != 'price'
                        and not (isinstance(processed[col], bool) or (processed[col] in [0, 1] and col in binary_columns))]
        
        # Normalization using predefined ranges
        normalization_ranges = {
            'squareMeters': (20, 200),
            'rooms': (1, 6),
            'floor': (0, 20),
            'floorCount': (1, 30),
            'centreDistance': (0, 20),
            'poiCount': (0, 50),
            'type_numeric': (0, 3),
            'condition_numeric': (0, 4),
            'city_numeric': (0, 6),
            'floor_ratio': (0, 1),
            'price_per_m2': (20, 500),
            'comfort_score': (0, 5)
        }
        
        for col in numeric_cols:
            if col in normalization_ranges:
                min_val, max_val = normalization_ranges[col]
                # Limit the value to the range and normalize
                val = max(min(processed[col], max_val), min_val)
                processed[col] = (val - min_val) / (max_val - min_val)
    
    return processed

## Testing

For testing the `process_row_final` function refer to the separate notebook `test_processing.ipynb`. This testing notebook includes:

1. Complete testing environment with all necessary imports
2. Sample dataset with representative data
3. Tests with complete and incomplete data
4. Visualization of processing results
5. Examples of integration with streaming pipeline using RabbitMQ

Using a separate test notebook keeps this development notebook focused while providing a clean environment for testing the functions independently.