# Airbnb NYC 2019 Data Preprocessing

This notebook focuses on preparing the data for machine learning models. The main objectives are:

1. **Data Cleaning**:
   - Handle missing values and outliers.
   - Correct data types and inconsistencies.

2. **Feature Engineering**:
   - Create new features from existing ones.
   - Encode categorical variables.
   - Scale numerical features.

3. **Data Splitting**:
   - Split the dataset into training and testing sets.

## 1. Import libraries and load data

In [1]:
import sys
import os
import numpy as np

sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))
from data_processing import *

In [2]:
data_numpy = read_csv('../data/raw/AB_NYC_2019.csv')

column_names = data_numpy[0]
data = data_numpy[1:]

print("Data loaded successfully")
print(f"Initial shape: {data.shape}")
print(f"Initial number of columns: {len(column_names)}")

Data loaded successfully
Initial shape: (48895, 16)
Initial number of columns: 16


## 2. Check initial data

In [3]:
print("Initial column list:")
for i, col in enumerate(column_names):
    print(f"  {i+1}. {col}")
    
print(f"\n{'='*50}")
print("Missing values in initial data:")
for i, col in enumerate(column_names):
    col_data = data[:, i]
    missing_count = np.sum(col_data == '')
    
    if missing_count > 0:
        missing_percent = (missing_count / len(data) * 100)
        print(f"\n{col}:")
        print(f"  - Count: {missing_count:,}")
        print(f"  - Percentage: {missing_percent:.3f}%")

Initial column list:
  1. id
  2. name
  3. host_id
  4. host_name
  5. neighbourhood_group
  6. neighbourhood
  7. latitude
  8. longitude
  9. room_type
  10. price
  11. minimum_nights
  12. number_of_reviews
  13. last_review
  14. reviews_per_month
  15. calculated_host_listings_count
  16. availability_365

Missing values in initial data:

name:
  - Count: 16
  - Percentage: 0.033%

host_name:
  - Count: 21
  - Percentage: 0.043%

last_review:
  - Count: 10,052
  - Percentage: 20.558%

reviews_per_month:
  - Count: 10,052
  - Percentage: 20.558%


## 3. Remove unnecessary columns

**Columns to drop:**
- `id`: No value for calculation/analysis purposes
- `name`: Just a descriptive name, no numerical value
- `host_name`: 
  - Has 21 missing values
  - Already have `host_id` with more complete data
- `last_review`: 
  - Not as important as review count
  - Already have `number_of_reviews` and `reviews_per_month` for analysis

In [4]:
columns_to_drop = ['id', 'name', 'host_name', 'last_review']

is_drop_col = np.isin(column_names, columns_to_drop)

keep_mask = ~is_drop_col

data_cleaned = data[:, keep_mask]
column_names_cleaned = column_names[keep_mask]

print(f"Shape before drop: {data.shape}")
print(f"Shape after drop: {data_cleaned.shape}")
print(f"\nDropped columns: {columns_to_drop}")

print(f"\nRemaining columns ({len(column_names_cleaned)}):")
print("\n".join([f"{i+1}. {col}" for i, col in enumerate(column_names_cleaned)]))

Shape before drop: (48895, 16)
Shape after drop: (48895, 12)

Dropped columns: ['id', 'name', 'host_name', 'last_review']

Remaining columns (12):
1. host_id
2. neighbourhood_group
3. neighbourhood
4. latitude
5. longitude
6. room_type
7. price
8. minimum_nights
9. number_of_reviews
10. reviews_per_month
11. calculated_host_listings_count
12. availability_365


## 4. Handle missing values in reviews_per_month

**Reason for assigning 0.00:**
- Column `reviews_per_month` has 10,052 missing values
- When cross-checking with `number_of_reviews`, listings with missing `reviews_per_month` all have `number_of_reviews = 0`
- Logic: If there are no reviews, then average reviews/month = 0.00

In [5]:
reviews_per_month_idx = np.where(column_names_cleaned == 'reviews_per_month')[0][0]

mask_missing = (data_cleaned[:, reviews_per_month_idx] == '')

missing_before = np.sum(mask_missing)
print(f"Missing values before processing: {missing_before:,}")

data_cleaned[mask_missing, reviews_per_month_idx] = '0.00'
missing_after = np.sum(data_cleaned[:, reviews_per_month_idx] == '')

print(f"Missing values after processing: {missing_after:,}")
print(f"\nReplaced {missing_before:,} missing values with 0.00")

Missing values before processing: 10,052
Missing values after processing: 0

Replaced 10,052 missing values with 0.00


## 5. Convert data to appropriate types

**Purpose:** Ensure consistent and optimized data types for computation
- String columns: `astype(str)` 
- Numeric columns: `astype(np.float64)` for high precision

**Technique:** Use `pandas.DataFrame.astype()` to efficiently and quickly convert column data types.

In [6]:
neighbourhood_group_idx = np.where(column_names_cleaned == 'neighbourhood_group')[0][0]
neighbourhood_idx = np.where(column_names_cleaned == 'neighbourhood')[0][0]
room_type_idx = np.where(column_names_cleaned == 'room_type')[0][0]
latitude_idx = np.where(column_names_cleaned == 'latitude')[0][0]
longitude_idx = np.where(column_names_cleaned == 'longitude')[0][0]
price_idx = np.where(column_names_cleaned == 'price')[0][0]
min_nights_idx = np.where(column_names_cleaned == 'minimum_nights')[0][0]
num_reviews_idx = np.where(column_names_cleaned == 'number_of_reviews')[0][0]
reviews_per_month_idx = np.where(column_names_cleaned == 'reviews_per_month')[0][0]
calc_host_idx = np.where(column_names_cleaned == 'calculated_host_listings_count')[0][0]
availability_idx = np.where(column_names_cleaned == 'availability_365')[0][0]

neighbourhood_groups = data_cleaned[:, neighbourhood_group_idx]
neighbourhoods = data_cleaned[:, neighbourhood_idx]
room_types = data_cleaned[:, room_type_idx]

latitudes = data_cleaned[:, latitude_idx].astype(float)
longitudes = data_cleaned[:, longitude_idx].astype(float)
prices = data_cleaned[:, price_idx].astype(float)
min_nights = data_cleaned[:, min_nights_idx].astype(float)
number_of_reviews = data_cleaned[:, num_reviews_idx].astype(float)
calc_host_listings = data_cleaned[:, calc_host_idx].astype(float)
reviews_per_month = data_cleaned[:, reviews_per_month_idx].astype(float)
availability = data_cleaned[:, availability_idx].astype(float)

print("\nData processing completed!")


Data processing completed!


## 6. Filter noisy data

**Purpose:** Remove records with invalid values to ensure data quality
- Focus on `price` column

**Technique:** Use Boolean Masking - Fancy Indexing to filter invalid data (price < 0)
- No loops needed
- Memory efficient (views instead of copies)
- Clean and readable syntax

In [7]:
valid_mask = prices > 0 
print(f"Rows before filtering: {len(prices):,}")
print(f"Rows with price <= 0: {np.sum(~valid_mask):,}")
print(f"Rows after filtering: {np.sum(valid_mask):,}")

prices = prices[valid_mask]
min_nights = min_nights[valid_mask]
latitudes = latitudes[valid_mask]
longitudes = longitudes[valid_mask]
neighbourhood_groups = neighbourhood_groups[valid_mask]
neighbourhoods = neighbourhoods[valid_mask]
room_types = room_types[valid_mask]
number_of_reviews = number_of_reviews[valid_mask]
reviews_per_month = reviews_per_month[valid_mask]
calc_host_listings = calc_host_listings[valid_mask]
availability = availability[valid_mask]

print(f"\nFiltering successful, {len(prices):,} valid rows remaining")

Rows before filtering: 48,895
Rows with price <= 0: 11
Rows after filtering: 48,884

Filtering successful, 48,884 valid rows remaining


## 7. One-hot encoding for neighbourhood_group column

**Purpose:** Convert categorical variables `neighbourhood_group` and `room_type` into numeric form for use in machine learning models
- Create separate binary columns for each value in `neighbourhood_group` and `room_type`

**Technique:** One-hot encoding using broadcasting to convert `neighbourhood_group` and `room_type` columns into binary columns
- Broadcasting: `(N,1) == (C,)` → `(N,C)` 
- Speed: O(N×C) but vectorized so much faster than loops

In [8]:
def fast_one_hot(arr):
    """
    One-hot encoding using Broadcasting
    arr: (N,) -> output: (N, C) where C is number of unique classes
    """
    classes = np.unique(arr)
    # Broadcasting: arr[:, None] shape (N,1), classes shape (C,)
    # Comparison result: (N,1) == (C,) -> (N,C)
    return (arr[:, None] == classes).astype(int)

# Apply one-hot encoding
oh_neighbourhood_group = fast_one_hot(neighbourhood_groups)
oh_room_type = fast_one_hot(room_types)

print("One-Hot Encoding Results:")
print(f"- neighbourhood_group: {oh_neighbourhood_group.shape}")
print(f"  Classes: {np.unique(neighbourhood_groups)}")
print(f"\n- room_type: {oh_room_type.shape}")
print(f"  Classes: {np.unique(room_types)}")

# Display sample
print(f"\nSample one-hot encoding for room_type (first 5 rows):")
print(oh_room_type[:5])

One-Hot Encoding Results:
- neighbourhood_group: (48884, 5)
  Classes: ['Bronx' 'Brooklyn' 'Manhattan' 'Queens' 'Staten Island']

- room_type: (48884, 3)
  Classes: ['Entire home/apt' 'Private room' 'Shared room']

Sample one-hot encoding for room_type (first 5 rows):
[[0 1 0]
 [1 0 0]
 [0 1 0]
 [1 0 0]
 [1 0 0]]


## 8. Splitting the train and test sets

**Purpose:** Split the data into train (80%) and test (20%) sets before performing Feature Engineering
- Avoid data leakage when applying target encoding and scaling
- Ensure the test set is completely independent from the train set

**Technique:** Random sampling with np.random.permutation
- Shuffle random data
- Split in an 80-20 ratio

In [9]:
# Set random seed to ensure reproducibility
np.random.seed(42)

# Total number of samples
n_samples = len(prices)

# Create random indices
shuffled_indices = np.random.permutation(n_samples)

# Calculate train size (80%)
train_size = int(0.8 * n_samples)

# Split indices
train_indices = shuffled_indices[:train_size]
test_indices = shuffled_indices[train_size:]

# Split arrays into train and test
# Numerical features
prices_train, prices_test = prices[train_indices], prices[test_indices]
latitudes_train, latitudes_test = latitudes[train_indices], latitudes[test_indices]
longitudes_train, longitudes_test = longitudes[train_indices], longitudes[test_indices]
min_nights_train, min_nights_test = min_nights[train_indices], min_nights[test_indices]
number_of_reviews_train, number_of_reviews_test = number_of_reviews[train_indices], number_of_reviews[test_indices]
reviews_per_month_train, reviews_per_month_test = reviews_per_month[train_indices], reviews_per_month[test_indices]
calc_host_listings_train, calc_host_listings_test = calc_host_listings[train_indices], calc_host_listings[test_indices]
availability_train, availability_test = availability[train_indices], availability[test_indices]

# Categorical features
neighbourhood_groups_train, neighbourhood_groups_test = neighbourhood_groups[train_indices], neighbourhood_groups[test_indices]
neighbourhoods_train, neighbourhoods_test = neighbourhoods[train_indices], neighbourhoods[test_indices]
room_types_train, room_types_test = room_types[train_indices], room_types[test_indices]

# One-hot encoded features (created in step 7)
oh_neighbourhood_group_train, oh_neighbourhood_group_test = oh_neighbourhood_group[train_indices], oh_neighbourhood_group[test_indices]
oh_room_type_train, oh_room_type_test = oh_room_type[train_indices], oh_room_type[test_indices]

print("Split data into train and test:")
print(f"- Total samples: {n_samples:,}")
print(f"- Train set: {len(train_indices):,} samples ({len(train_indices)/n_samples*100:.0f}%)")
print(f"- Test set: {len(test_indices):,} samples ({len(test_indices)/n_samples*100:.0f}%)")
print(f"\nShape check:")
print(f"- prices_train: {prices_train.shape}")
print(f"- prices_test: {prices_test.shape}")
print(f"- neighbourhoods_train: {neighbourhoods_train.shape}")
print(f"- neighbourhoods_test: {neighbourhoods_test.shape}")

Split data into train and test:
- Total samples: 48,884
- Train set: 39,107 samples (80%)
- Test set: 9,777 samples (20%)

Shape check:
- prices_train: (39107,)
- prices_test: (9777,)
- neighbourhoods_train: (39107,)
- neighbourhoods_test: (9777,)


## 9. Feature Engineering - Create new features

**Purpose:** Create new variables from original data to improve machine learning model performance

### 9.1. Calculate distance to Times Square

**Purpose:** Calculate distance from each listing to Times Square to use as a feature in machine learning models
- Times Square coordinates: (40.7580, -73.9855)

**Technique:** Use `np.einsum` for linear algebra
- Calculate distance from each listing to Times Square (40.7580, -73.9855)
- `einsum('ij,ij->i')`: Element-wise multiplication then sum along axis 1
- Much more efficient than loops or `np.sum()`

In [10]:
# Times Square coordinates
center = np.array([40.7580, -73.9855])

# Calculate distance for train set
coords_train = np.column_stack((latitudes_train, longitudes_train))
diff_train = coords_train - center
dist_sq_train = np.einsum('ij,ij->i', diff_train, diff_train)
dist_to_center_train = np.sqrt(dist_sq_train)

# Calculate distance for test set
coords_test = np.column_stack((latitudes_test, longitudes_test))
diff_test = coords_test - center
dist_sq_test = np.einsum('ij,ij->i', diff_test, diff_test)
dist_to_center_test = np.sqrt(dist_sq_test)

print("Distance to Times Square:")
print(f"\nTrain set:")
print(f"  - Min: {dist_to_center_train.min():.6f} degrees")
print(f"  - Max: {dist_to_center_train.max():.6f} degrees")
print(f"  - Mean: {dist_to_center_train.mean():.6f} degrees")
print(f"\nTest set:")
print(f"  - Min: {dist_to_center_test.min():.6f} degrees")
print(f"  - Max: {dist_to_center_test.max():.6f} degrees")
print(f"  - Mean: {dist_to_center_test.mean():.6f} degrees")

Distance to Times Square:

Train set:
  - Min: 0.000677 degrees
  - Max: 0.359430 degrees
  - Mean: 0.070638 degrees

Test set:
  - Min: 0.000744 degrees
  - Max: 0.363140 degrees
  - Mean: 0.069868 degrees


### 9.2 Binning for minimum_nights

**Purpose:** 
- Remove the influence of extreme outliers (e.g., 1250 nights) which can skew distance-based algorithms.
- Categorize minimum_nights into meaningful groups
    - Bin 0: ≤3 nights (short-term travelers)
    - Bin 1: 4-7 nights (weekly rental)
    - Bin 2: 8-31 nights (monthly rental)
    - Bin 3: >31 nights (long-term)

**Technique:** 
- `np.digitize(arr, bins, right=True)`: Categorize values into bins
- `np.eye(num_categories)[indices]`: Create one-hot encoding from indices

In [11]:
# Define bin boundaries
bins = np.array([3, 7, 31])

# Binning for train set
min_nights_binned_indices_train = np.digitize(min_nights_train, bins, right=True)
num_categories = 4
oh_min_nights_train = np.eye(num_categories)[min_nights_binned_indices_train]

# Binning for test set
min_nights_binned_indices_test = np.digitize(min_nights_test, bins, right=True)
oh_min_nights_test = np.eye(num_categories)[min_nights_binned_indices_test]

print("Binning minimum_nights:")
print(f"\nTrain set - oh_min_nights shape: {oh_min_nights_train.shape}")
print("Distribution:")
for i in range(num_categories):
    count = np.sum(min_nights_binned_indices_train == i)
    percentage = (count / len(min_nights_binned_indices_train)) * 100
    if i == 0:
        label = "≤3 nights (short-term)"
    elif i == 1:
        label = "4-7 nights (weekly)"
    elif i == 2:
        label = "8-31 nights (monthly)"
    else:
        label = ">31 nights (long-term)"
    print(f"  Bin {i} ({label}): {count} listings ({percentage:.3f}%)")

print(f"\nTest set - oh_min_nights shape: {oh_min_nights_test.shape}")
print("Distribution:")
for i in range(num_categories):
    count = np.sum(min_nights_binned_indices_test == i)
    percentage = (count / len(min_nights_binned_indices_test)) * 100
    if i == 0:
        label = "≤3 nights (short-term)"
    elif i == 1:
        label = "4-7 nights (weekly)"
    elif i == 2:
        label = "8-31 nights (monthly)"
    else:
        label = ">31 nights (long-term)"
    print(f"  Bin {i} ({label}): {count} listings ({percentage:.3f}%)")

Binning minimum_nights:

Train set - oh_min_nights shape: (39107, 4)
Distribution:
  Bin 0 (≤3 nights (short-term)): 25960 listings (66.382%)
  Bin 1 (4-7 nights (weekly)): 7334 listings (18.754%)
  Bin 2 (8-31 nights (monthly)): 5374 listings (13.742%)
  Bin 3 (>31 nights (long-term)): 439 listings (1.123%)

Test set - oh_min_nights shape: (9777, 4)
Distribution:
  Bin 0 (≤3 nights (short-term)): 6448 listings (65.951%)
  Bin 1 (4-7 nights (weekly)): 1811 listings (18.523%)
  Bin 2 (8-31 nights (monthly)): 1411 listings (14.432%)
  Bin 3 (>31 nights (long-term)): 107 listings (1.094%)


### 9.3 Log-transform of calc_host_listings_log

**Purpose:**
- Handle long-tail distribution: Most hosts have only 1 house or few listings, but a few have many (up to hundreds).
- Narrow the value range so the model isn't overwhelmed by very large numbers, while maintaining important ranking order.

**Technique:**
- Log Transformation (log1p): Use function $y = \ln(x + 1)$.
- `log1p` avoids log(0) problem and provides higher accuracy for small values

In [12]:
# Log-transform for train set
calc_host_listings_log_train = np.log1p(calc_host_listings_train)

# Log-transform for test set
calc_host_listings_log_test = np.log1p(calc_host_listings_test)

print("Log-Transform Results for calc_host_listings:")
print(f"\nTrain set:")
print(f"  - Original range: [{calc_host_listings_train.min()}, {calc_host_listings_train.max()}]")
print(f"  - Log-Transformed range: [{calc_host_listings_log_train.min():.3f}, {calc_host_listings_log_train.max():.3f}]")
print(f"\nTest set:")
print(f"  - Original range: [{calc_host_listings_test.min()}, {calc_host_listings_test.max()}]")
print(f"  - Log-Transformed range: [{calc_host_listings_log_test.min():.3f}, {calc_host_listings_log_test.max():.3f}]")

Log-Transform Results for calc_host_listings:

Train set:
  - Original range: [1.0, 327.0]
  - Log-Transformed range: [0.693, 5.793]

Test set:
  - Original range: [1.0, 327.0]
  - Log-Transformed range: [0.693, 5.793]


### 9.4 Log-transform of number_of_reviews_log

**Purpose:**
- Handle long-tail distribution: Most hosts have only 1 house or few reviews, but a few have many (up to hundreds).
- Narrow the value range so the model isn't overwhelmed by very large numbers, while maintaining important ranking order.

**Technique:**
- Log Transformation (log1p): Use function $y = \ln(x + 1)$.
- `log1p` avoids log(0) problem and provides higher accuracy for small values

In [13]:
# Log-transform for train set
number_of_reviews_log_train = np.log1p(number_of_reviews_train)

# Log-transform for test set
number_of_reviews_log_test = np.log1p(number_of_reviews_test)

print("number_of_reviews transformation:")
print(f"\nTrain set:")
print(f"  - Original range: [{number_of_reviews_train.min():.2f}, {number_of_reviews_train.max():.2f}]")
print(f"  - Log range: [{number_of_reviews_log_train.min():.3f}, {number_of_reviews_log_train.max():.3f}]")
print(f"\nTest set:")
print(f"  - Original range: [{number_of_reviews_test.min():.2f}, {number_of_reviews_test.max():.2f}]")
print(f"  - Log range: [{number_of_reviews_log_test.min():.3f}, {number_of_reviews_log_test.max():.3f}]")

number_of_reviews transformation:

Train set:
  - Original range: [0.00, 629.00]
  - Log range: [0.000, 6.446]

Test set:
  - Original range: [0.00, 607.00]
  - Log range: [0.000, 6.410]


### 9.5 Log-transform of price

**Purpose:** Reduce price distribution skewness and improve machine learning model performance

**Technique:**
- Log Transformation (log1p): Use function $y = \ln(x + 1)$.
- `log1p` avoids log(0) problem and provides higher accuracy for small values

In [14]:
# Log-transform for train set
price_log_train = np.log1p(prices_train)

# Log-transform for test set
price_log_test = np.log1p(prices_test)

print("Feature - Log Price:")
print(f"\nTrain set:")
print(f"  - Original price range: [${prices_train.min():.2f}, ${prices_train.max():.2f}]")
print(f"  - Log price range: [{price_log_train.min():.3f}, {price_log_train.max():.3f}]")
print(f"\nTest set:")
print(f"  - Original price range: [${prices_test.min():.2f}, ${prices_test.max():.2f}]")
print(f"  - Log price range: [{price_log_test.min():.3f}, {price_log_test.max():.3f}]")

Feature - Log Price:

Train set:
  - Original price range: [$10.00, $10000.00]
  - Log price range: [2.398, 9.210]

Test set:
  - Original price range: [$10.00, $10000.00]
  - Log price range: [2.398, 9.210]


## 10. Target Encoding with Smoothing for neighbourhood

**Purpose:** Convert categorical variable `neighbourhood` into numeric values based on relationship with target (price)
- Target encoding preserves information about the relationship between neighbourhood and price
- Smoothing helps avoid overfitting for neighbourhoods with few samples

**Technique:** Target Encoding with Smoothing
- Formula: `encoded_value = (count * mean + smoothing * global_mean) / (count + smoothing)`
- `smoothing`: hyperparameter adjusting confidence (typically 10-100)
- Fit on train set, transform both train and test to avoid data leakage

In [15]:
def target_encode_with_smoothing(train_cat, train_target, test_cat, smoothing=50):
    """
    Target encoding with smoothing
    
    Parameters:
    - train_cat: categorical values from train set
    - train_target: target values from train set (to calculate mean)
    - test_cat: categorical values from test set
    - smoothing: smoothing parameter (default=50)
    
    Returns:
    - train_encoded: target encoded values for train
    - test_encoded: target encoded values for test
    """
    # Calculate global mean from train set
    global_mean = np.mean(train_target)
    
    # Get unique categories from train
    unique_cats = np.unique(train_cat)
    
    # Create dictionary to store encoded values
    encoding_map = {}
    
    for cat in unique_cats:
        # Mask for this category in train
        mask = (train_cat == cat)
        
        # Count and calculate mean
        count = np.sum(mask)
        cat_mean = np.mean(train_target[mask])
        
        # Apply smoothing
        smoothed_value = (count * cat_mean + smoothing * global_mean) / (count + smoothing)
        encoding_map[cat] = smoothed_value
    
    # Encode train set
    train_encoded = np.array([encoding_map.get(cat, global_mean) for cat in train_cat])
    
    # Encode test set (use global_mean for unseen categories)
    test_encoded = np.array([encoding_map.get(cat, global_mean) for cat in test_cat])
    
    return train_encoded, test_encoded

# Apply target encoding with smoothing for neighbourhood
te_neighbourhood_train, te_neighbourhood_test = target_encode_with_smoothing(
    neighbourhoods_train, 
    prices_train,  # Use price as target
    neighbourhoods_test,
    smoothing=50
)

print("Target Encoding Results:")
print(f"- neighbourhood: {len(np.unique(neighbourhoods_train))} unique values in train")
print(f"\nTrain set:")
print(f"  - Encoded range: [{te_neighbourhood_train.min():.3f}, {te_neighbourhood_train.max():.3f}]")
print(f"  - Mean: {te_neighbourhood_train.mean():.3f}")
print(f"\nTest set:")
print(f"  - Encoded range: [{te_neighbourhood_test.min():.3f}, {te_neighbourhood_test.max():.3f}]")
print(f"  - Mean: {te_neighbourhood_test.mean():.3f}")
print(f"\nSample (Original -> Encoded) for train set:")
for i in range(5):
    print(f"  {neighbourhoods_train[i]} -> {te_neighbourhood_train[i]:.3f}")

Target Encoding Results:
- neighbourhood: 219 unique values in train

Train set:
  - Encoded range: [85.537, 386.425]
  - Mean: 153.548

Test set:
  - Encoded range: [85.537, 386.425]
  - Mean: 154.793

Sample (Original -> Encoded) for train set:
  Chelsea -> 242.826
  East Harlem -> 127.520
  Williamsburg -> 144.999
  Washington Heights -> 95.351
  Harlem -> 120.122


## 11. Data normalization

**Purpose:** Normalize features to range [0, 1] to ensure features have the same scale

**Technique:** Min-Max Scaling
- Formula: `(x - min) / (max - min)`
- **Important**: Fit on train set (calculate min, max from train), then transform both train and test
- Avoid data leakage by not using information from test set

In [16]:
def fit_min_max_scale(train_arr):
    """
    Fit min-max scaler on train set
    Returns: min_val, max_val, denom
    """
    min_val = np.min(train_arr)
    max_val = np.max(train_arr)
    denom = max_val - min_val
    
    if denom == 0:
        denom = 1  # Avoid division by zero
    
    return min_val, max_val, denom

def transform_min_max_scale(arr, min_val, max_val, denom):
    """
    Transform data using min, max from train set
    """
    if denom == 0:
        return np.zeros_like(arr)
    return (arr - min_val) / denom

# List of features to scale (already log-transformed)
train_features_raw = [
    number_of_reviews_log_train,
    calc_host_listings_log_train,
    availability_train, 
    dist_to_center_train,
    reviews_per_month_train,
]

test_features_raw = [
    number_of_reviews_log_test,
    calc_host_listings_log_test,
    availability_test, 
    dist_to_center_test,
    reviews_per_month_test,
]

feature_names = [
    'number_of_reviews_log',
    'calc_host_listings_log',
    'availability_365',
    'dist_to_center',
    'reviews_per_month',
]

# Fit on train and transform both train and test
train_scaled_features = []
test_scaled_features = []
scaling_params = []

for train_f, test_f in zip(train_features_raw, test_features_raw):
    # Fit on train
    min_val, max_val, denom = fit_min_max_scale(train_f)
    scaling_params.append((min_val, max_val, denom))
    
    # Transform both train and test
    train_scaled = transform_min_max_scale(train_f, min_val, max_val, denom)
    test_scaled = transform_min_max_scale(test_f, min_val, max_val, denom)
    
    train_scaled_features.append(train_scaled)
    test_scaled_features.append(test_scaled)

print("Min-Max Scaling Results:")
print(f"Normalized {len(train_scaled_features)} features to range [0, 1]")
print(f"\nCheck range of scaled features:")
print("\nTrain set:")
for name, scaled_f in zip(feature_names, train_scaled_features):
    print(f"  - {name}: [{scaled_f.min():.3f}, {scaled_f.max():.3f}]")

print("\nTest set:")
for name, scaled_f in zip(feature_names, test_scaled_features):
    print(f"  - {name}: [{scaled_f.min():.3f}, {scaled_f.max():.3f}]")
    
print("\nNote: Test set may have values outside [0,1] if outliers not present in train")

Min-Max Scaling Results:
Normalized 5 features to range [0, 1]

Check range of scaled features:

Train set:
  - number_of_reviews_log: [0.000, 1.000]
  - calc_host_listings_log: [0.000, 1.000]
  - availability_365: [0.000, 1.000]
  - dist_to_center: [0.000, 1.000]
  - reviews_per_month: [0.000, 1.000]

Test set:
  - number_of_reviews_log: [0.000, 0.994]
  - calc_host_listings_log: [0.000, 1.000]
  - availability_365: [0.000, 1.000]
  - dist_to_center: [0.000, 1.010]
  - reviews_per_month: [0.000, 0.478]

Note: Test set may have values outside [0,1] if outliers not present in train


## 12. Array Manipulation - Stacking & Combining features

**Purpose:** Combine all processed features into final feature matrix for train and test

**Technique:** Stack all features into final matrix
- `np.hstack()`: Horizontal stacking
- Reshape 1D arrays to (N, 1) before stacking
- Feature order: [Target Encoded, One-Hot Encoded, Scaled Numerical Features]

In [17]:
# Reshape 1D arrays to (N, 1) for stacking
train_scaled_reshaped = [f[:, None] for f in train_scaled_features]
test_scaled_reshaped = [f[:, None] for f in test_scaled_features]

# Create final matrix for train set (add price_log at the end)
final_matrix_train = np.hstack(
    [te_neighbourhood_train[:, None]] +  # Target encoded neighbourhood
    [oh_neighbourhood_group_train] +      # One-hot neighbourhood_group
    [oh_room_type_train] +                # One-hot room_type
    [oh_min_nights_train] +               # One-hot min_nights binned
    train_scaled_reshaped +               # Scaled numerical features
    [price_log_train[:, None]]            # Target variable (price_log)
)

# Create final matrix for test set (add price_log at the end)
final_matrix_test = np.hstack(
    [te_neighbourhood_test[:, None]] +   # Target encoded neighbourhood
    [oh_neighbourhood_group_test] +       # One-hot neighbourhood_group
    [oh_room_type_test] +                 # One-hot room_type
    [oh_min_nights_test] +                # One-hot min_nights binned
    test_scaled_reshaped +                # Scaled numerical features
    [price_log_test[:, None]]             # Target variable (price_log)
)

print("Final Feature Matrix:")
print(f"\nTrain set:")
print(f"  - Shape: {final_matrix_train.shape}")
print(f"  - Total features: {final_matrix_train.shape[1]}")

print(f"\nTest set:")
print(f"  - Shape: {final_matrix_test.shape}")
print(f"  - Total features: {final_matrix_test.shape[1]}")

print(f"\nBreakdown:")
print(f"  1. Target encoded neighbourhood: 1 feature")
print(f"  2. One-hot room_type: {oh_room_type_train.shape[1]} features")
print(f"  3. One-hot neighbourhood_group: {oh_neighbourhood_group_train.shape[1]} features")
print(f"  4. One-hot min_nights: {oh_min_nights_train.shape[1]} features")
print(f"  5. Scaled numerical features: {len(train_scaled_features)} features")
print(f"  6. Target variable (price_log): 1 feature")
print(f"\nData ready for Machine Learning!")

Final Feature Matrix:

Train set:
  - Shape: (39107, 19)
  - Total features: 19

Test set:
  - Shape: (9777, 19)
  - Total features: 19

Breakdown:
  1. Target encoded neighbourhood: 1 feature
  2. One-hot room_type: 3 features
  3. One-hot neighbourhood_group: 5 features
  4. One-hot min_nights: 4 features
  5. Scaled numerical features: 5 features
  6. Target variable (price_log): 1 feature

Data ready for Machine Learning!


## 13. Save feature data as CSV

**Purpose:** Save processed feature matrix as CSV separately for train and test sets
- `train_features.csv`: Train data for model training
- `test_features.csv`: Test data for model evaluation

In [18]:
# Create column names for final_matrix
feature_column_names = ['neighbourhood_target_encoded']

# Add column names for one-hot neighbourhood_group
ng_classes = np.unique(neighbourhood_groups_train)
feature_column_names.extend([f'ng_{cls}' for cls in ng_classes])


# Add column names for one-hot room_type
room_type_classes = np.unique(room_types_train)
feature_column_names.extend([f'room_{cls}' for cls in room_type_classes])

# Add column names for one-hot min_nights
min_nights_labels = ['short_term', 'weekly', 'monthly', 'long_term']
feature_column_names.extend([f'nights_{label}' for label in min_nights_labels])

# Add scaled numerical feature names
feature_column_names.extend(feature_names)

# Add target variable column (price_log)
feature_column_names.append('price_log')

# Convert final_matrix to string array for CSV saving
final_matrix_train_str = final_matrix_train.astype(str)
final_matrix_test_str = final_matrix_test.astype(str)

# Save CSV file for train set
train_csv_path = '../data/processed/train_features.csv'
write_csv(train_csv_path, final_matrix_train_str, np.array(feature_column_names))

# Save CSV file for test set
test_csv_path = '../data/processed/test_features.csv'
write_csv(test_csv_path, final_matrix_test_str, np.array(feature_column_names))

print(f"Saved feature data as CSV:")
print(f"\nTrain set:")
print(f"  - File: {train_csv_path}")
print(f"  - Shape: {final_matrix_train.shape}")
print(f"  - Number of samples: {final_matrix_train.shape[0]:,}")
print(f"\nTest set:")
print(f"  - File: {test_csv_path}")
print(f"  - Shape: {final_matrix_test.shape}")
print(f"  - Number of samples: {final_matrix_test.shape[0]:,}")
print(f"\nNumber of features: {len(feature_column_names)}")
print(f"\nColumns in CSV file:")
for i, col_name in enumerate(feature_column_names):
    if i < len(feature_column_names) - 1:
        print(f"  {i+1}. {col_name}")
    else:
        print(f"  {i+1}. {col_name} ← Target variable")
    
print(f"\nComplete! Data has been split and saved separately to avoid data leakage.")
print(f"Column price_log (target variable) has been added at the end.")

Saved data to: ../data/processed/train_features.csv
Shape: (39107, 19)
Number of columns: 19
Saved data to: ../data/processed/test_features.csv
Shape: (9777, 19)
Number of columns: 19
Saved feature data as CSV:

Train set:
  - File: ../data/processed/train_features.csv
  - Shape: (39107, 19)
  - Number of samples: 39,107

Test set:
  - File: ../data/processed/test_features.csv
  - Shape: (9777, 19)
  - Number of samples: 9,777

Number of features: 19

Columns in CSV file:
  1. neighbourhood_target_encoded
  2. ng_Bronx
  3. ng_Brooklyn
  4. ng_Manhattan
  5. ng_Queens
  6. ng_Staten Island
  7. room_Entire home/apt
  8. room_Private room
  9. room_Shared room
  10. nights_short_term
  11. nights_weekly
  12. nights_monthly
  13. nights_long_term
  14. number_of_reviews_log
  15. calc_host_listings_log
  16. availability_365
  17. dist_to_center
  18. reviews_per_month
  19. price_log ← Target variable

Complete! Data has been split and saved separately to avoid data leakage.
Column pri