# Feature Engineering Notebook

## Objectives
- Handle Categorical Variables: Convert categorical variables into numerical formats (one-hot encoding, label encoding, etc.).
- Scale Numerical Variables: Apply scaling to numerical features for model compatibility.
- Create Interaction Features: Optionally create interaction features to enhance model performance.
- Handle Date/Time Features: Extract useful components from date/time features like year, month, and day.
- Generate Polynomial Features: Optionally create polynomial features for feature interactions.

## Inputs
- outputs/datasets/cleaned/TrainSetCleaned.csv: Cleaned training set.
- outputs/datasets/cleaned/TestSetCleaned.csv: Cleaned test set.

## Outputs
outputs/datasets/featured/TrainSet_Featured.csv: Training set with engineered features.
outputs/datasets/featured/TestSet_Featured.csv: Test set with engineered features.

1. Import Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
import os


2. Change Working Directory

In [2]:
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

os.chdir(os.path.dirname(current_dir))
print(f"New working directory: {os.getcwd()}")


Current directory: /workspace/bicycle_thefts_berlin/jupyter_notebooks
New working directory: /workspace/bicycle_thefts_berlin


3. Load Cleaned Data

In [3]:
train_data_path = 'outputs/datasets/cleaned/TrainSetCleaned.csv'
test_data_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'

TrainSet = pd.read_csv(train_data_path)
TestSet = pd.read_csv(test_data_path)


4. Handle Categorical Variables

In [23]:
# Define categorical columns for encoding
categorical_columns = [
    "ART_DES_FAHRRADS",
    "DELIKT"
]

# Apply OneHotEncoding to categorical variables
TrainSet_encoded = pd.get_dummies(
    TrainSet, columns=categorical_columns, drop_first=True
)

TestSet_encoded = pd.get_dummies(
    TestSet, columns=categorical_columns, drop_first=True
)

# Strip whitespace from column names
TrainSet.columns = TrainSet.columns.str.strip()
TestSet.columns = TestSet.columns.str.strip()


5. Scale Numerical Features

In [28]:
from sklearn.preprocessing import StandardScaler

# Define numerical columns for scaling
numerical_columns = ["TATZEIT_ANFANG_STUNDE"]

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the training set
TrainSet_encoded[numerical_columns] = scaler.fit_transform(
    TrainSet_encoded[numerical_columns]
)

# Transform the test set using the same scaler
TestSet_encoded[numerical_columns] = scaler.transform(
    TestSet_encoded[numerical_columns]
)


6. Feature Engineering on DateTime Columns

In [29]:
# Extract year from 'TATZEIT_ANFANG_DATUM'
TrainSet_encoded["TATZEIT_ANFANG_YEAR"] = pd.to_datetime(
    TrainSet_encoded["TATZEIT_ANFANG_DATUM"]
).dt.year

TestSet_encoded["TATZEIT_ANFANG_YEAR"] = pd.to_datetime(
    TestSet_encoded["TATZEIT_ANFANG_DATUM"]
).dt.year

# Extract month from 'TATZEIT_ANFANG_DATUM'
TrainSet_encoded["TATZEIT_ANFANG_MONTH"] = pd.to_datetime(
    TrainSet_encoded["TATZEIT_ANFANG_DATUM"]
).dt.month

TestSet_encoded["TATZEIT_ANFANG_MONTH"] = pd.to_datetime(
    TestSet_encoded["TATZEIT_ANFANG_DATUM"]
).dt.month

# Drop the original datetime columns
TrainSet_encoded.drop(columns=["TATZEIT_ANFANG_DATUM"], inplace=True)
TestSet_encoded.drop(columns=["TATZEIT_ANFANG_DATUM"], inplace=True)


7. Create Polynomial Features

In [30]:
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Apply polynomial features on numerical columns
TrainSet_poly = poly.fit_transform(TrainSet_encoded[numerical_columns])
TestSet_poly = poly.transform(TestSet_encoded[numerical_columns])


8. Save the Feature-Engineered Datasets

In [31]:
import os

# Create directory if it doesn't exist
os.makedirs("outputs/datasets/featured", exist_ok=True)

# Save processed datasets to CSV
TrainSet_encoded.to_csv(
    "outputs/datasets/featured/TrainSet_Featured.csv", index=False
)
TestSet_encoded.to_csv(
    "outputs/datasets/featured/TestSet_Featured.csv", index=False
)

print("Feature Engineering completed and datasets saved.")


Feature Engineering completed and datasets saved.
