# 1. Title

# 02  Feature Engineering

Goal:
Prepare the 311 dataset for machine learning modeling.

Strategy:

1. Target creation  
2. Smart grouping of service types  
3. Encoding  
4. Feature selection  
5. PCA (when appropriate)

Dataset:
data/processed/311_2022_2025_base.csv


## 2. Imports


In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA


## 3. Load Processed Dataset


In [2]:
df = pd.read_csv(
    "../data/processed/311_2022_2025_base.csv",
    parse_dates=["creation_date"]
)

print("Shape:", df.shape)
df.head()


Shape: (1678480, 14)


Unnamed: 0,creation_date,status,first_3_chars_of_postal_code,intersection_street_1,intersection_street_2,ward,service_request_type,division,section,source_year,year,month,day_of_week,hour
0,2022-01-01 00:12:24,Completed,M1E,,,Scarborough-Rouge Park (25),Fireworks,Municipal Licensing & Standards,Parks Enforcement,2022,2022,1,5,0
1,2022-01-01 00:20:34,Completed,M9N,,,York South-Weston (05),Amplified Sound,Municipal Licensing & Standards,Bylaw Enforcement,2022,2022,1,5,0
2,2022-01-01 00:31:00,Completed,M3J,,,York Centre (06),Amplified Sound,Municipal Licensing & Standards,Bylaw Enforcement,2022,2022,1,5,0
3,2022-01-01 00:31:28,Completed,M3H,,,York Centre (06),Fireworks,Municipal Licensing & Standards,Parks Enforcement,2022,2022,1,5,0
4,2022-01-01 00:37:50,Completed,M9C,,,Etobicoke Centre (02),Fireworks,Municipal Licensing & Standards,Parks Enforcement,2022,2022,1,5,0


## 4. Target Creation

Convert `status` into binary classification:

- 1 → Cancelled  
- 0 → Not Cancelled


In [3]:
df["is_cancelled"] = (df["status"] == "Cancelled").astype(int)

df["is_cancelled"].value_counts(normalize=True)


is_cancelled
0    0.885113
1    0.114887
Name: proportion, dtype: float64

## 5. Drop Problematic Columns

Remove:

- Raw `creation_date`
- `intersection_street_1`
- `intersection_street_2`


In [4]:
df = df.drop(columns=[
    "creation_date",
    "intersection_street_1",
    "intersection_street_2"
])

print("Remaining columns:", df.columns.tolist())


Remaining columns: ['status', 'first_3_chars_of_postal_code', 'ward', 'service_request_type', 'division', 'section', 'source_year', 'year', 'month', 'day_of_week', 'hour', 'is_cancelled']


## 6. Reduce Service Type Cardinality

Original unique categories: 846

Keep top 30 most frequent and group the rest as "Other".


In [5]:
top_services = df["service_request_type"].value_counts().head(30).index

df["service_request_type"] = np.where(
    df["service_request_type"].isin(top_services),
    df["service_request_type"],
    "Other"
)

print("New unique service types:", df["service_request_type"].nunique())


New unique service types: 31


## 7. Encode Categorical Variables


In [6]:
categorical_cols = [
    "service_request_type",
    "first_3_chars_of_postal_code",
    "ward",
    "division",
    "section",
    "source_year"
]

df = pd.get_dummies(
    df,
    columns=categorical_cols,
    drop_first=True
)

print("Shape after encoding:", df.shape)


Shape after encoding: (1678480, 193)


## 8. Separate Features and Target


In [7]:
X = df.drop(columns=["status", "is_cancelled"])
y = df["is_cancelled"]

print("Feature matrix shape:", X.shape)


Feature matrix shape: (1678480, 191)


## 9. Feature Selection (Chi-Square)

Select top 50 most predictive features.


In [8]:
selector = SelectKBest(score_func=chi2, k=50)

X_selected = selector.fit_transform(X, y)

selected_columns = X.columns[selector.get_support()]

print("Selected feature count:", len(selected_columns))
selected_columns[:10]


Selected feature count: 50


Index(['month', 'hour', 'service_request_type_Amplified Sound',
       'service_request_type_Cadaver - Wildlife',
       'service_request_type_Construction Noise',
       'service_request_type_Injured - Wildlife',
       'service_request_type_Litter / Illegal Dumping Cleanup',
       'service_request_type_Missing/Damaged Signs',
       'service_request_type_Other',
       'service_request_type_Pick up Dead Wildlife'],
      dtype='object')

## 10. PCA (Dimensionality Reduction)

Apply PCA after scaling.

Retain 95% variance.


In [9]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_selected)

pca = PCA(n_components=0.95)

X_pca = pca.fit_transform(X_scaled)

print("Original feature count:", X_selected.shape[1])
print("Reduced feature count after PCA:", X_pca.shape[1])


Original feature count: 50
Reduced feature count after PCA: 37



Original selected feature count: 50  
Reduced feature count after PCA: 37  

PCA retained 95% of the total variance in the dataset.

Observations:

- Only 13 dimensions were removed (50 → 37).
- This suggests moderate correlation between features.
- The dataset is not extremely redundant.
- Most selected features contribute unique information.

Since PCA did not drastically reduce dimensionality, this implies:

- Service type indicators carry distinct predictive value.
- Temporal features add independent explanatory power.
- The feature selection step already removed much redundancy.

PCA still improves:

- Model stability
- Computational efficiency
- Generalization capability


## 11. Save Final Feature Dataset


In [10]:
feature_df = pd.DataFrame(X_pca)
feature_df["is_cancelled"] = y.reset_index(drop=True)

feature_df.to_csv(
    "../data/processed/311_features_pca.csv",
    index=False
)

print("Feature dataset saved.")


Feature dataset saved.


Because PCA reduced only moderately (50 → 37):

- Linear models (Logistic Regression) may perform well.
- Tree-based models (Random Forest, XGBoost) may capture nonlinear interactions between service types and time features.
- The dataset is information-rich rather than highly redundant.

Next step:
Model training and performance comparison.