# Feature Engineering
## Smart Civic Issue & Waste Management System

This notebook performs feature engineering on the raw civic complaint dataset.
The objective is to transform exploratory insights into structured, numerical
features suitable for machine learning models.

Important Notes:
- Raw data is NOT modified.
- All engineered features are derived logically from existing attributes.
- The output of this notebook is a processed dataset used for ML modeling.

## 1. Importing Required Libraries

This section imports the necessary libraries for data manipulation,
feature engineering, and numerical computation.

In [3]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import NearestNeighbors

## 2. Loading Raw Dataset

The raw civic complaint dataset is loaded from the `data/raw` directory.
This dataset serves as the base for all feature engineering.

In [18]:
df = pd.read_csv("data/raw/urban_civic_reports_synthetic.csv")
df.head()

Unnamed: 0,ID,Category,Status,Latitude,Longitude
0,1,Drainage,Pending,23.257167,77.41364
1,2,Waste,Completed,23.258523,77.415287
2,3,Pothole,Completed,23.256676,77.411646
3,4,Waste,Completed,23.26152,77.407977
4,5,Streetlight,Pending,23.257405,77.412233


## 3. Data Cleaning and Standardization

Column names are standardized and records are sorted to ensure consistency.
No data values are altered in this step.

In [21]:
df.columns = df.columns.str.strip()
df = df.sort_values(by='ID').reset_index(drop=True)
df.head()

Unnamed: 0,ID,Category,Status,Latitude,Longitude
0,1,Drainage,Pending,23.257167,77.41364
1,2,Waste,Completed,23.258523,77.415287
2,3,Pothole,Completed,23.256676,77.411646
3,4,Waste,Completed,23.26152,77.407977
4,5,Streetlight,Pending,23.257405,77.412233


## 4. Encoding Categorical Variables

Machine learning models require numerical inputs. This step converts
categorical attributes into numerical representations.

In [24]:
le_category = LabelEncoder()
le_status = LabelEncoder()

df['category_encoded'] = le_category.fit_transform(df['Category'])
df['status_encoded'] = le_status.fit_transform(df['Status'])

df[['Category', 'category_encoded', 'Status', 'status_encoded']].head()

Unnamed: 0,Category,category_encoded,Status,status_encoded
0,Drainage,0,Pending,2
1,Waste,3,Completed,1
2,Pothole,1,Completed,1
3,Waste,3,Completed,1
4,Streetlight,2,Pending,2


## 5. Spatial Density Feature Engineering

Local complaint density is a critical indicator of urgency. This feature
approximates density by measuring average distance to nearest neighbors.

In [27]:
coords = df[['Latitude', 'Longitude']].values

nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(coords)

distances, _ = nbrs.kneighbors(coords)

df['avg_neighbour_distance'] = distances.mean(axis=1)
df['avg_neighbour_distance'].describe()

count    1600.000000
mean        0.000645
std         0.000784
min         0.000062
25%         0.000207
50%         0.000296
75%         0.000582
max         0.004682
Name: avg_neighbour_distance, dtype: float64

## 6. Normalized Density Score

To make density comparable across records, the neighbour distance
is converted into a normalized density score.

In [30]:
df['density_score'] = 1 / (df['avg_neighbour_distance'] + 1e-6)
df['density_score'] = (
    df['density_score'] - df['density_score'].min()
) / (
    df['density_score'].max() - df['density_score'].min()
)

df[['avg_neighbour_distance', 'density_score']].head()

Unnamed: 0,avg_neighbour_distance,density_score
0,0.000207,0.29549
1,0.000348,0.170649
2,0.000246,0.246742
3,0.000568,0.099254
4,0.000337,0.176402


## 7. Complaint Frequency per Category

Frequently occurring categories often represent systemic issues.
This feature captures how common each complaint category is.

In [33]:
category_frequency = df['Category'].value_counts().to_dict()
df['category_frequency'] = df['Category'].map(category_frequency)

df[['Category', 'category_frequency']].head()

Unnamed: 0,Category,category_frequency
0,Drainage,214
1,Waste,497
2,Pothole,425
3,Waste,497
4,Streetlight,311


## 8. Pending Status Indicator

Unresolved complaints require higher attention. This binary feature
indicates whether a complaint is still pending.

In [36]:
df['is_pending'] = df['Status'].apply(lambda x: 1 if x.lower() == 'pending' else 0)
df[['Status', 'is_pending']].head()

Unnamed: 0,Status,is_pending
0,Pending,1
1,Completed,0
2,Completed,0
3,Completed,0
4,Pending,1


## 9. Composite Risk Indicator (Preliminary)

A preliminary risk indicator is constructed by combining spatial density
and complaint status. This is NOT the final model output, but an input signal.

In [39]:
df['risk_signal'] = (
    0.6 * df['density_score'] +
    0.4 * df['is_pending']
)

df[['density_score', 'is_pending', 'risk_signal']].head()

Unnamed: 0,density_score,is_pending,risk_signal
0,0.29549,1,0.577294
1,0.170649,0,0.10239
2,0.246742,0,0.148045
3,0.099254,0,0.059552
4,0.176402,1,0.505841


## 10. Feature Selection for Machine Learning

This section selects the final set of engineered features that will be
used by downstream ML models.

In [42]:
feature_columns = [
    'category_encoded',
    'status_encoded',
    'density_score',
    'category_frequency',
    'is_pending',
    'risk_signal',
    'Latitude',
    'Longitude'
]

features_df = df[['ID'] + feature_columns]
features_df.head()

Unnamed: 0,ID,category_encoded,status_encoded,density_score,category_frequency,is_pending,risk_signal,Latitude,Longitude
0,1,0,2,0.29549,214,1,0.577294,23.257167,77.41364
1,2,3,1,0.170649,497,0,0.10239,23.258523,77.415287
2,3,1,1,0.246742,425,0,0.148045,23.256676,77.411646
3,4,3,1,0.099254,497,0,0.059552,23.26152,77.407977
4,5,2,2,0.176402,311,1,0.505841,23.257405,77.412233


## 11. Saving Engineered Feature Dataset

The processed feature dataset is saved for use in machine learning models.
This dataset represents the final output of the feature engineering stage.

In [45]:
features_df.to_csv(
    "data/processed/features_data.csv",
    index=False
)

print("features_data.csv saved successfully.")

features_data.csv saved successfully.


## 12. Summary of Engineered Features

### Engineered Features:
- Encoded complaint category and status
- Spatial density score using nearest neighbors
- Category frequency indicator
- Pending status flag
- Composite risk signal

### Purpose:
These features transform raw civic complaint data into structured,
numerical inputs suitable for priority classification and hotspot
clustering models.

The next stage applies machine learning models using this dataset.


In [48]:
import pandas as pd

df = pd.read_csv("data/processed/features_data.csv")
print(df.columns.tolist())

['ID', 'category_encoded', 'status_encoded', 'density_score', 'category_frequency', 'is_pending', 'risk_signal', 'Latitude', 'Longitude']
