# Data Preprocessing and Feature Engineering

In this notebook, we prepare the bus route dataset for machine learning.
The process includes data cleaning, handling missing values, feature creation,
data transformation, and saving a final clean dataset for modeling.


In [1]:
#import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from urllib.parse import quote_plus
from sklearn.preprocessing import StandardScaler


Load Data from Database (PostgreSQL)

In [2]:
# PostgreSQL connection details
db_user = "postgres"
db_password = "Ganesha123@!"
db_host = "localhost"
db_port = "5432"
db_name = "bus_data"

# Encode password to handle special characters
db_password_encoded = quote_plus(db_password)

# Create SQLAlchemy engine
engine = create_engine(
    f"postgresql+psycopg2://{db_user}:{db_password_encoded}@{db_host}:{db_port}/{db_name}"
)

# Retrieve data from PostgreSQL table
query = "SELECT * FROM bus_routes"
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0,Bus_ID,Departure,Arrival,Duration,Duration_Minutes,Seats,Single_Seats,Price,Onwards,Operator,Bus_Type,Rating,Rating_Count,Live_Tracking,Source,Destination
0,26409654,23:15,06:10,6h 55m,415,42,10,1250,Onwards,Jayavin Travels,A/C Seater / Sleeper (2+1),4.4,575,Yes,Bangalore,Chennai
1,44319378,21:55,04:30,6h 35m,395,36,12,585,Onwards,HYBUS,Bharat Benz A/C Sleeper (2+1),4.5,245,Yes,Bangalore,Chennai
2,36258160,21:35,05:00,7h 25m,445,24,8,1600,Onwards,PADMAVATHI TRAVELS,A/C Sleeper (2+1),4.3,797,Yes,Bangalore,Chennai
3,37971794,22:45,05:45,7h,420,25,6,1400,Onwards,Krish Travels,Bharat Benz A/C Seater /Sleeper (2+1),4.2,1377,Yes,Bangalore,Chennai
4,44319377,21:40,04:55,7h 15m,435,36,12,585,Onwards,HYBUS,Bharat Benz A/C Sleeper (2+1),4.5,166,Yes,Bangalore,Chennai


Data Exploration

In [3]:
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Bus_ID            178 non-null    object 
 1   Departure         178 non-null    object 
 2   Arrival           178 non-null    object 
 3   Duration          178 non-null    object 
 4   Duration_Minutes  178 non-null    int64  
 5   Seats             178 non-null    int64  
 6   Single_Seats      178 non-null    int64  
 7   Price             178 non-null    int64  
 8   Onwards           178 non-null    object 
 9   Operator          178 non-null    object 
 10  Bus_Type          178 non-null    object 
 11  Rating            178 non-null    float64
 12  Rating_Count      178 non-null    int64  
 13  Live_Tracking     178 non-null    object 
 14  Source            178 non-null    object 
 15  Destination       178 non-null    object 
dtypes: float64(1), int64(5), object(10)
memory u

Bus_ID              0
Departure           0
Arrival             0
Duration            0
Duration_Minutes    0
Seats               0
Single_Seats        0
Price               0
Onwards             0
Operator            0
Bus_Type            0
Rating              0
Rating_Count        0
Live_Tracking       0
Source              0
Destination         0
dtype: int64

# Data Cleaning

This step handles missing values, removes duplicate records and fixes data type issues.

In [4]:
#Remove duplicate Records
df = df.drop_duplicates()

#Handle missing values
# Numerical columns
num_cols = [
    'Duration_Minutes',
    'Seats',
    'Single_Seats',
    'Price',
    'Rating',
    'Rating_Count'
]

for col in num_cols:
    df[col] = df[col].fillna(df[col].mean())

# Categorical columns
cat_cols = [
    'Source',
    'Destination',
    'Operator',
    'Bus_Type',
    'Live_Tracking'
]

for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])



Fix Datatypes

In [5]:
df['Price'] = df['Price'].astype(float)
df['Rating'] = df['Rating'].astype(float)
df['Seats'] = df['Seats'].astype(int)

## Feature Engineering

New features are created to improve machine learning model performance.


**Handling Missing Values for Duration:**  

The `Duration_Minutes` column may contain missing values.  
Fill these missing values with the **mean duration** to ensure the dataset is complete and suitable for modeling.


In [6]:
#Duration in minutes
df['Duration_Minutes'] = df['Duration_Minutes'].fillna(
    df['Duration_Minutes'].mean()
)


**Price Category**

The target variable classifying bus tickets into High, Medium, Very High, and Luxury for prediction.

In [7]:
df['Price_Category'] = pd.cut(
    df['Price'],
    bins=[0, 300, 600, 1000, 2000, df['Price'].max()],
    labels=['Low', 'Medium', 'High', 'Very High', 'Luxury']
)


**Peak Hour Indicator**

Binary feature showing if the bus departs during peak travel hours (1 = peak, 0 = non-peak).

In [9]:
df['Departure_Hour'] = pd.to_datetime(
    df['Departure'],
    format='%H:%M',
    errors='coerce'
).dt.hour

df['Is_Peak_Hour'] = df['Departure_Hour'].apply(
    lambda x: 1 if (6 <= x <= 10) or (17 <= x <= 21) else 0
)


## Data Transformation

Categorical variables are encoded and numerical features are scaled
to prepare the data for machine learning models.


In [10]:
#One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(
    df,
    columns=['Source', 'Destination', 'Operator', 'Bus_Type', 'Live_Tracking'],
    drop_first=True
)


**Feature Scaling**

Normalize numerical columns to mean 0 and standard deviation 1 for better model performance.

In [11]:
scaler = StandardScaler()

scale_cols = [
    'Price',
    'Duration_Minutes',
    'Seats',
    'Rating',
    'Rating_Count'
]

df_encoded[scale_cols] = scaler.fit_transform(df_encoded[scale_cols])


## Save Clean Dataset

The final preprocessed dataset is saved as a CSV file
for machine learning modeling.


In [12]:
df_encoded.to_csv("data/clean_bus_routes_ml.csv", index=False)


## Summary of Preprocessing and Feature Engineering

In this notebook, we prepared the dataset for machine learning by performing the following steps:

**1. Data Cleaning:**

- Removed duplicate records.

- Handled missing values: numerical columns filled with mean/median, categorical columns with mode.

**2. Feature Engineering:**

- Converted Duration to minutes (Duration_Minutes).

- Extracted Departure_Hour from the Departure time.

- Created a Peak Hour Indicator (Is_Peak_Hour) for high-traffic times.

**3. Categorical Encoding:**

Converted categorical columns (Source, Destination, Operator, Bus_Type, Live_Tracking) into numerical binary columns using one-hot encoding.

**4. Feature Scaling:**

Normalized numerical columns (Price, Duration_Minutes, Seats, Rating, Rating_Count) using StandardScaler to ensure all features have mean 0 and standard deviation 1.

**5. Final Dataset:**

The cleaned, encoded, and scaled dataset was saved as a CSV file (cleaned_bus_data.csv) for modeling.

These steps ensure the dataset is clean, consistent, and ready for machine learning models, improving model performance and generalization.