# Data Preprocessing/Cleaning


---
### To make data machine readable


# Install Libraries/Dependencies


---


*Note: You can find the unprocessed data set labled traffic.log on the github page of this project, you may use your own data set but make sure the columns line up*


In [1]:
import pandas as pd
from datetime import time

We will first read the data(because how else are you supposed to use it)

In [None]:
df = pd.read_csv("traffic_log.csv")
df['timestamp'] = pd.to_datetime(df['timestamp']) # This makes it readable for later use cases, becasue timestamp is currently a date time thing not a pandas datetime

Next we will calculate the congestion percentage -> (1-current_speed / free_flow_speed) * 100.

We will also be defining a function to categorize congestion

In [None]:
df['congestion_percent'] = (1- df['current_speed'] / df['free_flow_speed']) * 100

def categorize_data(p):
    if p < 30:
        return 'Low'
    elif p < 60:
        return 'Moderate'
    else:
        return 'Severe'

Here I've decided to add a rush hour feature, this will help us determine if congestion will be bad or not

In [None]:
def is_rush_hour(t):
    return (time(6, 0) <= t <= time(8, 0)) or (time(16, 0) <= t <= time(18, 0))

Next we will add these columns into the DataFrame

In [None]:
df['rush_hour'] = df['timestamp'].dt.time.apply(is_rush_hour)
df['congestion_level'] = df['congestion_percent'].apply(categorize_data)
df['low_confidence'] = df['confidence'] < 0.3
df['hour'] = df['timestamp'].dt.hour

Finally we will convert the processed data into a csv file, which we will use in model training

In [None]:
df.to_csv("traffic_processed.csv",index=False)
print("✅ Data prepared and saved to traffic_processed.csv")