### **Analyzing and Forecasting Traffic Patterns in California District 3 Using Hierarchical Time Series and Deep Learning**

## **Introduction**

Traffic congestion remains a critical challenge in urban mobility, influencing commute times, environmental sustainability, and overall transportation efficiency. This study leverages historical traffic data from **California’s District 3**, obtained from the **Caltrans Performance Measurement System (PeMS)**, to explore traffic trends and develop predictive models that enhance traffic management and forecasting.

## **Objectives**
The primary goal of this analysis is to **build a hierarchical time series model** to forecast traffic patterns and later incorporate **deep learning models** for enhanced predictive performance. This study follows a structured approach:

1. **Exploratory Data Analysis (EDA):**
   - Conduct a thorough statistical and visual analysis of the traffic data.
   - Identify seasonality, trends, and anomalies in the dataset.
   - Evaluate key traffic variables such as **total flow, average speed, and direction of travel**.

2. **Basic Time Series Modeling:**
   - Apply classical **time series models** such as **ARIMA, SARIMA, and Exponential Smoothing** to develop initial forecasting benchmarks.
   - Assess the performance of these models using standard evaluation metrics.

3. **Hierarchical Time Series Modeling:**
   - Construct **hierarchical time series (HTS) models** to analyze traffic patterns at different levels (e.g., station-level, route-level, and district-level).
   - Explore aggregation and disaggregation techniques for improved forecasting accuracy.

4. **Deep Learning Integration:**
   - Incorporate **deep learning-based forecasting models** such as **LSTMs and Transformer-based models** for more complex pattern recognition.
   - Compare model performance against traditional time series approaches.

5. **Incorporation of Electric Vehicle (EV) Charging Data:**
   - Integrate **EV charging station data** to analyze its impact on traffic congestion and patterns.
   - Develop a unified deep learning model that incorporates both **traffic and EV charging data** for holistic forecasting.

## **Data Source**
The traffic data used in this study is sourced from **PeMS (Caltrans Performance Measurement System)** and can be accessed at the following link:
[PeMS District 3 Traffic Data](https://pems.dot.ca.gov/?dnode=Clearinghouse&type=station_hour&district_id=3&submit=Submit)
The dataset contains **hourly traffic measurements** across multiple stations in District 3, capturing critical variables such as **total flow, speed, and observed percentages**.

## **Significance of the Study**
Accurate traffic forecasting is essential for **transportation planning, congestion mitigation, and infrastructure optimization**. By combining traditional time series techniques with **hierarchical modeling and deep learning**, this study aims to provide a **robust predictive framework** for traffic management in District 3 and beyond.

In [19]:
import pandas as pd
import glob
import os

In [5]:
# Use a raw string (r"") to avoid path issues
folder_path = r"C:\Users\attafuro\Desktop\Traffic Analysis"  
output_file = os.path.join(folder_path, "merged_traffic_data.csv")

# Find all text files
file_paths = glob.glob(os.path.join(folder_path, "*.txt"))

# Open the output file and process each text file one by one
with open(output_file, "w") as output:
    first_file = True  # Track first file to write column headers

    for file in file_paths:
        print(f"Processing: {file}")

        # Read file in chunks to save memory
        for chunk in pd.read_csv(file, delimiter=",", chunksize=10000):  # Adjust delimiter if needed
            chunk.to_csv(output, index=False, header=first_file, mode="a")
            first_file = False  # After first file, don’t write headers again
        
        # Remove the processed file to free up space
        os.remove(file)
        print(f"Deleted: {file}")

print(f" Merge completed! CSV saved as '{output_file}'.")


Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_06.txt
Deleted: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_06.txt
Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_07.txt
Deleted: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_07.txt
Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_08.txt
Deleted: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_08.txt
Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_09.txt
Deleted: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_09.txt
Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_10.txt
Deleted: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_10.txt
Processing: C:\Users\attafuro\Desktop\Traffic Analysis\d03_text_station_hour_2024_11.txt
Deleted: C:\Users\attafuro\Desktop\T

In [24]:
# Load the merged file
df = pd.read_csv("C:/Users/attafuro/Desktop/Traffic Analysis/merged_traffic_data.csv")

# Display basic info
df.head()

Unnamed: 0,06/01/2024 00:00:00,308511,3,50,E,ML,3.134,216,100,39,...,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41
0,06/01/2024 00:00:00,308512,3,50,W,ML,3.995,195,0,598.0,...,,,,,,,,,,
1,06/01/2024 00:00:00,311831,3,5,S,OR,,108,100,39.0,...,,,,,,,,,,
2,06/01/2024 00:00:00,311832,3,5,S,FR,,108,100,0.0,...,,,,,,,,,,
3,06/01/2024 00:00:00,311844,3,5,N,OR,,216,100,107.0,...,,,,,,,,,,
4,06/01/2024 00:00:00,311847,3,5,N,OR,,324,100,101.0,...,,,,,,,,,,


In [25]:
# Define the correct column names
column_names = [
    "Timestamp", "Station", "District", "Route", "Direction of Travel", "Lane Type",
    "Station Length", "Samples", "% Observed", "Total Flow", "Avg Occupancy", "Avg Speed",
    "Delay (V_t=35)", "Delay (V_t=40)", "Delay (V_t=45)", "Delay (V_t=50)", "Delay (V_t=55)", "Delay (V_t=60)"
]

In [26]:
# Identify extra columns (Lane N data) and rename accordingly
num_extra_cols = len(df.columns) - len(column_names)
for i in range(1, num_extra_cols // 3 + 1):  
    column_names.extend([
        f"Lane {i} Flow", f"Lane {i} Avg Occ", f"Lane {i} Avg Speed"
    ])

# Apply new column names
df.columns = column_names

# Save the cleaned dataset
cleaned_file_path = "C:/Users/attafuro/Desktop/Traffic Analysis/cleaned_traffic_data.csv"
df.to_csv(cleaned_file_path, index=False)

print(f" Column names fixed! Cleaned data saved as '{cleaned_file_path}'.")

 Column names fixed! Cleaned data saved as 'C:/Users/attafuro/Desktop/Traffic Analysis/cleaned_traffic_data.csv'.


In [27]:
df = pd.read_csv(cleaned_file_path)
df.head()

Unnamed: 0,Timestamp,Station,District,Route,Direction of Travel,Lane Type,Station Length,Samples,% Observed,Total Flow,...,Lane 5 Avg Speed,Lane 6 Flow,Lane 6 Avg Occ,Lane 6 Avg Speed,Lane 7 Flow,Lane 7 Avg Occ,Lane 7 Avg Speed,Lane 8 Flow,Lane 8 Avg Occ,Lane 8 Avg Speed
0,06/01/2024 00:00:00,308512,3,50,W,ML,3.995,195,0,598.0,...,,,,,,,,,,
1,06/01/2024 00:00:00,311831,3,5,S,OR,,108,100,39.0,...,,,,,,,,,,
2,06/01/2024 00:00:00,311832,3,5,S,FR,,108,100,0.0,...,,,,,,,,,,
3,06/01/2024 00:00:00,311844,3,5,N,OR,,216,100,107.0,...,,,,,,,,,,
4,06/01/2024 00:00:00,311847,3,5,N,OR,,324,100,101.0,...,,,,,,,,,,


In [28]:
# Convert 'Timestamp' to datetime format
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format="%m/%d/%Y %H:%M:%S")

# Define the final selected columns
selected_columns = [
    "Timestamp", "Station", "Route", "Direction of Travel",
    "Total Flow", "Avg Speed", "% Observed","Samples","Lane Type"
]

# Keep only the selected columns
df = df[selected_columns]

In [29]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type
0,2024-06-01,308512,50,W,598.0,63.7,0,195,ML
1,2024-06-01,311831,5,S,39.0,,100,108,OR
2,2024-06-01,311832,5,S,0.0,,100,108,FR
3,2024-06-01,311844,5,N,107.0,,100,216,OR
4,2024-06-01,311847,5,N,101.0,,100,324,OR


In [30]:
# Display initial summary
print(" Initial Data Overview:")
print(df.info())  # Check data types

 Initial Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9495975 entries, 0 to 9495974
Data columns (total 9 columns):
 #   Column               Dtype         
---  ------               -----         
 0   Timestamp            datetime64[ns]
 1   Station              int64         
 2   Route                int64         
 3   Direction of Travel  object        
 4   Total Flow           float64       
 5   Avg Speed            float64       
 6   % Observed           int64         
 7   Samples              int64         
 8   Lane Type            object        
dtypes: datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 652.0+ MB
None


In [22]:
print("\nMissing Values:\n", df.isnull().sum())  # Count missing values


Missing Values:
 Timestamp                    0
Station                      0
Route                        0
Direction of Travel          0
Total Flow              672053
Avg Speed              3603209
% Observed                   0
Samples                      0
Lane Type                    0
dtype: int64


In [32]:
# Fill Total Flow missing values with median of same Route & Lane Type
df['Total Flow'] = df.groupby(['Route', 'Lane Type'])['Total Flow'].transform(lambda x: x.fillna(x.median()))

In [33]:
# Set Avg Speed to 0 where Total Flow is 0
df.loc[df['Total Flow'] == 0, 'Avg Speed'] = 0

In [34]:
# Fill remaining missing Avg Speed values with median of similar Route & Lane Type
df['Avg Speed'] = df.groupby(['Route', 'Lane Type'])['Avg Speed'].transform(lambda x: x.fillna(x.median()))

In [35]:
print(" Missing Values After Cleaning:\n", df.isnull().sum())

 Missing Values After Cleaning:
 Timestamp                  0
Station                    0
Route                      0
Direction of Travel        0
Total Flow                 0
Avg Speed              33601
% Observed                 0
Samples                    0
Lane Type                  0
dtype: int64


In [37]:
# Fill any remaining missing values with the global median Avg Speed
df.loc[:, 'Avg Speed'] = df['Avg Speed'].fillna(df['Avg Speed'].median())

In [38]:
# Optimize numeric columns
df['Total Flow'] = df['Total Flow'].astype('int32')
df['Avg Speed'] = df['Avg Speed'].astype('int32')
df['% Observed'] = df['% Observed'].astype('int16')
df['Samples'] = df['Samples'].astype('int16')
df['Station'] = df['Station'].astype('int32')
df['Route'] = df['Route'].astype('int32')

# Convert categorical columns to category type
df['Direction of Travel'] = df['Direction of Travel'].astype('category')
df['Lane Type'] = df['Lane Type'].astype('category')

In [39]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type
0,2024-06-01,308512,50,W,598,63,0,195,ML
1,2024-06-01,311831,5,S,39,0,100,108,OR
2,2024-06-01,311832,5,S,0,0,100,108,FR
3,2024-06-01,311844,5,N,107,0,100,216,OR
4,2024-06-01,311847,5,N,101,0,100,324,OR


In [40]:
# Extract time features from Timestamp
df['Year'] = df['Timestamp'].dt.year
df['Month'] = df['Timestamp'].dt.month
df['Day'] = df['Timestamp'].dt.day
df['Day_of_Week'] = df['Timestamp'].dt.weekday  # Monday = 0, Sunday = 6
df['Hour'] = df['Timestamp'].dt.hour  

# Check updated DataFrame
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type,Year,Month,Day,Day_of_Week,Hour
0,2024-06-01,308512,50,W,598,63,0,195,ML,2024,6,1,5,0
1,2024-06-01,311831,5,S,39,0,100,108,OR,2024,6,1,5,0
2,2024-06-01,311832,5,S,0,0,100,108,FR,2024,6,1,5,0
3,2024-06-01,311844,5,N,107,0,100,216,OR,2024,6,1,5,0
4,2024-06-01,311847,5,N,101,0,100,324,OR,2024,6,1,5,0


In [41]:
df.loc[:, df.columns != 'Timestamp'] #drop the timstamp column

Unnamed: 0,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type,Year,Month,Day,Day_of_Week,Hour
0,308512,50,W,598,63,0,195,ML,2024,6,1,5,0
1,311831,5,S,39,0,100,108,OR,2024,6,1,5,0
2,311832,5,S,0,0,100,108,FR,2024,6,1,5,0
3,311844,5,N,107,0,100,216,OR,2024,6,1,5,0
4,311847,5,N,101,0,100,324,OR,2024,6,1,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9495970,3423094,99,S,68,64,96,118,ML,2024,12,31,1,23
9495971,3900021,50,E,803,66,67,292,ML,2024,12,31,1,23
9495972,3900022,50,E,509,68,0,0,HV,2024,12,31,1,23
9495973,3900023,50,W,881,67,67,289,ML,2024,12,31,1,23
