#  Spatio-Temporal Prediction and Coordination of EV Charging Demand for Power System Resilience

## Research Objectives

Recent studies have explored electric vehicles (EVs) from different perspectives, ranging from estimating vehicle range based on battery capacity, model specifications, and internal components (Ahmed et al., 2022) to forecasting charging behavior using machine learning methods such as Random Forest and SVM with factors like previous payment data, weather, and traffic (Shahriar et al., 2020). In parallel, research on smart cities has focused on managing traffic flow efficiently to reduce congestion and energy consumption (Dymora, Mazurek, & Jucha, 2024).

Building on these insights, this study links traffic dynamics with EV energy consumption to better predict when and where charging demand will arise. By integrating spatio-temporal traffic features with deep learning models, the goal is to anticipate EV charging needs in real time and enable coordinated charging strategies that support overall power system resilience.


## Load Required Libraries 

In [34]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Load and Clean the Data 

In [35]:
df = pd.read_csv("cleaned_traffic_data.csv")

## How the data looks directly from PEMS

In [36]:
df.head()

Unnamed: 0,Timestamp,Station,District,Route,Direction of Travel,Lane Type,Station Length,Samples,% Observed,Total Flow,...,Lane 5 Avg Speed,Lane 6 Flow,Lane 6 Avg Occ,Lane 6 Avg Speed,Lane 7 Flow,Lane 7 Avg Occ,Lane 7 Avg Speed,Lane 8 Flow,Lane 8 Avg Occ,Lane 8 Avg Speed
0,10/01/2024 00:00:00,308512,3,50,W,ML,3.995,197,0,497.0,...,,,,,,,,,,
1,10/01/2024 00:00:00,311831,3,5,S,OR,,101,92,27.0,...,,,,,,,,,,
2,10/01/2024 00:00:00,311832,3,5,S,FR,,101,92,78.0,...,,,,,,,,,,
3,10/01/2024 00:00:00,311844,3,5,N,OR,,202,92,43.0,...,,,,,,,,,,
4,10/01/2024 00:00:00,311847,3,5,N,OR,,303,92,73.0,...,,,,,,,,,,


### We ignore and remove features that contain only NAN values, and maintain the other features.

In [37]:
# Define the final selected columns
selected_columns = [
    "Timestamp", "Station", "Route", "Direction of Travel",
    "Total Flow", "Avg Speed", "% Observed","Samples","Lane Type"
]

# Keep only the selected columns
df = df[selected_columns]

In [38]:
df

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type
0,10/01/2024 00:00:00,308512,50,W,497.0000,64.1000,0,197,ML
1,10/01/2024 00:00:00,311831,5,S,27.0000,,92,101,OR
2,10/01/2024 00:00:00,311832,5,S,78.0000,,92,101,FR
3,10/01/2024 00:00:00,311844,5,N,43.0000,,92,202,OR
4,10/01/2024 00:00:00,311847,5,N,73.0000,,92,303,OR
...,...,...,...,...,...,...,...,...,...
4114675,12/31/2024 23:00:00,3423094,99,S,68.0000,64.8000,96,118,ML
4114676,12/31/2024 23:00:00,3900021,50,E,803.0000,66.5000,67,292,ML
4114677,12/31/2024 23:00:00,3900022,50,E,509.0000,68.0000,0,0,HV
4114678,12/31/2024 23:00:00,3900023,50,W,881.0000,67.4000,67,289,ML


## Check the data types 

In [39]:
df.dtypes

Timestamp               object
Station                  int64
Route                    int64
Direction of Travel     object
Total Flow             float64
Avg Speed              float64
% Observed               int64
Samples                  int64
Lane Type               object
dtype: object

## Check the Percent of Missing Data in every feature 

In [40]:
pd.set_option('display.float_format', '{:.4f}'.format)

missing_percent = (df.isna().sum() / len(df)) * 100
print(missing_percent)


Timestamp              0.0000
Station                0.0000
Route                  0.0000
Direction of Travel    0.0000
Total Flow             7.3827
Avg Speed             38.4621
% Observed             0.0000
Samples                0.0000
Lane Type              0.0000
dtype: float64


## Imputation Strategy for Key Traffic Variables

We decided to retain both the Average Speed and Total Flow features instead of dropping them because they are core variables that capture the essence of traffic dynamics. Average Speed reflects congestion levels and driving conditions, while Total Flow represents the number of vehicles passing a station—both directly influencing how traffic impacts EV range and, ultimately, charging demand. Dropping them would mean ignoring the very behaviors that determine how energy is consumed on the road. Even though these features had missing values, the patterns in traffic data are strongly structured in time and space, making them ideal candidates for informed imputation rather than removal.

For Average Speed, we applied a two-step temporal–spatial imputation strategy. First, we used forward and backward filling within each station to maintain continuity and preserve the natural hourly flow of traffic data. This approach works well because traffic speed rarely changes abruptly from one hour to the next unless influenced by an external event.

For Total Flow, the missingness was much lower, so a simpler approach was sufficient. We performed linear interpolation within each station to fill in small hourly gaps, ensuring that flow values remained smooth and representative of actual traffic movement. These imputation steps allowed us to preserve critical information about how vehicles move through the network without introducing artificial noise or bias. By reconstructing rather than discarding incomplete data, we maintained the integrity of the dataset and strengthened the foundation for accurate spatio-temporal modeling of EV charging demand and range prediction.

In [41]:
df.sort_values(['Station', 'Timestamp'], inplace=True)
df['Avg Speed'] = df.groupby('Station')['Avg Speed'].ffill().bfill()

In [42]:
df['Total Flow'] = df.groupby('Station')['Total Flow'].transform(
    lambda x: x.interpolate(method='linear')
)

## How the data Looks Like Now 

In [43]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type
1827,10/01/2024 01:00:00,308511,50,E,12.0,67.5,100,202,ML
3688,10/01/2024 02:00:00,308511,50,E,12.0,67.0,100,197,ML
5549,10/01/2024 03:00:00,308511,50,E,20.0,66.3,92,197,ML
7410,10/01/2024 04:00:00,308511,50,E,55.0,67.4,100,197,ML
9271,10/01/2024 05:00:00,308511,50,E,228.0,66.1,83,168,ML


# Feature Engineering 

## How do we incoporate and account for the spatial relationship in our data ?

### Incorporating Spatial Features into Linear Regression Models

To rigorously incorporate spatial features into linear regression models for the  PEMS-based traffic data, a systematic statistical approach is essential. Each record contains Station, Route, and Direction of Travel, which are sufficient for defining physical proximity. For each station, we partition the dataset by Route and Direction of Travel, then sort by Station ID. This ordering leverages the typical installation sequence of PEMS sensors and is supported in transportation literature when actual milepost data are unavailable.

For any station $s$ at time $t$, let the set of spatial neighbors (commonly the immediate upstream and downstream stations) be denoted as $\mathcal{N}(s)$. The flow at station $s$, $y_{s,t}$, is modeled as a function of its own temporal history and the flows of neighboring stations:

$$
y_{s,t} = \beta_0 + \sum_{k=1}^{p} \beta_k x_{s,t-k} + \sum_{j \in \mathcal{N}(s)} \gamma_j y_{j,t-l_j} + \epsilon_{s,t}
$$

Here:
- $x_{s,t-k}$ represents temporal features (including lagged flows at station $s$)  
- $y_{j,t-l_j}$ are flows at adjacent stations $j$, possibly with their own lags $l_j$   NB: for the neighboring stations we consider t, t-1 and t-2 
- $\beta_k$ and $\gamma_j$ are regression coefficients  
- $\epsilon_{s,t}$ is the error term  

This formulation captures spatial correlation as conditional dependence between adjacent sites, following spatial autoregressive principles within a linear regression framework.

In practice, using Python (pandas), after grouping by Route and Direction of Travel and sorting by Station, we generate for each observation the "upstream flow" and "downstream flow" variables, optionally at various lags (e.g., current or one-hour prior). These neighbor-based features are then included alongside traditional temporal predictors during model training. This ensures that spatial propagation and congestion effects, which are core to traffic dynamics, are represented in the model.

This approach ensures that even without explicit geo-coordinates, the regression model effectively captures spatial dependencies, leading to more accurate and interpretable traffic flow predictions across the studied transportation corridor.


In [44]:
import pandas as pd

grp_keys = ["Route", "Direction of Travel"]

#  Get unique stations per corridor with their spatial rank
corridor_stations = (
    df.groupby(grp_keys)["Station"]
    .unique()
    .apply(sorted)
    .reset_index()
    .rename(columns={"Station": "stations_list"})
)

# Explode to create a lookup table
neighbor_map = corridor_stations.explode("stations_list").reset_index(drop=True)
neighbor_map["station_rank"] = neighbor_map.groupby(grp_keys).cumcount()

# Create upstream/downstream mappings
neighbor_map["upstream_station"] = neighbor_map.groupby(grp_keys)["stations_list"].shift(1)
neighbor_map["downstream_station"] = neighbor_map.groupby(grp_keys)["stations_list"].shift(-1)

#  Merge back to original data
df = df.merge(
    neighbor_map.rename(columns={"stations_list": "Station"}),
    on=grp_keys + ["Station"],
    how="left"
)


In [45]:
# See all unique combinations
df[['Route', 'Direction of Travel', 'Station', 'upstream_station', 'downstream_station']]\
  .drop_duplicates()\
  .sort_values(['Route', 'Direction of Travel', 'Station'])


Unnamed: 0,Route,Direction of Travel,Station,upstream_station,downstream_station
8829,5,N,311844,,311847
11037,5,N,311847,311844,311864
13245,5,N,311864,311847,312133
48573,5,N,312133,311864,312134
50781,5,N,312134,312133,314780
...,...,...,...,...,...
2184496,267,W,320332,319284,
620215,275,W,314530,,316106
1075063,275,W,316106,314530,
3404652,505,N,3085051,,


In [46]:
# Look at Station 311903 instead
df[df['Station'] == 311903].head()


Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type,station_rank,upstream_station,downstream_station
15453,10/01/2024 00:00:00,311903,50,E,1198.0,66.6,0,300,ML,1,308511,311930
15454,10/01/2024 01:00:00,311903,50,E,1085.0,66.3,0,303,ML,1,308511,311930
15455,10/01/2024 02:00:00,311903,50,E,960.0,66.1,0,297,ML,1,308511,311930
15456,10/01/2024 03:00:00,311903,50,E,988.0,66.4,0,297,ML,1,308511,311930
15457,10/01/2024 04:00:00,311903,50,E,1325.0,66.8,0,297,ML,1,308511,311930


In [47]:
def merge_neighbor_flows(df, neighbor_col, new_col_prefix):
    '''
    Merge neighbor flows with t, t-1, t-2 lags
    
    Parameters:
    -----------
    df : DataFrame with upstream_station/downstream_station columns
    neighbor_col : str, name of the neighbor column ('upstream_station' or 'downstream_station')
    new_col_prefix : str, prefix for new columns ('upstream' or 'downstream')
    '''
    
    # Create lookup table
    neighbor_flow = df[['Station', 'Timestamp', 'Total Flow']].copy()
    neighbor_flow.rename(columns={'Station': neighbor_col}, inplace=True)
    
    # Merge current hour (t)
    df = df.merge(
        neighbor_flow.rename(columns={'Total Flow': f'{new_col_prefix}_flow'}),
        on=[neighbor_col, 'Timestamp'],
        how='left'
    )
    
    # Create lag 1 (t-1)
    df[f'{new_col_prefix}_flow_lag1'] = df.groupby(neighbor_col)[
        f'{new_col_prefix}_flow'
    ].shift(1)
    
    # Create lag 2 (t-2)
    df[f'{new_col_prefix}_flow_lag2'] = df.groupby(neighbor_col)[
        f'{new_col_prefix}_flow'
    ].shift(2)
    
    return df


# Now apply the function

# Apply to upstream neighbors
df = merge_neighbor_flows(df, 'upstream_station', 'upstream')

# Apply to downstream neighbors
df = merge_neighbor_flows(df, 'downstream_station', 'downstream')

# Verify the results
print("Spatial features created successfully!")
print(df[[
    'Timestamp', 'Station', 'Total Flow',
    'upstream_flow', 'upstream_flow_lag1', 'upstream_flow_lag2',
    'downstream_flow', 'downstream_flow_lag1', 'downstream_flow_lag2'
]].head(10))

# Check for missing values


Spatial features created successfully!
             Timestamp Station  Total Flow  upstream_flow  upstream_flow_lag1  \
0  10/01/2024 01:00:00  308511     12.0000            NaN                 NaN   
1  10/01/2024 02:00:00  308511     12.0000            NaN                 NaN   
2  10/01/2024 03:00:00  308511     20.0000            NaN                 NaN   
3  10/01/2024 04:00:00  308511     55.0000            NaN                 NaN   
4  10/01/2024 05:00:00  308511    228.0000            NaN                 NaN   
5  10/01/2024 06:00:00  308511    258.0000            NaN                 NaN   
6  10/01/2024 07:00:00  308511    208.0000            NaN                 NaN   
7  10/01/2024 08:00:00  308511    288.0000            NaN                 NaN   
8  10/01/2024 09:00:00  308511    244.0000            NaN                 NaN   
9  10/01/2024 10:00:00  308511    301.0000            NaN                 NaN   

   upstream_flow_lag2  downstream_flow  downstream_flow_lag1  \
0    

In [48]:
df = df.dropna(subset=[
'upstream_flow', 'upstream_flow_lag1', 'upstream_flow_lag2',
'downstream_flow', 'downstream_flow_lag1', 'downstream_flow_lag2'
])

In [49]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type,station_rank,upstream_station,downstream_station,upstream_flow,upstream_flow_lag1,upstream_flow_lag2,downstream_flow,downstream_flow_lag1,downstream_flow_lag2
6623,10/01/2024 02:00:00,311832,5,S,26.0,66.6,100,99,FR,1,311831,312132,21.0,24.0,27.0,191.0,175.0,292.0
6624,10/01/2024 03:00:00,311832,5,S,27.0,66.6,92,99,FR,1,311831,312132,34.0,21.0,24.0,248.0,191.0,175.0
6625,10/01/2024 04:00:00,311832,5,S,44.0,66.6,100,99,FR,1,311831,312132,56.0,34.0,21.0,446.0,248.0,191.0
6626,10/01/2024 05:00:00,311832,5,S,70.0,66.6,83,84,FR,1,311831,312132,72.0,56.0,34.0,753.0,446.0,248.0
6627,10/01/2024 06:00:00,311832,5,S,162.0,66.6,100,95,FR,1,311831,312132,160.0,72.0,56.0,1244.0,753.0,446.0


## Including Temporal Features in our Data


### 1. **Autoregressive Lags**
- **Names:** flow_lag_1, flow_lag_2, flow_lag_3, flow_lag_6, flow_lag_12, flow_lag_24
- **Role:** Capture short/intermediate/daily dependencies and persistence in traffic flow.
- **Model inclusion:**
$$ y_t = \beta_0 + \sum_{k \in \{1,2,3,6,12,24\}} \beta_k y_{t-k} + \epsilon_t $$

### 2. **Rolling Statistics: Trend and Volatility**
- **Names:** rolling_mean_24h, rolling_std_24h, rolling_min_24h, rolling_max_24h
- **Role:** Quantify average, spread, and extremes over the last day to smooth volatility and capture local behavior.
- **Formulas:**
  - Mean: $$ \text{rolling\_mean\_24h}(t) = \frac{1}{24} \sum_{i=1}^{24} y_{t-i} $$
  - Std Dev: $$ \text{rolling\_std\_24h}(t) = \sqrt{\frac{1}{24} \sum_{i=1}^{24}(y_{t-i} - \bar{y})^2} $$

### 3. **Periodicity Features (Cyclic Encoding)**
- **Names:** hour_sin, hour_cos, dow_sin, dow_cos, is_weekend, is_peak_hour
- **Role:** Represent daily and weekly periodicities.
- **Formulas:**
  - Hour: $$ \text{hour\_sin}_t = \sin\left(\frac{2\pi h_t}{24}\right), \quad \text{hour\_cos}_t = \cos\left(\frac{2\pi h_t}{24}\right) $$
  - DOW: $$ \text{dow\_sin}_t = \sin\left(\frac{2\pi d_t}{7}\right), \quad \text{dow\_cos}_t = \cos\left(\frac{2\pi d_t}{7}\right) $$
  - Binary: is_weekend = 1 on weekends, is_peak_hour = 1 during commuting hours

### 4. **Coefficient of Variation (CV_24h)**
- **Name:** cv_24h
- **Definition:** Ratio of standard deviation to mean over a 24-hour window:
  - $$ \text{cv\_24h}(t) = \frac{\text{rolling\_std\_24h}(t)}{\text{rolling\_mean\_24h}(t)} $$
- **Role:** Quantifies relative volatility; high CV signals instability in traffic flow. Used for diagnosing traffic state (stable, congested, fluctuating).

## Mathematical and Applied Justification
- **Autoregressive lags** capture natural persistence and delayed effects in traffic, standard in time-series analysis.
- **Rolling statistics** (mean, std, min, max, CV) smooth local fluctuations and allow the model to react to recent volatility, supporting more robust predictions.
- **Cyclic features** reflect the inherent periodicity in urban traffic, improving fit and interpretability, avoiding spurious jumps from one-hot hour/day encoding.
- **Coefficient of variation** is widely used in transportation for characterizing the steadiness of flows and identifying transition states between free-flow and congestion.

