# Predicting Wait Times with Lag Features and Differencing

In this notebook, we will build a time series forecasting model to predict wait times. We'll use a linear regression model and engineer features using two common techniques:

1.  **Lag Features**: We will create features that represent past values of the time series. This helps the model understand the state of the system at previous time steps.
2.  **Differencing**: To stabilize the time series and make it easier to model, we will predict the *change* in wait time from one period to the next, rather than the absolute wait time itself.

Let's get started!

### 1. Importing Libraries and Loading Data

First, we import the necessary Python libraries, including `pandas` for data manipulation and `scikit-learn` for building the linear regression model. We then load the `final_df.csv` dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
try:
    df = pd.read_csv('final_df.csv')
    # Assuming the unnamed column is an old index, we drop it.
    if 'Unnamed: 0' in df.columns:
        df = df.drop('Unnamed: 0', axis=1)
except FileNotFoundError:
    print("Please make sure 'final_df.csv' is in the same directory.")
    df = pd.DataFrame() # Create an empty dataframe to avoid further errors

df.head()

### 2. Creating Lag Features

Now, we'll implement the function to create a lagged version of our dataframe. We will create 10 lags for each feature. This means that for each row (time step), we will have new columns containing the values of all features from the previous 10 time steps.

In [None]:
# Make a copy of the original dataframe
df_lagged = df.copy()

# Define the size of the trailing window for lagging
trailing_window_size = 10

# Loop to create lagged columns
for window in range(1, trailing_window_size + 1):
    shifted = df.shift(window)
    shifted.columns = [x + "_lag" + str(window) for x in df.columns]
    df_lagged = pd.concat((df_lagged, shifted), axis=1)

# The lagging process creates missing values (NaNs) for the first few rows.
# We need to drop these rows as they cannot be used for training.
df_lagged = df_lagged.dropna()

print("Shape of the new lagged dataframe:", df_lagged.shape)
df_lagged.head()

### 3. Preparing Data for Modeling

Our goal is to predict `wait_time_minutes`. We will use the lagged features we just created as our predictors (`X`). For our target (`y`), we will use the *difference* in `wait_time_minutes` from one step to the next. This helps to make the target variable more stationary.

In [None]:
# The original columns are no longer needed as predictors, since we have the lags.
# We will keep the target 'wait_time_minutes' to calculate its difference.
feature_columns = [col for col in df_lagged.columns if 'lag' in col]
X = df_lagged[feature_columns]

# Create the differenced target variable 'y'
# This calculates the change in wait time from the previous step to the current one.
y = df_lagged['wait_time_minutes'].diff()

# The differencing operation creates one NaN value at the start, which we remove.
# We must also remove the corresponding row from our features 'X' to keep them aligned.
y = y.dropna()
X = X.iloc[1:]

print("Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

### 4. Splitting Data into Training and Testing Sets

For time series data, it's crucial to split the data chronologically. We will train the model on the earlier portion of the data and test it on the more recent portion. We'll use an 80/20 split.

In [None]:
# Calculate the split point
split_index = int(len(X) * 0.8)

# Split the data
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

### 5. Training the Linear Regression Model

Now we can instantiate and train our `LinearRegression` model using the training data.

In [None]:
# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

print("Linear Regression model trained successfully!")

### 6. Making Predictions and Evaluating the Model

With the model trained, we can make predictions on the test set and evaluate its performance using standard regression metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = metrics.r2_score(y_test, y_pred)

print('--- Model Evaluation ---')
print('Mean Absolute Error (MAE):', round(mae, 2))
print('Mean Squared Error (MSE):', round(mse, 2))
print('Root Mean Squared Error (RMSE):', round(rmse, 2))
print('R-squared (R²):', round(r2, 2))

### 7. Visualizing the Results

Finally, let's create a plot to visually compare the actual changes in wait time versus the predicted changes. This gives us a more intuitive understanding of the model's performance.

In [None]:
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(15, 6))

ax.plot(y_test.values, label='Actual Change in Wait Time', color='dodgerblue', linewidth=2)
ax.plot(y_pred, label='Predicted Change in Wait Time', color='tomato', linestyle='--')

ax.set_title('Actual vs. Predicted Change in Wait Time', fontsize=16)
ax.set_xlabel('Time Steps (in test set)')
ax.set_ylabel('Change in Wait Time (Minutes)')
ax.legend()
ax.grid(True)
plt.show()