### Imports and CUDA

In [49]:
# Matplotlib
import requests
import matplotlib.pyplot as plt
# Numpy
import numpy as np
# Torch
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

In [50]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


### Objective

#### To develop a model that predicts taxi availability within a specific area for the next three hours. This means that if the model is run at 12 PM, it will provide predicted taxi availability for 1 PM, 2 PM, and 3 PM.

The area of interest is defined by the following geographical boundaries:

    North: 1.35106
    South: 1.32206
    East: 103.97839
    West: 103.92805

To identify the taxis currently available within this region, we use the TaxiAvailabilityScript.py.

This script collects real-time data, which serves as input for our predictive model.

By leveraging historical taxi availability trends and real-time data, our model aims to provide accurate forecasts, helping commuters, ride-hailing services, and urban planners make informed decisions.


# **To-Do List for Taxi Availability Prediction**

## **Step 1: Cleaning the Taxi Availability Data**
The first step involves retrieving and preprocessing the taxi availability dataset. The dataset consists of the following columns:

1. **DateTime**  
2. **Taxi Available Throughout Singapore**  
3. **Taxi Available in Selected Box Area**  
4. **Coordinates[]**  

For our specific use case, **the coordinates column will not be used for now**.  

To prepare the data for the neural network:  
- **Inputs:** We will use `DateTime` and `Taxi Available Throughout Singapore` as features.  
- **Output:** `Taxi Available in Selected Box Area` will be the target variable.  
- **DateTime Conversion:** Since `DateTime` is not in a format suitable for neural networks, we will extract relevant features:
  - **IsWeekend**: A binary feature (1 if it's a weekend, 0 otherwise).  
  - **Hour**: Transformed into a numerical value between **1 and 24** (avoiding 0, which may cause training issues).  

---

## **Step 2: Adding Additional Features**  
*(Partially completed; will be refined over time)*  

Aside from the existing columns, we aim to incorporate additional features that may improve prediction accuracy:  

1. **ERP Rates (Electronic Road Pricing) at the given time and location**  
   - Uncertain if this will significantly impact predictions. Further analysis is needed.  

2. **Number of LTA (Land Transport Authority) gantry locations**  
   - Again, its usefulness remains uncertain—further evaluation required.  

3. **Traffic Incidents in the Selected Area**  
   - A script (`TrafficIncidentScript.py`) has been written to update `traffic_incident.csv` with the latest traffic incidents.  
   - Over time, as the dataset grows, we hope this feature will become useful.  

4. **Number of Taxi Stands in the Area**  
   - Currently **not useful** because our area of interest is fixed.  
   - However, if we allow dynamic selection of areas in the future, this could become relevant.  

5. **Temperature at a Given Time and Date** *(To be implemented)*  

6. **Rainfall Data** *(To be implemented)*  

To ensure all features align properly, we will **synchronize all datasets based on DateTime** before feeding them into the model.  

---

## **Step 3: Creating the Training-Test Split**  
- Initially, we will perform an **80/20 Training-Test split** for simplicity.  
- In the future, we may introduce a **Training-Validation-Test split** to further refine model performance.  

---

## **Step 4: Building the Model**  
We will begin with an **LSTM model**, as LSTMs are well-suited for time-series forecasting.  
- **Initial Limitation:** The model, in its basic form, will only predict the next hour.  
- **Future Improvement:** A **sliding window approach** will be explored and implemented to extend predictions further.  

---

## **Step 5: Model Evaluation and Improvement**  
- After the initial model is trained, we will assess its performance.  
- Based on results, we will explore potential improvements, such as hyperparameter tuning, architectural modifications, or additional feature engineering.  

---

This structured approach will guide the development of a robust and accurate taxi availability prediction model. 🚖💡


## **Preparing the taxi_availability data here.**

Normalization of certain inputs are done as well, but I am unsure if it is the right thing to do as well.

In [51]:
taxi_availability_file_path = "taxi_availability.csv"

taxi_df = pd.read_csv(taxi_availability_file_path, delimiter=",")
# Save this just in case we need it in the future.
taxi_df_coordinates = taxi_df["Coordinates[]"]
taxt_df_datetime = taxi_df["DateTime"]
taxi_df = taxi_df.drop(columns = "Coordinates[]")

taxi_df["DateTime"] = pd.to_datetime(taxi_df["DateTime"])


taxi_df["IsWeekend"] = (taxi_df["DateTime"].dt.weekday >= 5).astype(int)
taxi_df["Hour"] = taxi_df["DateTime"].dt.hour + 1  # Convert 0-23 to 1-24
taxi_df = taxi_df.drop(columns = "DateTime")

#it takes 23:59:59 as midnight, same for every DateTime value.
print(taxi_df.iloc[0])


#---------------Normalise-----------------------
# Drop 'DateTime' as it's no longer needed

# Normalize the 'Hour' and 'IsWeekend' columns (if needed)
taxi_df = taxi_df[:5120]
print(taxi_df.shape)
scaler = MinMaxScaler()
taxi_df[["Hour", "IsWeekend"]] = scaler.fit_transform(taxi_df[["Hour", "IsWeekend"]])
taxi_df = taxi_df.apply(pd.to_numeric, errors='coerce')
taxi_df_output  = taxi_df["Taxi Available in Selected Box Area"]
taxi_df = taxi_df.drop(columns = "Taxi Available in Selected Box Area")

# Convert to NumPy arrays
input_data = taxi_df.values  # Shape: (5120, num_features)
output_data = taxi_df_output.values  # Shape: (5120,)

# Define sequence length
seq_length = 48

# Function to create sequences
def create_sequences(data, labels, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        xs.append(data[i : i + seq_length])  # Input sequence
        ys.append(labels[i + seq_length])   # Corresponding target
    return np.array(xs), np.array(ys)


Taxi Available throughout SG           1924
Taxi Available in Selected Box Area      79
IsWeekend                                 0
Hour                                     24
Name: 0, dtype: object
(5120, 4)


In [52]:
X, y = create_sequences(input_data, output_data, seq_length)

# 4️⃣ Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)  # Add channel dim
y = torch.tensor(y, dtype=torch.float32)  # Convert to tensor

# Split into training (80%) and testing (20%)
train_size = int(0.8 * len(X))
test_size = len(X) - train_size

trainX, testX = X[:train_size], X[train_size:]
trainY, testY = y[:train_size], y[train_size:]

# Convert to PyTorch tensors
trainX = torch.tensor(trainX, dtype=torch.float32)  # Shape: (train_size, 24, num_features)
trainY = torch.tensor(trainY, dtype=torch.float32)  # Shape: (train_size,)
testX = torch.tensor(testX, dtype=torch.float32)    # Shape: (test_size, 24, num_features)
testY = torch.tensor(testY, dtype=torch.float32)    # Shape: (test_size,)

train_dataset = TensorDataset(trainX, trainY)
test_dataset = TensorDataset(testX, testY)

batch_size = 64  # You can adjust the batch size as needed
#Shuffle is false because you need it to be in time sequenced, but please double check
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Example of accessing a batch of data
for inputs, targets in train_loader:
    print(f'Inputs: {inputs.shape}, Targets: {targets.shape}')
    break  # Only print the first batch for verification

Inputs: torch.Size([64, 48, 3]), Targets: torch.Size([64])


  trainX = torch.tensor(trainX, dtype=torch.float32)  # Shape: (train_size, 24, num_features)
  trainY = torch.tensor(trainY, dtype=torch.float32)  # Shape: (train_size,)
  testX = torch.tensor(testX, dtype=torch.float32)    # Shape: (test_size, 24, num_features)
  testY = torch.tensor(testY, dtype=torch.float32)    # Shape: (test_size,)


In [58]:
class LSTM_pt(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTM_pt, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.lstm = torch.nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, output_dim)

    def forward(self, x, h0=None, c0=None):
        if h0 is None or c0 is None:
            h0 = torch.randn(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
            c0 = torch.randn(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        
        out, (hn, cn) = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out, hn, cn

In [59]:
def train(model, dataloader,num_layers, hidden_size, num_epochs, learning_rate, device, ):
    # Set the loss function and optimizer
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    model.train()  # Set the model to training mode
    
    for epoch in range(num_epochs):
        loss=0
        hidden_state, cell_state = None, None   
        optimizer.zero_grad() #Double check if this is the right position
        for batch_idx, (inputs, targets) in enumerate(train_loader):
            if batch_idx == len(train_loader) - 1:  
                break  # Skip the last batch
            # Initialize hidden state and cell state for each batch
            # Forward pass
            output, cell_state, hidden_state = model(inputs,cell_state, hidden_state)
            # Calculate loss
            loss = criterion(output, targets)
            loss.backward()
            optimizer.step()

            hidden_state = hidden_state.detach()
            cell_state = cell_state.detach()
        if epoch %100 == 0:
            print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss/len(dataloader)}')
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss/len(dataloader)}')

In [60]:
# Define the model parameters
# Following the research paper's instructions
input_size = 3
hidden_size = 20
num_layers = 1 # Can be changed to stack multiple LSTM layers!
output_size = 1
dataloader = train_loader
#Create the model
model = LSTM_pt(input_size, hidden_size, num_layers, output_size).to(device)
train(model, dataloader,num_layers, hidden_size, num_epochs = 200, learning_rate = 0.01, device = device)


  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 1/200, Loss: 144.8883056640625
Epoch 101/200, Loss: 11.485030174255371
Epoch 200/200, Loss: 11.521540641784668


In [61]:

model.eval()

# Initialize variables to track loss
loss_value = 0
num_batches = 0

# Define the loss function
criterion = torch.nn.MSELoss()

# Initialize hidden state and cell state
hidden_state, cell_state = None, None  

# Disable gradient computation for validation
with torch.no_grad():
    for batch_idx, (inputs, targets) in enumerate(test_loader):
        if batch_idx == len(test_loader) - 1:  
            break  # Skip the last batch
        # Forward pass
        output, cell_state, hidden_state = model(inputs, cell_state, hidden_state)


        # Compute loss
        # Compute loss
        loss_value += criterion(output, targets)
# Compute average loss
loss_value = loss_value / (len(test_loader) -1 )
print("Predicted output: ", output)
print("True Output: ", targets)
# Print validation results
print(f'Average Validation Loss: {loss_value:.4f}')


Predicted output:  tensor([[95.8804],
        [92.7523],
        [93.7073],
        [94.8243],
        [91.0994],
        [92.5596],
        [91.2717],
        [95.9914],
        [93.0093],
        [92.7006],
        [92.0134],
        [94.5887],
        [93.3420],
        [95.9683],
        [96.0888],
        [94.1514],
        [93.0915],
        [92.4650],
        [91.6161],
        [95.1799],
        [95.9402],
        [92.3965],
        [95.1546],
        [95.8049],
        [95.0173],
        [92.4342],
        [91.3179],
        [91.5161],
        [91.5622],
        [95.6537],
        [94.2433],
        [92.6506],
        [94.1834],
        [95.9352],
        [92.8676],
        [94.7181],
        [95.1253],
        [93.2887],
        [93.1371],
        [94.0924],
        [92.3436],
        [96.0835],
        [92.0413],
        [92.4805],
        [94.7472],
        [92.6899],
        [94.6861],
        [92.8960],
        [92.6380],
        [91.5986],
        [96.1092],
        [96.

In [57]:

# model.eval()

# # Initialize variables to track loss
# loss_value = 0
# num_batches = 0

# # Define the loss function
# criterion = torch.nn.MSELoss()

# # Initialize hidden state and cell state
# hidden_state, cell_state = None, None  

# # Disable gradient computation for validation
# with torch.no_grad():
#     for batch_idx, (inputs, targets) in enumerate(test_loader):
#         if batch_idx == len(test_loader) - 1:  
#             break  # Skip the last batch
#         # Forward pass
#         output, cell_state, hidden_state = model(inputs, cell_state, hidden_state)
#         print("Predicted output: ", output)
#         print("True Output: ", targets)

#         # Compute loss
#         # Compute loss
#         loss_value += criterion(output, targets)
# # Compute average loss
# loss_value = loss_value / (len(test_loader) -1 )

# # Print validation results
# print(f'Average Validation Loss: {loss_value:.4f}')
