### Imports and CUDA

In [62]:
# Matplotlib
import requests
import matplotlib.pyplot as plt
# Numpy
import numpy as np
# Torch
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

In [63]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


### Objective

#### To develop a model that predicts taxi availability within a specific area for the next three hours. This means that if the model is run at 12 PM, it will provide predicted taxi availability for 1 PM, 2 PM, and 3 PM.

The area of interest is defined by the following geographical boundaries:

    North: 1.35106
    South: 1.32206
    East: 103.97839
    West: 103.92805

To identify the taxis currently available within this region, we use the TaxiAvailabilityScript.py.

This script collects real-time data, which serves as input for our predictive model.

By leveraging historical taxi availability trends and real-time data, our model aims to provide accurate forecasts, helping commuters, ride-hailing services, and urban planners make informed decisions.


# **To-Do List for Taxi Availability Prediction**

## **Step 1: Cleaning the Taxi Availability Data**
The first step involves retrieving and preprocessing the taxi availability dataset. The dataset consists of the following columns:

1. **DateTime**  
2. **Taxi Available Throughout Singapore**  
3. **Taxi Available in Selected Box Area**  
4. **Coordinates[]**  

For our specific use case, **the coordinates column will not be used for now**.  

To prepare the data for the neural network:  
- **Inputs:** We will use `DateTime` and `Taxi Available Throughout Singapore` as features.  
- **Output:** `Taxi Available in Selected Box Area` will be the target variable.  
- **DateTime Conversion:** Since `DateTime` is not in a format suitable for neural networks, we will extract relevant features:
  - **IsWeekend**: A binary feature (1 if it's a weekend, 0 otherwise).  
  - **Hour**: Transformed into a numerical value between **1 and 24** (avoiding 0, which may cause training issues).  

---

## **Step 2: Adding Additional Features**  
*(Partially completed; will be refined over time)*  

Aside from the existing columns, we aim to incorporate additional features that may improve prediction accuracy:  

1. **ERP Rates (Electronic Road Pricing) at the given time and location**  
   - Uncertain if this will significantly impact predictions. Further analysis is needed.  

2. **Number of LTA (Land Transport Authority) gantry locations**  
   - Again, its usefulness remains uncertain—further evaluation required.  

3. **Traffic Incidents in the Selected Area**  
   - A script (`TrafficIncidentScript.py`) has been written to update `traffic_incident.csv` with the latest traffic incidents.  
   - Over time, as the dataset grows, we hope this feature will become useful.  

4. **Number of Taxi Stands in the Area**  
   - Currently **not useful** because our area of interest is fixed.  
   - However, if we allow dynamic selection of areas in the future, this could become relevant.  

5. **Temperature at a Given Time and Date** *(To be implemented)*  

6. **Rainfall Data** *(To be implemented)*  

To ensure all features align properly, we will **synchronize all datasets based on DateTime** before feeding them into the model.  

---

## **Step 3: Creating the Training-Test Split**  
- Initially, we will perform an **80/20 Training-Test split** for simplicity.  
- In the future, we may introduce a **Training-Validation-Test split** to further refine model performance.  

---

## **Step 4: Building the Model**  
We will begin with an **LSTM model**, as LSTMs are well-suited for time-series forecasting.  
- **Initial Limitation:** The model, in its basic form, will only predict the next hour.  
- **Future Improvement:** A **sliding window approach** will be explored and implemented to extend predictions further.  

---

## **Step 5: Model Evaluation and Improvement**  
- After the initial model is trained, we will assess its performance.  
- Based on results, we will explore potential improvements, such as hyperparameter tuning, architectural modifications, or additional feature engineering.  

---

This structured approach will guide the development of a robust and accurate taxi availability prediction model. 🚖💡


## **Preparing the taxi_availability data here.**

Normalization of certain inputs are done as well, but I am unsure if it is the right thing to do as well.

In [77]:
taxi_availability_file_path = "taxi_availability.csv"

taxi_df = pd.read_csv(taxi_availability_file_path, delimiter=",")
# Save this just in case we need it in the future.
taxi_df_coordinates = taxi_df["Coordinates[]"]
taxt_df_datetime = taxi_df["DateTime"]
taxi_df = taxi_df.drop(columns = "Coordinates[]")

taxi_df["DateTime"] = pd.to_datetime(taxi_df["DateTime"])


taxi_df["IsWeekend"] = (taxi_df["DateTime"].dt.weekday >= 5).astype(int)
taxi_df["Hour"] = taxi_df["DateTime"].dt.hour + 1  # Convert 0-23 to 1-24
taxi_df = taxi_df.drop(columns = "DateTime")

#it takes 23:59:59 as midnight, same for every DateTime value.
print(taxi_df.iloc[0])


#---------------Normalise-----------------------
# Drop 'DateTime' as it's no longer needed

# Normalize the 'Hour' and 'IsWeekend' columns (if needed)
taxi_df = taxi_df[:5120]
print(taxi_df.shape)
scaler = MinMaxScaler()
taxi_df[["Hour", "IsWeekend"]] = scaler.fit_transform(taxi_df[["Hour", "IsWeekend"]])
taxi_df = taxi_df.apply(pd.to_numeric, errors='coerce')
taxi_df_output  = taxi_df["Taxi Available in Selected Box Area"]
taxi_df = taxi_df.drop(columns = "Taxi Available in Selected Box Area")

# Convert to a tensor
tensor_input_data = torch.tensor(taxi_df.values, dtype=torch.float32)
tensor_output_data = torch.tensor(taxi_df_output, dtype=torch.float32)

# Print first row for verification
# print(taxi_df.head())   # To check the first few rows of the DataFrame
print("Feature Size: ",tensor_input_data.shape)
print("Target Size: ",tensor_output_data.shape)

Taxi Available throughout SG           1924
Taxi Available in Selected Box Area      79
IsWeekend                                 0
Hour                                     24
Name: 0, dtype: object
(5120, 4)
Feature Size:  torch.Size([5120, 3])
Target Size:  torch.Size([5120])


## **Splitting the data, using DataLoader 80/20** 

In [65]:
train_size = int(len(tensor_input_data) * 0.8)  # 80% for training
X_train, X_test = tensor_input_data[:train_size], tensor_input_data[train_size:]
y_train, y_test = tensor_output_data[:train_size], tensor_output_data[train_size:]

# Print the shapes to verify the split
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

batch_size = 64  # You can adjust the batch size as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Example of accessing a batch of data
for inputs, targets in train_loader:
    print(f'Inputs: {inputs.shape}, Targets: {targets.shape}')
    
    break  # Only print the first batch for verification

torch.Size([4096, 3]) torch.Size([1024, 3])
torch.Size([4096]) torch.Size([1024])
Inputs: torch.Size([64, 3]), Targets: torch.Size([64])


In [66]:
class LSTM_pt(torch.nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(LSTM_pt, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers
        
        # LSTM cell
        self.lstm = torch.nn.LSTM(input_size, hidden_size, num_layers = self.num_layers, batch_first = True)
        
        # Linear layer for final prediction
        self.linear = torch.nn.Linear(hidden_size, output_size)

    def forward(self, inputs, cell_state=None, hidden_state=None):
        # Forward pass through the LSTM cell
        if hidden_state is None or cell_state is None:
            hidden_state = torch.zeros(1, inputs.size(0), 20).to(device)
            cell_state = torch.zeros(1, inputs.size(0), 20).to(device)
        hidden = cell_state, hidden_state
        output, new_memory = self.lstm(inputs, hidden)
        cell_state, hidden_state = new_memory
        output = self.linear(output[:, -1, :])  # Take only the last time step
        return output, cell_state, hidden_state, # Return correct order

In [81]:
def train(model, dataloader,num_layers, hidden_size, num_epochs, learning_rate, device, ):
    # Set the loss function and optimizer
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    
    for epoch in range(num_epochs):
        model.train()  # Set the model to training mode
        optimizer.zero_grad()
        loss=0
        counter = 0
        hidden_state, cell_state = None, None   
        for inputs, targets in dataloader: 
            counter +=1
            inputs = inputs.unsqueeze(1).to(device)  # Reshape inputs
        
            targets = targets.to(device)
            # Initialize hidden state and cell state for each batch
            # Forward pass
            output, hidden_state, cell_state = model(inputs.to(device),cell_state, hidden_state)            # Zero the gradients

            # Calculate loss
            loss += criterion(output, targets.unsqueeze(1))             

        loss /= len(dataloader)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        if epoch %100 == 0:
            print(f"Epoch {epoch + 1}/{num_epochs}, Loss: { loss.item()/ len(dataloader):.4f}")
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: { loss.item()/ len(dataloader):.4f}")



In [82]:
# Define the model parameters
# Following the research paper's instructions
input_size = 3
hidden_size = 20
num_layers = 1 # Can be changed to stack multiple LSTM layers!
output_size = 1
dataloader = train_loader
#Create the model
model = LSTM_pt(input_size, hidden_size, num_layers, output_size).to(device)
train(model, dataloader,num_layers, hidden_size, num_epochs = 500, learning_rate = 0.1, device = device)


Epoch 1/500, Loss: 157.5854
Epoch 101/500, Loss: 21.5475
Epoch 201/500, Loss: 17.4772
Epoch 301/500, Loss: 16.7253
Epoch 401/500, Loss: 16.7857
Epoch 500/500, Loss: 16.6897
