# Predicting the level of busyness of a train

The initial focus will be just on CrossCountry trains at Reading and then perhaps extend out to other stations and train lines from Reading. Credit goes to RealTimeTrains for their data. As I was not able to get access to crowdedness or passenger information, I will be taking an unsupervised learning approach. 

This is an initial version; a fully comprehensive version would probably be better built by an organisation that has access to realtime granular passenger data rather than a single person.

Number of seats per CrossCountry type:

- Class 220s (4 carriages): 174 standard class seats and 26 first class seats plus 34 standing
- Class 221s:
  - 4 cars: 26 first-class, 160 standard, 32 standing
  - 5 cars: 26 first-class, 220 standard, 44 standing

For standing capacity I'm using the estimates laid out in this link https://www.gov.uk/government/statistics/rail-passenger-numbers-and-crowding-on-weekdays-in-major-cities-in-england-and-wales-2024/rail-passenger-numbers-and-crowding-statistics-notes-and-definitions for CrossCountry trains which is 20% of standard seats.

In [1]:
import requests
import pandas as pd
import numpy as np
import json
from requests.auth import HTTPBasicAuth
from datetime import datetime
import os

In [2]:
pd.set_option('display.max_colwidth', None)

In [3]:
username = 'rttapi_fc_penfold'
password = 'a379ca2bece444592949832eefe0502b9caa55c1'

In [4]:
xc_configs = {
    'Class 220':{'standard_class_seats': 174, 'first_class_seats': 26, 'standing_cap': 34},
    'Class 221 4 car':{'standard_class_seats': 160, 'first_class_seats': 26, 'standing_cap': 32},
    'Class 221 5 car':{'standard_class_seats': 220, 'first_class_seats': 26, 'standing_cap': 44}
}

xc_config_df = pd.DataFrame.from_dict(xc_configs, orient='index')
xc_config_df.reset_index(inplace=True, names='train_type')
xc_config_df.head()

Unnamed: 0,train_type,standard_class_seats,first_class_seats,standing_cap
0,Class 220,174,26,34
1,Class 221 4 car,160,26,32
2,Class 221 5 car,220,26,44


## Collecting service data

In [6]:
month = '11'
day = '11'

url = f'https://api.rtt.io/api/v1/json/search/RDG/2025/{month}/{day}'
response = requests.get(url, auth=HTTPBasicAuth(username, password))
if response.status_code == 200:
    data = response.json()  # Assuming the response is JSON
else:
    print(f"Request failed with status code: {response.status_code}")

all_services_data = pd.json_normalize(data['services'])

all_services_data.head()
all_services_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 664 entries, 0 to 663
Data columns (total 39 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   serviceUid                                 664 non-null    object
 1   runDate                                    664 non-null    object
 2   trainIdentity                              664 non-null    object
 3   runningIdentity                            614 non-null    object
 4   atocCode                                   664 non-null    object
 5   atocName                                   664 non-null    object
 6   serviceType                                664 non-null    object
 7   isPassenger                                664 non-null    bool  
 8   locationDetail.realtimeActivated           614 non-null    object
 9   locationDetail.tiploc                      664 non-null    object
 10  locationDetail.crs                    

In [6]:
# services_data = pd.read_csv('combined_service_data_RDG.csv')

# services_data = services_data[services_data['atocCode'] == 'XC']
# display(services_data.tail(20))

# services_data.info()