# Train Delay Prediction

### CONTEXT


Train delays are a significant issue in the railway industry, causing inconvenience to passengers and financial losses for railway operators.
Delays can occur due to various reasons such as signal failures, technical issues, weather conditions, and staffing shortages.
Predicting whether a train journey will be "On Time" or "Delayed" can help railway operators manage resources more effectively and improve passenger satisfaction.

The dataset provided contains historical booking and journey data, including details such as purchase date and time, ticket type, departure and arrival stations, journey status, and reasons for delays.
By analyzing this data, we can build a predictive model to classify train journeys as "On Time" or "Delayed."

### PROBLEM STATEMENT


Using the provided dataset, the goal is to predict whether a train journey will be "On Time" or "Delayed." 


### CONTENT



The dataset includes the following features:

Transaction ID: Unique identifier for each transaction.

Date of Purchase: The date when the ticket was purchased.

Time of Purchase: The time when the ticket was purchased.

Purchase Type: The method of purchase (e.g., Online, Station).

Payment Method: The payment method used (e.g., Credit Card, Contactless).

Railcard: Type of railcard used (e.g., Adult, Disabled).

Ticket Class: Class of the ticket (e.g., Standard, First Class).

Ticket Type: Type of ticket (e.g., Advance, Off-Peak).

Price: The price of the ticket.

Departure Station: The station where the journey starts.

Arrival Destination: The station where the journey ends.

Date of Journey: The date of the train journey.

Departure Time: The scheduled departure time.

Arrival Time: The scheduled arrival time.

Actual Arrival Time: The actual arrival time.

Journey Status: The status of the journey (e.g., On Time, Delayed).

Reason for Delay: The reason for the delay, if any.

Refund Request: Whether a refund was requested.

#### Dataset Overview:


Rows: 31,653

Columns: 18

Target Variable: Journey Status (values: "On Time" or "Delayed")

# DATA EXPLORATION

#### Importing Libraries

In [1]:
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
sns.set(style="whitegrid")

##### Loading dataset

In [22]:
df = pd.read_csv("railway.csv")

# Display basic information
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31653 entries, 0 to 31652
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Transaction ID       31653 non-null  object
 1   Date of Purchase     31653 non-null  object
 2   Time of Purchase     31653 non-null  object
 3   Purchase Type        31653 non-null  object
 4   Payment Method       31653 non-null  object
 5   Railcard             10735 non-null  object
 6   Ticket Class         31653 non-null  object
 7   Ticket Type          31653 non-null  object
 8   Price                31653 non-null  int64 
 9   Departure Station    31653 non-null  object
 10  Arrival Destination  31653 non-null  object
 11  Date of Journey      31653 non-null  object
 12  Departure Time       31653 non-null  object
 13  Arrival Time         31653 non-null  object
 14  Actual Arrival Time  29773 non-null  object
 15  Journey Status       31653 non-null  object
 16  Reas

(None,
             Transaction ID Date of Purchase Time of Purchase Purchase Type  \
 0  da8a6ba8-b3dc-4677-b176       2023-12-08         12:41:11        Online   
 1  b0cdd1b0-f214-4197-be53       2023-12-16         11:23:01       Station   
 2  f3ba7a96-f713-40d9-9629       2023-12-19         19:51:27        Online   
 3  b2471f11-4fe7-4c87-8ab4       2023-12-20         23:00:36       Station   
 4  2be00b45-0762-485e-a7a3       2023-12-27         18:22:56        Online   
 
   Payment Method Railcard Ticket Class Ticket Type  Price  \
 0    Contactless    Adult     Standard     Advance     43   
 1    Credit Card    Adult     Standard     Advance     23   
 2    Credit Card      NaN     Standard     Advance      3   
 3    Credit Card      NaN     Standard     Advance     13   
 4    Contactless      NaN     Standard     Advance     76   
 
        Departure Station    Arrival Destination Date of Journey  \
 0      London Paddington  Liverpool Lime Street      2024-01-01   
 1     

In [23]:
df.shape

(31653, 18)

In [24]:
df.describe()

Unnamed: 0,Price
count,31653.0
mean,23.4392
std,29.997628
min,1.0
25%,5.0
50%,11.0
75%,35.0
max,267.0


#### Check for duplicate values

In [25]:
duplicate_count = df.duplicated().sum()

#### Check for missing values

In [26]:
missing_values = df.isnull().sum()

duplicate_count, missing_values

(0,
 Transaction ID             0
 Date of Purchase           0
 Time of Purchase           0
 Purchase Type              0
 Payment Method             0
 Railcard               20918
 Ticket Class               0
 Ticket Type                0
 Price                      0
 Departure Station          0
 Arrival Destination        0
 Date of Journey            0
 Departure Time             0
 Arrival Time               0
 Actual Arrival Time     1880
 Journey Status             0
 Reason for Delay       27481
 Refund Request             0
 dtype: int64)

#### Handling Missing Values

In [27]:
# Drop "Reason for Delay" as it's mostly missing
df.drop(columns=['Reason for Delay'], inplace=True)

# Replace "None" in Railcard with "No Railcard"
df['Railcard'] = df['Railcard'].replace('None', 'No Railcard')

# Impute missing values in "Actual Arrival Time" with the scheduled "Arrival Time"
df['Actual Arrival Time'].fillna(df['Arrival Time'], inplace=True)

# Verify missing values are handled
print(df.isnull().sum())

Transaction ID             0
Date of Purchase           0
Time of Purchase           0
Purchase Type              0
Payment Method             0
Railcard               20918
Ticket Class               0
Ticket Type                0
Price                      0
Departure Station          0
Arrival Destination        0
Date of Journey            0
Departure Time             0
Arrival Time               0
Actual Arrival Time        0
Journey Status             0
Refund Request             0
dtype: int64
