To analyze and forecast the dataset you provided for National Rail using machine learning, I will outline a detailed script that includes data preprocessing, exploratory data analysis (EDA), model selection, training, and forecasting. Given that I do not yet know the structure of your dataset, I will first load and inspect it to understand its contents.

Let's start by loading and examining the dataset.

The dataset consists of 31,653 entries and 18 columns, with various attributes related to rail ticket purchases and journeys. Here’s a summary of the key columns:

1. **Date of Purchase, Time of Purchase**: Date and time when the ticket was purchased.
2. **Purchase Type, Payment Method**: The channel and method used for purchasing the ticket.
3. **Railcard, Ticket Class, Ticket Type**: Information about the railcard used, class of the ticket, and the type of ticket (e.g., Advance).
4. **Price**: The price of the ticket.
5. **Departure Station, Arrival Destination**: Start and end stations for the journey.
6. **Date of Journey, Departure Time, Arrival Time**: Scheduled journey dates and times.
7. **Actual Arrival Time**: The actual time the train arrived at the destination.
8. **Journey Status**: Indicates whether the journey was on time or delayed.
9. **Reason for Delay**: If delayed, the reason for the delay.
10. **Refund Request**: Whether a refund was requested.

### Steps to Analyze and Forecast:
1. **Data Preprocessing**:
   - Convert date and time columns to appropriate datetime formats.
   - Handle missing values, especially in the "Actual Arrival Time" and "Reason for Delay" columns.
   - Encode categorical variables.

2. **Exploratory Data Analysis (EDA)**:
   - Analyze the distribution of ticket prices.
   - Examine patterns in delays (e.g., by station, time of day, etc.).
   - Investigate correlations between different variables (e.g., class of ticket vs. likelihood of delay).

3. **Feature Engineering**:
   - Create new features such as "Journey Duration," "Day of the Week," "Month," etc.
   - Aggregate data by dates to analyze trends over time.

4. **Modeling**:
   - **Forecasting**: Use time series forecasting models like ARIMA, Prophet, or LSTM to predict ticket sales or delays.
   - **Classification**: Use models like Random Forest or XGBoost to predict the likelihood of a journey being delayed based on various features.

5. **Model Evaluation**:
   - Split data into training and testing sets.
   - Use metrics like MAE, RMSE for regression models (forecasting) and accuracy, precision, recall for classification models (delay prediction).

6. **Prediction and Forecasting**:
   - Forecast future ticket sales.
   - Predict future delays based on current and historical data.

### Detailed Script Outline

I'll write the detailed script step by step, starting with data preprocessing.

It looks like there was a connection error while processing the data. I’ll re-execute the previous steps to preprocess the data, encode the categorical variables, and check for missing values. Let's continue from where we left off.

It seems that the connection issue is persisting. However, I can still outline the next steps for you to continue the analysis and forecasting on your end. Here’s what you can do next:

### 1. **Handling Missing Values**
   - Examine the `missing_values` to decide on the treatment (drop, fill, etc.).
   - For the "Actual Arrival Time", since it's crucial for delay analysis, you can drop rows where this is missing if necessary.



In [6]:
import pandas as pd

# Load the dataset
file_path = 'https://raw.githubusercontent.com/Paulakinpelu/National-Rail-Data-Analysis-and-Forecasting/main/railway.csv'
data = pd.read_csv(file_path)

# Display the first few rows and summary information
data_info = data.info()
data_head = data.head()

data_info, data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31653 entries, 0 to 31652
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Transaction ID       31653 non-null  object
 1   Date of Purchase     31653 non-null  object
 2   Time of Purchase     31653 non-null  object
 3   Purchase Type        31653 non-null  object
 4   Payment Method       31653 non-null  object
 5   Railcard             10735 non-null  object
 6   Ticket Class         31653 non-null  object
 7   Ticket Type          31653 non-null  object
 8   Price                31653 non-null  int64 
 9   Departure Station    31653 non-null  object
 10  Arrival Destination  31653 non-null  object
 11  Date of Journey      31653 non-null  object
 12  Departure Time       31653 non-null  object
 13  Arrival Time         31653 non-null  object
 14  Actual Arrival Time  29773 non-null  object
 15  Journey Status       31653 non-null  object
 16  Reas

(None,
             Transaction ID Date of Purchase Time of Purchase Purchase Type  \
 0  da8a6ba8-b3dc-4677-b176       2023-12-08         12:41:11        Online   
 1  b0cdd1b0-f214-4197-be53       2023-12-16         11:23:01       Station   
 2  f3ba7a96-f713-40d9-9629       2023-12-19         19:51:27        Online   
 3  b2471f11-4fe7-4c87-8ab4       2023-12-20         23:00:36       Station   
 4  2be00b45-0762-485e-a7a3       2023-12-27         18:22:56        Online   
 
   Payment Method Railcard Ticket Class Ticket Type  Price  \
 0    Contactless    Adult     Standard     Advance     43   
 1    Credit Card    Adult     Standard     Advance     23   
 2    Credit Card      NaN     Standard     Advance      3   
 3    Credit Card      NaN     Standard     Advance     13   
 4    Contactless      NaN     Standard     Advance     76   
 
        Departure Station    Arrival Destination Date of Journey  \
 0      London Paddington  Liverpool Lime Street      2024-01-01   
 1     

### 2. **Feature Engineering**
   - **Journey Duration**: Calculate the difference between "Departure Datetime" and "Actual Arrival Datetime".
   - **Delay**: Create a binary feature indicating whether a journey was delayed.
   - **Time-based Features**: Extract day of the week, month, hour of day, etc., from the datetime columns.



In [7]:
# Step 1: Data Preprocessing

# Convert date columns to datetime format
data['Date of Purchase'] = pd.to_datetime(data['Date of Purchase'])
data['Date of Journey'] = pd.to_datetime(data['Date of Journey'])

# Combine Date and Time columns to create full datetime features
data['Purchase Datetime'] = pd.to_datetime(data['Date of Purchase'].astype(str) + ' ' + data['Time of Purchase'])
data['Departure Datetime'] = pd.to_datetime(data['Date of Journey'].astype(str) + ' ' + data['Departure Time'])
data['Arrival Datetime'] = pd.to_datetime(data['Date of Journey'].astype(str) + ' ' + data['Arrival Time'])
data['Actual Arrival Datetime'] = pd.to_datetime(data['Date of Journey'].astype(str) + ' ' + data['Actual Arrival Time'])

# Check for missing values
missing_values = data.isnull().sum()

# Drop or fill missing values based on the analysis
# Here we will drop rows where Actual Arrival Time is missing since it affects delay analysis
data_cleaned = data.dropna(subset=['Actual Arrival Datetime'])

# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data_cleaned, drop_first=True)

missing_values, data_encoded.head()


(Transaction ID                 0
 Date of Purchase               0
 Time of Purchase               0
 Purchase Type                  0
 Payment Method                 0
 Railcard                   20918
 Ticket Class                   0
 Ticket Type                    0
 Price                          0
 Departure Station              0
 Arrival Destination            0
 Date of Journey                0
 Departure Time                 0
 Arrival Time                   0
 Actual Arrival Time         1880
 Journey Status                 0
 Reason for Delay           27481
 Refund Request                 0
 Purchase Datetime              0
 Departure Datetime             0
 Arrival Datetime               0
 Actual Arrival Datetime     1880
 dtype: int64,
   Date of Purchase  Price Date of Journey   Purchase Datetime  \
 0       2023-12-08     43      2024-01-01 2023-12-08 12:41:11   
 1       2023-12-16     23      2024-01-01 2023-12-16 11:23:01   
 2       2023-12-19      3      2024-01

### 3. **Modeling**
   - **Classification (Delay Prediction)**:
     - Split data into training and test sets.
     - Use models like Random Forest, XGBoost, or Logistic Regression.
     - Evaluate using accuracy, precision, recall, and F1 score.

   - **Time Series Forecasting**:
     - Aggregate ticket sales by date.
     - Use ARIMA or Facebook Prophet to forecast future sales.
     - Evaluate using MAE, RMSE.



In [9]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Prepare features and target variable
X = data_encoded.drop(columns=['Journey Status_On Time'])
y = data_encoded['Journey Status_On Time']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate and print the model's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")


### 4. **Python Code for Modeling**

```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Prepare features and target variable
X = data_encoded.drop(columns=['Journey Status_On Time'])  # Assume delay prediction
y = data_encoded['Journey Status_On Time']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

### 5. **Time Series Forecasting Example**

```python
from fbprophet import Prophet

# Prepare the data
sales_data = data_cleaned.groupby('Date of Journey').agg({'Price': 'sum'}).reset_index()
sales_data.columns = ['ds', 'y']

# Initialize and train the Prophet model
model = Prophet()
model.fit(sales_data)

# Make future predictions
future = model.make_future_dataframe(periods=30)  # Predict the next 30 days
forecast = model.predict(future)

# Plot the forecast
model.plot(forecast)
```

These steps will help you analyze the dataset and build forecasting models. If you encounter any specific issues while implementing, feel free to ask for further clarification!

