# Forecasting Railroad Track Failures Due to Extreme Temperature Flucuations

## Project Definition & Objectives

### Objective/Thesis
<p>The objective of this project is to develop a machine learning model that helps to predict potential derailment locations specifically caused by track failures due to the result of extreme temperatures fluctuations. By integrating national weather forecasts with historical derailment and track data, the model will identify high-risk areas where derailments are most likely to occur based off the weather forecast. This analysis aims to enable predictive maintenance strategies, focusing on increasing track inspections and repairs during and immediately after periods of extreme weather conditions to mitigate derailment risks.</p>

### Scope
<ul>
    <li><strong>Data Collection</strong>: Gathering historical data on derailments, environmental factors (temperature fluctuations), track conditions, and weather forecasts.</li>
    <li><strong>Model Development</strong>: Building and validating a machine learning model capable of predicting derailment locations based on extreme weather patterns. [Possible options: Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Neural Networks.]</li>
    <li><strong>Geospatial Analysis</strong>: Applying geospatial methods to identify high-risk regions for derailments due to wide gauge.</li>
    <li><strong>Real-Time Forecast Integration</strong>: Integrating national weather forecasts to provide real-time predictions of potential derailment locations.</li>
    <li><strong>Actionable Insights</strong>: Recommending areas for increased track maintenance and inspection based on predictions, with a focus on preventing accidents caused by wide gauge in extreme weather conditions.</li>
</ul>

### Significance
<p>Track failures are the leading cause of non-reportable derailments, with over 12,000 reportable events recorded, according to the Federal Railroad Administration (FRA). As climate change contributes to increasingly unpredictable and extreme weather patterns, the risk of derailments due to wide gauge is likely to rise. This project is significant because it will provide a data-driven approach to mitigating these risks. By predicting where derailments are likely to occur, rail companies can proactively focus maintenance and inspection efforts, reducing the likelihood of accidents, protecting human lives, and preserving infrastructure.</p>

## Data Preprocessing

### Import

In [2]:
import requests
import pandas as pd
from datetime import datetime, timedelta

#### Railroad Accident Data from Federal Railroad Administration (FRA)

In [3]:
# WARNING can take up to 6 minutes to download.
# API URL
url = "https://data.transportation.gov/resource/85tf-25kj.json"

# Set the parameters
limit = 1000  # The number of rows to fetch per request
offset = 0    # The starting point for the next batch of rows
all_data = [] # To store all the data

while True:
    # Create the query string with the limit and offset
    query_url = f"{url}?$limit={limit}&$offset={offset}"
    
    # Make the API request
    response = requests.get(query_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Load the response into JSON format
        data = response.json()
        
        # If no data is returned, we've reached the end
        if not data:
            break
        
        # Append the data to our list
        all_data.extend(data)
        
        # Update the offset for the next batch of rows
        offset += limit
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        break

# Convert the list of records into a pandas DataFrame
df = pd.DataFrame(all_data)

# Display the number of rows fetched
#print(f"Total records retrieved: {len(df)}")

# Display the first few rows of the DataFrame
#print(df.head())

In [4]:
# Print all features
with pd.option_context('display.max_columns', None):  # Adjust pandas to temporarily display all features
    features = list(df.columns)
    print(features)

['reportingrailroadcode', 'reportingrailroadname', 'year', 'accidentnumber', 'url', 'accidentyear', 'accidentmonth', 'maintenancerailroadcode', 'maintenancerailroadname', 'maintenanceaccidentnumber', 'maintenanceaccidentyear', 'maintenanceaccidentmonth', 'day', 'date', 'time', 'accident_type_code', 'accidenttype', 'hazmatcars', 'hazmatcarsdamaged', 'hazmatreleasedcars', 'personsevacuated', 'divisioncode', 'division', 'station', 'milepost', 'statecode', 'stateabbr', 'statename', 'countycode', 'countyname', 'district', 'temperature', 'visibility_code', 'visibility', 'weather_condition_code', 'weathercondition', 'track_type_code', 'tracktype', 'trackname', 'trackclass', 'trackdensity', 'train_direction_code', 'traindirection', 'equipment_type_code', 'equipmenttype', 'equipmentattended', 'trainnumber', 'trainspeed', 'recordedestimatedspeed', 'maximumspeed', 'grosstonnage', 'method_of_operation_code', 'firstcarinitials', 'firstcarnumber', 'firstcarposition', 'firstcarloaded', 'passengerstra

##### Create Analysis Dataframe of RR incidents with a Track-Related Primary Accident Cause

In [5]:
# Create a Track Accident DataFrame to filter only incidents with an identified "primaryaccidentcause" of a track type.
from primaryAccidentCodesLibrary import primary_accident_cause_codes

# Get a list of codes from the dictionary
codes_list = list(primary_accident_cause_codes.keys())

# Create Track Accident DataFrame to filter only incidents with an identified "primaryaccidentcause"
track_accidents_df = df[df['primaryaccidentcausecode'].isin(codes_list)].copy()

# Relevant features to include in the analysis based on domain knowledge.
features_to_analyze = ['reportingrailroadcode', 'accidentnumber', 'date', 'time', 'accidenttype', 'hazmatreleasedcars', 'station', 'stateabbr', 'temperature', 'visibility_code', 'visibility', 'weathercondition', 'tracktype', 'equipmenttype', 'trainspeed', 'equipmentdamagecost', 'trackdamagecost', 'totaldamagecost', 'primaryaccidentcausecode', 'latitude', 'longitude']

# Drop all columns except the selected ones
track_accidents_df = track_accidents_df[features_to_analyze]

# Display the first few rows of the new DataFrame
#track_accidents_df.head()

In [6]:
# Check the date range of incidents in our query.

# Drop rows where 'accidenttype' or 'visibility' have missing values
#track_accidents_df.dropna(subset=['accidenttype', 'visibility'], inplace=True)

# Ensure the 'date' column is in datetime format
track_accidents_df['date'] = pd.to_datetime(track_accidents_df['date'])

# Define the date 10 years ago from today
ten_years_ago = datetime.today() - timedelta(days=365*10)

# Filter the DataFrame to only include incidents from the past 10 years
track_accidents_df = track_accidents_df[track_accidents_df['date'] >= ten_years_ago]

# Check the new date range
date_min = track_accidents_df['date'].min()
date_max = track_accidents_df['date'].max()

print(f"The filtered incidents occurred between {date_min.date()} and {date_max.date()}.")

The filtered incidents occurred between 2014-10-19 and 2024-07-30.


In [7]:
track_accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4636 entries, 119 to 220537
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   reportingrailroadcode     4636 non-null   object        
 1   accidentnumber            4636 non-null   object        
 2   date                      4636 non-null   datetime64[ns]
 3   time                      4636 non-null   object        
 4   accidenttype              4636 non-null   object        
 5   hazmatreleasedcars        4636 non-null   object        
 6   station                   4636 non-null   object        
 7   stateabbr                 4636 non-null   object        
 8   temperature               4636 non-null   object        
 9   visibility_code           4636 non-null   object        
 10  visibility                4636 non-null   object        
 11  weathercondition          4636 non-null   object        
 12  tracktype            

In [8]:
null_counts = track_accidents_df.isnull().sum()
null_counts

reportingrailroadcode         0
accidentnumber                0
date                          0
time                          0
accidenttype                  0
hazmatreleasedcars            0
station                       0
stateabbr                     0
temperature                   0
visibility_code               0
visibility                    0
weathercondition              0
tracktype                     0
equipmenttype               295
trainspeed                    0
equipmentdamagecost           0
trackdamagecost               0
totaldamagecost               0
primaryaccidentcausecode      0
latitude                      0
longitude                     0
dtype: int64

In [11]:
track_accidents_df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max
date,4636,2019-05-27 11:28:19.050905856,2014-10-19 00:00:00,2016-12-23 18:00:00,2019-03-12 00:00:00,2021-10-19 06:00:00,2024-07-30 00:00:00


In [12]:
# Export the DataFrame to a CSV file
track_accidents_df.to_csv('track_accidents_last_10_years.csv', index=False)

#### Weather Data from...

In [20]:
# Extract GPS location and Date from the Track Accident Dataframe
#track_accidents_df['incident_date'] = pd.to_datetime(track_accidents_df['incident_date']) 