# Human Factor Caused Railroad Accidents in US
## Project Goals

* Find potential drivers for human factor caused railroad accidents in the US

* Produce a viable model to predict high risk factors contributing to human factor related railroad accidents.

* Propose actionable options for the FRA to change policy for railroad companies to further prevent human factor related railroad accidents.

In [2]:
### imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

import datetime
from datetime import timedelta, datetime

# .py files
import acquire as acq
import prepare as prep
import explore as exp

#stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
from sklearn.metrics import accuracy_score
from scipy.stats import mannwhitneyu

SyntaxError: invalid non-printable character U+200B (explore.py, line 156)

## Acquire
* Data is pulled from the Federal Railroad Administration (FRA) accident database and loaded as four .csv's ranging from 2019-present day in the directory then joined in Python.
* The initial dataframe loaded contains 9922 rows and 146 columns
* Each row is an accident reported to the FRA
* Each column is a required feature inputed into a form after an accident

## Prepare
* dropped columns containin more than 50% nulls
* converted column strings to lower case
* after research kept columns that seemed most relevant. Down to 42 columns and 9900 rows.
* dropped insignificant amount of rows in particular columns
* converted most columns to integers
* replaced null values in engineer/conductor hours with the mean
* split datframe into train, test, split and X-train, y-train etc.
* encoded and created dummies from categorical columns

| Feature          | Definition                                                                       |
|------------------|----------------------------------------------------------------------------------|
| 'date'           | Year, Month, and Day                                                             |
| 'timehr'         | hour of the day (1-12)                                                           |
| 'timemin'        | minute of the hour (0-59)                                                        |
| 'ampm'           | AM or PM indicator                                                               |
| 'type'           | type of train (e.g. freight, passenger, work)                                    |
| 'state'          | state where the incident occured                                                 |
| 'temp'           | temperature in degrees fahrenheight                                              |
| 'visibility'     | visibility measured in miles                                                     |
| 'weather'        | weather conditions (e.g. clear/PC, rain, fog, snow                               |
| 'trnspd'         | train speed in miles per hour                                                    |
| 'tons'           | train weight in tons                                                             |
| 'loadf1'         | load factor (ratio of weight of train to max allowable weight)                   |
| 'emptyf1'        | empty factor (ratio of weight of train to min allowable weight)                  |
| 'cause' = target | cause of incident (e.g. human error, electrical/mechanical, signal, track, misc) |
| 'acctrk'         | contributing factors related to the track                                        |
| 'actrkcl'        | accident classification (e.g. derailment, collision)                             |
| 'enghr'          | number of hours worked by the engineer at the time of the incident               |
| 'condrhr'        | number of hours worked by the conductor at the time of the incident              |

In [None]:
# Read in the first CSV file and keep only the column titles
df = pd.read_csv('RRD_US_2023.csv', nrows=0)

# Loop through the remaining CSV files and stack them on top of the first one
for year in range(2022, 2018, -1):
    filename = f'RRD_US_{year}.csv'
    temp_df = pd.read_csv(filename, header=0)
    df = pd.concat([df, temp_df], axis=0, ignore_index=True)

# Write the combined CSV file to disk
df.to_csv('RRD_US_combined.csv', index=False)

In [None]:
# clean data
df = acq.clean_rrd()

In [None]:
# prepped data
df = prep.prep_rrd_data()

# Look at my data

In [None]:
# look at dataframe
df.head(5)

In [None]:
# look at info
df.info()

In [None]:
## A summary of my data
df.describe().T

In [None]:
# split and see shape of data
train, val, test = prep.split_data(
    df, train_size=0.6, val_size=0.2, 
    test_size=0.2, random_state=123)
print('Train shape:', train.shape)
print('Validate shape:', val.shape)
print('Test shape:', test.shape)

## Explore

### Does visibility affect human factor caused accidents?

In [None]:
exp.visibility_plot(train)

* Between >= 1/4 mile and <= 1/2 mile visibility it seems more proportional to the population of accidents, however, when visibility  is >= 1/2 mile it looks like it human factor accidents have a larger proportion of the overall accidents.

* The chi-squared test of independence tests the null hypothesis that there is no relationship between the two variables (cause and visibility) in the accident population. If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant relationship between the two variables.

* $H0$ = There is a relationship between cause and visibility
* $Ha$ = There is a relationship between cause and visibility

In [None]:
# get visibility chi2 stat
exp.visibility_stat(train)

* The p-value is less than the 0.05 alpha. Therefore we reject the null hypothesis and there is evidence of a relationship between cause and visibility. We will send it on to modeling.

### Does weather affect human factor caused accidents?

In [None]:
exp.weather_plot(train)

* Looking into the different types of weather in the plot it doesn't seem to have much of a disproportion when compared to clear/pc conditions.

* The chi-squared test of independence tests the null hypothesis that there is no relationship between the two variables (cause and weather) in the accident population. If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant relationship between the two variables.

* $H0$ = There is a relationship between cause and weather
* $Ha$ = There is a relationship between cause and weather

In [None]:
# get weather chi2 stat
exp.weather_stat(train)

* The p-value is slightly less than the 0.05 alpha. Therefore we reject the null hypothesis and there is evidence of a weak relationship between cause and visibility. We will send it on to modeling.

### Does conductor work hours affect human factor caused accidents?

In [None]:
# get conductor hours plot
exp.cdthr_plot(train)

* It seems there is a slightly higher mean for human factor accidents over other.

* The mann_whitney tests the null hypothesis that there is no difference between the means of the two groups (cause and cdtrhr) in the population. If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant difference between the means of the two groups.

* $H0$ = There is no difference between cdtrhr hours means for those who had human factor accidents and those that were not human factor accidents
* $Ha$ = There is a difference between cdtrhr hours means for those who had human factor accidents and those that were not human factor accidents

In [None]:
# get conductor hours Mann-Whit stat
exp.cdtrhr_stat(train)

* The p-value is less than the 0.05 significance level so we will reject the null hypothesis and will send train conductor hours to modeling

### Does train speed affect human factor caused accidents?

In [None]:
# get trainspeed plot
exp.trnspd_plot(train)

* It visually seems there is no difference concerning trainspeed when it comes to human factor accidents means and actually it looks less than all other accidents.

* The mann-whitney tests the null hypothesis that there is difference between the means of the two groups (cause and trnspd) in the population. If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant difference between the means of the two groups.

* $H0$ = There is no difference between trnspd hours means for those who had human factor accidents and those that were not human factor accidents
* $Ha$ = There is a difference between trnspd hours means for those who had human factor accidents and those that were not human factor accidents

In [None]:
# get trainspeed stat test
exp.trnspd_stat(train)

* The p-value is less than the 0.05 significance level so we reject the null hypothesis and will send train speed to modeling

### Exploration Summary
* visibility has a relationship with cause (human factor) with a p value of: 4.350195748698547e-05
* weather has a weak relationship with cause (human factor) with a p value of: 0.00013986800540663016
* The p-value is less than the 0.05 significance level so we will reject the null hypothesis and will send train conductor hours to modeling. p value: 0.005661266681436928
* The p-value is less than the 0.05 significance level so we reject the null hypothesis and will send train speed to modeling. p value: 2.1638787633218512e-77

## Modeling
* The purpose of modeling the features I have selected is to find the most accurate model for predicting whether or not a customer will churn.
* The baseline accuracy we are trying to beat is 58%. I will be using a Decision Tree, Random Forest, Logistic Regression, and KNearest models to try to find the best model. I will also be using a grid search to find the optimal hyperparameters with the highest accuracy scores.

### Features I will model
* Visibility
* Weather
* Conductor Hours
* Train Speed

### Prep Data for Modeling

In [None]:
#calculate baseline accuracy
baseline_accuracy = 3498 / (3498+2441)
baseline_accuracy

In [None]:
# Get X_train
X_train, X_val, X_test, y_train, y_val, y_test = prep.X_train_data(train, val, test)

In [None]:
# Create X_train scaled
X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled, y_test = prep.scaled_data(X_train, X_val, X_test, y_train, y_val, y_test)


### Decision Tree Classifier

In [None]:
# Get Decision Tree Classifier
exp.DTC_model(X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled, y_test)

### Random Forest Classifier

In [None]:
exp.RF_model(X_train_scaled, y_train, X_val_scaled,
         y_val, X_test_scaled, y_test)

### Logistic Regression

### KNN Model