### <u>Homework 3: Trains</u>
- By Thomas Truong

<b>Description</b>
- Homework 3 involves analysis of data recording information about freight train accidents (including mostly minor incidents, such as driving into an occupied block, but with no collisions).  You will be making a Jupyter Notebook, and including the data file FRAFirm2.csv, which is located in Canvas under Modules > Extra Files > Homework 3.

- Each record in the file contains information about one particular shift that an engineer or conductor worked.  Clock-in and clock-out information, plus many statistics, are provided. The time of clock-in is in the column named 'Start', and this value is given as the number of minutes since midnight.  So if the person clocked-in at 7:00 am, this would be recorded as 420 (60 minutes * 7 o'clock).  The clock-out information is in the column named 'End'.  The column labeled 'FIRMF' is the estimated fatigue level of the person, with values ranging from -10 (really fatigued) to +10 (really alert).

- One column, named 'class', is actually the target.  This column contains an integer that can have one of three values:
  - 0: No accident occurred during this shift
  - 1: An accident of type '1' occurred during this shift
  - 2: An accident of type '2' occurred during this shift

- There are many other features (columns of input data), but you don't really know what each column represents.  You might make some educated guesses, but you shouldn't really need to know what each column represents to complete this work.  We are simply interested in finding the features (columns) that indicate a higher percentage of accidents.  Of even more interest is finding two or more columns that, when taken together, can be used to identify a higher probability of accident.


- Here is the interesting twist: Just because an engineer might be working with a certain combination of features does not guarantee that an accident will occur!  However, the hypothesis is that for some ranges of some features, the probability of an accident increases.  So the real question is, for given features, what is the probability of an accident?  Which features lead to an increased probability, and which do not really affect the probability? 

- Some of the features represent a range of values, such as the Start time.  This has values ranging from 0 to 1439, representing the clock-in time as minutes since midnight.  Rather than using this actual value, it might be best to divide the values into 'bins'.  For example, you might want to use 'hours of the day' rather than 'minutes of the day'.  You can make a new column that takes the values from 'Start', then rounding to the nearest hour.  Or you might separate the values into 3-hour bins, so you only have values from 0-7.  You can then group the records based on this new column, then in each group determine the percentage of records in that group that are accidents.  You might find that in the 7 am bin, there is a 28% chance of accident, but in the 6 am bin there is a 32% chance.

- Here are some things to consider:
  - Because of the probabilistic nature of the problem, the standard logistic regression might not work too well.  Do the logistic regressions, see if you can find some interesting results.  But in addition to this, try other techniques, such as the grouping talked about above.
  - In addition to dividing individual features into bins, then determining the probabilities in each bin, you might consider examining two features, building a 2-dimensional set of bins.  It might be that certain combinations of features lead to a better identification of risk.
  - When determining the probability of risk in a 'bin', consider that if a bin only as a few records, then the probability estimate will not be that convincing.  For example, suppose you were looking at a clock-in bin but found only two records where the person clocked in at 3 am. (I don't know if this is actually the case, we are just making an assumption here).  If both of those records happened to be accidents, you might conclude that starting work at 3 am is 100% guaranteed to have an accident.  Again, these are probabilities, so it is not realistic to conclude a probability of 100% for any bin.  So be careful about this!

- A number of people have analyzed this train data, and there are two schools of thought:
  - (A) The two types of accident, 1 and 2, really are different types of accident, caused by different situations, so these should be distinguishable by the various statistics in the file.  In other words, in one bin, type 1 accidents are much more likely than type 2 accidents, while in another bin the reverse is true.  If we can find a number of bins with these differences, this suggests that the two types of accident are distinguishable.
  - (B) The two types of accident are indistinguishable, there is no real difference.  While there are different numbers of type 1 and type 2 accidents, so the probabilities won't match in every bin, but if the probabilities are roughly proportional, then we can't distinguish the accidents.  For example, if there were twice as many type 1 as type 2 accidents, then in any bin we would expect the probability of a type 1 accident to be roughly twice the probability of type 2.

- The first question we would like answered is this: Are the two types of accident distinguishable?  Is there a fairly reliable way to tell these apart?

- The second question (which depends upon the answer to the first!) is this: Which features or combination of features are the most useful for predicting accidents, so that if we build bins using these features, some bins will have relatively low probabilities (and hence show good working conditions), while other bins have relatively high probabilities (and show poor working conditions).  If we can find answers to this question, we can identify good vs poor working policies.

- And how do you choose to visualize your results?  Do you have graphs, confusion matrices, or charts?

- The third question is this: Which of the features (input columns) are not significant in performing the classification, and can hence be ignored.

- There is also an opportunity for some extra credit.  In the introduction I mentioned the FIRMF column, that contains our estimation of an employee's fatigue.  We actually have two similar calculations, the other using the FIRM column.  The question is: which is better, or are they both roughly the same?  If your results in step 2 show that either FIRM or FIRMF is part of your solution, in this part show how your results would differ by using the other column.

- What I am looking for is not so much your results, but I am looking to see your thought processes, how do you prepare and analyze the data.  So even if your results are inconclusive, if you are using good techniques in your work, you get a good grade!  But if you do have conclusive results, that would be great!

- In your Jupyter notebook, in addition to the cells which contain the code, include markdown cells explaining what you are doing, or highlighting conclusions that you can draw from the analysis.

- It is helpful if you do NOT clear the cells before turning in your results, because otherwise I have to run all of your results rather than just reading all of your results!

- Check your notebook in to Canvas to submit your homework!

##### <u>Extracting Data From CSV</u>

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Name of the file.
DATA_NAME = "FRAFirm2.csv"


# Turns the CSV into a dataframe.
def csv_to_dataframe():
  data = pd.read_csv(DATA_NAME)
  return pd.DataFrame(data)


# Main part of the program starts here.
dataframe = csv_to_dataframe()

# Display dataframe.
dataframe

Unnamed: 0,FIRM,Class,Start,End,TOD,FIRMF,Length,Night,Gap,WS,...,AFZ,WAFZ,MFZ,WMFZ,NBad,TIW,RIW,HIW,SIW,CCW
0,-3,0,170,680,-1,-3,510,0,750,0,...,0,0,0,0,0,2269,1,0,3,2.0
1,5,0,920,1295,-1,5,375,0,855,0,...,0,1,0,0,0,465,0,0,1,1.0
2,1,1,1130,1865,1470,-2,735,0,999,0,...,0,0,0,0,0,2855,2,0,5,0.0
3,4,0,670,1265,-1,4,595,0,605,0,...,0,1,0,0,0,1325,0,0,2,1.5
4,3,0,330,1049,-1,3,719,0,999,1,...,0,1,0,1,0,2434,2,0,4,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7067,-1,0,225,855,-1,-1,630,0,999,0,...,0,0,0,0,0,824,0,0,3,0.0
7068,5,1,1245,2030,1410,0,785,0,999,1,...,0,0,0,0,0,0,0,0,0,0.0
7069,5,0,1201,1365,-1,5,164,0,999,0,...,0,0,0,0,0,2000,1,0,3,0.0
7070,5,0,945,1485,-1,5,540,0,999,0,...,0,1,0,0,0,495,0,0,1,0.0


##### <u>Create Bins For Start</u>

In [9]:
# Amount of hours per bin.
X_HOURS_PER_BIN = 1


# Creates a bin column for the dataframe.
def create_bin_column(dataframe, for_column, bin_name, x_hours):
  bins = []
  for row in dataframe.iterrows():
    bins.append(np.ceil(row[1].get(for_column) / (x_hours * 60)) - 1)
  dataframe[bin_name] = bins


# Create bin column.
create_bin_column(dataframe, "Start", "Bin", X_HOURS_PER_BIN)

# Display dataframe.
dataframe

Unnamed: 0,FIRM,Class,Start,End,TOD,FIRMF,Length,Night,Gap,WS,...,WAFZ,MFZ,WMFZ,NBad,TIW,RIW,HIW,SIW,CCW,Bin
0,-3,0,170,680,-1,-3,510,0,750,0,...,0,0,0,0,2269,1,0,3,2.0,2.0
1,5,0,920,1295,-1,5,375,0,855,0,...,1,0,0,0,465,0,0,1,1.0,15.0
2,1,1,1130,1865,1470,-2,735,0,999,0,...,0,0,0,0,2855,2,0,5,0.0,18.0
3,4,0,670,1265,-1,4,595,0,605,0,...,1,0,0,0,1325,0,0,2,1.5,11.0
4,3,0,330,1049,-1,3,719,0,999,1,...,1,0,1,0,2434,2,0,4,0.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7067,-1,0,225,855,-1,-1,630,0,999,0,...,0,0,0,0,824,0,0,3,0.0,3.0
7068,5,1,1245,2030,1410,0,785,0,999,1,...,0,0,0,0,0,0,0,0,0.0,20.0
7069,5,0,1201,1365,-1,5,164,0,999,0,...,0,0,0,0,2000,1,0,3,0.0,20.0
7070,5,0,945,1485,-1,5,540,0,999,0,...,1,0,0,0,495,0,0,1,0.0,15.0


##### <u>Calculate Start's Accident Probability</u>

In [18]:
# Calculates the accident probability for a bin.
def get_accident_probability(bin_number):
  classes = [0, 0, 0]
  for item in dataframe[dataframe.Bin == bin_number].iterrows():
    classes[int(item[1].get("Class"))] += 1
  
  # Get the probabilty of accident {1 - P(no accident)}.
  return 1 - classes[0] / sum(classes)


# Get chances of accident per bin.
print("Hour: Accident Chance")
for i in range(0, 24 // X_HOURS_PER_BIN):
  print(f"{i}: {round(get_accident_probability(i) * 100, 2)}%")

Hour: Accident Chance
0: 37.13%
1: 38.42%
2: 40.78%
3: 38.2%
4: 35.78%
5: 29.07%
6: 32.79%
7: 28.8%
8: 26.79%
9: 28.26%
10: 30.51%
11: 29.37%
12: 34.06%
13: 26.47%
14: 33.75%
15: 26.82%
16: 33.77%
17: 29.75%
18: 30.67%
19: 33.79%
20: 29.13%
21: 34.16%
22: 27.37%
23: 23.55%
