# Investigating Aviation Accidents
Our goal in this mission will be to explore the time dimension of various approaches to data analysis. We'll be working with a dataset of aviation accident records for the US.

Let's read in the data, then search for a specific target code, let's say "LAX94LA336".

In [17]:
f = open("AviationData.txt", "r")
data = f.read()
# Split the data on the newline character
aviation_data = data.split("\n")
aviation_list = []
for d in aviation_data:
    temp = d.split("|") # Further split each line into separate elements on the bar character
    temp = [x.strip() for x in temp] # Strip away whitespace at beginning and end for each element
    aviation_list.append(temp)
target = "LAX94LA336" # Search target

## First approach - nested loops

In [18]:
lax_code = []
for r in aviation_list:
    for c in r:
        if target in c:
            lax_code.append(r)
print(lax_code)

[['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']]


This approach was exponential, since it has to search through all the rows and through all the columns of each row on top of it. Can we make the code faster?

## Second approach - single loop

In [19]:
lax_code = []
for r in aviation_data:
    if target in r:
            lax_code.append(r.split("|"))
print(lax_code)

[['20001218X45447 ', ' Accident ', ' LAX94LA336 ', ' 07/19/1962 ', ' BRIDGEPORT, CA ', ' United States ', '  ', '  ', '  ', '  ', ' Fatal(4) ', ' Destroyed ', '  ', ' N5069P ', ' PIPER ', ' PA24-180 ', ' No ', ' 1 ', ' Reciprocating ', '  ', '  ', ' Personal ', '  ', ' 4 ', ' 0 ', ' 0 ', ' 0 ', ' UNK ', ' UNKNOWN ', ' Probable Cause ', ' 09/19/1996 ', ' ']]


This approach is linear, since it only searches through rows before splitting just the correct one. It is faster and the code is shorter, so it is generally a better approach. The row splitting becomes redundant however, since we already did that when assigning to aviation_list.

## Third approach - binary search

In [20]:
from math import floor
sorted_aviation_list = sorted(aviation_list, key = lambda x: x[2])
length = len(sorted_aviation_list)
upper = length - 1
lower = 0
index = floor((upper + lower) / 2)
code = sorted_aviation_list[index][2]
lax_code = []
while upper >= lower and target not in code:
    if target < code:
        upper = index - 1
    else:
        lower = index + 1
    index = int((upper + lower) / 2)
    code = sorted_aviation_list[index][2]
if target == code:
    lax_code.append(sorted_aviation_list[index])
    print(lax_code)
else:
    print("Code not found.")

IndexError: list index out of range

We're returned an error, let's try to debug the problem.

In [21]:
print(aviation_list[:3])

[['Event Id', 'Investigation Type', 'Accident Number', 'Event Date', 'Location', 'Country', 'Latitude', 'Longitude', 'Airport Code', 'Airport Name', 'Injury Severity', 'Aircraft Damage', 'Aircraft Category', 'Registration Number', 'Make', 'Model', 'Amateur Built', 'Number of Engines', 'Engine Type', 'FAR Description', 'Schedule', 'Purpose of Flight', 'Air Carrier', 'Total Fatal Injuries', 'Total Serious Injuries', 'Total Minor Injuries', 'Total Uninjured', 'Weather Condition', 'Broad Phase of Flight', 'Report Status', 'Publication Date', ''], ['20150908X74637', 'Accident', 'CEN15LA402', '09/08/2015', 'Freeport, IL', 'United States', '42.246111', '-89.581945', 'KFEP', 'albertus Airport', 'Non-Fatal', 'Substantial', 'Unknown', 'N24TL', 'CLARKE REGINALD W', 'DRAGONFLY MK', '', '', '', 'Part 91: General Aviation', '', 'Personal', '', '', '1', '', '', 'VMC', 'TAKEOFF', 'Preliminary', '09/09/2015', ''], ['20150906X32704', 'Accident', 'ERA15LA339', '09/05/2015', 'Laconia, NH', 'United States'

The list aviation_list is indeed a list of lists, so the reason why the script isn't running isn't clear - let's check the last few lists in the list.

In [22]:
print(aviation_list[-3:])

[['20010711X01367', 'Incident', 'DCA00WA063', '', 'Cuzco, Peru', 'Peru', '', '', '', '', 'Incident', '', '', '', 'Boeing', 'B-737', 'No', '', '', '', 'SCHD', '', '', '', '', '', '', '', '', 'Foreign', '07/12/2001', ''], ['20150729X33718', 'Accident', 'CEN15FA325', '', 'Truth or Consequences, NM', 'United States', '33.250556', '-107.293611', 'TCS', 'TRUTH OR CONSEQUENCES MUNI', 'Fatal(2)', 'Substantial', 'Airplane', 'N32401', 'PIPER', 'PA-28-151', 'No', '1', 'Reciprocating', 'Part 91: General Aviation', '', 'Personal', '', '2', '', '', '', '', 'UNKNOWN', 'Preliminary', '08/10/2015', ''], ['']]


The very last list in the list of lists is anomalous and is causing the script to return an error. Let's drop this last list.

In [23]:
aviation_list = aviation_list[:-1]
print(aviation_list[-3:]) # Check

[['20130128X92153', 'Accident', 'WPR12TA445', '', 'Unknown, UN', 'United States', '', '', '', '', 'Non-Fatal', 'Substantial', 'Airplane', 'N14CP', 'BEECH', 'C90', 'No', '2', 'Turbo Prop', 'Public Use', '', 'Public Aircraft - Federal', '', '', '', '', '1', '', '', 'Preliminary', '02/08/2013', ''], ['20010711X01367', 'Incident', 'DCA00WA063', '', 'Cuzco, Peru', 'Peru', '', '', '', '', 'Incident', '', '', '', 'Boeing', 'B-737', 'No', '', '', '', 'SCHD', '', '', '', '', '', '', '', '', 'Foreign', '07/12/2001', ''], ['20150729X33718', 'Accident', 'CEN15FA325', '', 'Truth or Consequences, NM', 'United States', '33.250556', '-107.293611', 'TCS', 'TRUTH OR CONSEQUENCES MUNI', 'Fatal(2)', 'Substantial', 'Airplane', 'N32401', 'PIPER', 'PA-28-151', 'No', '1', 'Reciprocating', 'Part 91: General Aviation', '', 'Personal', '', '2', '', '', '', '', 'UNKNOWN', 'Preliminary', '08/10/2015', '']]


In [24]:
sorted_aviation_list = sorted(aviation_list, key = lambda x: x[2])
length = len(sorted_aviation_list)
upper = length - 1
lower = 0
index = floor((upper + lower) / 2)
code = sorted_aviation_list[index][2]
lax_code = []
while target not in code and upper >= lower:
    if target < code:
        upper = index - 1
    else:
        lower = index + 1
    index = floor((upper + lower) / 2)
    code = sorted_aviation_list[index][2]
if target in code:
    lax_code.append(sorted_aviation_list[index])
    print(lax_code)
else:
    print("Code not found.")

[['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']]


This approach is logarithmic and is thus the fastest of the three. However, this comes at the cost of added complexity of the script.

## Alternative approach - dictionaries and a single loop

A further method is to use dictionaries. Dictionaries are useful for organising data more neatly and have the added benefit of assigning named keys to each variable. The syntax isn't any more complex than that of lists either.

In [25]:
# Let's convert all of our data into a list of dictionaries.
aviation_dict_list = []
headers = aviation_list[0]
aviation_list_slice = aviation_list[1:]
for r in aviation_list_slice:
    dic = {}
    for i, e in enumerate(r):
        dic[headers[i]] = e
    aviation_dict_list.append(dic)

In [26]:
lax_dict = []
for d in aviation_dict_list:
    if "LAX94LA336" in d[headers[2]]:
        lax_dict.append(d)
print(lax_dict)

[{'Event Id': '20001218X45447', 'Investigation Type': 'Accident', 'Accident Number': 'LAX94LA336', 'Event Date': '07/19/1962', 'Location': 'BRIDGEPORT, CA', 'Country': 'United States', 'Latitude': '', 'Longitude': '', 'Airport Code': '', 'Airport Name': '', 'Injury Severity': 'Fatal(4)', 'Aircraft Damage': 'Destroyed', 'Aircraft Category': '', 'Registration Number': 'N5069P', 'Make': 'PIPER', 'Model': 'PA24-180', 'Amateur Built': 'No', 'Number of Engines': '1', 'Engine Type': 'Reciprocating', 'FAR Description': '', 'Schedule': '', 'Purpose of Flight': 'Personal', 'Air Carrier': '', 'Total Fatal Injuries': '4', 'Total Serious Injuries': '0', 'Total Minor Injuries': '0', 'Total Uninjured': '0', 'Weather Condition': 'UNK', 'Broad Phase of Flight': 'UNKNOWN', 'Report Status': 'Probable Cause', 'Publication Date': '09/19/1996', '': ''}]


This approach has the added benefit of giving labels to our data, so that we can look up specific fields by name. The dictionary approach can be used together with binary search as well!

## Alternative approach II - dictionaries and binary search

In [27]:
sorted_aviation_dict_list = sorted(aviation_dict_list, key = lambda x: x["Accident Number"])
length = len(sorted_aviation_dict_list)
upper = length - 1
lower = 0
index = floor((upper + lower) / 2)
code = sorted_aviation_dict_list[index]["Accident Number"]
lax_code = []
while target not in code and upper >= lower:
    if target < code:
        upper = index - 1
    else:
        lower = index + 1
    index = floor((upper + lower) / 2)
    code = sorted_aviation_dict_list[index]["Accident Number"]
if target in code:
    lax_code.append(sorted_aviation_dict_list[index])
    print(lax_code)
else:
    print("Code not found.")

[{'Event Id': '20001218X45447', 'Investigation Type': 'Accident', 'Accident Number': 'LAX94LA336', 'Event Date': '07/19/1962', 'Location': 'BRIDGEPORT, CA', 'Country': 'United States', 'Latitude': '', 'Longitude': '', 'Airport Code': '', 'Airport Name': '', 'Injury Severity': 'Fatal(4)', 'Aircraft Damage': 'Destroyed', 'Aircraft Category': '', 'Registration Number': 'N5069P', 'Make': 'PIPER', 'Model': 'PA24-180', 'Amateur Built': 'No', 'Number of Engines': '1', 'Engine Type': 'Reciprocating', 'FAR Description': '', 'Schedule': '', 'Purpose of Flight': 'Personal', 'Air Carrier': '', 'Total Fatal Injuries': '4', 'Total Serious Injuries': '0', 'Total Minor Injuries': '0', 'Total Uninjured': '0', 'Weather Condition': 'UNK', 'Broad Phase of Flight': 'UNKNOWN', 'Report Status': 'Probable Cause', 'Publication Date': '09/19/1996', '': ''}]


### Data analysis using dictionaries

Now that we have labels, we can search for "Location" and work out which state had most accidents.

In [28]:
location_accidents = []
for d in aviation_dict_list:
    location_accidents.append(d["Location"])
stateloc_accidents = [x.split(", ")[-1] for x in location_accidents] 
# The -1 index is useful here because it doesn't bug out if faulty entries result in just one single split element
from collections import Counter
state_accidents = Counter(stateloc_accidents)
print("State with most accidents:", state_accidents.most_common(1))

State with most accidents: [('CA', 8030)]


Furthermore, we may calculate the number of injuries by month, and even divide them by type of injury and/or give the total number of accidents to give an idea of how many accidents result in fatalities or less serious consequences.

In [29]:
total_accidents = {}
total_fatal_injuries = {}
total_serious_injuries = {}
total_minor_injuries = {}
total_uninjured = {}
for d in aviation_dict_list:
    mnth = d["Event Date"].split("/")[0]
    if mnth in total_accidents:
        total_accidents[mnth] += 1
    else:
        total_accidents[mnth] = 1
    if mnth in total_fatal_injuries:
        try:
            total_fatal_injuries[mnth] += int(d["Total Fatal Injuries"])
        except:
            total_fatal_injuries[mnth] += 0
    else:
        try:
            total_fatal_injuries[mnth] = int(d["Total Fatal Injuries"])
        except:
            total_fatal_injuries[mnth] = 0
    if mnth in total_serious_injuries:
        try:
            total_serious_injuries[mnth] += int(d["Total Serious Injuries"])
        except:
            total_serious_injuries[mnth] += 0
    else:
        try:
            total_serious_injuries[mnth] = int(d["Total Serious Injuries"])
        except:
            total_serious_injuries[mnth] = 0
    if mnth in total_minor_injuries:
        try:
            total_minor_injuries[mnth] += int(d["Total Minor Injuries"])
        except:
            total_minor_injuries[mnth] += 0
    else:
        try:
            total_minor_injuries[mnth] = int(d["Total Minor Injuries"])
        except:
            total_minor_injuries[mnth] = 0
    if mnth in total_uninjured:
        try:
            total_uninjured[mnth] += int(d["Total Uninjured"])
        except:
            total_uninjured[mnth] += 0
    else:
        try:
            total_uninjured[mnth] = int(d["Total Uninjured"])
        except:
            total_uninjured[mnth] = 0

A DataFrame can now be used to present the new data in a more compact fashion.

In [30]:
import pandas as pd
monthly_injuries = pd.DataFrame([total_accidents,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured])
monthly_injuries

Unnamed: 0,Unnamed: 1,01,02,03,04,05,06,07,08,09,10,11,12
0,3,4345,4597,5884,6430,7480,8342,9309,8760,7096,5918,4743,4374
1,2,3201,2974,2663,2972,3551,3557,5001,4855,4027,3738,3848,3636
2,0,1023,933,1134,1346,1404,1634,2002,2069,1502,1354,1147,1111
3,0,1434,1574,1754,2224,2368,2804,3700,3419,2305,1983,1594,1811
4,1,27633,26988,32778,30134,32991,41012,35418,39791,27659,28301,24941,28908


Good, but not very pretty. We should modify our DataFrame to give it a better presentation.

In [31]:
monthly_injuries = monthly_injuries.drop([""],axis=1)
monthly_injuries.index = ["Total Accidents","Total Fatal Injuries","Total Serious Injuries","Total Minor Injuries","Total Uninjured"]
monthly_injuries = monthly_injuries.transpose()
monthly_injuries

Unnamed: 0,Total Accidents,Total Fatal Injuries,Total Serious Injuries,Total Minor Injuries,Total Uninjured
1,4345,3201,1023,1434,27633
2,4597,2974,933,1574,26988
3,5884,2663,1134,1754,32778
4,6430,2972,1346,2224,30134
5,7480,3551,1404,2368,32991
6,8342,3557,1634,2804,41012
7,9309,5001,2002,3700,35418
8,8760,4855,2069,3419,39791
9,7096,4027,1502,2305,27659
10,5918,3738,1354,1983,28301


Aviation accidents are perhaps not as lethal as one would imagine, although this may very well depend on what we define as an aviation accident. At any rate, it's good practice to check that all calculations were done correctly.

In [33]:
# Print the first few lines of the dataframe to get some bearings on the columns
df = pd.DataFrame(aviation_dict_list)
df.head()

Unnamed: 0,Unnamed: 1,Accident Number,Air Carrier,Aircraft Category,Aircraft Damage,Airport Code,Airport Name,Amateur Built,Broad Phase of Flight,Country,...,Publication Date,Purpose of Flight,Registration Number,Report Status,Schedule,Total Fatal Injuries,Total Minor Injuries,Total Serious Injuries,Total Uninjured,Weather Condition
0,,CEN15LA402,,Unknown,Substantial,KFEP,albertus Airport,,TAKEOFF,United States,...,09/09/2015,Personal,N24TL,Preliminary,,,,1.0,,VMC
1,,ERA15LA339,,Weight-Shift,Substantial,LCI,Laconia Municipal Airport,No,MANEUVERING,United States,...,09/10/2015,Personal,N2264X,Preliminary,,1.0,,,,VMC
2,,GAA15CA251,,,,,,,,United States,...,,,N321DA,Preliminary,,,,,,
3,,WPR15FA256,,Airplane,Substantial,SEE,GILLESPIE FIELD,No,TAKEOFF,United States,...,09/09/2015,Instructional,N8441B,Preliminary,,2.0,,,,VMC
4,,ERA15LA338,,Airplane,Destroyed,,,No,,United States,...,09/10/2015,Aerial Observation,N758DK,Preliminary,,,,2.0,,VMC


In [34]:
uninj = df["Total Uninjured"]
sumt = 0
for i in uninj:
    try:
        sumt += int(i)
    except:
        pass
sumt

376555

In [35]:
fatls = df["Total Fatal Injuries"]
sumt = 0
for i in fatls:
    try:
        sumt += int(i)
    except:
        pass
sumt

44025

In [36]:
srios = df["Total Serious Injuries"]
sumt = 0
for i in srios:
    try:
        sumt += int(i)
    except:
        pass
sumt

16659

In [37]:
minor = df["Total Minor Injuries"]
sumt = 0
for i in minor:
    try:
        sumt += int(i)
    except:
        pass
sumt

26970

It seems that everything is indeed correct. Let's single out some of the flights with large numbers of uninjured people and see if we can understand what happened.

In [38]:
uninj.sort_values(ascending = False).head(3)

58465    99
14734    99
57376    99
Name: Total Uninjured, dtype: object

In [39]:
df.loc[58465]

                                          
Accident Number                 NYC87FA264
Air Carrier               EASTERN AIRLINES
Aircraft Category                         
Aircraft Damage                      Minor
Airport Code                              
Airport Name                              
Amateur Built                           No
Broad Phase of Flight               CRUISE
Country                            Bermuda
Engine Type                      Turbo Fan
Event Date                      09/28/1987
Event Id                    20001213X32186
FAR Description                           
Injury Severity                  Non-Fatal
Investigation Type                Accident
Latitude                                  
Location                  BERMUDA, Bermuda
Longitude                                 
Make                              LOCKHEED
Model                               L-1011
Number of Engines                        3
Publication Date                11/15/1989
Purpose of 

These are likely to be large airliner flights which had some mechanical failures or turbulence-related injuries inflight but managed to land without fatalities! In fact, [here](https://app.ntsb.gov/pdfgenerator/ReportGeneratorFile.ashx?EventID=20001213X32186&AKey=1&RType=Final&IType=FA) is the final report for NYC87FA264.