# DataViz Project - Airplane Crashes Since 1908

## 1. Data Preparation

### 1.1. Reading the data



In [1]:
import pandas as pd

file_path = "dataset/clean_v1_Airplane_Crashes_and_Fatalities_Since_1908.csv"
dframe = pd.read_csv(file_path)

# get information about data entries by column:
dframe.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 11 columns):
Date          5268 non-null object
Time          3049 non-null object
Location      5248 non-null object
Latitude      5240 non-null float64
Longitude     5240 non-null float64
Operator      5250 non-null object
Type          5241 non-null object
Aboard        5246 non-null float64
Fatalities    5256 non-null float64
Ground        5246 non-null float64
Summary       4878 non-null object
dtypes: float64(5), object(6)
memory usage: 452.8+ KB


From the info() method we can see there are several missing values - difference from the total of 5268 entries

Date          5268 

Time          3049  *-> missing 2219 values*

Location      5248  *-> missing 20 values*

Latitude      5240  *-> missing 28 values (generated from Location, missing 8)*

Longitude     5240  *-> missing 28 values (generated from Location, missing 8)*

Operator      5250  *-> missing 18 values*

Type          5241  *-> missing 19 values*

Aboard        5246  *-> missing 22 values*

Fatalities    5256  *-> missing 12 values*

Ground        5246  *-> missing 22 values*

Summary       4878  *-> missing 390 values*



In [2]:
# get a brief statistical summary of the data
dframe.describe(include="object") #dtype object

Unnamed: 0,Date,Time,Location,Operator,Type,Summary
count,5268,3049,5248,5250,5241,4878
unique,4753,999,4304,2476,2445,4673
top,08/28/1976,15:00,"Sao Paulo, Brazil",Aeroflot,Douglas DC-3,Crashed during takeoff.
freq,4,32,15,179,334,15


In [3]:
dframe.describe() #default only numerical values

Unnamed: 0,Latitude,Longitude,Aboard,Fatalities,Ground
count,5240.0,5240.0,5246.0,5256.0,5246.0
mean,27.010367,-16.061076,27.554518,20.068303,1.608845
std,24.570333,83.419121,43.076711,33.199952,53.987827
min,-77.529716,-176.669861,0.0,0.0,0.0
25%,10.944428,-82.810095,5.0,3.0,0.0
50%,34.392923,-9.14948,13.0,9.0,0.0
75%,43.801488,37.615021,30.0,23.0,0.0
max,80.449997,178.800476,644.0,583.0,2750.0


Only numerical columns are described if we don't use the argument include="object". 
We can spot some possible errors in Aboard and Fatalities: the minimum shouldn't be zero for either of them, as this dataset is only about crashes with reported fatalities - we'll have to investigate this, as it could be due to missing values or a wrong value.

The fact that there are only 4673 unique summaries of the crash doesn't necessarily point to duplicate entries - it might just be the summary was the same kind of description, such as "Crashed on landing".


### 2. Cleaning and organizing the data

Correct types for columns:

In [4]:
dframe["Date"].head()

0    09/17/1908
1    07/12/1912
2    08/06/1913
3    09/09/1913
4    10/17/1913
Name: Date, dtype: object

In [5]:
dframe["Date"] = pd.to_datetime(dframe["Date"], format="%m/%d/%Y", errors="coerce")

0      1908-09-17
1      1912-07-12
2      1913-08-06
3      1913-09-09
4      1913-10-17
5      1915-03-05
6      1915-09-03
7      1916-07-28
8      1916-09-24
9      1916-10-01
10     1916-11-21
11     1916-11-28
12     1917-03-04
13     1917-03-30
14     1917-05-14
15     1917-06-14
16     1917-08-21
17     1917-10-20
18     1918-04-07
19     1918-05-10
20     1918-08-11
21     1918-12-16
22     1919-05-25
23     1919-07-19
24     1919-10-02
25     1919-10-14
26     1919-10-20
27     1919-10-30
28     1920-03-10
29     1920-03-30
          ...    
5238   2008-11-13
5239   2008-11-16
5240   2008-11-27
5241   2008-12-03
5242   2008-12-11
5243   2008-12-15
5244   2009-01-04
5245   2009-01-15
5246   2009-02-07
5247   2009-02-07
5248   2009-02-12
5249   2009-02-15
5250   2009-02-20
5251   2009-02-25
5252   2009-03-09
5253   2009-03-12
5254   2009-03-22
5255   2009-03-23
5256   2009-04-01
5257   2009-04-06
5258   2009-04-09
5259   2009-04-17
5260   2009-04-17
5261   2009-04-29
5262   200

In [6]:
dframe['Aboard'] = pd.to_numeric(dframe['Aboard'], downcast="integer", errors="coerce")
dframe['Aboard']
#dtype="Int64" could also work to cast these as integers and maintain rows with NaN, 
#meanwhile they are floats

0         2.0
1         5.0
2         1.0
3        20.0
4        30.0
5        41.0
6        19.0
7        20.0
8        22.0
9        19.0
10       28.0
11       20.0
12       20.0
13       23.0
14       21.0
15       24.0
16       18.0
17       18.0
18       23.0
19       22.0
20       19.0
21        1.0
22        1.0
23        1.0
24        1.0
25        1.0
26        NaN
27        1.0
28        1.0
29        1.0
        ...  
5238      7.0
5239      8.0
5240      7.0
5241      3.0
5242      3.0
5243     12.0
5244      9.0
5245    155.0
5246     28.0
5247      2.0
5248     49.0
5249     13.0
5250      5.0
5251    134.0
5252     11.0
5253     18.0
5254     14.0
5255      2.0
5256     16.0
5257     24.0
5258      6.0
5259     11.0
5260     11.0
5261      7.0
5262     18.0
5263    112.0
5264      4.0
5265    228.0
5266      1.0
5267     13.0
Name: Aboard, Length: 5268, dtype: float64


Before proceeding with changes to other datatypes, we need to check for NaN values, etc.

> isna(obj)	    Detect missing values for an array-like object.
>
> isnull(obj)	Detect missing values for an array-like object.
>
> notna(obj)	Detect non-missing values for an array-like object.
>
> notnull(obj)	Detect non-missing values for an array-like object.

Aboard's mean value is 27.5, so we can use that as a temporary placeholder (or 28, for an integer)

In [7]:
import numpy as np

aboard_series = dframe['Aboard']
aboard_series.fillna("28", inplace=True) #TODO: correct this
aboard_series

#alternatively, replace([1.5, df00], [np.nan, 'a'])
# see http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#numeric-replacement

0         2
1         5
2         1
3        20
4        30
5        41
6        19
7        20
8        22
9        19
10       28
11       20
12       20
13       23
14       21
15       24
16       18
17       18
18       23
19       22
20       19
21        1
22        1
23        1
24        1
25        1
26       28
27        1
28        1
29        1
       ... 
5238      7
5239      8
5240      7
5241      3
5242      3
5243     12
5244      9
5245    155
5246     28
5247      2
5248     49
5249     13
5250      5
5251    134
5252     11
5253     18
5254     14
5255      2
5256     16
5257     24
5258      6
5259     11
5260     11
5261      7
5262     18
5263    112
5264      4
5265    228
5266      1
5267     13
Name: Aboard, Length: 5268, dtype: object

In [8]:
#are there NA values left? it will return a list with the values that satisfy the condition, 
#otherwise empty list
null_aboard = aboard_series[aboard_series.isnull()]
print("No null values in Aboard series.") if not null_aboard.any() else print(null_aboard)
dframe["Aboard"] = pd.to_numeric(aboard_series, errors="coerce", downcast="integer")
#check dataframe to see if all replacements were done.. aboard should now have 5268 ints
dframe.info()

No null values in Aboard series.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 11 columns):
Date          5268 non-null object
Time          3049 non-null object
Location      5248 non-null object
Latitude      5240 non-null float64
Longitude     5240 non-null float64
Operator      5250 non-null object
Type          5241 non-null object
Aboard        5268 non-null int16
Fatalities    5256 non-null float64
Ground        5246 non-null float64
Summary       4878 non-null object
dtypes: float64(4), int16(1), object(6)
memory usage: 421.9+ KB


In [9]:
#quick plot - EDA
import matplotlib as mp
import matplotlib.pyplot as plt

#print(plt.style.available)
mp.style.use("tableau-colorblind10")

dframe["Aboard"].hist(bins = range(dframe["Aboard"].min(), dframe["Aboard"].max(), 10))
plt.ylabel("Count of Occurrences")
plt.xlabel("Passengers Aboard per Flight")
plt.yticks(range(0, 2500, 250))
plt.xticks(range(0, 650, 50))
plt.show()



<Figure size 640x480 with 1 Axes>

In [10]:
dframe["Aboard"].max()


644

Doing the same thing for fatalities and ground... Let's check which values are missing or NaN

Also, let's check for values lower than 1 in Aboard and Fatalities

In [160]:
#Nan and lower than x values (useful for checking zeros, for instance)

def check_nulls(column, dataframe=dframe):
    column_series = dataframe[column]
    null_column = column_series[(column_series.isna()) | (column_series.isnull())]
    print("No null values in {0} series.".format(column)) if null_column.empty else print(null_column)

def check_lower_than(column, value=1, dataframe=dframe):
    column_series = dataframe[column]
    filtered_column = column_series[(column_series < value)]
    print("No values lower than {1} in {0} series.".format(column, value)) if filtered_column.empty else print(filtered_column)


#TODO: create a neat function for this
print([(print("\n--- {}".format(column.upper())), check_nulls(column), check_lower_than(column, 1)) for column in ["Aboard", "Fatalities", "Ground"]])   



--- ABOARD
No null values in Aboard series.
No values lower than 1 in Aboard series.

--- FATALITIES
364   NaN
423   NaN
768   NaN
Name: Fatalities, dtype: float64
26      0.0
108     0.0
387     0.0
593     0.0
889     0.0
897     0.0
1265    0.0
1359    0.0
1440    0.0
1443    0.0
1610    0.0
1837    0.0
1868    0.0
1885    0.0
1927    0.0
1982    0.0
1983    0.0
2066    0.0
2247    0.0
2266    0.0
2359    0.0
2486    0.0
2590    0.0
2835    0.0
3182    0.0
3341    0.0
3366    0.0
3417    0.0
3428    0.0
3470    0.0
3541    0.0
3549    0.0
3611    0.0
3767    0.0
3927    0.0
3950    0.0
4068    0.0
4117    0.0
4171    0.0
4199    0.0
4231    0.0
4242    0.0
4273    0.0
4339    0.0
4543    0.0
4553    0.0
4594    0.0
4701    0.0
4748    0.0
4797    0.0
4875    0.0
5057    0.0
5074    0.0
5134    0.0
5178    0.0
5186    0.0
5197    0.0
5217    0.0
5245    0.0
Name: Fatalities, dtype: float64

--- GROUND
364   NaN
423   NaN
768   NaN
Name: Ground, dtype: float64
0       0.0
1       0.0

From the above report we can see the rows we need to fix (NaN), and possibly investigate to check if the values are correct (zeros). A 0 on the Ground column is not "supicious", as that column indicates fatalities *on impact*, but a 0 value on the Aboard column certainly raises some flags - was it an unmanned flight?

When we fill in all these values, we can downcast them to integers - there should be no "half" of a person accounted. Right now they are floats as it is pandas way to deal with NaN values.

Starting with the two rows of the Aboard series that show 0s...

>--- ABOARD
>
>No null values in Aboard series.
>
>
>*3307*    0
>
>*3611*   0
>
>Name: Aboard, dtype: int16

In [12]:
dframe[3307:3308] #slicing selects rows instead of columns

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
3307,09/22/1981,12:00,"Near Babaeski, Turkey",41.428761,27.09429,Military - Turkish Air Force,Northrop F-5A,0,0.0,40.0,The fighter crashed into a village after the p...


For this record, we find the following information:

> A Turkish Air Force F-5A crashedon to an army camp at Pancarkoy, 276 n.m. north-west of Ankara, during a mock attack on September 22. 
>
> The pilot and 39 soldiers were killed, while 72 soldiers were severely injured. 
[Source](https://www.flightglobal.com/FlightPDFArchive/1981/1981%20-%203134.PDF)

and a news article the day after the crash:

>Hospital sources said more than 100 soldiers, including dead and injured, had been flown to Istanbul by helicopter from the crash site, near Babaeski, about 30 miles from the Greek border and 70 miles northwest of Istanbul.

>The Turkish military imposed a news blackout after initial reports that the jet was an F-104 and that at least 100 soldiers had been killed. Turkey's military ruler, Gen. Kenan Evren, announced over the state radio later that an F-5 had crashed and that there were ''several casualties.''

>The sources said the pilot was practicing a diving run over the bivouac area and was unable to pull the plane out of its descent. They said that he was killed in the crash, which occurred about noon, and that there were reports the jet hit a gasoline or jet fuel dump.

[NYTimes](https://www.nytimes.com/1981/09/23/world/26-killed-as-turkish-jet-crashes-into-area-set-for-nato-exercise.html)

So let's amend the row to read Aboard/Fatalities/Ground as 1/1/40. The summary also appears to be incorrect regarding this accident.

In [13]:
dframe.loc[3307, "Aboard"] = 1
dframe.loc[3307, "Fatalities"] = 1
dframe.loc[3307, "Summary"] = "A Turkish Air Force F-5A crashed on to an army camp at Pancarkoy, 276 n.m. north-west of Ankara, during a mock attack on September 22. The pilot and 39 soldiers were killed, while 72 soldiers were severely injured."

dframe.loc[3307, :]

Date                                                 09/22/1981
Time                                                      12:00
Location                                  Near Babaeski, Turkey
Latitude                                                41.4288
Longitude                                               27.0943
Operator                           Military - Turkish Air Force
Type                                              Northrop F-5A
Aboard                                                        1
Fatalities                                                    1
Ground                                                       40
Summary       A Turkish Air Force F-5A crashed on to an army...
Name: 3307, dtype: object

In [14]:
dframe[3611:3612]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
3611,03/27/1986,,"Bangui, Central African Republic",4.39669,18.558599,Military - French Air Force,Sepecat Jaguar A,0,0.0,35.0,The jet fighter crashed into a school shortly ...


Some info on this crash:

> Time:	08:00

> Fatalities:	Fatalities: 0 / Occupants: 1

> Other fatalities:	21

> Aircraft damage:	Written off (damaged beyond repair)

> Narrative: Crashed into a school shortly after taking off from Bangui Central African Republic, after experiencing engine failure. Pilot (Lt. M. Etcheberry) ejected safely, but the crippled Jaguar killed at least 21 people on the ground (some reports say "up to 35") when it came down on a school on the outskirts of the city.

[Source](https://aviation-safety.net/wikibase/wiki.php?id=64984)

So we can add Time and Aboard as 08:00/1.

In [15]:
dframe.loc[3611, "Aboard"] = 1
dframe.loc[3611, "Time"] = "08:00"

dframe.loc[3611, :]

Date                                                 03/27/1986
Time                                                      08:00
Location                       Bangui, Central African Republic
Latitude                                                4.39669
Longitude                                               18.5586
Operator                            Military - French Air Force
Type                                           Sepecat Jaguar A
Aboard                                                        1
Fatalities                                                    0
Ground                                                       35
Summary       The jet fighter crashed into a school shortly ...
Name: 3611, dtype: object

In [16]:
check_nulls("Aboard")
check_lower_than("Aboard", 1)


No null values in Aboard series.
No values lower than 1 in Aboard series.


Moving on to check Fatalities, we should keep in mind that this column details the number of people that died *aboard* - crew and passengers.

As we saw in the entry above (row 3611), detailing the accident in Bangui, the pilot was the sole person aboard and escaped alive, and all the fatalities were on impact. So Fatalities can be less than Ground but not more than Aboard. For now let's check the NaN values.


In [17]:
check_nulls("Fatalities")

26     NaN
333    NaN
364    NaN
423    NaN
537    NaN
570    NaN
571    NaN
573    NaN
593    NaN
678    NaN
768    NaN
4705   NaN
Name: Fatalities, dtype: float64


In [18]:
dframe[26:27]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
26,10/20/1919,,English Channel,49.035534,-5.337111,Aircraft Transport and Travel,De Havilland DH-4,28,,,


[Source 1](https://aviation-safety.net/wikibase/wiki.php?id=27974)

> Time:	day

> Fatalities: 0 / Occupants: 1

> Other fatalities:	0

[Source 2](https://www.baaa-acro.com/crash/crash-de-havilland-dh4a-folkestone)

> Crew on board: 1    Crew fatalities:  0

> Pax on board: 0     Pax fatalities:  0

> Other fatalities: 0 Total fatalities:  0

> Circumstances:  The pilot, Major-General Edward James Montagu-Stuart-Wortley, was returning to Croydon following an exhibition in Interlaken, Switzerland. After a fuel stop in Paris-Le Bourget, he continued his flight to the base in Croydon. While overflying The Channel and approaching the British coast, pilot encountered fog and the visibility was low. Eventually, aircraft crashed into the sea off Folkestone, Kent. The pilot was rescued while the aircraft was lost.

Aboard is 28 because we replaced it by the mean value before, but we can now fix all this missing information. Because the time is uncertain, but it's marked as "day", let's default to 12:00

In [19]:
dframe.loc[26, "Time"] = "12:00"
dframe.loc[26, "Aboard"] = 1
dframe.loc[26, "Fatalities"] = 0
dframe.loc[26, "Ground"] = 0
dframe.loc[26, "Summary"] = "The pilot, Major-General Edward James Montagu-Stuart-Wortley, was returning to Croydon following an exhibition in Interlaken, Switzerland. After a fuel stop in Paris-Le Bourget, he continued his flight to the base in Croydon. While overflying The Channel and approaching the British coast, pilot encountered fog and the visibility was low. Eventually, aircraft crashed into the sea off Folkestone, Kent. The pilot was rescued while the aircraft was lost."

dframe.loc[26, :]

Date                                                 10/20/1919
Time                                                      12:00
Location                                        English Channel
Latitude                                                49.0355
Longitude                                              -5.33711
Operator                          Aircraft Transport and Travel
Type                                          De Havilland DH-4
Aboard                                                        1
Fatalities                                                    0
Ground                                                        0
Summary       The pilot, Major-General Edward James Montagu-...
Name: 26, dtype: object

In [20]:
dframe[333:334]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
333,08/10/1934,,"Ningbo, China",29.873859,121.55027,China National Aviation Corporation,Sikorsky S-38B,28,,,


[Source](https://www.baaa-acro.com/crash/crash-sikorsky-s-38-ningbo-1-killed)

> Date & Time:  Aug 10, 1934

> Crew fatalities:  1

> Total fatalities:  1

> Circumstances:  Crashed in unknown circumstances in the Bay of Ningbo. At least one crew member was killed.

In [21]:
dframe.loc[333, "Aboard"] = 1
dframe.loc[333, "Fatalities"] = 1
dframe.loc[333, "Ground"] = 0
dframe.loc[333, "Summary"] = "Crashed in unknown circumstances in the Bay of Ningbo. At least one crew member was killed."
dframe.loc[333, :]

Date                                                 08/10/1934
Time                                                        NaN
Location                                          Ningbo, China
Latitude                                                29.8739
Longitude                                                121.55
Operator                    China National Aviation Corporation
Type                                             Sikorsky S-38B
Aboard                                                        1
Fatalities                                                    1
Ground                                                        0
Summary       Crashed in unknown circumstances in the Bay of...
Name: 333, dtype: object

In [22]:
dframe[364:365]
#couldn't find info on this entry
#TODO: drop it?

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
364,08/13/1935,,"Hangow, China",36.563114,103.735809,China National Aviation Corporation,Sikorsky S-38B,28,,,Destoryed in a storm.


In [23]:
dframe[423:424]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
423,12/26/1936,,"Nanking, China",32.05838,118.796471,China National Aviation Corporation,Douglas DC-2,28,,,


Not a lot of information on this flight found, but it seems like a rather minor incident:
    
[Source](https://aviation-safety.net/database/record.php?id=19361228-1)
> Emergency landed on a river sandbar near Nanking after engine trouble. Pilot Hiram Broiles injured. Aircraft recovered and flown again. Ex Pan Am NC14297

*Consider dropping this entry as well?*

In [24]:
#TODO:drop?
dframe.loc[423, "Summary"] = "Emergency landed on a river sandbar near Nanking after engine trouble. Pilot Hiram Broiles injured. Aircraft recovered and flown again. "
dframe.loc[423, :]

Date                                                 12/26/1936
Time                                                        NaN
Location                                         Nanking, China
Latitude                                                32.0584
Longitude                                               118.796
Operator                    China National Aviation Corporation
Type                                               Douglas DC-2
Aboard                                                       28
Fatalities                                                  NaN
Ground                                                      NaN
Summary       Emergency landed on a river sandbar near Nanki...
Name: 423, dtype: object

In [25]:
dframe[527:528]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
527,11/20/1939,,"Gosport, England",50.794418,-1.12174,British Airways,Airspeed Oxford,2,2.0,,


[Source](https://aviation-safety.net/wikibase/wiki.php?id=24958)

>Time:	day

>Fatalities:	Fatalities: 2 / Occupants: 2
/ Other fatalities:	0

>Written off (destroyed) when crashed after struck barrage balloon near Marchan, Gosport, Hampshire 20.11.39. 
 
>The pilot and the radio operator were killed when a civil aeroplane crashed near Gosport, Hants, yesterday after hitting a cable. The pilot, Arthur George Nicholson was 23. Before the war he flew on the Heston to Warsaw route for British Airways. He was a native of Aylesbury, Bucks., and was married. The radio operator, Arthur Edward Eady, a Londoner, was 27. He too, was a married man and before the war was with British Airways, serving on the Scandinavian and Brussels routes.

In [26]:
dframe.loc[527, "Time"] = "12:00"
dframe.loc[527, "Ground"] = 0
dframe.loc[527, "Summary"] = "Written off (destroyed) when crashed after struck barrage balloon near Marchan, Gosport, Hampshire 20.11.39. The pilot and the radio operator were killed when a civil aeroplane crashed near Gosport, Hants, yesterday after hitting a cable."
dframe.loc[527, :]

Date                                                 11/20/1939
Time                                                      12:00
Location                                       Gosport, England
Latitude                                                50.7944
Longitude                                              -1.12174
Operator                                        British Airways
Type                                            Airspeed Oxford
Aboard                                                        2
Fatalities                                                    2
Ground                                                        0
Summary       Written off (destroyed) when crashed after str...
Name: 527, dtype: object

In [27]:
dframe[537:538]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
537,07/07/1940,,Gulf of Tonkin,19.837351,107.775848,Air France,Dewoitine D-338,28,,,Shot down by a Japanese military fighter.


[Source](https://aviation-safety.net/database/record.php?id=19400707-0)

>Crew:	Fatalities: 4 / Occupants: 4

>Passengers:	Fatalities: 0 / Occupants: 0

>Total:	Fatalities: 4 / Occupants: 4

[Source 2]()

>Take-off at 0835 with two French officers and a Major of the IJN as passengers.

>Supposed to have been shot down by Japanese fighters.

>The wreck was located at 109°30 E and 21° N. Fishermen from Hong Kong retrieved the corpses 6 km off the coast.

>On 10th July, the Japanese apologized to the Gouverneur Général for this "mistake", thus confirming that the D.338 was actually shot down by their fighters.

From this we can confirm that we have approximately the right coordinates from the data scrapping, and it gives us a time window - for practical purposes, let's round it to 9am, we could be more precise if we went to the trouble of checking how long it would take this aircraft to cover the distance from departure at Hanoi to where it crashed.


In [28]:
dframe.loc[537, "Time"] = "09:00"
dframe.loc[537, "Aboard"] = 4
dframe.loc[537, "Fatalities"] = 4
dframe.loc[537, "Ground"] = 0
dframe.loc[537, :]

Date                                         07/07/1940
Time                                              09:00
Location                                 Gulf of Tonkin
Latitude                                        19.8374
Longitude                                       107.776
Operator                                     Air France
Type                                    Dewoitine D-338
Aboard                                                4
Fatalities                                            4
Ground                                                0
Summary       Shot down by a Japanese military fighter.
Name: 537, dtype: object

In [29]:
check_nulls("Fatalities")
dframe[570:574]

364    NaN
423    NaN
570    NaN
571    NaN
573    NaN
593    NaN
678    NaN
768    NaN
4705   NaN
Name: Fatalities, dtype: float64


Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
570,01/24/1942,,"Near Samarinda, Borneo",-0.502183,117.153763,KNILM,Douglas DC-3,28,,,Shot down by Japanese military aircraft.
571,01/26/1942,,"Kupang, Timor",-10.172443,123.577942,KNILM,Grumman G-21 Goose,28,,,Shot down by Japanese military aircraft.
572,01/30/1942,,"Near Kupang, Timor",-10.172443,123.577942,Qantas,Short S-23 (flying boat),18,13.0,0.0,Shot down by Japanese military aircraft. Owned...
573,02/14/1942,,,,,China National Aviation Corporation,Douglas DC-2,28,,,


[Source](https://en.wikipedia.org/wiki/1942_KNILM_Douglas_DC-3_shootdown)


> Passengers	8 / Crew	4 / Fatalities	4 / Survivors	8

> On 3 March 1942, PK-AFV, a Douglas DC-3-194 airliner operated by KNILM, was shot down over Western Australia by Imperial Japanese Navy Air Service fighter aircraft, resulting in the deaths of four passengers fleeing the Japanese invasion of Java. Among the passengers were five pilots from the army as well navy, Pieter Cramerus, G.D Brinkman, Leon Vanderburg, Daan Hendriksz and H.M. Gerrits. The other three passengers were Maria van Tuyn, her baby son Johannes and trainee flight engineer H. van Romondt.

> The plane was airborne at 01.15am. At about 09.00am, as the DC-3 neared Broome, skirting the Kimberley coast, three Mitsubishi Zeroes dived at the DC-3 and fired at its port side, scoring numerous hits. The port engine caught fire, putting the aircraft into a steep spiral dive.

>The flight engineer and three passengers, including a baby, were killed and others seriously injured by bullets. 

Time, Aboard, Fatalities, Ground and Summary can be replaced with this information.

In [30]:
dframe.loc[570, "Time"] = "09:00"
dframe.loc[570, "Aboard"] = 12
dframe.loc[570, "Fatalities"] = 4
dframe.loc[570, "Ground"] = 0
dframe.loc[570, :]

Date                                        01/24/1942
Time                                             09:00
Location                        Near Samarinda, Borneo
Latitude                                     -0.502183
Longitude                                      117.154
Operator                                         KNILM
Type                                      Douglas DC-3
Aboard                                              12
Fatalities                                           4
Ground                                               0
Summary       Shot down by Japanese military aircraft.
Name: 570, dtype: object

[Source1](https://aviation-safety.net/wikibase/wiki.php?id=26184)

[Source2](http://www.hdekker.info/Nieuwe%20map/1940.htm#26.01.1942)

[Source3](https://www.baaa-acro.com/crash/crash-grumman-g-21a-goose-kupang-5-killed)

> Crew on board:  5 / Pax on board:  0

> Total fatalities:  5

> The crew left Kupang-Penfui Airport in the day to proceed to an aerial surveillance of the region of Kupang to inspect the evacuation of the civilians because of the impending Japanese invasion. En route, the seaplane was shot down by the pilot of a Japanese fighter and crashed in a field, killing all five crew members.

In [31]:
dframe.loc[571, "Aboard"] = 5
dframe.loc[571, "Fatalities"] = 5
dframe.loc[571, "Ground"] = 0
dframe.loc[571, "Summary"] = "The crew left Kupang-Penfui Airport in the day to proceed to an aerial surveillance of the region of Kupang to inspect the evacuation of the civilians because of the impending Japanese invasion. En route, the seaplane was shot down by the pilot of a Japanese fighter and crashed in a field, killing all five crew members."
dframe.loc[571, :]

Date                                                 01/26/1942
Time                                                        NaN
Location                                          Kupang, Timor
Latitude                                               -10.1724
Longitude                                               123.578
Operator                                                  KNILM
Type                                         Grumman G-21 Goose
Aboard                                                        5
Fatalities                                                    5
Ground                                                        0
Summary       The crew left Kupang-Penfui Airport in the day...
Name: 571, dtype: object

> Date & Time:  Feb 19, 1942

> Location:  Zhengzhou-Xinzheng Henan

> Circumstances:  At least one person was killed when the aircraft crashed near Zhengzhou in unknown circumstances.

[Source](https://www.baaa-acro.com/crash/crash-douglas-dc-2-near-zhengzhou-1-killed)

> Aboard:	1   (passengers:0  crew:1)

> Fatalities:	1   (passengers:0  crew:1)

[Source2](http://www.planecrashinfo.com/1942/1942-6.htm)

In [32]:
dframe.loc[573, "Date"] = "02/19/1942"
dframe.loc[573, "Location"] = "Zhengzhou-Xinzheng Henan, China"
dframe.loc[573, "Aboard"] = 1
dframe.loc[573, "Fatalities"] = 1
dframe.loc[573, "Ground"] = 0
dframe.loc[573, "Summary"] = "At least one person was killed when the aircraft crashed near Zhengzhou in unknown circumstances."
dframe.loc[573, :]

#TODO: run script to get location coordinates and fill those values

Date                                                 02/19/1942
Time                                                        NaN
Location                        Zhengzhou-Xinzheng Henan, China
Latitude                                                    NaN
Longitude                                                   NaN
Operator                    China National Aviation Corporation
Type                                               Douglas DC-2
Aboard                                                        1
Fatalities                                                    1
Ground                                                        0
Summary       At least one person was killed when the aircra...
Name: 573, dtype: object

In [33]:
dframe.loc[593, :]

#no more info found

Date                                                 10/01/1942
Time                                                        NaN
Location                                         Kunming, China
Latitude                                                24.8797
Longitude                                               102.833
Operator                    China National Aviation Corporation
Type                                               Douglas C-47
Aboard                                                       28
Fatalities                                                  NaN
Ground                                                      NaN
Summary       Crashed while attempting to land after losing ...
Name: 593, dtype: object

In [34]:
dframe.loc[678, :]

Date                              11/09/1944
Time                                     NaN
Location                     Seljord, Norway
Latitude                             59.4838
Longitude                            8.62663
Operator      Military - U.S. Army Air Corps
Type                                     NaN
Aboard                                    28
Fatalities                               NaN
Ground                                   NaN
Summary                                  NaN
Name: 678, dtype: object

The information relative to this entry on the dataset is wrong and very incomplete, cross-referencing seems to indicate this is the crash:

[Source](https://www.norwegianamerican.com/featured/soldiers-last-flight-to-telemark/)

> The bomb squadron was equipped with a Consolidated B-24 Liberator with registration number 42-52196, and the bomber’s commander was the 22-year-old junior officer John B. O’Hara accompanied by nine other crew members.

[Source2](https://www.458bg.com/crewij3ohara.htm)

>On September 9, 1944 the crew took off from Scotland to their drop point, to drop arms and ammunition to the Gullknappen Norwegian resistance group. Flying B-24H-10-FO 42-52196 War Bride originally from the 453rd Bombardment Group, their aircraft was designated Crupper-5.  Also along on this mission as a gunner was a veteran from the August 1, 1943 Ploesti raid, S/Sgt John P. Morris of the 389th Bomb Group.  Due to fog, rain, and icing conditions the crew crashed into Skorve Mountain killing all eleven men on board.  A memorial has been built in the nearby town of Selfjord, Norway.  Wreckage can still be found scattered on the mountaintop today.

In [35]:
dframe.loc[678, "Date"] = "09/09/1944"
dframe.loc[678, "Time"] = "20:30"
dframe.loc[678, "Type"] = "B-24 Liberator"
dframe.loc[678, "Aboard"] = 11
dframe.loc[678, "Fatalities"] = 11
dframe.loc[678, "Ground"] = 0
dframe.loc[678, "Summary"] = "On September 9, 1944 the crew took off from Scotland to their drop point, to drop arms and ammunition to the Gullknappen Norwegian resistance group. Flying B-24H-10-FO 42-52196 War Bride originally from the 453rd Bombardment Group, their aircraft was designated Crupper-5. Also along on this mission as a gunner was a veteran from the August 1, 1943 Ploesti raid, S/Sgt John P. Morris of the 389th Bomb Group. Due to fog, rain, and icing conditions the crew crashed into Skorve Mountain killing all eleven men on board. A memorial has been built in the nearby town of Selfjord, Norway. Wreckage can still be found scattered on the mountaintop today."
dframe.loc[678, :]

Date                                                 09/09/1944
Time                                                      20:30
Location                                        Seljord, Norway
Latitude                                                59.4838
Longitude                                               8.62663
Operator                         Military - U.S. Army Air Corps
Type                                             B-24 Liberator
Aboard                                                       11
Fatalities                                                   11
Ground                                                        0
Summary       On September 9, 1944 the crew took off from Sc...
Name: 678, dtype: object

In [36]:
check_nulls("Fatalities")
dframe.loc[768, :]

364    NaN
423    NaN
593    NaN
768    NaN
4705   NaN
Name: Fatalities, dtype: float64


Date                                                03/18/1946
Time                                                       NaN
Location                 Between Chungking and Shanghai, China
Latitude                                               31.2304
Longitude                                              121.474
Operator                   China National Aviation Corporation
Type                                                       NaN
Aboard                                                      28
Fatalities                                                 NaN
Ground                                                     NaN
Summary       Disappeared while en route. Plane never located.
Name: 768, dtype: object

In [37]:
dframe.loc[4705, :]

Date                                                 03/22/2000
Time                                                        NaN
Location                                        Herreira, Spain
Latitude                                                37.3628
Longitude                                               -4.8507
Operator                          Military - Ej√©rcito del Aire
Type                                    CASA 212-DE Aviocar 200
Aboard                                                       28
Fatalities                                                  NaN
Ground                                                      NaN
Summary       Crashed while attempting to land in poor weather.
Name: 4705, dtype: object

[Source](http://www.airalandalus.org/content/eda-casa-aviocar-c-212-herrer%C3%AD-22-de-marzo-de-2000)
[Source](https://aviation-safety.net/database/record.php?id=20000322-0&lang=es)

In [38]:
dframe.loc[4705, "Time"] = "17:45"
dframe.loc[4705, "Operator"] = "Military - Ejercito del Aire"
dframe.loc[4705, "Aboard"] = 7
dframe.loc[4705, "Fatalities"] = 7
dframe.loc[4705, "Ground"] = 0
dframe.loc[4705, "Summary"] = "Crashed in flames in bad weather. The CASA 212 belonged to the 408 Squadron of the Air Force as part of the Air Intelligence Center."
print(dframe.loc[4705, :])

Date                                                 03/22/2000
Time                                                      17:45
Location                                        Herreira, Spain
Latitude                                                37.3628
Longitude                                               -4.8507
Operator                           Military - Ejercito del Aire
Type                                    CASA 212-DE Aviocar 200
Aboard                                                        7
Fatalities                                                    7
Ground                                                        0
Summary       Crashed in flames in bad weather. The CASA 212...
Name: 4705, dtype: object


In [39]:
#only 4 rows left with nan values; consider dropping
check_nulls("Fatalities")
#checking how many entries had no fatalities
check_lower_than("Fatalities", 1)

364   NaN
423   NaN
593   NaN
768   NaN
Name: Fatalities, dtype: float64
26      0.0
108     0.0
387     0.0
889     0.0
897     0.0
1265    0.0
1359    0.0
1440    0.0
1443    0.0
1610    0.0
1837    0.0
1868    0.0
1885    0.0
1927    0.0
1982    0.0
1983    0.0
2066    0.0
2247    0.0
2266    0.0
2359    0.0
2486    0.0
2590    0.0
2835    0.0
3182    0.0
3341    0.0
3366    0.0
3417    0.0
3428    0.0
3470    0.0
3541    0.0
3549    0.0
3611    0.0
3767    0.0
3927    0.0
3950    0.0
4068    0.0
4117    0.0
4171    0.0
4199    0.0
4231    0.0
4242    0.0
4273    0.0
4339    0.0
4543    0.0
4553    0.0
4594    0.0
4701    0.0
4748    0.0
4797    0.0
4875    0.0
5057    0.0
5074    0.0
5134    0.0
5178    0.0
5186    0.0
5197    0.0
5217    0.0
5245    0.0
Name: Fatalities, dtype: float64


In [40]:
check_nulls("Ground")
dframe.info()

228    NaN
308    NaN
310    NaN
364    NaN
423    NaN
593    NaN
643    NaN
768    NaN
1177   NaN
1186   NaN
1822   NaN
4709   NaN
5264   NaN
Name: Ground, dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 11 columns):
Date          5268 non-null object
Time          3056 non-null object
Location      5249 non-null object
Latitude      5240 non-null float64
Longitude     5240 non-null float64
Operator      5250 non-null object
Type          5242 non-null object
Aboard        5268 non-null int16
Fatalities    5264 non-null float64
Ground        5255 non-null float64
Summary       4884 non-null object
dtypes: float64(4), int16(1), object(6)
memory usage: 421.9+ KB


In [41]:
dframe[228:229]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
228,11/18/1930,c: 2:00,"Techachapi Mountains, California",34.990406,-118.569565,PacifiAir Transport,Boeing 40,3,3.0,,Crashed into a mountainside at an altitude of ...


[Source](https://scvhistory.com/scvhistory/lw2998.htm)

In [42]:
dframe.loc[228, "Time"] = "02:00"
dframe.loc[228, "Ground"] = 0
#TODO: clean time column for things like "c:"
dframe.loc[228, :]

Date                                                 11/18/1930
Time                                                      02:00
Location                       Techachapi Mountains, California
Latitude                                                34.9904
Longitude                                               -118.57
Operator                                    PacifiAir Transport
Type                                                  Boeing 40
Aboard                                                        3
Fatalities                                                    3
Ground                                                        0
Summary       Crashed into a mountainside at an altitude of ...
Name: 228, dtype: object

In [43]:
dframe[308:311]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
308,11/09/1933,22:35,"Portland, Oregon",45.511791,-122.675629,United Air Lines,Boeing 247,9,4.0,,Crashed in a thickly wooded area upon taking o...
309,11/10/1933,,"Moriarty, New Mexico",35.006649,-106.045151,Trans Continental and Western Air,Northrop Delta,1,1.0,0.0,
310,11/20/1933,,"Near Tsinan, China",36.65184,117.120087,China National Aviation Corporation,Sinson,8,8.0,,Crashed into the Chingshan mountain range in fog.


[Portland crash source](http://offbeatoregon.com/1712e.1933-a-bad-year-for-plane-crashes-476.html)

In [44]:
dframe.loc[308, "Ground"] = 0
dframe.loc[310, "Ground"] = 0
check_nulls("Ground")
dframe.loc[593, :]

364    NaN
423    NaN
593    NaN
643    NaN
768    NaN
1177   NaN
1186   NaN
1822   NaN
4709   NaN
5264   NaN
Name: Ground, dtype: float64


Date                                                 10/01/1942
Time                                                        NaN
Location                                         Kunming, China
Latitude                                                24.8797
Longitude                                               102.833
Operator                    China National Aviation Corporation
Type                                               Douglas C-47
Aboard                                                       28
Fatalities                                                  NaN
Ground                                                      NaN
Summary       Crashed while attempting to land after losing ...
Name: 593, dtype: object

[Source](https://aviation-safety.net/database/record.php?id=19431006-0)

In [45]:
dframe.loc[593, "Date"] = "10/06/1943"
dframe.loc[593, "Aboard"] = 2
dframe.loc[593, "Fatalities"] = 0
dframe.loc[593, "Ground"] = 0
dframe.loc[593, "Summary"] = "Oil pressure on one engine was lost, soon after takeoff from Kunming, China. The captain feathered the propeller and came around for a landing but hydraulic pressure had dropped to zero so he couldn't get the landing gear down and locked. The C-47 crashed and caught fire."
print(dframe.loc[593, :])

Date                                                 10/06/1943
Time                                                        NaN
Location                                         Kunming, China
Latitude                                                24.8797
Longitude                                               102.833
Operator                    China National Aviation Corporation
Type                                               Douglas C-47
Aboard                                                        2
Fatalities                                                    0
Ground                                                        0
Summary       Oil pressure on one engine was lost, soon afte...
Name: 593, dtype: object


In [62]:
dframe.loc[643, "Ground"] = 0


dframe[1177:1178]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1177,02/07/1952,14:35,"Seguin, San Antonio, Texas",36.400002,139.0,Military - U.S. Air Force,Boeing B-29,13,13.0,5.0,Hit power lines and crashed into houses. Seven...


[Source](http://planecrashinfo.com/1952/1952-6.htm)

In [63]:
dframe.loc[1177, "Ground"] = 5
dframe.loc[1177, "Aboard"] = 13
dframe.loc[1177, "Fatalities"] = 13
dframe.loc[1177, "Location"] = "Kaneko, Japan"


dframe[1186:1187]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1186,03/12/1952,14:35,"Seguin, San Antonio, Texas",29.56951,-97.964661,Military - U.S. Air Force / U.S. Air Force,Boeing B-29 / Boeing B-29,15,15.0,0.0,While on a training mission and flying blind o...


[Source](https://aviation-safety.net/wikibase/wiki.php?id=178359)

In [61]:
dframe.loc[1186, "Ground"] = 0
dframe.loc[1186, "Aboard"] = 8+7
dframe.loc[1186, "Fatalities"] = 8+7
dframe.loc[1186, "Time"] = pd.to_datetime("14:35", format='%HH:%MM', errors="ignore")
dframe.loc[1186, "Location"] = "Seguin, San Antonio, Texas"
dframe[1186:1187]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1186,03/12/1952,14:35,"Seguin, San Antonio, Texas",29.56951,-97.964661,Military - U.S. Air Force / U.S. Air Force,Boeing B-29 / Boeing B-29,15,15.0,0.0,While on a training mission and flying blind o...


In [64]:
#1822   NaN
dframe[1822:1823]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1822,12/20/1962,10:00,"Kadena AB, Okinawa",25.660908,125.725212,Military - U.S. Air Force,KB-50,12,12.0,,"Twelve killed, including civilians. Two civili..."


[Source](https://aviation-safety.net/wikibase/wiki.php?id=192523)

In [65]:
dframe.loc[1822, "Time"] = pd.to_datetime("10:00", format='%HH:%MM', errors="ignore")
dframe.loc[1822, "Summary"] = "Tanker plane crashed into farm house while attempting go-around after aborted landing. Two occupants survived serious injured. Two in the house killed."
dframe.loc[1822, "Aboard"] = 7
dframe.loc[1822, "Fatalities"] = 5
dframe.loc[1822, "Ground"] = 2
dframe[1822:1823]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1822,12/20/1962,10:00,"Kadena AB, Okinawa",25.660908,125.725212,Military - U.S. Air Force,KB-50,7,5.0,2.0,Tanker plane crashed into farm house while att...


In [67]:
#4709   NaN
dframe[4709:4710]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
4709,06/23/2000,11:41,"Boca Raton, Florida",26.350491,-80.088409,Universal Jet Aviation,Learjet 55,3,3.0,0.0,Shortly after takeoff the aircraft impacted an...


[Source](https://aviation-safety.net/database/record.php?id=20000623-1)

In [66]:
dframe.loc[4709, "Ground"] = 0

#5264   NaN
dframe[5264:5265]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
5264,05/26/2009,,"Near Isiro, DemocratiRepubliCongo",2.757426,27.626169,Service Air,Antonov An-26,4,4.0,,The cargo plane crashed while on approach to I...


[Source](https://aviation-safety.net/database/record.php?id=20090526-0)

In [70]:
dframe.loc[5264, "Ground"] = 0
dframe.loc[5264, "Fatalities"] = 3
dframe.loc[5264, "Time"] = pd.to_datetime("16:16", format='%HH:%MM', errors="ignore")
dframe.loc[5264, "Location"] = "Isiro, Democratic Republic of Congo"
dframe.loc[5264, :]

check_nulls("Fatalities")
check_nulls("Ground")

364   NaN
423   NaN
768   NaN
Name: Fatalities, dtype: float64
364   NaN
423   NaN
768   NaN
Name: Ground, dtype: float64


These three rows refer to entries we weren't able to find information for. As we can't validate these, let's drop them. We'll save it as a copy:

In [171]:
df = dframe.drop([364, 423, 768])

df = df.reset_index(drop=True)


df[364:365]
df_dupes = df.duplicated() #returns a series with bool values
df_dupes[(df_dupes == True)] #filter: if it returns an empty list, there are no duplicate rows... happy days!

Series([], dtype: bool)

In [133]:
df.Date.count()

5265

Some additional clean up to the Locations column, removing "near to" and similar words, and filling empty values with nan to get the correct count.


In [261]:
#check if these words are in the values
df[df.Location.str.contains(r"Near|Off the|Over the")]["Location"]

#to remove/replace those words and then strip trailing whitespaces
df.Location = df.Location.str.replace("Near", "")
df.Location = df.Location.str.replace("Off the", "")
df.Location = df.Location.str.replace("Off", "")
df.Location = df.Location.str.replace("Over the", "")
df.Location = df.Location.str.strip()

In [161]:
df.Location.replace("", np.nan, inplace = True)

check_nulls("Location", dataframe=df)

df.Summary.replace(np.nan, "No details available.", inplace = True)

check_nulls("Summary", dataframe=df)

142     NaN
410     NaN
562     NaN
586     NaN
594     NaN
702     NaN
901     NaN
1497    NaN
1913    NaN
1915    NaN
1975    NaN
2911    NaN
2950    NaN
3865    NaN
3877    NaN
4031    NaN
4040    NaN
4972    NaN
5036    NaN
Name: Location, dtype: object
No null values in Summary series.


Finally, let's try to add the missing locations:

In [179]:
#142     NaN
df[142:143]


Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
142,05/07/1928,,,,,Aeropostale,Latecoere 26,1,1,0,No details available.


[Source](https://aviation-safety.net/wikibase/wiki.php?id=190785)

> Date:	07-MAY-1928

> Time:	07:40 LT

> Fatalities:	Fatalities: 1 / Occupants: 4 / Other fatalities:	0

> Location:	Armação beach, Florianópolis, Santa Catarina -     Brazil 

> Forced landing on Armação beach 7.5.28 after engine fire shortly after take-off. The crew was performing an internal leg within Brazil on a flight from South America to France. While cruising in the region of Florianópolis, a fire erupted and the pilot Henri Delaunay reduced his altitude to make an emergency landing. Hurt by fire, he eventually lost control of the aircraft that crashed in flames in a prairie. A passenger was killed while both other passengers and the mechanic were injured. The pilot was seriously injured (several burns) but came back on the line after ten months in hospital.

In [181]:
df.loc[142, "Time"] = pd.to_datetime("07:40", format='%HH:%MM', errors="ignore")
df.loc[142, "Location"] = "Armação beach, Florianópolis, Santa Catarina, Brazil"
df.loc[142, "Aboard"] = 4
df.loc[142, "Summary"] = "Forced landing on Armação beach 7.5.28 after engine fire shortly after take-off. The crew was performing an internal leg within Brazil on a flight from South America to France. While cruising in the region of Florianópolis, a fire erupted and the pilot Henri Delaunay reduced his altitude to make an emergency landing. Hurt by fire, he eventually lost control of the aircraft that crashed in flames in a prairie. A passenger was killed while both other passengers and the mechanic were injured. The pilot was seriously injured (several burns) but came back on the line after ten months in hospital."

df[142:143]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
142,05/07/1928,07:40,"Armação beach, Florianópolis, Santa Catarina, ...",,,Aeropostale,Latecoere 26,4,1,0,Forced landing on Armação beach 7.5.28 after e...


In [182]:
#410     NaN
df[410:411]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
410,10/09/1936,,,,,North Sea Aerial and General Transport,Blackburn B-2,1,1,0,No details available.


[Source](https://aviation-safety.net/wikibase/wiki.php?id=25045)

In [183]:
df.loc[410, "Time"] = "12:00"
df.loc[410, "Location"] = "Ellerton, Selby, North Yorkshire"
df.loc[410, "Summary"] = "Written off (damaged beyond repair) when crashed at Ellerton, near Selby, North Yorkshire on 9.10.1936. Aircraft caught fire on impact and was burnt out. Pilot - Algernon Hinchliffe Simpson (aged 23) - killed. Registration G-ABWI cancelled by the Air Ministry 2.12.36 due to 'destruction or permanent withdrawl from use of aircraft'"

df[410:411]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
410,10/09/1936,12:00,"Ellerton, Selby, North Yorkshire",,,North Sea Aerial and General Transport,Blackburn B-2,1,1,0,Written off (damaged beyond repair) when crash...


In [188]:
#562     NaN
df[562:563]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
562,10/28/1941,12:00,"Gabrene, Petrich, Bulgaria",,,Deutsche Lufthansa,Junkers JU-53/3m,13,13,0,While on a regular scheduled flight from Athen...


[Source](http://www.planecrashinfo.com/1941/1941-21.htm)
[Source2](https://www.baaa-acro.com/crash/crash-junkers-ju523m-near-petrich-13-killed)

In [187]:
df.loc[562, "Date"] = "10/28/1941"
df.loc[562, "Time"] = "12:00"
df.loc[562, "Location"] = "Gabrene, Petrich, Bulgaria"
df.loc[562, "Summary"] = "While on a regular scheduled flight from Athens to Sofia, the three engine aircraft crashed in unknown circumstances in the region of Petrich, south Bulgaria. All 13 occupants were killed and the aircraft christened 'Otto von Beaulieu-Marconay' was destroyed."

In [192]:
#586     NaN
df[586:587]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
586,08/21/1942,12:00,"Mühlberg an der Elbe, Brandenburg",,,Deutsche Lufthansa,Siebel Si-204,3,3,0,The twin engine aircraft was performing a flig...


[Source](https://www.baaa-acro.com/crash/crash-siebel-si-204a-muhlberg-der-elbe-3-killed)

> The twin engine aircraft was performing a flight within Germany and left Berlin in the day with a pilot, a radio operator and a passenger on board. En route, it seems the crew encountered marginal weather conditions and following an unknown technical failure, the pilot lost control of the aircraft that crashed in Mühlberg an der Elbe. All three occupants were killed, among them Baron Carl August Freiherr von Gablenz, founder of the Deutsche Lufthansa. 


In [191]:
df.loc[586, "Location"] = " Mühlberg an der Elbe, Brandenburg"
df.loc[586, "Time"] = "12:00"
df.loc[586, "Aboard"] = 3
df.loc[586, "Fatalities"] = 3
df.loc[586, "Summary"] = "The twin engine aircraft was performing a flight within Germany and left Berlin in the day with a pilot, a radio operator and a passenger on board. En route, it seems the crew encountered marginal weather conditions and following an unknown technical failure, the pilot lost control of the aircraft that crashed in Mühlberg an der Elbe. All three occupants were killed, among them Baron Carl August Freiherr von Gablenz, founder of the Deutsche Lufthansa."

In [194]:
#594     NaN
df[594:595]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
594,10/22/1942,,,,,Deutsche Lufthansa,Junkers JU-52/3m,17,17,0,No details available.


[Source](https://www.baaa-acro.com/crash/crash-junkers-ju523m-near-bukovac-17-killed)

In [195]:
df.loc[594, "Location"] = "Bukovac, Vojvodina, Serbia"
df.loc[594, "Summary"] = "While passing over Novi Sad, bound for Belgrade, the captain encountered marginal weather conditions with low clouds. Some five km east of Bukovac, the three-engine aircraft christened 'Johannes Höroldt' hit the north slope of Mt Fruška Gora (380 meters high). The wreckage was found less than 30 meters from the summit, the aircraft was destroyed by impact forces and all 17 occupants were killed. Causes:  It appears that weather information transmitted to the crew were erroneous and did not reflect the reality. At the time of the accident, the cloud base was lower than the 600 meter base previously announced."


In [198]:
#702     NaN
df[702:703]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
702,04/20/1945,18:00,"Steinreich, Brandenburg",,,,Junkers JU-53/3m,18,18,0,Missing on an evacuation flight from Berlin to...


[Source](https://www.baaa-acro.com/crash/crash-junkers-ju523m-steinreich-18-killed)

In [197]:
df.loc[702, "Location"] = "Steinreich, Brandenburg"
df.loc[702, "Time"] = "18:00"

In [204]:
#901     NaN
df[901:902]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
901,10/27/1947,12:00,"Yulin, China",,,China National Aviation Corporation,Douglas DC-3,3,2,0,The cargo plane was shot down by communist ant...


[Source](https://aviation-safety.net/database/record.php?id=19471027-0)

In [203]:
df.loc[901, "Date"] = "10/27/1947"
df.loc[901, "Time"] = "12:00"
df.loc[901, "Location"] = "Yulin, China"

In [248]:
#1497    NaN
df[1497:1498]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1497,09/28/1957,,"Islay-Glenegedale Airport (ILY), United Kingdom",,,British European Airways,de Havilland DH-114 Heron,3,3,0,The pilot did not appreciate that the air ambu...


[Source](https://aviation-safety.net/database/record.php?id=19570928-0)

In [247]:
df.loc[1497, "Location"] = "Islay-Glenegedale Airport (ILY), United Kingdom"

In [209]:
#1913    NaN
df[1913:1914]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1913,05/27/1964,15:00,"Morro de São Lourenço, São Paulo, Brazil",,,VASP,Douglas C-47-DL,3,3,0,Crashed during a training flight. The aircraft...


[Source](https://aviation-safety.net/database/record.php?id=19640527-0)

In [208]:
df.loc[1913, "Time"] = "15:00"
df.loc[1913, "Location"] = "Morro de São Lourenço, São Paulo, Brazil"
df.loc[1913, "Summary"] = "Crashed during a training flight. The aircraft took off from São Paulo-Congonhas Airport, SP (CGH) at 14:06 hours. The aircraft came down near Morro do São Lourenço, near the Rodovia Régis Bittencourt highway."

In [212]:
#1915    NaN
df[1915:1916]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1915,06/13/1964,,"Jeddah, Saudi Arabia",,,Saudi Arabian Airlines,Douglas C-47A,2,2,0,Crashed into the Red Sea.


[Source](https://aviation-safety.net/database/record.php?id=19640613-2)

In [246]:
df.loc[1915, "Location"] = "Jeddah, Saudi Arabia"

In [216]:
#1975    NaN
df[1975:1976]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
1975,05/05/1965,21:17,"Los Rodeos International Airport (TCI), Teneri...",,,Iberia Airlines,Lockheed 1049G-55 Super Constellation,49,30,0,"The pilot, who saw the beginning of the runway..."


In [215]:
df.loc[1975, "Location"] = "Los Rodeos International Airport (TCI), Tenerife, Spain"

In [219]:
#2911    NaN
df[2911:2912]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
2911,08/09/1976,12:30,"Vejer de la Frontera, Spain",,,Military - Spanish Air Force,Douglas C-54E,33,11,0,"Crashed 15 minutes after taking off, in a hill..."


[Source](https://aviation-safety.net/database/record.php?id=19760809-2)

In [218]:
df.loc[2911, "Time"] = "12:30"
df.loc[2911, "Summary"] = "Crashed 15 minutes after taking off, in a hilly wooded area, and burned. The airplane carried officers and family members to the Canary Islands."
df.loc[2911, "Location"] = "Vejer de la Frontera, Spain"

In [223]:
#2950    NaN
df[2950:2951]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
2950,11/20/1977,12:00,"Hay River, Northwest Territories, Canada",,,North Canada Air,Bristol 170 Freighter 31M,2,1,0,The cargo plane stalled nearly vertical and cr...


[Source1](https://www.baaa-acro.com/crash/crash-bristol-170-freighter-hay-river-1-killed)
[Source2](https://aviation-safety.net/database/record.php?id=19771120-1)

In [222]:
df.loc[2950, "Date"] = "11/20/1977"
df.loc[2950, "Time"] = "12:00"
df.loc[2950, "Location"] = "Hay River, Northwest Territories, Canada"

In [224]:
#3865    NaN
df[3865:3866]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
3865,08/15/1989,,,,,China Eastern Airlines,Antonov AN-24RV,40,34,0,Lost power and crashed into a river shortly af...


In [225]:
df.loc[3865, "Location"] = "Shanghai-Hongqiao Airport (SHA), China"

In [228]:
#3877    NaN
df[3877:3878]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
3877,09/21/1989,,"Baghran, Helmond, Afghanistan",,,Military - Afghan Republican Air Force,Mil Mi-8 (helicopter),26,26,0,Crashed and burned.


[Source](https://aviation-safety.net/wikibase/wiki.php?id=58445)

In [227]:
df.loc[3877, "Location"] = "Baghran, Helmond, Afghanistan"
df.loc[3877, "Summary"] = "Crashed and burned."

In [229]:
#4031    NaN
df[4031:4032]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
4031,08/15/1991,,,,,Transports A√©riens de la Guinee-Bissau,Fokker F-27 Friendship 100,3,3,0,On a positioning flight the plane struck trees...


[Source](http://www.fokker-aircraft.info/f27accidents90-99.htm)
[Source2](http://aviation-safety.net/database/record.php?id=19910815-0)

In [241]:
df.loc[4031, "Location"] = "Dori Airport (DOR), Burkina Faso"

In [240]:
#4040    NaN
df[4040:4041]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
4040,09/17/1991,,"Mount Arey, Djibouti",,,Ethiopian Airlines,Lockheed L-100-30 Hercules,4,4,0,After experiencing a nose gear problem and att...


[Source](https://aviation-safety.net/database/record.php?id=19910917-1)

In [239]:
df.loc[4040, "Location"] = "Mount Arey, Djibouti"

In [236]:
#4972    NaN
df[4972:4973]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
4972,03/04/2004,09:40,"Baku Airport (BAK), Azerbeijan",,,Azov Avia Airlines,Ilyushin II-76,7,3,0,Shortly after taking off and climbing to about...


[Source](https://aviation-safety.net/database/record.php?id=20040304-0)

In [235]:
df.loc[4972, "Location"] = "Baku Airport (BAK), Azerbeijan"

In [238]:
#5036    NaN
df[5036:5037]

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
5036,02/22/2005,07:15,"Sarmi Airport (ZRM), Indonesia",,,Indonesian National Police,CASA 212 Aviocar,18,15,0,"On final approach, the aircraft crashed into t..."


[Source](https://aviation-safety.net/database/record.php?id=20050222-1)

In [249]:
df.loc[5036, "Location"] = "Sarmi Airport (ZRM), Indonesia"

#final check to see if we got all the locations with a value
check_nulls("Location", dataframe=df)

No null values in Location series.


### 1.3 Type Conversions

Converting all values in Time column to a datetime objects:

In [260]:
df["Time"] = pd.to_datetime(df["Time"], format='%HH:%MM', errors="ignore")
    
df.Time.head()

0    17:18
1    06:30
2      NaN
3    18:30
4    10:30
Name: Time, dtype: object

***
Convert all dates to datetime:


In [257]:
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y", errors="coerce")
df.Date.head()

0   1908-09-17
1   1912-07-12
2   1913-08-06
3   1913-09-09
4   1913-10-17
Name: Date, dtype: datetime64[ns]


***
Convert all values in Fatalities and Ground columns to integers:

In [258]:
df['Fatalities'] = pd.to_numeric(df['Fatalities'], downcast="integer", errors="coerce")

df['Ground'] = pd.to_numeric(df['Ground'], downcast="integer", errors="coerce")

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5265 entries, 0 to 5264
Data columns (total 11 columns):
Date          5265 non-null datetime64[ns]
Time          3069 non-null object
Location      5265 non-null object
Latitude      5237 non-null float64
Longitude     5237 non-null float64
Operator      5247 non-null object
Type          5240 non-null object
Aboard        5265 non-null int16
Fatalities    5265 non-null int16
Ground        5265 non-null int16
Summary       5265 non-null object
dtypes: datetime64[ns](1), float64(2), int16(3), object(5)
memory usage: 360.0+ KB


### 1.4 Export clean data 

Export clean data to a format of our choice - in this case, csv - so we can run our Python script again to get the remaining GPS coordinates.
After that, just run a couple of checks and force types again and we can move on to EDA.


In [262]:
df.to_csv("dataset/dataframe_export.csv", index=False)

file_path = "dataset/dataframe_export.csv"
test_exported_df = pd.read_csv(file_path)
test_exported_df

Unnamed: 0,Date,Time,Location,Latitude,Longitude,Operator,Type,Aboard,Fatalities,Ground,Summary
0,1908-09-17,17:18,"Fort Myer, Virginia",38.882431,-77.080750,Military - U.S. Army,Wright Flyer III,2,1,0,"During a demonstration flight, a U.S. Army fly..."
1,1912-07-12,06:30,"Atlantic City, New Jersey",39.362869,-74.426369,Military - U.S. Navy,Dirigible,5,5,0,First U.S. dirigible Akron exploded just offsh...
2,1913-08-06,,"Victoria, British Columbia, Canada",48.428551,-123.364448,Private,Curtiss seaplane,1,1,0,The first fatal airplane accident in Canada oc...
3,1913-09-09,18:30,North Sea,56.610554,4.387920,Military - German Navy,Zeppelin L-1 (airship),20,14,0,The airship flew into a thunderstorm and encou...
4,1913-10-17,10:30,"Johannisthal, Germany",52.799000,13.096050,Military - German Navy,Zeppelin L-2 (airship),30,30,0,Hydrogen gas which was being vented was sucked...
5,1915-03-05,01:00,"Tienen, Belgium",50.809681,4.930130,Military - German Navy,Zeppelin L-8 (airship),41,21,0,Crashed into trees while attempting to land af...
6,1915-09-03,15:20,"Cuxhaven, Germany",53.858299,8.697810,Military - German Navy,Zeppelin L-10 (airship),19,19,0,"Exploded and burned near Neuwerk Island, when..."
7,1916-07-28,,"Jambol, Bulgeria",42.487801,26.511030,Military - German Army,Schutte-Lanz S-L-10 (airship),20,20,0,"Crashed near the Black Sea, cause unknown."
8,1916-09-24,01:00,"Billericay, England",51.624851,0.416950,Military - German Navy,Zeppelin L-32 (airship),22,22,0,Shot down by British aircraft crashing in flames.
9,1916-10-01,23:45,"Potters Bar, England",51.693130,-0.178330,Military - German Navy,Zeppelin L-31 (airship),19,19,0,Shot down in flames by the British 39th Home D...


***
And a second export with the locations series so we can run the script to get the remaining GPS coordinates:

In [264]:
df["Location"].to_csv("dataset/locations_export.csv", index=False)

file_path = "dataset/locations_export.csv"
test_exported_df = pd.read_csv(file_path)
test_exported_df

Unnamed: 0,"Fort Myer, Virginia"
0,"Atlantic City, New Jersey"
1,"Victoria, British Columbia, Canada"
2,North Sea
3,"Johannisthal, Germany"
4,"Tienen, Belgium"
5,"Cuxhaven, Germany"
6,"Jambol, Bulgeria"
7,"Billericay, England"
8,"Potters Bar, England"
9,"Mainz, Germany"
