# Airplane Crash Reasons Text Mining #

## Prepare tools ##

In [1]:
# utility
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# import urllib.request # download files
from collections import Counter
import string
########

# text pre-processing
import nltk 
from nltk.corpus import stopwords, wordnet # stop words
from nltk.stem.snowball import SnowballStemmer # stemming
from nltk import pos_tag, word_tokenize # identify POS tag, required by lemmatizer
from nltk.stem import WordNetLemmatizer # lemmatization
########

# feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import FeatureHasher
########

from sklearn.model_selection import train_test_split
########

# for Latent Dirichlet Allocation
# import gensim
# from gensim import corpora, models
from sklearn.decomposition import LatentDirichletAllocation
########

**Resources to be downloaded with `nltk` downloader:**
-  `stopwords` - stop words list
-  `wordnet` - lemmatization
-  `punkt` - tokenization
-  `averaged_perceptron_tagger` - POS tag of tokens

In [2]:
# set working directory
working_dir = "/home/lee/Documents/Datasets for GitHub/kaggle_plane_crashes/"

# download to working directory
nltk.download(['stopwords', 'wordnet', 'punkt', 'averaged_perceptron_tagger'], download_dir=working_dir)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lee/Documents/Datasets for
[nltk_data]     GitHub/kaggle_plane_crashes/...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/lee/Documents/Datasets for
[nltk_data]     GitHub/kaggle_plane_crashes/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/lee/Documents/Datasets
[nltk_data]     for GitHub/kaggle_plane_crashes/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/lee/Documents/Datasets for
[nltk_data]     GitHub/kaggle_plane_crashes/...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Load & Inspect Data ##

In [3]:
df = pd.read_csv(working_dir+"Airplane_Crashes_and_Fatalities_Since_1908.csv", index_col=False)

In [4]:
print("dataframe shape: {}".format(df.shape))
print("\n")
print("preview:\n{}".format(df.head()))

dataframe shape: (5268, 13)


preview:
         Date   Time                            Location  \
0  09/17/1908  17:18                 Fort Myer, Virginia   
1  07/12/1912  06:30             AtlantiCity, New Jersey   
2  08/06/1913    NaN  Victoria, British Columbia, Canada   
3  09/09/1913  18:30                  Over the North Sea   
4  10/17/1913  10:30          Near Johannisthal, Germany   

                 Operator Flight #          Route                    Type  \
0    Military - U.S. Army      NaN  Demonstration        Wright Flyer III   
1    Military - U.S. Navy      NaN    Test flight               Dirigible   
2                 Private        -            NaN        Curtiss seaplane   
3  Military - German Navy      NaN            NaN  Zeppelin L-1 (airship)   
4  Military - German Navy      NaN            NaN  Zeppelin L-2 (airship)   

  Registration cn/In  Aboard  Fatalities  Ground  \
0          NaN     1     2.0         1.0     0.0   
1          NaN   NaN     5.0     

### Cleaning ###

In [5]:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# see debate: https://github.com/pandas-dev/pandas/issues/11453
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M', errors='coerce').dt.time

**Just curious about the overall deadliest planes.**

In [6]:
print("total fatality count by airplane type, all years, both military and non-military operators:\n {}"\
      .format(df.groupby(['Type'])['Fatalities'].sum().sort_values(ascending=False).nlargest()))

total fatality count by airplane type, all years, both military and non-military operators:
 Type
Douglas DC-3                 4793.0
Antonov AN-26                1068.0
Douglas DC-6B                1055.0
Douglas C-47                 1046.0
McDonnell Douglas DC-9-32     951.0
Name: Fatalities, dtype: float64


## Topic Mining ##
We now apply topic mining methods to the "summary" field. 

### Inspect the Field of Interest ###

In [7]:
for i in (list(range(5))):
    print(df.loc[i, 'Summary']+"\n")

During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge who was a passenger. This was the first recorded airplane fatality in history.  One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft.  Orville Wright suffered broken ribs, pelvis and a leg.  Selfridge suffered a crushed skull and died a short time later.

First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight.

The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed.

The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.

Hydrogen gas which was being vented was sucked i

### Check for Missing ###

In [8]:
# pd.isnull(df_airplane_crashes['Summary'])
print("Does the Summary field contain any missing record? {}"\
      .format(df['Summary'].isnull().values.any()))

Does the Summary field contain any missing record? True


### Pre-processing Text Data ###

In [9]:
nltk.data.path.append(working_dir)

# replace NaN
df['summary_filled'] = df['Summary'].fillna(value=' ')

# remove punctuation
df['removepunc'] = df['summary_filled'].str.replace('[^\w\s]', ' ')

# lower casing
df['lower'] = df['removepunc'].str.lower()

# tokenize and lemmatize
# adjective, satellite adjective, adverb, noun, verb = 'a', 's', 'r', 'n', 'v'
def lemmatize_after_pos(summary):
    lemma_summary = []
    for word, tag in pos_tag(word_tokenize(summary)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        lemma = WordNetLemmatizer().lemmatize(word, wntag) if wntag else word
        lemma_summary.append(lemma)
    return lemma_summary

df['lemmatized'] = df['lower'].apply(lambda x: lemmatize_after_pos(x))

# remove stopwords
stop = stopwords.words('english')
df['removestop'] = df['lemmatized'].apply(lambda x: [item for item in x if item not in stop])

df['processed_summary'] = df['removestop']

**Check the results**

In [10]:
print(df.loc[0, 'processed_summary'])

['demonstration', 'flight', 'u', 'army', 'flyer', 'fly', 'orville', 'wright', 'nose', 'dive', 'ground', 'height', 'approximately', '75', 'foot', 'kill', 'lt', 'thomas', 'e', 'selfridge', 'passenger', 'first', 'record', 'airplane', 'fatality', 'history', 'one', 'two', 'propeller', 'separate', 'flight', 'tear', 'loose', 'wire', 'brace', 'rudder', 'cause', 'loss', 'control', 'aircraft', 'orville', 'wright', 'suffer', 'broken', 'rib', 'pelvis', 'leg', 'selfridge', 'suffer', 'crushed', 'skull', 'die', 'short', 'time', 'later']


In [11]:
del stop
df.drop(['summary_filled', 'removepunc', 'lower', 'removestop', 'lemmatized'], axis=1, inplace=True)

In [12]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message+"\n")
    
# Use tf (raw term count) features for LDA.
# print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(df['processed_summary']\
                                 .apply(lambda summary_token_list: " ".join(summary_token_list)))

print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..." % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,\
                                learning_method='online',\
                                learning_offset=50.,\
                                random_state=0)

lda.fit(tf)

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...

Topics in LDA model:
Topic #0: fuel engine cause problem cabin crew run tank passenger thrust plane door explosion failure damage pressure experience rear open lead

Topic #1: aircraft accident minute flight passenger report mile kill disappear later cause wreckage time military fly 30 day transport receive shoot

Topic #2: ft 000 flight mountain crash 500 strike miss contact hold minute 11 10 16 lightning explode pattern 14 test crew

Topic #3: aircraft pilot engine plane crash lose runway altitude failure power crew foot takeoff right fail ground turn control left hit

Topic #4: crash route mountain en land heavy mile fog rain attempt sea airport weather poor aircraft mt hit near north visibility

Topic #5: kill air helicopter aboard collision midair shoot dc force collide pilot jet atc rebel fighter cessna missile aircraft land avoid

Topic #6: approach pilot flight condition weather crash aircraft crew runw

Common causes of plane crash identified: engine failure, explosion, weather condition (fog, lightning), collision. Someone with more aviation experience will have a better understanding of the generated results.