# NLP Processing With SpaCy
---

## Contents
---
- [Data Retrieval](#Data-Retrieval)
- [SpaCy Processing](#SpaCy-Processing)

### Data Retrieval
___

**Library Imports**

In [1]:
import pandas as pd
import re
import spacy

**Read in cleaned_df.csv**

In [2]:
cleaned_df = pd.read_csv('./data/clean_aviation_data.csv')

In [3]:
cleaned_df.head()

Unnamed: 0,event_type,event_date,tail_number,highest_injury_level,fatal_injury_count,serious_injury_count,minor_injury_count,probable_cause,latitude,longitude,airport_id,operator,make,aircraft_damage,model
0,INC,2023-11-30,N494HA,minor,0,0,1,unknown,20.899501,-156.42973,OGG,hawaiian,boeing,Minor,717
1,ACC,2023-09-30,N37560,none,0,0,0,unknown,39.849312,-104.67382,DEN,united,boeing,Substantial,737
2,ACC,2023-08-21,N516AS,none,0,0,0,unknown,33.675701,-117.86799,SNA,alaska,boeing,Substantial,737
3,INC,2023-08-11,"N7734H, N564HV",none,0,0,0,unknown,32.730189,-117.17562,SAN,southwest,boeing,no_damage,737
4,INC,2023-08-03,N649JB,none,0,0,0,unknown,30.486167,-81.750781,JAX,jetblue,airbus,no_damage,A320


After looking through data, decided that 'unknown' is not a good choice to fill probable_cause NaNs with.  Will change to ReportUnavailable so that I can ensure it's a unique text value.  Something must be included in each row so that there are no NaNs for modeling.  We don't want to drop any rows just because 'probable_cause' is blank.

It appears that 'probable_cause' is left blank either because the investigation isn't complete or the 'report_status' is NA.

In [4]:
cleaned_df['probable_cause'] = cleaned_df['probable_cause'].str.replace('unknown', 'ReportUnavailable')

In [5]:
cleaned_df.head()

Unnamed: 0,event_type,event_date,tail_number,highest_injury_level,fatal_injury_count,serious_injury_count,minor_injury_count,probable_cause,latitude,longitude,airport_id,operator,make,aircraft_damage,model
0,INC,2023-11-30,N494HA,minor,0,0,1,ReportUnavailable,20.899501,-156.42973,OGG,hawaiian,boeing,Minor,717
1,ACC,2023-09-30,N37560,none,0,0,0,ReportUnavailable,39.849312,-104.67382,DEN,united,boeing,Substantial,737
2,ACC,2023-08-21,N516AS,none,0,0,0,ReportUnavailable,33.675701,-117.86799,SNA,alaska,boeing,Substantial,737
3,INC,2023-08-11,"N7734H, N564HV",none,0,0,0,ReportUnavailable,32.730189,-117.17562,SAN,southwest,boeing,no_damage,737
4,INC,2023-08-03,N649JB,none,0,0,0,ReportUnavailable,30.486167,-81.750781,JAX,jetblue,airbus,no_damage,A320


### SpaCy Processing
---

In [6]:
nlp = spacy.load('en_core_web_md')

In [7]:
# Load the medium size pipeline
nlp = spacy.load('en_core_web_md')

In [8]:
word_exception = 'ReportUnavailable'
nlp.vocab[word_exception].is_stop = False

**Function that allows SpaCy to process text data so that it can be ran through an apply method for the 'text' column of the 'cleaned_df' dataset.**

Function created using SpaCy lesson, references are https://spacy.io/api/token, https://realpython.com/natural-language-processing-spacy-python/#lemmatization, and ChatGPT for stucture help.  Hank reminded me that .apply will apply a function to a dataframe column.


In [9]:
def spacy_processor(text):
    
    #Put the data into spaCy model
    doc = nlp(text)
    
    # Create a tokens list with only alpha characters.
    # Also lemmatizes words and omits SpaCy stop words
    tokens = [token.lemma_.lower().strip() for token in doc if token.is_alpha and not token.is_stop]

    #Put the processed text back together
    processed_text = ' '.join(tokens)

    #return processed text to dataframe
    return processed_text
    

In [10]:
# Apply the function to the text column of the cleaned_corpus dataframe
cleaned_df['probable_cause'] = cleaned_df['probable_cause'].apply(spacy_processor)

In [11]:
#Checking out how it looks
cleaned_df.head()

Unnamed: 0,event_type,event_date,tail_number,highest_injury_level,fatal_injury_count,serious_injury_count,minor_injury_count,probable_cause,latitude,longitude,airport_id,operator,make,aircraft_damage,model
0,INC,2023-11-30,N494HA,minor,0,0,1,reportunavailable,20.899501,-156.42973,OGG,hawaiian,boeing,Minor,717
1,ACC,2023-09-30,N37560,none,0,0,0,reportunavailable,39.849312,-104.67382,DEN,united,boeing,Substantial,737
2,ACC,2023-08-21,N516AS,none,0,0,0,reportunavailable,33.675701,-117.86799,SNA,alaska,boeing,Substantial,737
3,INC,2023-08-11,"N7734H, N564HV",none,0,0,0,reportunavailable,32.730189,-117.17562,SAN,southwest,boeing,no_damage,737
4,INC,2023-08-03,N649JB,none,0,0,0,reportunavailable,30.486167,-81.750781,JAX,jetblue,airbus,no_damage,A320


In [12]:
#Checking out a non 'reportunavailable' entry in the processed_text column
cleaned_df['probable_cause'][24]

'fatigue failure right main landing gear initiate liquid metal embrittlement cadmium arc burn location outer cylinder tooling hole area arc burn likely result operator error stylus cadmium plating operation overhaul'

In [13]:
cleaned_df.head()

Unnamed: 0,event_type,event_date,tail_number,highest_injury_level,fatal_injury_count,serious_injury_count,minor_injury_count,probable_cause,latitude,longitude,airport_id,operator,make,aircraft_damage,model
0,INC,2023-11-30,N494HA,minor,0,0,1,reportunavailable,20.899501,-156.42973,OGG,hawaiian,boeing,Minor,717
1,ACC,2023-09-30,N37560,none,0,0,0,reportunavailable,39.849312,-104.67382,DEN,united,boeing,Substantial,737
2,ACC,2023-08-21,N516AS,none,0,0,0,reportunavailable,33.675701,-117.86799,SNA,alaska,boeing,Substantial,737
3,INC,2023-08-11,"N7734H, N564HV",none,0,0,0,reportunavailable,32.730189,-117.17562,SAN,southwest,boeing,no_damage,737
4,INC,2023-08-03,N649JB,none,0,0,0,reportunavailable,30.486167,-81.750781,JAX,jetblue,airbus,no_damage,A320


In [14]:
cleaned_df.to_csv('./data/text_processed_aviation_data.csv', index = False)