<a href="https://colab.research.google.com/github/LukrecijaTudor/Aviation-Reports-Classification/blob/main/PreparingTheData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Preprocessing data for all Machine Learning (ML) tasks except BERT (BERT has his own way of data preprocessing)
Topics:
Loading raw dataset
Cleaning data
Removing acronyms
Tokenisation
Restructuring the data
TD-IDF representation and building sparse matrix
Saving prepocessed data

In [None]:
pip install pycld2

In [7]:
import pandas as pd
import numpy as np
import pycld2 as cld2
import spacy
sp = spacy.load("en")
sp_stopwords=sp.Defaults.stop_words

import math

Loading the data

In [8]:
document = pd.read_csv("./qa06_all.csv")
document = document.rename(columns={'qa06id': 'id', 'qa06name':'title', 'qa06wher':'location','qa06dsc':'report'})
document_risk = pd.read_csv("./qa06_only_having_risk_valuesl.csv")
document_risk = document_risk.rename(columns={'qa06id': 'id', 'qa06name':'title', 'qa06wher':'location','qa06dsc':'report','ty26colo':'label','ty26fakt':'factor'})

acronyms = pd.read_excel("./kratice.xlsx")
acronyms = acronyms.rename(columns={'kratice':'acronym','Unnamed: 1':'replacement'})

In [9]:
document.head()

Unnamed: 0,xx,id,title,location,report
0,1953679,CAA MOR-0010-2017,Precautionary return to LJU due to engine N1 v...,LOVV / NR / JP568,Engine fans were found frozen at morning prefl...
1,1650041,ADR MOR-0012-2014,EGPWS (terrain terrain 1x),KWSK / ADR 826,During visual app RWY 34 EGPWS terrain terrrai...
2,1639560,CAA MOR-0076-2013,L windshield shattered,LJLJ / ADR653,"Left windshield was shattered during descend, ..."
3,1639587,CAA MOR-0079-2013,Stick shaker in flare,Final approach / ADR284,Stick shaker on landing flare.
4,1632617,CAA MOR-0004-2013,Stick shaker during flare,Paris - CDG,Stick shaker was engaged at flare. Flare was r...


Few function that will help me to prepare the raw data

In [10]:
def union (list1, list2):
    final_list = list(set(list1) | set(list2)) 
    return final_list 

def isNaN(string):
    return string != string

A function that will replace all the acronyms in the reports. Going through reports, I found the most important set of acronmys for each meaningful term <br>(e.g. rnw or rny for runway)

In [11]:
def acronyms_to_words (txt): 
    for i in range (len(acronyms)):
        txt=txt.replace(' ' + acronyms.acronym[i] + ' ',' ' + acronyms.replacement[i] + ' ')
    return txt

A function that will tokenize each report, remove the stop words, punctuation marks and make all the cases lower. <br>For this, the spaCy library was used.

In [12]:
def tokenizing_text(txt):
    tok=[]
    pom=txt.lower()
    pom=acronyms_to_words(pom)
    sp_pom=sp(pom)
    sp_pom=[x for x in sp_pom if not x.is_stop and x.is_alpha]
    for w in sp_pom:
        tok.append(w.lemma_)
    return tok

Below function is preparing each report in a list. First checking is it NaN, then is it in english (for this study only english reports were used). The function has a little catch, it has a boolean argument "isRisk" because beside of this set of classified reports, later, for other task, I will use expanded set of unclassified reports. So, if my set has a risk classification then I will use the argument "True" when calling my function, if not, "False" argument will be used.
Output of this function is DataFrame which contains identification number of each report (xx), tokenized content (txt) and ("isRisk") label of that report.

In [18]:
def prep_doc(document,isRisk):
    
    if isRisk :
        doc = pd.DataFrame(columns=['xx', 'report', 'label','factor', 'accident'])
    else:
        doc = pd.DataFrame(columns=['xx', 'report'])
    
    for i in range(len(document)):
       
        if isNaN(document.report[i]):
            continue
            
  # checking the languange with few lines below
  # if English and isRelible then continue the process and if not continue with the next report and go through the loop again
        isReliable, textBytesFound, details=cld2.detect(document.report[i])
        if not isReliable:
            continue
        if details[0][0]!='ENGLISH':
            continue
            
        txt=tokenizing_text(document.report[i])
       
#     Here it is important to know a little bit more about aviation risk classification.
#     These reports, according to predetermined parameters, are classified in line with 
#     the risk factor given in the non-equidistant scale from 1 to 2500. 
#     Considering this factor, there are 3 classes of events: 
#         - No accident outcome -Minor injuries or damage - Major or catastrophic accident (including death)
#     We defined a feature "accident" that contain information about which of these 3 classes each report belong to.
#     This was the best way to have any proper and relevant data to classify.
 
#     For better understanding next few lines look the AvioClass.jpg
        if isRisk :
            if (document.factor[i]==1):
                risk_level='no accident outcome'
            elif (document.factor[i] in (2,4,20,100)):
                risk_level='minor injuries or damage'
            else:
                risk_level='major or catastrophic accident'
            
            doc=doc.append({'xx': document.xx[i], 'report': txt, 
                    'label': document.label[i], 'factor': document.factor[i],
                        'accident': risk_level},ignore_index=True)
        
        else:
            doc=doc.append({'xx': document.xx[i], 'report': txt}, ignore_index=True) 
    return doc

Safety report text representation using TF-IDF 
There are a lot of open sources libraries for this, but to understand how TF-IDF works code it from scratch.

In [14]:
def TF_IDF_Data(doc):
    j=0
    br=-1
    tokens_list=[]
    tokens=pd.DataFrame()
    
#     for each document and each token in it, we made a combination doc-tok-number that represents:
#     "In this document, this token has this TF-IDF number"
#     Later, we can make a sparse matrix (dim.: DOC x TOK) from this DataFrame which will represent the same thing 
   
    TD=pd.DataFrame(columns=['doc','tok','tf_idf']) 
    n_data=len(doc)
    
    for i in range (n_data):
            
        br=br+1;
        tokenized_text=doc.report[i]
        n_tok=len(tokenized_text)
        
        if br==0:
            tokens_list=union(tokens_list,tokenized_text)
            for w in tokens_list:
                tokens[w]=0
            tokens.loc[0] = 0
            for w in tokens_list:
                nmb=tokenized_text.count(w)
                tokens.loc[[0],[w]] = tokens.loc[[0],[w]] + 1
                TD.loc[j]=0
                TD.doc[j] = doc.xx[i]
                TD.tok[j] = w
                TD.tf_idf[j] = nmb/n_tok
                j=j+1
        else:
            pom_tok=union([],tokenized_text)
            for w in pom_tok:
                nmb=tokenized_text.count(w)
                if w not in tokens_list:
                    tokens[w] = 0
                tokens.loc[[0],[w]] = tokens.loc[[0],[w]] + 1
                    
                TD.loc[j]=0
                TD.doc[j] = doc.xx[i]
                TD.tok[j] = w
                TD.tf_idf[j] = nmb/n_tok
                j=j+1
            tokens_list=union(tokens_list,pom_tok)
        
    N=br+1 
    for k in range (j):
        pom=TD.tf_idf[k]
        pom2 = TD.tok[k]
        df=pd.to_numeric(tokens[pom2])
        l =1+ N/df
        l = math.log(l)
        TD.tf_idf[k] = pom*l
        
    return(tokens,TD)

Here is the function for sparse matrix

In [15]:
def pdoc2sparsedf(pdoc):
    n=len(set(pdoc.tok))
    m=len(set(pdoc.doc))
    zero = np.zeros(shape=(m,n))
    df=pd.DataFrame(zero,columns=[list(set(pdoc.tok))],index=[list(set(pdoc.doc))])
    for i in range (len(pdoc)):
        a=[pdoc.doc[i]]
        b=[pdoc.tok[i]]
        df.loc[a,b]=pdoc.tf_idf[i]
    return df

This function is not necessary for this task. But later, we can check if the results are better this way (just exploring the data). What are we doing here? Just reducing the number of tokens that are meaningful for the task. For example, if one token appears only in small number of reports, maybe it is not of great importance.
Here is the example of reducing tokens only to ones that are appearing in 8 ore more reports, but you can choose and explore with any number of reports you want. 
Function output are: new list of tokens with number of document appearance, new TF-IDF represantation (with all possible combinations found) and a list of excluded tokens.

In [16]:
def pdoc_reduced(pdoc,tokens, min_doc_appearance=8):
    new_pdoc = pd.DataFrame(columns=['doc','tok','tf_idf'])
    new_tokens = pd.DataFrame()
    ejected_tokens = []
    n=len(pdoc)
    
    for col in tokens.columns:
        if (tokens.loc[0][col] < min_doc_appearance):
            ejected_tokens.append(col)
            continue
        else:
            new_tokens[col]=0
            
    new_tokens.loc[0]=0        
    for col in new_tokens.columns:
        new_tokens.loc[0][col]=tokens.loc[0][col]
        
    br=0
    for i in range(n):
        
        if (pdoc.tok[i] in new_tokens.columns):
            new_pdoc.loc[br]=0
            new_pdoc.doc[br]=pdoc.doc[i]
            new_pdoc.tok[br]=pdoc.tok[i]
            new_pdoc.tf_idf[br]=pdoc.tf_idf[i]
            tf=len(pdoc[pdoc.doc==pdoc.doc[i]])
            br+=1
   
    for i in range(len(new_pdoc)):
        tf_idf=new_pdoc.tf_idf[i]
        tf_idf=(tf_idf * len(new_pdoc[new_pdoc.doc==new_pdoc.doc[i]])) / len(pdoc[pdoc.doc==pdoc.doc[i]])
        new_pdoc.tf_idf[i]=tf_idf
        
    return new_pdoc, new_tokens , ejected_tokens

It is time to use our functions:

In [19]:
doc_risk_tok=prep_doc(document_risk, True) #tokenized data with risk
doc_no_risk_tok=prep_doc(document, False) #tokenized no risk data

pdoc_risk=TF_IDF_Data(doc_risk_tok)
tokens_risk=pdoc_risk[0]
pdoc_risk=pdoc_risk[1]

pdoc_no_risk=TF_IDF_Data(doc_no_risk_tok)
tokens_no_risk=pdoc_no_risk[0]
pdoc_no_risk=pdoc_no_risk[1]

sparse_risk=pdoc2sparsedf(pdoc_risk)
sparse_no_risk=pdoc2sparsedf(pdoc_no_risk)

In [21]:
doc_risk_tok.head()

Unnamed: 0,xx,report,label,factor,accident
0,2522377,"[perform, ils, raw, datum, approach, training,...",0,2,minor injuries or damage
1,2346650,"[bru, approach, rw, ils, asighne, heading, clo...",0,2,minor injuries or damage
2,2047008,"[security, seal, board, takeoff, seal, aircraf...",0,1,no accident outcome
3,2528229,"[vie, runway, position, rh, base, radar, head,...",1,20,minor injuries or damage
4,2139606,"[electrical, smell, find, cabin, crew, member,...",0,1,no accident outcome


In [22]:
pdoc_risk.head()

Unnamed: 0,doc,tok,tf_idf
0,2522377,vectore,0.0755675
1,2522377,app,0.06286
2,2522377,takeoff,0.110858
3,2522377,flight,0.0233324
4,2522377,time,0.0368316


We are going to save prepared new data:

In [23]:
sparse_risk.head()

Unnamed: 0,fill,loka,inferior,izpolniti,appliance,oral,smoking,rise,threaten,network,uf,uuee,clock,secodn,bolgaria,ignition,signature,cockpit,wasa,majority,panelu,cb,valve,payement,lszh,trough,arcking,fralju,eddt,discuss,responsibilite,lwa,heavily,noticeably,electricity,deselect,seam,forget,pulse,happend,...,rnav,etihad,cph,remarks,accidently,smartphone,hoče,date,cir,uncertain,fulfill,proximity,principle,adapter,collective,sept,figuere,flux,element,skj,copule,introduction,strong,sudden,party,juice,stickshaker,withe,black,instructed,dicharger,friendly,xferre,harmful,unfit,hude,grz,maja,heat,fuse
2605065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.151597,0.0,0.0,0.0,0.0,0.0
1949714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2580498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2187283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2129941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
doc_risk_tok.to_csv('./doc_risk_tok.csv')
doc_no_risk_tok.to_csv('./doc_no_risk_tok.csv')
pdoc_risk.to_csv('./pdoc_risk.csv')
tokens_risk.to_csv('./tokens_risk.csv')
pdoc_no_risk.to_csv('./pdoc_no_risk.csv')
tokens_no_risk.to_csv('./tokens_no_risk.csv')
sparse_risk.to_csv('./sparse_risk.csv')
sparse_no_risk.to_csv('./sparse_no_risk.csv')