
# Safety Text Data Preprocessing Pipeline Demonstration

This notebook demonstrates **text data preprocessing for safety-related text data** . The following steps will be demonstrated as a SOP for text data preprocessing:\
    1. Document collection  
    2. Document standardization  
    3. Tokenization  
    4. Stopword and punctuation filtering  
    5. Stemming and lemmatization  
    6. Part-of-speech (POS) tagging  
    7. Phrase recognition  
    8. Named Entity Recognition (NER)  
    9. Parsing  
    10. Vector generation  
    11. Feature generation  



In [None]:
# Import required libraries
import pandas as pd
import re
import nltk #natural language toolkit

nltk.download('punkt') #for sentence tokenizer
nltk.download('stopwords') #for stop words
nltk.download('wordnet')  #lexical database #corpus # synonyms #word relationships
nltk.download('omw-1.4') #Open Multilingual Wordnet (1.4) #lemmatization #word base form
nltk.download('averaged_perceptron_tagger') #pre-trained Part-of-Speech (POS) tagging model #initiates the download of the dataset and models necessary for the PerceptronTagger
nltk.download('maxent_ne_chunker') #a pre-trained, Maximum Entropy-based model used for Named Entity Recognition (NER) to identify and classify entities (e.g., persons, organizations, locations) in text
nltk.download('words') # lexical resource ("words") into your local NLTK data directory. This resource is a large corpus (collection of text or data) containing a standard English word list.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk

True

## 1. Document Collection (Importing Text data)

In [132]:
#Importing Text data from Excel/CSV file

import pandas as pd

# df = pd.read_excel('C:/Users/hp/Desktop/SARALab_coding/3_unstruc_data_preproc/safety_narrative_1.xlsx')
df = pd.read_excel('safety_narrative_1.xlsx')
df

Unnamed: 0,Narratives,description
0,Narrative-4,A car moving on the left lane of the road is t...


## 2. Document Standardization

In [134]:
# Standardizing safety narrative data; e.g lower capitals for all words

def standardize_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['clean_text'] = df['description'].apply(standardize_text) #adding column 'clean_text'
df[['description', 'clean_text']]


Unnamed: 0,description,clean_text
0,A car moving on the left lane of the road is t...,a car moving on the left lane of the road is t...


## 3. Tokenization

In [1]:
# Tokenization

from nltk.tokenize import word_tokenize

df['tokens'] = df['clean_text'].apply(word_tokenize)

#Output
df[['description',	'clean_text', 'tokens']]

NameError: name 'df' is not defined

## 4. Stopword and Punctuation Filtering

In [138]:
# For stop words and punctuation
# Stop words mean insignificant words which carry no meaning; remove punctuation

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

df['tokens_no_stopwords'] = df['tokens'].apply(
    lambda x: [word for word in x if word not in stop_words]
)

#Output

df[['description',	'clean_text', 'tokens','tokens_no_stopwords']]

Unnamed: 0,description,clean_text,tokens,tokens_no_stopwords
0,A car moving on the left lane of the road is t...,a car moving on the left lane of the road is t...,"[a, car, moving, on, the, left, lane, of, the,...","[car, moving, left, lane, road, taking, right,..."


## 5. Stemming and Lemmatization

In [139]:
#Stemming and Lemmatization

#Stemming a process for reducing inflected words to their root or base form (the "stem")..
# ..by removing affixes, primarily suffixes

# Lemmatization reduces words to their base or dictionary form (lemma) based on their meaning and part of speech (POS). 

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

df['stemmed'] = df['tokens_no_stopwords'].apply(
    lambda x: [stemmer.stem(word) for word in x]
)

df['lemmatized'] = df['tokens_no_stopwords'].apply(
    lambda x: [lemmatizer.lemmatize(word) for word in x] #  reduce a list of words (tokens) to their base or dictionary form (lemmas). 
)

 #To get accurate results for verbs or adjectives

#lambda x: [lemmatizer.lemmatize(word, pos = 'v') for word in x]

#Output
df[['tokens_no_stopwords', 'stemmed', 'lemmatized']]


Unnamed: 0,tokens_no_stopwords,stemmed,lemmatized
0,"[car, moving, left, lane, road, taking, right,...","[car, move, left, lane, road, take, right, tur...","[car, moving, left, lane, road, taking, right,..."


## 6. Part-of-Speech (POS) Tagging

In [140]:
#Part of speech (pos) 
#The process of assigning grammatical categories (like noun, verb, adjective) to words in a text

df['pos_tags'] = df['lemmatized'].apply(nltk.pos_tag)
df[['lemmatized', 'pos_tags']]


Unnamed: 0,lemmatized,pos_tags
0,"[car, moving, left, lane, road, taking, right,...","[(car, NN), (moving, VBG), (left, VBD), (lane,..."


In [141]:
# Save the DataFrame to a CSV file named 'output.csv'
df.to_csv('Example1_output.csv') 

## 7. Phrase Recognition (Bigrams)

In [142]:
# Bigram a bigram is a sequence of two adjacent words (or tokens) from a text 
# generated using functions like nltk.bigrams() to help analyze word co-occurrence.

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

all_tokens = [word for tokens in df['lemmatized'] for word in tokens]
bigram_finder = BigramCollocationFinder.from_words(all_tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)

#Output

bigrams


[('lane', 'road'),
 ('apply', 'brake'),
 ('avoid', 'collision'),
 ('balance', 'fall'),
 ('brake', 'avoid'),
 ('car', 'moving'),
 ('coming', 'high'),
 ('high', 'speed'),
 ('indicator', 'time'),
 ('lose', 'balance')]

## 8. Named Entity Recognition (NER)

In [143]:
# import nltk
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


True

In [144]:
# NER identifies and classifies key entities in unstructured text, such as people, organizations, locations, dates, and quantities.
# NER transforms raw text into structured data for better machine understanding
def get_named_entities(tokens):
    pos_tags = nltk.pos_tag(tokens)
    chunks = nltk.ne_chunk(pos_tags)
    return chunks

    
#Output
df['named_entities'] = df['lemmatized'].apply(get_named_entities)
df[['lemmatized', 'named_entities']]


Unnamed: 0,lemmatized,named_entities
0,"[car, moving, left, lane, road, taking, right,...","[(car, NN), (moving, VBG), (left, VBD), (lane,..."


## 9. Parsing (Shallow Parsing)

In [60]:
# The process of analyzing a raw sequence of text (tokens) to determine its grammatical structure and the relationships between words.
# Converts unstructured information into a structured, machine-readable format


grammar = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(grammar)

#Output
df['parsed_phrases'] = df['pos_tags'].apply(cp.parse)
df[['pos_tags', 'parsed_phrases']]


Unnamed: 0,pos_tags,parsed_phrases
0,"[(car, NN), (moving, VBG), (left, VBD), (lane,...","[[(car, NN)], (moving, VBG), (left, VBD), [(la..."


## 10. Vector Generation (TF-IDF)

### Example of TF-IDF vector generation using sklearn library


In [99]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Let's say we have three sample documents
documents = ["I love playing guitar", "I love singing", "playing guitar is fun"]
# type(documents)  #list of strings

In [111]:
# Initialize and fit

vectorizer = TfidfVectorizer(stop_words='english') #for excluding stop words
# vectorizer

matrix = vectorizer.fit_transform(documents) ## text to number #L2 transformation


# matrix.shape

In [117]:
# To get the all the features

vectorizer.get_feature_names_out()

array(['fun', 'guitar', 'love', 'playing', 'singing'], dtype=object)

In [113]:
#To get the word index
vectorizer.vocabulary_

{'love': 2, 'playing': 3, 'guitar': 1, 'singing': 4, 'fun': 0}

In [114]:
# Convert to dense format to view the tf-idf vector
print(matrix.todense())
print(vectorizer.get_feature_names_out())

[[0.         0.57735027 0.57735027 0.57735027 0.        ]
 [0.         0.         0.60534851 0.         0.79596054]
 [0.68091856 0.51785612 0.         0.51785612 0.        ]]
['fun' 'guitar' 'love' 'playing' 'singing']


### *Feature selection*

In [118]:
#Feature selection
import numpy as np

X = matrix.toarray()   # convert sparse → dense
feature_names = vectorizer.get_feature_names_out()

# max TF-IDF value of each feature across documents
max_tfidf = X.max(axis=0)

# select features with max TF-IDF ≥ 0.6
selected_features = feature_names[max_tfidf >= 0.6]

print(selected_features)

['fun' 'love' 'singing']


### *Continue with the example document*

In [119]:
#TF: Term frequency
# IDF: Inverse Document Frequency)

# A fundamental technique in NLP that transforms raw, unstructured text data into meaningful numerical vectors
# TF-IDF highlights important words in a document while reducing the weight of commonly occurring words across a corpus. 
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['clean_text'])

tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf.get_feature_names_out()
)

# Vector tf-idf matrix
tfidf_df.head()


Unnamed: 0,and,apply,at,avoid,balance,bike,biker,brake,can,car,...,same,speed,suddenly,suffer,taking,the,time,to,turn,without
0,0.125,0.0625,0.1875,0.0625,0.0625,0.125,0.125,0.0625,0.0625,0.0625,...,0.0625,0.0625,0.0625,0.0625,0.0625,0.8125,0.0625,0.0625,0.0625,0.0625


In [126]:
# Score for rare terms is highher
# Score for frequent terms is lower.

# df['clean_text']

0    a car moving on the left lane of the road is t...
Name: clean_text, dtype: object

In [125]:
#To get the word index

# tfidf.vocabulary_

In [65]:
# To check all features

# tfidf.get_feature_names_out()

array(['and', 'apply', 'at', 'avoid', 'balance', 'bike', 'biker', 'brake',
       'can', 'car', 'collision', 'coming', 'crossing', 'fall', 'from',
       'he', 'high', 'his', 'indicator', 'injuries', 'is', 'lane', 'left',
       'lose', 'major', 'may', 'moving', 'of', 'off', 'on', 'right',
       'road', 'same', 'speed', 'suddenly', 'suffer', 'taking', 'the',
       'time', 'to', 'turn', 'without'], dtype=object)

In [35]:
# Save the DataFrame to a CSV file

tfidf_df.to_csv('Example1_vector.csv') 

In [122]:
#Saving features in array

# tfidf_matrix.toarray()

## 11. Feature Generation

In [123]:
# Feature generation 

df['word_count'] = df['tokens'].apply(len)
df['unique_word_count'] = df['tokens'].apply(lambda x: len(set(x)))
df['avg_word_length'] = df['clean_text'].apply(
    lambda x: sum(len(word) for word in x.split()) / len(x.split())
)

df[['description', 'word_count', 'unique_word_count', 'avg_word_length']]


Unnamed: 0,description,word_count,unique_word_count,avg_word_length
0,A car moving on the left lane of the road is t...,71,43,3.929577


In [128]:
# Maximum TF-IDF value of each feature across all documents
max_tfidf = tfidf_df.max(axis=0)

# Select features with TF-IDF >= 0.1
selected_features = max_tfidf[max_tfidf >= 0.1].index

# Reduced TF-IDF dataframe
tfidf_selected = tfidf_df[selected_features]

tfidf_selected.head()

Unnamed: 0,and,at,bike,biker,crossing,is,lane,may,of,on,right,road,the
0,0.125,0.1875,0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.1875,0.125,0.125,0.8125


In [129]:
# Save the DataFrame to a CSV file 

tfidf_selected.to_csv('Features_final.csv') 

### Next Model building (Do-it-Yourself)
   ### *Each selected feature becomes a predictor variable.*


## Conclusion

This code demonstrates a **complete text data analytics preprocessing pipeline**
for **safety-related text data**, which can be directly extended to modelling such as:
- Topic modeling (LDA / NMF)
- Accident severity classification
- Safety risk analysis
- NLP-based safety analytics
- Document cluster based on the selected features


# Problem statement and to do

1. For the given safety text data in the Excel spreadsheet shared on **Moodle**, apply all the above-discussed text data preprocessing steps.
2. Develop a **tf-idf** vector matrix.
3. Select important features and decide on your **own decision rule**. 