### Seed-Guided LDA with negative sentences input from BERT model
by: Anood Alkatheeri

#### 1) Importing negative sentences from JSON file (and removing duplicates)

In [1]:
import json
import lda.datasets as gldad
import numpy as np
from lda import guidedlda as glda
from collections import defaultdict
import random
import pandas as pd

f = open('extracted_neg_sentences_24th_report.json')

data = json.load(f)
  
# Iterating through the json
original_sentences24 = []

for i in data:
    sentence = i['sentence']
    if sentence not in original_sentences24:
        original_sentences24.append(sentence)

f = open('extracted_neg_sentences_31st_report.json')

data = json.load(f)
  
# Iterating through the json
original_sentences31 = []

for i in data:
    sentence = i['sentence']
    if sentence not in original_sentences31:
        original_sentences31.append(sentence)

f = open('extracted_neg_sentences_32nd_report.json')

data = json.load(f)
  
# Iterating through the json
original_sentences32 = []

for i in data:
    sentence = i['sentence']
    if sentence not in original_sentences32:
        original_sentences32.append(sentence)


In [2]:
combined_sentences = original_sentences24 + original_sentences31 + original_sentences32
len(combined_sentences)

1053

In [3]:
random.choices(combined_sentences, k=7)

['In August 2013, the primary reason for Unit 2 not having a Green rating was that the unit experienced a Rod Control Urgent Failure Alarm during a quarterly control rod operability test when Control Rod Groups 1 and 2 in Bank A were slightly out of their expected positions.',
 'Conclusions: DCPP experienced significant Feedwater Heater (FWH) tube failures in FWH 2-5B in mid-October 2021, which caused Unit 2 to be shut down twice to perform inspections and repairs.',
 'The event was caused by the operator’s misunderstanding the status of protected equipment which precluded work on a redundant component when another component is out of service.',
 '(A Cross-cutting aspect of H.1, "Inadequate Procedure," was assigned to this violation) Green Finding associated with the inadequate use of industry operating experience associated with environmental corrosion of outdoor piping (No Cross-Cutting aspects were assigned to this violation.) Green Non-Cited Violation associated with a Containment 

#### 2) Pre-processing: removing stop words/punctuations/numbers, lemmatization, lowerization...

In [4]:
import re
import spacy

# load the spacy model and stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = nlp.Defaults.stop_words

def preprocess(combined_sentences):
    list_sentences = []
    # assume that 'df' is a list of lists containing strings
    for i in range(len(combined_sentences)):
        
        #join the list into a single string
        text = combined_sentences[i]
        
        #remove months
        text = re.sub('{January|February|March|April|May|June|July|August|September|October|November|December}', '', text)
        
        #remove prefixes
        prefixes = "(Mr|St|Mrs|Ms|Dr|Ph.D|Chair|ViceChair|Volume)"
        text  = re.sub(prefixes, '', text)

        #lemmatization
        lemma_list = [token.lemma_ for token in nlp(text)]

        #remove stopwords and -PRON- tags
        clean_list = [re.sub('-PRON-', '', word) for word in lemma_list if word not in stop_words]

        #join the cleaned tokens back into a string
        clean_text = ' '.join(clean_list).lower()

        #removes numbers and punctuations.
        clean_text=re.sub(r'\d+', '', clean_text)
        clean_text=re.sub(r'[\W_]+', ' ', clean_text).rstrip()

        #remove single character terms or 2-letter words
        terms_list = [word for word in clean_text.split() if len(word) > 2]
        clean_text = ' '.join(terms_list).lower()
            
        #assign the cleaned text back to list
        list_sentences.append(clean_text)
        
    return list_sentences
        
list_sentences = preprocess(combined_sentences)

2023-04-18 02:19:42.688264: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
random.choices(list_sentences, k=7)

['hipschman report inspection activity dcpp include review trip design error insulator replace offsite electrical system',
 'remain engine approximately bolt find slightly require torque',
 'backup pump start fail come speed pressure quickly decrease degraded backup oil pressure accumulator',
 'dcisc review safety system functional failures reference conclude following dcpp experience significant number safety system functional failure mid mid',
 'number breakdown planning conduct maintenance activity refueling outage',
 'presentation find design requirement assumption use nei unrealistic',
 'report period automatic manual number reactor trip continue commendably low']

#### 3) Creating sentence-term matrix and generating unique vocabulary

Convert a collection of text documents to a matrix of token counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_2 = vectorizer.fit_transform(list_sentences)
X_2.shape

(1053, 2486)

Convert matrix to DataFrame.

In [7]:
feature_names = vectorizer.get_feature_names_out()

X_df = pd.DataFrame(X_2.toarray(), columns=feature_names)

X_df.iloc[:10,0:20]

Unnamed: 0,ability,able,abnormal,abnormally,abort,abrupt,absence,absorb,accelerate,acceptable,acceptance,accepted,access,accident,accidents,accommodate,accordance,accordingly,account,accumulate
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Convert DataFrame to Array.

In [8]:
X_array = X_2.toarray()
X_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Extracting and encoding unique words/vocabulary (Creating Word-id dictionary).

In [9]:
vocab = tuple(feature_names)
word2id = dict((v, idx) for idx, v in enumerate(vocab))
len(vocab)

2486

#### 4) Defining seed words that correspond to each safety trait

Source of seed words: INPO 12-012- Traits of a Healthy Nuclear Safety Culture - April 2013.pdf

In [10]:
seed_words = [
['responsibility', 'accountability', 'help', 'support', 'trained', 'qualified', 'understand', 'complete', 'involvement'],
['complacency', 'complacent', 'challenge', 'error', 'hazard', 'caution', 'discrepancy', 'anomaly', 'assumption', 'question', 'uncertain', 'unknown', 'risk', 'trend', 'unexpected', 'unclear', 'degrading', 'aging'],
['communication', 'licensee', 'event', 'report', 'documentation', 'request', 'LER', 'information', 'safety', 'prompt', 'share', 'respond', 'listen', 'concern', 'expectation', 'clear'],
['leadership', 'management', 'leader', 'owner', 'ownership', 'program', 'guidance', 'policy', 'resource', 'staffing', 'oversight', 'reinforce', 'priority', 'plan', 'delegate', 'align', 'define', 'manage', 'resolve', 'address', 'translate', 'funding', 'implementation', 'violation'],
['thorough', 'conservative', 'systematic', 'consistent', 'process', 'choice', 'consequence', 'authority', 'future', 'timely', 'executive', 'senior'],
['trust', 'respect', 'opinion', 'dignity', 'fair', 'disagree', 'receptive', 'valuable', 'tolerate', 'value', 'insight', 'perspective', 'collaboration', 'conflict', 'listening'],
['learn', 'training', 'assessment', 'improve', 'performance', 'scrutiny', 'monitor', 'adopt', 'idea', 'benchmarking', 'knowledge', 'competent', 'skills', 'develop', 'acquire'],
['identify', 'corrective', 'action','issue', 'yellow', 'red', 'prevent', 'foreign', 'poor', 'inadequate', 'degraded', 'evaluation', 'problem', 'cause', 'root', 'investigation', 'investigate', 'recommendation', 'resolution', 'mitigate'],
['environment', 'fear', 'harassment', 'discrimination', 'promote', 'severity', 'failure', 'submit', 'report', 'expired', 'raise'],
['engineering', 'control', 'activity', 'contingency', 'production', 'schedule', 'work', 'margin', 'operate', 'maintain', 'maintenance', 'procedure', 'package', 'accurate', 'current', 'backlog', 'instruction', 'operation', 'design', 'requirement','standard']
]

# seed_words = [['responsible', 'encourage', 'accountable', 'help', 'ownership', 'support', 'trained', 'qualified', 'understand', 'non-cited', 'involvement', 'self-identifying'],
#               ['complacent', 'discrepancy', 'anomaly', 'assumption', 'question', 'uncertain', 'unknown', 'risk', 'investigate', 'unexpected', 'unclear', 'review', 'degrading', 'aging'],
#               ['communication', 'information', 'safety', 'share', 'respond', 'listen', 'concerns', 'expectations', 'clear'],
#               ['leadership', 'management', 'leaders', 'owner', 'guidance', 'decisions', 'policies', 'resources', 'staffing', 'expectations', 'oversight', 'reinforce', 'decision-making', 'mentoring', 'coaching', 'encourage', 'incentive', 'reward', 'priorities', 'plan', 'delegate', 'define', 'manage', 'support', 'resolve', 'motivate', 'address', 'resources', 'translate', 'funding', 'implementation', 'violation'],
#               ['decision-making', 'thorough', 'conservative', 'systematic', 'consistent', 'process', 'leaders', 'choices', 'consequences', 'reinforce', 'accountable', 'authority', 'responsibility'],
#               ['trust', 'respect', 'encourage', 'opinions', 'bullying', 'fair', 'productive', 'concerns', 'voice', 'discussions', 'value', 'insight', 'perspectives', 'collaboration', 'conflict', 'listening'],
#               ['learn', 'training', 'self-assessments', 'improve', 'performance', 'monitor', 'experience', 'assessment', 'corrective action', 'benchmarking', 'knowledge', 'competence', 'skills', 'develop', 'understand'],
#               ['identify', 'address', 'correct', 'corrective action', 'issues', 'deviation', 'degraded conditions', 'evaluation', 'problems', 'root cause', 'investigation', 'recommendation', 'cause analysis', 'resolution', 'mitigate', 'trends'],
#               ['concerns', 'confidence', 'address', 'feedback', 'listening', 'report'],
#               ['plan', 'control', 'activities', 'production', 'schedule', 'manage', 'execute', 'risk', 'coordinate', 'margins', 'operate', 'maintain', 'maintenance', 'document', 'complete', 'procedures', 'packages', 'accurate', 'current', 'backlog', 'follow', 'review', 'instructions', 'operations', 'design', 'requirements', 'operations']]

# seed_words = [['responsibility', 'authority', 'override', 'adherence', 'accountability', 'organization', 'meeting', 'standard', 'demonstrate', 'proper', 'reinforce', 'discussion', 'peer', 'personally', 'consistent', 'solicit', 'feedback', 'understand', 'foster', 'professional', 'teamwork', 'environment', 'raise', 'ownership', 'preparation', 'execute', 'assign', 'activity', 'work', 'participate', 'briefing', 'qualify', 'communicate', 'coordinate', 'boundary', 'sense', 'operation'],
#               ['complacency', 'challenge', 'discrepancy', 'error', 'anomaly', 'undesirable', 'complex', 'unpredictable', 'oversight', 'caution', 'hazard', 'radioactive', 'core', 'decay', 'heat', 'fuel', 'cooling', 'question', 'degrade', 'equipment', 'risk', 'uncertain', 'unexpected', 'attitude', 'resolve', 'rationalize', 'abnormal', 'investigate', 'supervisor', 'consult', 'expert', 'unclear', 'oppose', 'view', 'management', 'decision', 'contrary', 'possibility', 'mistake', 'inherit', 'risk', 'contingency', 'undesired'],
#               ['communication', 'safety', 'worker', 'equipment', 'labeling', 'operating', 'documentation', 'formal', 'informal', 'convey', 'flow', 'organization', 'group', 'frequent', 'status', 'supervisor', 'information', 'shift', 'turnover', 'briefing', 'meeting', 'daily', 'prompt', 'unintended', 'conflicting', 'ask', 'share', 'reason', 'implication', 'openly', 'candidly', 'respond', 'forthright', 'audit', 'solicit', 'listen', 'assess', 'expectation', 'reliability'],
#               ['leadership', 'commitment', 'decision', 'lead', 'advocate', 'corporate', 'policy', 'resource', 'reliability', 'staffing', 'sufficient', 'qualified', 'personnel', 'facility', 'maintain', 'emergency', 'executive', 'senior', 'manager', 'evaluation', 'ensure', 'expectation', 'disciplinary', 'consistent', 'raise', 'foster', 'oversight', 'cost', 'schedule', 'goal', 'establish', 'align', 'priority', 'systematic', 'process', 'change', 'implement', 'authority', 'plan', 'ownership', 'accountability', 'role', 'recommendation', 'feedback', 'governance', 'monitor', 'perspective', 'tool', 'survey', 'review', 'culture', 'detract', 'act', 'unsafe', 'decision-making'],
#               ['expectation', 'systematic', 'unexpected', 'uncertain', 'reinforce', 'conservative', 'decision', 'consistent', 'process', 'process', 'seek', 'group', 'organization', 'safety', 'risk', 'bias', 'effectiveness', 'future', 'choice', 'timely', 'commensurate', 'executive', 'senior', 'reinforce', 'procedure', 'reactor', 'margin', 'operation', 'shift', 'accountability', 'authority', 'responsibility'],
#               ['trust', 'respect', 'communication', 'opinion', 'employees', 'everyone', 'dignity', 'capability', 'experience', 'valuable', 'asset', 'group', 'bullying', 'humiliating', 'tolerate', 'behavior', 'disagree', 'fair', 'concern', 'suggestion', 'question', 'problems', 'receptive', 'differing', 'discussion', 'expertise', 'value', 'experience', 'perspective', 'program', 'personnel', 'lack', 'information', 'share', 'timely', 'milestone', 'positive', 'negative', 'confidentiality', 'conflict', 'objective', 'resolution', 'equitable', 'consistent', 'defined', 'result', 'professional'],
#               ['opportunities', 'implement', 'learn', 'training', 'assessment', 'benchmarking', 'stimulate', 'performance', 'improve', 'scrutiny', 'monitoring', 'institutionalized', 'procedures', 'adopt', 'ideas', 'routine', 'critical', 'practice', 'corrective', 'topics', 'needs', 'knowledge', 'skills', 'acquire', 'best', 'corrective', 'action', 'transfer', 'retention', 'strategy', 'competent', 'develop'],
#               ['issue', 'impact', 'identify', 'evaluate', 'address', 'correct', 'commensurate', 'significant', 'deviation', 'standard', 'action', 'corrective', 'document', 'threshold', 'describe', 'prioritize', 'assign', 'resolution', 'evaluation', 'classify', 'report', 'operability', 'investigation', 'root', 'cause', 'understand', 'conduct', 'resolution', 'trend', 'mitigate', 'routine'],
#               ['conscious', 'environments', 'fear', 'retaliation', 'intimidation', 'harassment', 'discrimination', 'policies', 'rights', 'responsibility', 'leaders', 'ownership', 'investigate', 'establish', 'support', 'promote', 'concern', 'raise'],
#               ['planning', 'controlling', 'work', 'activities', 'process', 'schedule', 'execute', 'criticize', 'management', 'incorporate', 'contingency', 'action', 'coordinate', 'probabilistic', 'consider', 'conflicting', 'modification', 'design', 'margin', 'operate', 'maintain', 'backlog', 'engineering', 'fission', 'prevent', 'documentation', 'procedures', 'complete', 'adherence', 'human', 'error', 'status', 'validate', 'implementation']]
              
safety_traits = ['Personal Accountability', 'Questioning Attitude', 'Effective Safety Communication', 
                 'Leadership Safety Values and Actions','Decision Making', 'Respectful Work Environment', 
                 'Continuous Learning','Problem Identification and Resolution','Environment for Raising Concerns',
                 'Work Processes']

seed_words_count = sum([len(listt) for listt in seed_words])
print("Total number of seed words:", seed_words_count)


Total number of seed words: 161


Pre-processing seed words (changing them to their basic form "lemmatization")

In [11]:
#Lemmatizing seed words

lemma_seed_words = []

for listt in seed_words:
    new_listt = []
    for words in listt:
        doc = nlp(words)
        i = 0
        for token in doc:
            if token.text == "-":
                continue
            new_listt.append(token.lemma_)
            i+=1
    lemma_seed_words.append(new_listt)
print(lemma_seed_words)
print("\nNumber of words:", sum([len(x) for x in lemma_seed_words]))

[['responsibility', 'accountability', 'help', 'support', 'train', 'qualify', 'understand', 'complete', 'involvement'], ['complacency', 'complacent', 'challenge', 'error', 'hazard', 'caution', 'discrepancy', 'anomaly', 'assumption', 'question', 'uncertain', 'unknown', 'risk', 'trend', 'unexpected', 'unclear', 'degrade', 'age'], ['communication', 'licensee', 'event', 'report', 'documentation', 'request', 'LER', 'information', 'safety', 'prompt', 'share', 'respond', 'listen', 'concern', 'expectation', 'clear'], ['leadership', 'management', 'leader', 'owner', 'ownership', 'program', 'guidance', 'policy', 'resource', 'staff', 'oversight', 'reinforce', 'priority', 'plan', 'delegate', 'align', 'define', 'manage', 'resolve', 'address', 'translate', 'funding', 'implementation', 'violation'], ['thorough', 'conservative', 'systematic', 'consistent', 'process', 'choice', 'consequence', 'authority', 'future', 'timely', 'executive', 'senior'], ['trust', 'respect', 'opinion', 'dignity', 'fair', 'disa

#### 5) Initializing Guided LDA model and mapping seed words to vocabulary terms 
Issue: many seed words do not exist in training data (Increase training size?)

In [12]:
model = glda.GuidedLDA(n_topics=10, n_iter=500, random_state=7, refresh=20)

lemm_seed_topics = {}
for t_id, st in enumerate(lemma_seed_words):
    for word in st:
        try:
            lemm_seed_topics[word2id[word]] = t_id
        except:
            continue
lemm_seed_topics

{1908: 0,
 2200: 0,
 2274: 6,
 1760: 0,
 2326: 0,
 387: 0,
 318: 1,
 762: 1,
 1008: 1,
 651: 1,
 140: 1,
 1768: 1,
 1942: 1,
 2289: 1,
 2332: 1,
 2321: 1,
 585: 7,
 56: 1,
 381: 2,
 1269: 2,
 777: 2,
 1875: 8,
 670: 2,
 1883: 2,
 1125: 2,
 1972: 2,
 1735: 2,
 1906: 2,
 406: 2,
 813: 2,
 343: 2,
 1249: 3,
 1330: 3,
 1248: 3,
 1570: 3,
 1571: 3,
 1728: 3,
 993: 3,
 1644: 3,
 1900: 3,
 2130: 3,
 1564: 3,
 1844: 3,
 1710: 3,
 1635: 3,
 68: 3,
 1329: 3,
 1898: 3,
 41: 3,
 2282: 3,
 945: 3,
 1067: 3,
 2400: 3,
 2248: 4,
 438: 4,
 443: 4,
 1721: 4,
 949: 4,
 2258: 4,
 801: 4,
 2016: 4,
 1902: 5,
 1525: 5,
 2379: 5,
 425: 5,
 1254: 6,
 132: 6,
 1074: 6,
 1602: 6,
 1420: 6,
 1052: 6,
 203: 6,
 1230: 6,
 621: 6,
 1054: 7,
 494: 7,
 27: 7,
 1205: 7,
 2479: 7,
 1825: 7,
 1697: 7,
 913: 7,
 1650: 7,
 1082: 7,
 774: 7,
 1715: 7,
 300: 7,
 1954: 7,
 1192: 7,
 1191: 7,
 1817: 7,
 1897: 7,
 1405: 7,
 752: 8,
 2031: 8,
 844: 8,
 2170: 8,
 820: 8,
 1780: 8,
 740: 9,
 472: 9,
 31: 9,
 461: 9,
 1725: 9,
 1

In [13]:
len(lemm_seed_topics)

119

#### 6) Fitting LDA model, guided by seed words.

seed_confidence: Measures how much the model is biasing the seeded words towards the seeded topics.

In [14]:
model.fit(X_array, seed_topics=lemm_seed_topics, seed_confidence=1)

INFO:lda:n_documents: 1053
INFO:lda:vocab_size: 2486
INFO:lda:n_words: 16371
INFO:lda:n_topics: 10
INFO:lda:n_iter: 500
INFO:lda:<0> log likelihood: -195413
INFO:lda:<20> log likelihood: -127885
INFO:lda:<40> log likelihood: -125077
INFO:lda:<60> log likelihood: -123758
INFO:lda:<80> log likelihood: -123093
INFO:lda:<100> log likelihood: -122794
INFO:lda:<120> log likelihood: -122457
INFO:lda:<140> log likelihood: -122023
INFO:lda:<160> log likelihood: -121742
INFO:lda:<180> log likelihood: -121541
INFO:lda:<200> log likelihood: -121274
INFO:lda:<220> log likelihood: -121329
INFO:lda:<240> log likelihood: -121148
INFO:lda:<260> log likelihood: -121078
INFO:lda:<280> log likelihood: -120842
INFO:lda:<300> log likelihood: -120880
INFO:lda:<320> log likelihood: -120908
INFO:lda:<340> log likelihood: -120679
INFO:lda:<360> log likelihood: -120569
INFO:lda:<380> log likelihood: -120589
INFO:lda:<400> log likelihood: -120531
INFO:lda:<420> log likelihood: -120482
INFO:lda:<440> log likelihoo

<lda.guidedlda.GuidedLDA at 0x7f9be3a78760>

#### 7) Printing top words per topic (topic-word distributions)

A mix of words that make up each topic.

In [15]:
n_top_words = 50
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('{}: [{}]'.format(safety_traits[i], ', '.join(topic_words)))
    print('\n')

Personal Accountability: [fuel, pump, spent, reactor, pool, door, system, water, containment, repair, outage, replace, auxiliary, replacement, fire, result, loss, window, additional, fan, core, heat, service, unit, margin, dcisc, motor, inventory, find, valve, vessel, yellow, cooling, assembly, inspection, coolant, crane, degraded, cfcu, reliability, currently, learn, cooler, voltage, cavity, remove, leak, maintain, dcpp, deferral]


Questioning Attitude: [dcpp, system, review, tube, safety, failure, dcisc, inspection, following, unit, outage, cause, reference, conclude, fwh, experience, concern, functional, significant, refueling, feedwater, design, heater, perform, mid, shut, nrc, program, opportunity, repair, plant, unnecessary, fact, miss, replacement, failures, update, resident, conclusion, trip, inspector, fuel, reactor, handling, shutdown, length, maintenance, historically, conclusions, meet]


Effective Safety Communication: [safety, public, meeting, state, dcpp, low, concern, 

Test model classification of the word "leadership"

In [16]:
np.set_printoptions(suppress=True)
word_index = vocab.index('leadership')
probabilities = model.word_topic_[word_index]
max_prob = max(probabilities)
max_prob_index = np.argmax(probabilities)
print(list(np.round(probabilities,4)))
print("Most probable topic:", safety_traits[max_prob_index])

[0.0009, 0.0009, 0.0009, 0.9919, 0.0009, 0.0009, 0.0009, 0.0009, 0.0009, 0.0009]
Most probable topic: Leadership Safety Values and Actions


#### 8) Using the GuidedLDA model to classify test sentences to safety trait labels (from training data)
Probability threshold assumption: 10%
<br> Actual labels are unknown

In [17]:
test = list_sentences[5]
original_test = combined_sentences[5]
print(original_test,"\n")
print(test, "\n")
counter = {}
traits = []
sum_prob = defaultdict(list)
for word in test.split():
    vocab_index = vocab.index(word)
    probabilities = model.word_topic_[vocab_index]
    max_prob = max(probabilities)
    max_prob_index = np.argmax(probabilities)
    if max_prob > 0.5:
        counter[safety_traits[max_prob_index]]=counter.get(safety_traits[max_prob_index],0)+1
    i = 0
    while i < len(probabilities):
        sum_prob[safety_traits[i]].append(probabilities[i])
        i+=1
for trait in safety_traits:
    avg_prob = sum(sum_prob[trait])/len(sum_prob[trait])
    print(trait+":", round(avg_prob,4))
    if (avg_prob >= 0.1) and (trait in counter):
        if trait not in traits:
            traits.append((trait,avg_prob))
top_traits = [x[0] for x in sorted(traits,key=lambda x: -x[1])[:4]]
print('\nLikely Traits:', top_traits)
print('\nCounter:', counter)

Problems have been mostly due to age-related issues and lack of adequate inspection, maintenance, and component replacement, especially electrical contacts. 

problem age relate issue lack adequate inspection maintenance component replacement especially electrical contact 

Personal Accountability: 0.0671
Questioning Attitude: 0.0601
Effective Safety Communication: 0.1179
Leadership Safety Values and Actions: 0.0226
Decision Making: 0.0902
Respectful Work Environment: 0.02
Continuous Learning: 0.0919
Problem Identification and Resolution: 0.048
Environment for Raising Concerns: 0.0702
Work Processes: 0.412

Likely Traits: ['Work Processes', 'Effective Safety Communication']

Counter: {'Work Processes': 3, 'Questioning Attitude': 1, 'Continuous Learning': 1, 'Effective Safety Communication': 1}


In [18]:
test = list_sentences[0]
original_test = combined_sentences[0]
print(original_test,"\n")
print(test,"\n")
counter = {}
traits = []
sum_prob = defaultdict(list)
for word in test.split():
    vocab_index = vocab.index(word)
    probabilities = model.word_topic_[vocab_index]
    max_prob = max(probabilities)
    max_prob_index = np.argmax(probabilities)
    if max_prob > 0.5:
        counter[safety_traits[max_prob_index]]=counter.get(safety_traits[max_prob_index],0)+1
    i = 0
    while i < len(probabilities):
        sum_prob[safety_traits[i]].append(probabilities[i])
        i+=1
for trait in safety_traits:
    avg_prob = sum(sum_prob[trait])/len(sum_prob[trait])
    print(trait+":", round(avg_prob,4))
    if (avg_prob >= 0.1) and (trait in counter):
        if trait not in traits:
            traits.append((trait,avg_prob))
top_traits = [x[0] for x in sorted(traits,key=lambda x: -x[1])[:4]]
print('\nLikely Traits:', top_traits)
print('\nCounter:', counter)

DCPP acted prompt with corrective actions and submitted a Licensee Event Report when it discovered Technical Specification non-compliance on the Low Temperature Overpressure Protection System. 

dcpp act prompt corrective action submit licensee event report discover technical specification non compliance low temperature overpressure protection system 

Personal Accountability: 0.0335
Questioning Attitude: 0.055
Effective Safety Communication: 0.2546
Leadership Safety Values and Actions: 0.0369
Decision Making: 0.0203
Respectful Work Environment: 0.0685
Continuous Learning: 0.092
Problem Identification and Resolution: 0.0816
Environment for Raising Concerns: 0.2944
Work Processes: 0.0632

Likely Traits: ['Environment for Raising Concerns', 'Effective Safety Communication']

Counter: {'Effective Safety Communication': 3, 'Environment for Raising Concerns': 5, 'Problem Identification and Resolution': 1}


In [19]:
test = list_sentences[57]
original_test = combined_sentences[57]
print(original_test,"\n")
print(test,"\n")
counter = {}
traits = []
sum_prob = defaultdict(list)
for word in test.split():
    vocab_index = vocab.index(word)
    probabilities = model.word_topic_[vocab_index]
    max_prob = max(probabilities)
    max_prob_index = np.argmax(probabilities)
    if max_prob > 0.5:
        counter[safety_traits[max_prob_index]]=counter.get(safety_traits[max_prob_index],0)+1
    i = 0
    while i < len(probabilities):
        sum_prob[safety_traits[i]].append(probabilities[i])
        i+=1
for trait in safety_traits:
    avg_prob = sum(sum_prob[trait])/len(sum_prob[trait])
    print(trait+":", round(avg_prob,4))
    if (avg_prob >= 0.1) and (trait in counter):
        if trait not in traits:
            traits.append((trait,avg_prob))
top_traits = [x[0] for x in sorted(traits,key=lambda x: -x[1])[:4]]
print('\nLikely Traits:', top_traits)
print('\nCounter:', counter)

This ultimately created an environment that promulgated a human error-likely environment.” More specifically, the RCE team determined that the environment consisted of poor communication, lack of engineering leadership, too much reliance on vendor designs, time pressure, and distractions. 

ultimately create environment promulgate human error likely environment specifically rce team determine environment consist poor communication lack engineering leadership reliance vendor design time pressure distraction 

Personal Accountability: 0.0202
Questioning Attitude: 0.0179
Effective Safety Communication: 0.0422
Leadership Safety Values and Actions: 0.2405
Decision Making: 0.0336
Respectful Work Environment: 0.2486
Continuous Learning: 0.0836
Problem Identification and Resolution: 0.0983
Environment for Raising Concerns: 0.0088
Work Processes: 0.2062

Likely Traits: ['Respectful Work Environment', 'Leadership Safety Values and Actions', 'Work Processes']

Counter: {'Respectful Work Environme

#### 9) Evaluating Model Accuracy
Using manually-labelled texts from "Diablo Canyon - On Nuclear Safety and Safety Culture - Hector & Parker 04-08-16" <br>
Number of test sentences: 27

In [20]:
# Metrics Functions
def accuracy(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection)/union

def precision(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    return float(intersection)/len(list1)

def recall(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    return float(intersection)/len(list2) 

def f1(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    precision_score = precision(list1, list2)
    recall_score = recall(list1, list2)
    try:
        f1_score = float(2*precision_score*recall_score)/(precision_score+recall_score)
    except:
        f1_score = 0
    return f1_score

Importing test dataset as DataFrame

In [24]:
import csv
import pandas as pd

test_data = []

# opening the CSV file
with open('test_set.csv', mode ='r', encoding='utf-8') as csvfile:    
    csvreader = csv.reader(csvfile, delimiter=',',)
    next(csvreader)
    for row in csvreader:
        sentence = row[0]
        true_labels = [i for i in row[1:] if i]
        test_data.append((sentence,true_labels))

df = pd.read_csv('test_set.csv')
df

Unnamed: 0,sentence,labels,labels.1,labels.2,labels.3,labels.4,labels.5
0,"DCISC identified 11 Non-cited Violations, one ...",leadership safety values and actions,problem identification and resolution,personal accountability,environment for raising concerns,effective safety communication,questioning attitude
1,The number of violations has increased.,leadership safety values and actions,work processes,continuous learning,,,
2,The DCISC has identified a number of potential...,effective safety communication,,,,,
3,New regulatory requirements were not adequatel...,leadership safety values and actions,work processes,,,,
4,The DCISC learned in December 2013 that 16 imp...,leadership safety values and actions,personal accountability,effective safety communication,,,
5,The DCPP Fuel Handling System has been problem...,problem identification and resolution,,,,,
6,Additional efforts also need to be devoted to ...,personal accountability,work processes,continuous learning,,,
7,The loss of power to Unit 2 4kV Bus G during R...,leadership safety values and actions,work processes,,,,
8,Three Station Level Human Performance Event Cl...,leadership safety values and actions,work processes,continuous learning,,,
9,Equipment problems and failures increased the ...,personal accountability,environment for raising concerns,effective safety communication,questioning attitude,,


In [25]:
test_sentences = [x[0] for x in test_data]
clean_sentences = preprocess(test_sentences)
clean_sentences

['dcisc identify non cited violations severity level violation',
 'number violation increase',
 'dcisc identify number potential nuclear safety issue use closed cooling dcpp',
 'new regulatory requirement adequately translate specific calculation plant design basis fail demonstrate prefer offsite power source adequate capacity capability supply minimum require terminal voltage plant engineering safety feature follow limit transmission system contingency',
 'dcisc learn december impaired fire door repair replace funding deferral find unacceptable follow dcisc find door repair replace remain high priority plant door life cycle management plan',
 'dcpp fuel handling system problematic refueling outage problem age relate issue lack adequate inspection maintenance component replacement especially electrical contact',
 'additional effort need devote reduce operator burden workaround backlog deficient critical component require involvement station work group operations',
 'loss power unit bus

Using GuidedLDA model to predict labels of test sentences. Actual labels are known.

In [26]:
all_accuracies = []
all_precision = []
all_recall = []
all_f1 = []

index = 0
for sentence in test_sentences:
    test = clean_sentences[index]
    original_test = test_sentences[index]
    print(original_test,"\n")
    print(test,"\n")
    counter = {}
    traits = []
    sum_prob = defaultdict(list)
    for word in test.split():
        try:
            vocab_index = vocab.index(word)
            probabilities = model.word_topic_[vocab_index]
            max_prob = max(probabilities)
            max_prob_index = np.argmax(probabilities)
        except:
            continue
        if max_prob > 0.5:
            counter[safety_traits[max_prob_index]]=counter.get(safety_traits[max_prob_index],0)+1
        i = 0
        while i < len(probabilities):
            sum_prob[safety_traits[i]].append(probabilities[i])
            i+=1
    for trait in safety_traits:
        avg_prob = sum(sum_prob[trait])/len(sum_prob[trait])
        print(trait+":", round(avg_prob,4))
        if (avg_prob >= 0.1) and (trait in counter):
            if trait not in traits:
                traits.append(trait.lower())

    accuracy_score = accuracy(traits, test_data[index][1])
    precision_score = precision(traits, test_data[index][1])
    recall_score = recall(traits, test_data[index][1])
    f1_score = f1(traits, test_data[index][1])
    
    all_accuracies.append(accuracy_score)
    all_precision.append(precision_score)
    all_recall.append(recall_score)
    all_f1.append(f1_score)
    
    print('\nLikely Traits:', sorted(traits))
    print('\nActual Traits:', sorted(test_data[index][1]))
    print('\nCounter:', counter)
    print("\nAccuracy score:", accuracy_score)
    print("\nPrecision score:", precision_score)
    print("\nRecall score:", recall_score)
    print("\nF1 score:", f1_score)
    print(100*'-')

    index+=1

DCISC identified 11 Non-cited Violations, one Severity Level IV violation. 

dcisc identify non cited violations severity level violation 

Personal Accountability: 0.0247
Questioning Attitude: 0.0612
Effective Safety Communication: 0.1998
Leadership Safety Values and Actions: 0.1268
Decision Making: 0.0017
Respectful Work Environment: 0.0582
Continuous Learning: 0.1219
Problem Identification and Resolution: 0.2964
Environment for Raising Concerns: 0.0423
Work Processes: 0.067

Likely Traits: ['effective safety communication', 'problem identification and resolution']

Actual Traits: ['effective safety communication', 'environment for raising concerns', 'leadership safety values and actions', 'personal accountability', 'problem identification and resolution', 'questioning attitude']

Counter: {'Problem Identification and Resolution': 1, 'Effective Safety Communication': 1}

Accuracy score: 0.3333333333333333

Precision score: 1.0

Recall score: 0.3333333333333333

F1 score: 0.5
--------

In [27]:
avg_acc = sum(all_accuracies)/len(all_accuracies)
avg_prec = sum(all_precision)/len(all_precision)
avg_rec = sum(all_recall)/len(all_recall)
avg_f1 = sum(all_f1)/len(all_f1)

print("Average Accuracy on Test Data:", round(avg_acc*100,2), "%")
print("Average Precision on Test Data:", round(avg_prec*100,2), "%")
print("Average Recall on Test Data:", round(avg_rec*100,2), "%")
print("Average F1-Score on Test Data:", round(avg_f1*100,2), "%")

Average Accuracy on Test Data: 30.03 %
Average Precision on Test Data: 50.93 %
Average Recall on Test Data: 44.57 %
Average F1-Score on Test Data: 42.97 %


#### Evaluating Model Accuracy
Using our manually-labelled texts from 24th annual report <br>
Number of test sentences: 65

In [28]:
# opening the CSV file
test_data2 = []
with open('test_set_new.csv', mode ='r', encoding='utf-8') as csvfile:    
    csvreader = csv.reader(csvfile, delimiter=',',)
    next(csvreader)
    for row in csvreader:
        sentence = row[0]
        true_labels = [i for i in row[1:] if i]
        test_data2.append((sentence,true_labels))

df_2 = pd.read_csv('test_set_new.csv')
df_2

Unnamed: 0,sentence,labels,labels.1,labels.2,labels.3
0,Three Station Level Human Performance Event Cl...,personal accountability,continuous learning,work processes,
1,Equipment problems due to aging have led to an...,work processes,problem identification and resolution,,
2,"The DCPP knowledge transfer program, “Passport...",continuous learning,decision making,,
3,RC1: The process for evaluating both the risk ...,problem identification and resolution,decision making,,
4,RC2: Maintenance leadership has not been proac...,work processes,continuous learning,,
...,...,...,...,...,...
60,DCPP experienced significant Feedwater Heater ...,problem identification and resolution,environment for raising concerns,,
61,The health of transmission systems at DCPP was...,work processes,continuous learning,,
62,The health of DCPP's Emergency Diesel Generato...,work processes,,,
63,A July 2021 failure of the same Emergency Dies...,personal accountability,effective safety communication,,


In [29]:
test_sentences2 = [x[0] for x in test_data2]
clean_sentences2 = preprocess(test_sentences2)
clean_sentences2

['ation level human performance event clock resets occur fourth quarter cause station month indicator resets yellow deficient event involve operations personnel operation performance respect human error rate red unsatisfactory component mispositioning appear continue contributor',
 'equipment problem aging lead increasingly negative trend station deficient critical component backlog orders dcpp performance reduce eliminate safety system functional failures improve despite implementation corrective action plan',
 'dcpp knowledge transfer program passport knowledge appear design implementation seat high priority item outage planning outage dcisc encourage dcpp forward program lose valuable job knowledge employee retire',
 'process evaluate risk outage emergent work outage protect equipment potential impact operate unit formal include prerequisite adequate analysis review approval prior decision work protect equipment',
 'maintenance leadership proactive approach shortfall human performan

Using GuidedLDA model to predict labels of test sentences. Actual labels are known.

In [32]:
all_accuracies = []
all_precision = []
all_recall = []
all_f1 = []

index = 0
for sentence in test_sentences2:
    test = clean_sentences2[index]
    original_test = test_sentences2[index]
    print(original_test,"\n")
    print(test,"\n")
    counter = {}
    traits = []
    sum_prob = defaultdict(list)
    for word in test.split():
        try:
            vocab_index = vocab.index(word)
            probabilities = model.word_topic_[vocab_index]
            max_prob = max(probabilities)
            max_prob_index = np.argmax(probabilities)
        except:
            continue
        if max_prob > 0.5:
            counter[safety_traits[max_prob_index]]=counter.get(safety_traits[max_prob_index],0)+1
        i = 0
        while i < len(probabilities):
            sum_prob[safety_traits[i]].append(probabilities[i])
            i+=1
    for trait in safety_traits:
        avg_prob = sum(sum_prob[trait])/len(sum_prob[trait])
        print(trait+":", round(avg_prob,4))
        if (avg_prob >= 0.1) and (trait in counter):
            if trait not in traits:
                traits.append((trait.lower(),avg_prob))
                
    top_traits = [x[0] for x in sorted(traits,key=lambda x: -x[1])[:4]]

    accuracy_score = accuracy(top_traits, test_data2[index][1])
    precision_score = precision(top_traits, test_data2[index][1])
    recall_score = recall(top_traits, test_data2[index][1])
    f1_score = f1(top_traits, test_data2[index][1])
    
    all_accuracies.append(accuracy_score)
    all_precision.append(precision_score)
    all_recall.append(recall_score)
    all_f1.append(f1_score)
    
    print('\nLikely Traits:', sorted(top_traits))
    print('\nActual Traits:', sorted(test_data2[index][1]))
    print('\nCounter:', counter)
    print("\nAccuracy score:", accuracy_score)
    print("\nPrecision score:", precision_score)
    print("\nRecall score:", recall_score)
    print("\nF1 score:", f1_score)
    print(100*'-')

    index+=1

Three Station Level Human Performance Event Clock Resets occurred during the fourth quarter of 2013, causing the station’s 18-month indicator for such Resets to become Yellow (deficient). Two of these three events involved Operations personnel. Operations performance with respect to human error rate has been Red (Unsatisfactory) since July 2013. Component mispositioning appears to continue to be a contributor. 

ation level human performance event clock resets occur fourth quarter cause station month indicator resets yellow deficient event involve operations personnel operation performance respect human error rate red unsatisfactory component mispositioning appear continue contributor 

Personal Accountability: 0.044
Questioning Attitude: 0.0106
Effective Safety Communication: 0.0259
Leadership Safety Values and Actions: 0.0991
Decision Making: 0.0413
Respectful Work Environment: 0.0627
Continuous Learning: 0.534
Problem Identification and Resolution: 0.0525
Environment for Raising Con

Personal Accountability: 0.0426
Questioning Attitude: 0.0682
Effective Safety Communication: 0.0069
Leadership Safety Values and Actions: 0.0165
Decision Making: 0.1154
Respectful Work Environment: 0.1414
Continuous Learning: 0.0485
Problem Identification and Resolution: 0.332
Environment for Raising Concerns: 0.0569
Work Processes: 0.1715

Likely Traits: ['decision making', 'problem identification and resolution', 'respectful work environment', 'work processes']

Actual Traits: ['problem identification and resolution', 'work processes']

Counter: {'Respectful Work Environment': 5, 'Problem Identification and Resolution': 12, 'Work Processes': 3, 'Personal Accountability': 1, 'Decision Making': 4, 'Questioning Attitude': 2, 'Environment for Raising Concerns': 1}

Accuracy score: 0.5

Precision score: 0.5

Recall score: 1.0

F1 score: 0.6666666666666666
----------------------------------------------------------------------------------------------------
Asymmetric deposition of “extra he

Personal Accountability: 0.0011
Questioning Attitude: 0.0011
Effective Safety Communication: 0.1501
Leadership Safety Values and Actions: 0.1455
Decision Making: 0.1231
Respectful Work Environment: 0.0011
Continuous Learning: 0.1614
Problem Identification and Resolution: 0.0011
Environment for Raising Concerns: 0.0842
Work Processes: 0.3311

Likely Traits: ['effective safety communication', 'leadership safety values and actions', 'work processes']

Actual Traits: ['leadership safety values and actions', 'work processes']

Counter: {'Effective Safety Communication': 1, 'Leadership Safety Values and Actions': 1, 'Work Processes': 1}

Accuracy score: 0.6666666666666666

Precision score: 0.6666666666666666

Recall score: 1.0

F1 score: 0.8
----------------------------------------------------------------------------------------------------
Life Cycle Plan not current. 

life cycle plan current 

Personal Accountability: 0.128
Questioning Attitude: 0.001
Effective Safety Communication: 0.001


Accuracy score: 0.4

Precision score: 0.5

Recall score: 0.6666666666666666

F1 score: 0.5714285714285715
----------------------------------------------------------------------------------------------------


In [33]:
avg_acc = sum(all_accuracies)/len(all_accuracies)
avg_prec = sum(all_precision)/len(all_precision)
avg_rec = sum(all_recall)/len(all_recall)
avg_f1 = sum(all_f1)/len(all_f1)

print("Average Accuracy on Test Data:", round(avg_acc*100,2), "%")
print("Average Precision on Test Data:", round(avg_prec*100,2), "%")
print("Average Recall on Test Data:", round(avg_rec*100,2), "%")
print("Average F1-Score on Test Data:", round(avg_f1*100,2), "%")

Average Accuracy on Test Data: 32.97 %
Average Precision on Test Data: 46.41 %
Average Recall on Test Data: 51.54 %
Average F1-Score on Test Data: 46.27 %
