# <center>Technical Information Extraction</center>
## <center>Week 4</center>
## <center>Interactive Notebook</center>


<center>📚 Source: W4 Technical Information Extraction</center>

⚠️ To make the most out of this notebook, some level of Python programming is desirable. If you've never used Python (or any other programming language) checkout the resources [here](https://wiki.python.org/moin/BeginnersGuide/Programmers) to get up to speed. This is not mandatory, but will allow you to modify the code examples and complete the activities.

⚠️ The intention of this notebook is not to teach you how to build complex machine learning algorithms, but instead demonstrate how to go through the process of identifying a problem and using off-the-shelf models on your own data. The overarching goal here is to gain an intuition and practical understanding of supervised machine learning applied to technical texts and how it can support maintenance and reliability engineering workflows.

⚠️ The application herein is on maintenance work order records, but the techniques are agnostic and can be applied in any domain and source of text.

## Notebook Overview
In the previous two notebooks (W3-1 and W3-1) we focused on eliciting end-of-life terms and phrases to help semi-automate the process of identifying failure and suspension data from maintenance work order records to perform statistical data life analysis at a greater scale. The process we explored leveraged natural language processing (NLP) concepts such as word embeddings to accelerate the process of identifying end-of-life within the short text description of work order records. 

In this notebook, we are going to continue to NLP but instead of developing dictionaries of terms and phrases representing a specific concept, we'll use a machine learning approach. Specifically, the application of this notebook is to automatically extract information from work order descriptions, where our goal is to aggregate the information from thousands of work orders into knowledge, as exemplified by the figure below.

The motiviation for this application is that there exists a wealth of valuable knowledge inside unstructured texts, but gaining insight into them is challenging and arduous due to the vast number created. If we were able to automatically extract information from texts such as work order descriptions, we could focus on answering, or gaining insight into, the following questions:
- Which assets have the most activities performed on them, and what activities are they?
- What undesirable states are assets participating in?
- What behaviour are assets demonstrating?

<center>
<img src="./images/W4_dik_real_data.png"/>
</center>

**Legend**
- ⚡ indicates a new concept
- 📌 indicates an activity or interactive part of the notebook

## Table of Contents
* [Week 3 - Recap](#week-3-recap)
* [4.1 Demonstration](#4.1-demonstration)
    * [Activity 4.1](#4.1-activity)
* [4.2 Fundamentals of supervised information extraction from unstructured texts](#4.2-fundamentals-supervised-ie)
    * [4.2.1 Supervised Machine Learning (learning from example)](4.2.1-supervised-ml)
        * [Activity 4.2.1](#4.2.1-activity)
    * [4.2.2 Information Extraction (structuring the unstructured)](#4.2.2-fundamentals-ie)
        * [4.2.2.1 Named Entity Recognition](#4.2.2.1-fundamentals-ner)
        * [4.2.2.2 Relation Classification](#4.2.2.2-fundamentals-relation-classification)
        * [4.2.2.3 Knowledge Graphs (turning information into knowledge)](#4.2.2.3-activity)
        * [Activity 4.2.2](#4.2.2-activity)
* [4.3 Technical Information Extraction](#4.3-technical-information-extraction)
    * [Notebook Setup](#4.3-notebook-setup)
    * [4.3.A Process Overview](#4.3.A-process-overview)
    * [4.3.B Development of Conceptual Model](#4.3.B-development-of-conceptual-model)
        * [Activity 4.3.B](#4.3.B-activity)
    * [4.3.C Data Prepartion](#4.3.C-data-preparation)
    * [4.3.D Data Curation](#4.3.D-data-curation)
        * [Activity 4.3.D](#4.3.D-activity)
    * [4.3.E Model Development](#4.3.E-model-development)
    * [4.3.F Model Application](#4.3.F-model-application)
        * [4.3.F.1 Using a general models on technical data](#4.3.F.1-general-model)
        * [Activity 4.3.F](#4.3.F-activity)
    * [4.3.G Analysis](#4.3.G-analysis)
        * [4.3.G.1 Extracted States](#4.3.G.1-extracted-states)
        * [4.3.G.2 Extracted Activities](#4.3.G.2-extracted-activites)
        * [4.3.G.3 Extracted Physical Objects](#4.3.G.3-extracted-physical-objects)
        * [Activity 4.3.G](#4.3.G-activity)
* [4.4 Network Graph Analysis](#4.4-network-graph-analysis)
    * [4.4.1 Triple Generation and Network Creation](#4.4.1-triple-generation-network-creation)
    * [4.4.2 Network Visualisation and Analysis](#4.4.2-network-visualisation-analysis)
        * [4.4.2.1 Functional Location Graph](#4.4.2.1-functional-location-graph)
        * [4.4.2.2 Entire Graph](#4.4.2.2-entire-graph)
        * [4.4.2.3 Querying the Network Graph](#4.4.2.3-query-network-graph)
    * [4.4 Activity](#4.4-activity)
* [Summary](#summary)
* [Appendix](#appendix)

## Notebook Objectives
- Apply a supervised machine learning model to natural language texts in maintenance work order records
- Use a pretrained information extraction model to structure unstructured text and perform analysis
- Build a simple knowledge graph from maintenance work order records using natural language processing

## Learning Outcomes
- Understand the process of developing a conceptualisation of meaning to apply to natural language texts
- Understand the process of human-annotation to curate datasets for supervised machine learning
- Gain familiarity with training and using supervised machine learning algorithms for natural language processing tasks
- Understand the importance of domain experts (like yourselves) in the process of data-driven technology like supervised machine learning
- Understand how to use the outputs of deep learning models for gaining insight into maintenance data to support reliability engineering and maintenance decision-making

## Week 3 Recap <a class="anchor" id="week-3-recap"></a>

Post your answers to [Menti](https://www.menti.com/efoiig7u9u)
- Do you have any questions or comments from last week?

## 4.1 - Demonstration <a class="anchor" id="4.1-demonstration"></a>
Before we get into the details, lets first get a partial idea of what we're trying to accomplish.

### 📌 Activity 4.1 <a class="anchor" id="4.1-activity"></a>

Provide your answers on [Menti](https://www.menti.com/wsojfffxjp)

Given the following maintenance short texts:
1. CVR3 roller frame damaged
2. CVR 1 replace collapsed idler
3. CVR2 impact plate replace / repair
4. CVR1 impact plate failed
5. CVR4 DRV 1 - change out flexible coupling
6. CVR5 replace hard skirts and soft skirts
7. CVR6 - return roller collapsed
8. CVR7 replace rock jammed idler FR#61
9. CVR6 replace spraybar feed pipe elbow
10. CVR4 tighten impact frame bolts
11. CVR7 soft skirt popped out
12. CVR8 adjust left hand side guide rollers on hammocks

Identify and count the number of: **activities**, **undesirable states** and **physical objects**.

An example of this activity on the text "replace seized pump impeller" is:
- **activities**: replace (1)
- **undesirable states**: seized (1)
- **physical objects**: pump, impeller (2)

### Performing the same activity with a machine learning based NLP model
Here we are going to load a machine learning model trained to identify concepts such as `Activity`, `PhysicalObject` and `State` from maintenance work orders, essentially the same task that we performed manually above. We'll ignore the specifics of the code for now and focus on the process that is being performed. But first, lets install a third part package for deep learning based NLP.

In [None]:
%%capture
!pip install flair

Now that we have installed the third party package (details to follow), we can import functions from it.

In [None]:
# Import required functions to download and use the machine learning model
from urllib.request import urlretrieve
import os

from flair.models import SequenceTagger
from flair.data import Sentence

In [None]:
dir_path = os.path.abspath("")

Here the documents in Activity 4.1 are put into a small corpus

In [None]:
# Create a corpus containing the documents from Activity 4.1
demo_texts = [
"CVR3 roller frame damaged",
"CVR 1 replace collapsed idler",
"CVR2 impact plate replace / repair",
"CVR1 impact plate failed",
"CVR4 DRV 1 - change out flexible coupling",
"CVR5 replace hard skirts and soft skirts",
"CVR6 - return roller collapsed",
"CVR7 replace rock jammed idler FR#61",
"CVR6 replace spraybar feed pipe elbow",
"CVR4 tighten impact frame bolts",
"CVR7 soft skirt popped out",
"CVR8 adjust left hand side guide rollers on hammocks"
]

Using our third party package, we'll load a pretrained machine learning model. For now, we'll ignore the specifics, but the high-level intuition here is that we are loading a machine learning model that assigns a category (or label) to the tokens in each document within our corpus. This is similar to the activity we performed manually, e.g. "replace pump" $\rightarrow$ "replace" is an activity and "pump" is a physical object. This is an important and popular task in NLP called [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition).

Lets download and load the pretrained model. Note the download may take a minute or two.

In [None]:
# URL to trained model
url_to_ner_model = "https://coreskills.blob.core.windows.net/ds-reliability/ner/ds-best-model.pt"
path_to_save_ner_model = os.path.join(dir_path, "../data/ner/ds-best-model.pt")

if os.path.exists(path_to_save_ner_model):
    print('Model already downloaded')
else:
    print('Downloading model...')
    urlretrieve(url_to_ner_model, path_to_save_ner_model)
    print('Download finished')

In [None]:
# Load the model that was downloaded
demo_model = SequenceTagger.load(
    r'../data/ner/ds-best-model.pt')

Now that we have loaded our machine learning model (how easy was that!), we'll apply it to the documents in our corpus. Again, we'll ignore the details of this, but if you are interested, feel free to examine it closer. What we are doing here is encoding each document in our corpus into a special 'object' and then using the model to make predictions on the tokens. After this, we are extracting the predictions (labels on our tokens) and counting them. We use the magic method `%%time` to record the time taken to perform this process.

In [None]:
%%time
# Perform inference on the demonstration texts and output time taken as well as counts
demo_sentences = [Sentence(text.lower()) for text in demo_texts]
demo_model.predict(demo_sentences)

demo_counts = {}
for sentence in demo_sentences:
    for entity in sentence.get_spans('ner'):
        entity_value = entity.get_label("ner").value
        if entity_value in demo_counts.keys():
            demo_counts[entity_value] = demo_counts[entity_value] + 1
        else:
            demo_counts[entity_value] = 1
    print(f'{sentence}\n')

print(f'\n{demo_counts}\n')

As we can see, in only a few lines of code we were able to make a computer do the same type of task we did manually, except it was extremely quick whilst also being reproducible (e.g. the results would be the same if we ran this 100 times). For routine tasks such as determining what activities are being performed on assets and the behaviour exhibited by assets, using machine learning based NLP can be very useful for extracting information from thousands of work order records.

In [None]:
# Feel free to try out any text that you like
single_demo_text = "replace idler bearing - too hot"

# Encode -> Predict -> Display
single_demo_sentence = Sentence(single_demo_text)
demo_model.predict(single_demo_sentence)
print(single_demo_sentence)

⚠️ Note there are a few erroneous predictions in the demonstration example that are attributed to a number of factors that will be discussed in due course.

In [None]:
# Lets remove the model from memory (modern ML models take up a lot of space!)
try:
    del demo_model
except:
    print('Model already deleted')

With this demonstration as motivation, in the following parts of this notebook we are going to:
- Review the fundamentals of extracting information from texts using machine learning based NLP, and
- Work through the typical process of using machine learning based NLP with a particular emphasis on supervised information extraction (as seen above)

## 4.2 - Fundamentals of supervised information extraction from unstructured texts <a class="anchor" id="4.2-fundamentals-supervised-ie"></a>

Before we can dive into the application of this notebook, like *W3-1*, we need to first get on a similar page about the fundamentals of machine learning based NLP and the specific NLP task of information extraction.

### ⚡ 4.2.1 - Supervised Machine Learning (learning from example) <a class="anchor" id="4.2.1-supervised-ml"></a>
Supervised learning is one of the most popular forms of machine learning (other forms include unsupervised and reinforcement learning). A semi-formal definition of this category of technique/algorithm is provided below, but in simple terms it is the process of teaching an algorithm to learn from example. Typically, examples are acquired through human elicitation of knowledge usually called **annotation**.

> "Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately." [IBM Cloud Education, 2020](https://www.ibm.com/cloud/learn/supervised-learning)

Many of the services that we use day-to-day, in some form another, would have used or benefited from supervised learning. For example, services such as Alexa/Google Home and Google translate leverage supervised machine learning to some extent (services such as Alexa convert your speech to natural language queries).

In general, we can characterise supervised machine learning as follows - an algorithm that learns a complex function that maps $x \rightarrow y$ when provided a set of $(x,y)$ examples. A set of examples could be $(mwo\_short\_text, failure\_or\_not)$ e.g. `(pump impeller seized, failure)`, `(conveyor belt holed, failure)`, and `(inspect pump bearing, not failure)`, where these examples are provided to a system to learn to classify texts as failures (or not).

<center>
    <img src="./images/W4_supervised_learning.jpg" alt="supervised learning diagram" width="75%"/>
</center>

The pairs of examples we supply to the algorithm are typically acquired through human annotation using a conceptual model for the application we want the algorithm to perform. In the example above, our examples were pairs of text and a label of 'failure' or 'not failure'. The label we provided is predefined and constitutes a model of the task we want to perform, in this case, binary classification of text. The algorithm we use to learn this mapping can vary in complexity and difficulty, but the process of $x \rightarrow y$ is shared regardless of these factors.

There are multitudes of supervised learning tasks typically characterised as either classification (e.g. is A the class of X or Y?) or regression tasks (e.g. what is the value of X given A?), but focusing on natural language texts and maintenance/reliability, we could perform the following:
- Classify work order records into failure mode codes
- Estimate the hours of work or cost required for a work order given short/long text
- Classify work order records by information quality
- Translate noisy work order texts into clean work order texts
- Extract lubrication related information to support cost aggregation
- Classify work order on risk level
- and so on

In particular to natural language and classification, we typically classify texts at two levels:
1. document-level (entire document is associated with a label)
    - `pump impeller blown` $\rightarrow$ `failure`
2. token-level (each token is associated with a label)
    - `pump impeller blown` $\rightarrow$ `[(pump, PhysicalObject), (impeller, PhysicalObject), (blown, FailureState)]`

⚠️ This is an important distinction as token-level classification is the focus of this notebook. The demonstration was a token-level classification task called named entity recognition (NER).

Now that we have a high-level intuition towards what supervised machine learning is and a general understanding of it with respect to NLP, lets briefly look at the general steps required to make use of this type of technique. The process of supervised learning can be generalised as:
1. Development of task description and objectives
2. Curation of labelled data
3. Development of supervised machine learning algorithm
4. Training of supervised machine learning algorithm
5. Performance evaluation
6. Inference
7. Analysis

An important take-away when applying supervised machine learning to texts in technical domains such as maintenance and reliability engineering is that the involvement of subject matter experts (SME) is crucial for steps 1, 2, 5, 6 and 7.

Why do we care about these details? For those of you that will leverage machine learning based NLP, its useful to have an understanding of the steps required. For those of you that do not plan to perform these tasks, it is valuable to understand the process others may take for you or that have been taken in the services that you may consume in the future (if you aren't already).

<center>
<img src="./images/W4_ds_timespent.jpg"/>
</center>

### 📌 Activity 4.2.1 <a class="anchor" id="4.2.1-activity"></a>

Post your answer to [Menti](https://www.menti.com/kiosdn26mz)
- Why do you think that SMEs are crucial for these five steps in the process outlined above?

### 4.2.2 - ⚡ Information Extraction (structuring the unstructured) <a class="anchor" id="4.2.2-fundamentals-ie"></a>

Now that we have an intuition towards the two main types of classification tasks in NLP (document and token classification), we'll put our attention on a specific application of NLP called **information extraction**. The reason we care about information extraction is that ~80% of information is unstructured, which is especially true for detailed observations made within maintenance such as short and long text in *maintenance records* and *notifications*, observation sections of *condition monitoring reports* (vibration analysis, lubrication laboratory results), comments/notes in *downtime records*, *work procedures*, and so forth. In maintenance and reliability, our ability to make decisions and understand whether our maintenance strategy is correct is predicated on understanding how our systems are performing. Information extraction allows us to bring structure to the texts people create that describe the way our systems are behaving.

Information extraction is broadly two types of NLP tasks:
- Named entity recognition, and
- Relation extraction or classification.

#### 4.2.2.1 - ⚡ Named Entity Recognition <a class="anchor" id="4.2.2.1-fundamentals-ner"></a>

A popular way of extracting information from natural language texts is called named entity recognition ([NER](https://en.wikipedia.org/wiki/Named-entity_recognition)). This technique aims to identify and classify spans of tokens within texts. This is what we performed in the first activity and demonstration.

Below is an example of named entity recognition (NER) applied to three work order descriptions for a conveyor. 

<center>
    <img src="./images/W4_fundamental_ner_markup.png" width="100%"/>
</center>

The core idea behind NER is to provide structure to unstructured text. The structure that is provided can be arbitrary, but usually is reflective of the domain the text is derived from and what the specific use-case is. In the example above, the structure is elicited by the concepts of `PhysicalObject`, `UndesirableState`, `Replace`, and `FailedState`. However, we could as easily extract `Asset`, `Technician`, `Alarm` from *condition monitoring reports*, `Consumable`, `Degradation` and `Alert` from *FLAC reports*, or `Injury`, `Body Part`, and `Location` from *HSE records*.

The high-level steps of NER is to:
1. Identify `spans` of `tokens` in a given text that might be of interest e.g. `roller frame` in "CVR3 roller frame damaged", and
2. Assign a predefined `category` of interest that represents the information we desire e.g. `(roller frame)[PhysicalObject]`.

We will use the notation of `(text)[label]` to denote **entities** that are extracted from texts using NER. Note "entity" refers to span of text that is assigned to a label e.g. `(roller frame)[PhysicalObject]` is an entity.

An example of NER on the text "replace engine oil and change out oil filter":
```
    replace engine oil and change out oil filter  ->
    (replace)[Activity] (engine oil)[PhysicalObject] and (change out)[Activity] (oil filter)[PhysicalObject]
```

#### 4.2.2.2 - ⚡ Relation Classification <a class="anchor" id="4.2.2.2-fundamentals-relation-classification"></a>

In addition to extracting entities from unstructured text, we can also relate the entities to one another to better understand what is happening within the text. Consider the previous example "replace engine oil and change out oil filter". Using relations, we could link the `Activities` to the objects/items that they are being performed on, essentially telling us *"what is being done to whom"*. Similar to NER, we need to specify the concepts being relationships, such as `has_part` for relating `items` or `PhysicalObjects`, etc. However, we could reasonably extract the following structure from the text "replace engine oil and change out oil filter":

- `(replace)[Activity]-[hasParticipant]->(engine oil)[PhysicalObject]`
- `(change out)[Activity]-[hasParticipant]->(oil filter)[PhysicalObject]`

where `(text)[label]-[relationship]->(text)[label]` denotes a relationship between two entities (called a *triplet*).

If we were to perform this process over thousands of maintenance texts, we'd be able to understand a lot about the nature of activities and behaviour of our assets. 

#### 4.2.2.3 - ⚡ Knowledge Graphs (turning information into knowledge) <a class="anchor" id="4.2.3-activity"></a>

As seen above, by extracting entities using NER and creating relations between them using relation classification/extraction, we end up with things called **triplets** (or triples). These pieces of information can represent **facts** about the domain our texts are created in. These triplets of facts, naturally, can be linked together through shared concepts to create graphs of knowledge aka **knowledge graphs**. Why do we care about graphs? Graphs provide a convenient way to interpret large amounts of disconnected data in a digestable way whilst being able to be queried. Moreover, *Gartner* rank graph technology in the top 10 trending technologies as of 2020-2022.

<center>
<img src="./images/W4_data_info_knowledge_insight.jpg"/>
</center>

An example of the process of converting unstructured text into knowledge can be seen in the diagram below, created from only a handful of documents.

<center>
<img src="./images/W4_dik_real_data.png"/>
</center>


Now, imagine you were to perform this process on 100,000 maintenance work orders, with a more detailed schemata of categories for entities and relations. This process would provide rapid, updatable, insight into the performance of assets, activities being performed, failure modes falling outside of maintenance strategies, and so forth. Moreover, the structured information could be combined with other structured fields such as resource and financial information to gain further insight.

### 📌 Activity 4.2.2 <a class="anchor" id="4.2.2-activity"></a>

- Do you have any questions or comments on what we have gone through?

## 4.3 Technical Information Extraction <a class="anchor" id="4.3-technical-information-extraction"></a>

<!-- Now that we have a concrete intuition towards supervised machine learning and the NLP task of information extraction, we will step through the process of going from unstructured technical text data to knowledge which we will visualise in a network graph. The process we will follow is outined in the figure below, but before we dive in, lets first set up our notebook. -->
Before we dive in, lets first set up our notebook.

### Notebook Setup <a class="anchor" id="4.3-notebook-setup"></a>

Install required third party packages

In [None]:
%%capture
!pip install pandas flair torch plotly nb_black tqdm networkx bokeh

In [None]:
%%capture
!pip install torch --extra-index-url https://download.pytorch.org/whl/cu116

In [None]:
# Package for ensuring code we write is formatted nicely
%load_ext nb_black

#### Import Packages

Import standard packages

In [None]:
import os
from pprint import pprint
import itertools
from collections import Counter
from typing import List
import random
import json
import math
from urllib.request import urlretrieve

Import third party packages

- [pandas](https://github.com/pandas-dev/pandas) - Package for data handling and wrangling
- [numpy](https://numpy.org/) - Package for working with numerical arrays.
- [tqdm](https://github.com/tqdm/tqdm) - Package for monitoring progress of operations
- [flair](https://github.com/flairNLP/flair) - Package for simple state-of-the-art NLP.
- [torch](https://pytorch.org/) - Package for machine learning in Python.
- [networkx](https://networkx.org/) - Package for working with complex network graphs.
- [plotly](https://plotly.com/) - Package for interactive visualisation.
- [bokeh](https://docs.bokeh.org/en/latest/) - Package for interactive visualisation.
- [panel](https://panel.holoviz.org/reference/panes/Bokeh.html) - Package for turning visualisations into interactive dashboards.

In [None]:
import pandas as pd
from tqdm import tqdm
import numpy as np

# Packages for machine learning
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
import torch

# Packages for visualisation and network graph
import networkx as nx
import plotly.express as px
from plotly.subplots import make_subplots
import panel as pn
from bokeh.io import show, output_notebook
from bokeh.models import (
    BoxZoomTool,
    Circle,
    HoverTool,
    MultiLine,
    Plot,
    Range1d,
    ResetTool,
    ColumnDataSource,
    LabelSet,
    Legend,
    LegendItem,
    NodesAndLinkedEdges,
    EdgesAndLinkedNodes,
    PanTool,
)
from bokeh.palettes import Spectral4, Category20c
from bokeh.plotting import from_networkx

Configuration

In [None]:
pd.set_option("display.max_rows", None)
pn.extension()
output_notebook()

### 4.3.A Process Overview <a class="anchor" id="4.3.A-process-overview"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_overview.png" width="75%"/>
</center>

Now that we have an appreciation for the task of information extraction, what we are going to do is use this technique to try and gain insight into the behaviour of assets and the effectiveness of maintenance strategy using the unstructured text in maintenance work order records. To achieve this, we are going to work through the following steps:
1. Develop a conceptual model of information contained within work order texts that will support insight into asset behaviour and maintenance strategy
2. Prepare a set of maintenance work order records using what we learnt in Week 3 (*notebook W3-1*)
3. Acquire human-annotated training data to use in supervised machine learning
4. Develop a supervised machine learning model for information extraction on work order descriptions
5. Apply the developed machine learning model to the set of maintenance work order records
6. Perform analysis on the data acquired automatically via the machine learning model

### 4.3.B Development of Conceptual Model <a class="anchor" id="4.3.B-development-of-conceptual-model"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_step_B.png" width="75%"/>
</center>

As stated, our goal / purpose in this notebook is to automatically extract information from natural language descriptions in maintenance work order records as to provide us deeper insight into asset behaviour and the effectiveness of maintenance strategy. Before we can pursue this, we first need to concretely identify and define a conceptual model of information we require to facilitate this.

Every supervised machine learning task requires some level of conceptual model, where a conceptual model is essentially a set of *concepts* that are used to help people communicate the subject that a model represents. For example, imagine you want to build a document-level classification model that detects whether a work order contains a failure, a conceptual model for this could consist of `failure` and `not failure`.

#### 📌 Activity 4.3.B <a class="anchor" id="4.3.B-activity"></a>

Post your answer to [Menti](https://www.menti.com/euqyg45d2t)
- What type of concepts could we use to extract information from maintenance work order descriptions? Can you think of anymore in addition to `Activity`, `Undesirable State` and `Physical Object`?

There are numerous ways of defining the concepts you use in information extraction, for this notebook we'll stick with `Activity`, `Undesirable State` and `Physical Object`. For each of these, we have additional, more specific, concepts below them such as `Activity/MaintenanceActivity` and `Activity/MaintenanceActivity/Replace`. Having finer detail on the concepts we use allow us to more precisely assign meaning to our unstructured texts, but comes at the cost of difficulty in acquiring examples and training machine learning models.

As stated at the start of this notebook, what we're learning is not specific to maintenance texts. Using NER, we could as easily extract `Asset`, `Technician`, `Alarm` from condition monitoring reports, `Consumable`, `Degradation` and `Alert` from FLAC reports, `Injury`, `Body Part`, and `Location` from HSE records, or `Cause` and `Effect` from work order notification long text.

To make the concepts we'll use concrete, they are defined as follows:
- `PhysicalObject` - *A thing that physically exists*. Examples include: *gearbox*, *light bulb*, *excavator* 
- `State` - *A condition that a physical object is in at a specific time*. Examples include: *blown*, *not working*, *snapped*
- `Activity` - *A condition in which things are happening or being done to a physical object or state*. Examples include: *overhaul*, *replace*, *diagnose*

An example of these concepts identified through information extraction in a work order description looks like:

```
    replace engine oil and inspect blown hose
    
    [replace](Activity) [engine oil](PhysicalObject) and [inspect](Activity) [blown](State) [hose](PhysicalObject)
```

### 4.3.C Data Preparation <a class="anchor" id="4.3.C-data-prepartion"></a>

</br></br>

<center>
<img src="./images/W4_flow_diagram_step_C.png" width="75%"/>
</center>

In the previous notebooks (*W3-1 and W3-2*), we developed a process for preparing and cleaning out natural language texts. Here, we'll use this again on the same dataset we used previously (40,000 conveyor work order records). Unlike notebook W3-2, here we do not need to filter our dataset based on the availability of structured data like `actual_start_date` as we are not computing reliability metrics. However, if you wanted to improve the process of the data-driven reliability metrics by incorporating what we will learn in this notebook, then you would need to use the preparation script.

Lets load the script that we created in W3 to clean our work order descriptions.

In [None]:
from scripts import text_cleaner

Now lets load and prepare the data for this notebook

In [None]:
# If you are using your own data, please change the name of the file to the name of your data.
# Otherwise, use the URL link provided in this part of the program.
path_to_data = "https://coreskills.blob.core.windows.net/ds-reliability/rh_mod_v1.csv"  # "../data/<YOUR_CSV_FILE>"

Note we do not need additional structured fields, but we will see later that this will be useful for performing analysis.

In [None]:
expected_cols = [
    "id",
    "description",
    "wo_order_type",
    "total_actual_costs",
    "actual_start_date",
    "actual_finish_date",
    "functional_loc_desc",
    "functional_loc",
]

date_cols = [
    "actual_start_date",
    "actual_finish_date",
]

path_to_data = (
    path_to_data if "https" in path_to_data else os.path.join(dir_path, path_to_data)
)

df = pd.read_csv(
    path_to_data,
    parse_dates=date_cols,
    dayfirst=True,
    encoding="ISO-8859-1",
    thousands=",",
    dtype={"description": "str", "total_actual_costs": "float"},
)

original_df_size = len(df)

assert set(expected_cols).issubset(
    set(df.columns)
), "Uploaded data does not have all the expected columns"

# Lets remove any descriptions that contain erroneous or missing content (e.g. numbers)
df = df[~df["description"].isna()]  # Removes rows that have no description
df = df[~df["description"].str.isnumeric()]  # Removes rows that are only numbers

# Lets clean the description column using out `text_cleaner` function
df["description"] = df["description"].apply(
    lambda text: text_cleaner.clean_text(text=text)
)

# Lets also remove any descriptions that contain a single token as these are unlikely to be useful (you can comment this out if you want to include them)
df = df[df["description"].apply(lambda text: 1 < len(text.split(" ")))]

filtered_df_size = len(df)

print(f"Reduced dataframe from {original_df_size} to {filtered_df_size}")

Lets check that our data loading and preprocessing worked as expected. If using the `text_cleaner` script, we should see that the descriptions are lower cased.

In [None]:
df.head(5).T

### 4.3.D Data Curation <a class="anchor" id="4.3.D-data-curation"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_step_D.png" width="75%"/>
</center>

As stated in the fundamentals section, we require training examples to teach a supervised machine learning algorithm. Here we'll call the process of acquiring these examples **data curation**. This process is critical in supervised learning and is challenging in technical domains like maintenance and reliability. Unlike in general, everyday, scenarios, the knowledge required to understand content in technical domains is very specific and typically difficult to transfer (it is called [*tacit knowledge*](https://en.wikipedia.org/wiki/Tacit_knowledge)).

Consider work orders raised against a bespoke asset that highlight novel failure behaviour that are not readily understandable by lay-people. Instead, it requires specialist knowledge to parse this information and to understand it. Due to this fact, when curating data to teach supervised machine learning algorithms, it is paramount that those that understand the domain, help acquire the data. Hence, subject matter experts are indispensible for many machine learning based tasks that are intended to be used on complex technical information.

Unfortunately, many supervised machine learning processes require a large amount of data. To overcome the cost of data acquisition, in general settings, paid human workers can be used through crowd sourcing. That is, people are paid per document to apply labels that are then used to train machine learning models. In technical domains, this is difficult due to tacit knowledge requirements and data cofidentiality. Instead, individuals usually use purpose-built **annotation software** to collect examples for supervised machine learning.

The annotation software used typically depends on the task you wish to perform, however for information extraction, the annotation tool **Quickgraph** developed by the [UWA NLP-TLP Group](https://nlp-tlp.org/) can be used to quickly obtain data we require for information extraction. Although there are other options out there including: [Prodigy](https://prodi.gy/) and [LabelStudio](https://labelstud.io/).

Unfortunately, due to the time constraints in this part of the program, we cannot go through the process of manually acquiring data. Instead, we'll load some data that has been annotated by a human already. The data we will use acquired over the course of a few hours of effort.

Typically, you'll have three datasets for supervised machine learning - `training`, `validation` and `test`. Go [here](https://machinelearningmastery.com/difference-test-validation-datasets/) to find out more about what these mean. Lets load the `training` dataset that consists of examples we'll teach our machine learning algorithm with.

In [None]:
# Lets load all of the human-annotated datasets and save them to disk
data_urls = {
    "train": "https://coreskills.blob.core.windows.net/ds-reliability/ner/train.txt",
    "valid": "https://coreskills.blob.core.windows.net/ds-reliability/ner/valid.txt",
    "test": "https://coreskills.blob.core.windows.net/ds-reliability/ner/test.txt",
}

for split_name, split_url in data_urls.items():
    print(f"{split_name}: {split_url}")
    path_to_save_data = os.path.join(dir_path, f"../data/ner/{split_name}.txt")
    urlretrieve(split_url, path_to_save_data)

In [None]:
# Lets load one of the files that we downloaded (the training dataset)
with open("../data/ner/train.txt", "r", encoding="utf-8") as infile:
    training_data = infile.readlines()
    training_data = [line.replace("\n", "") for line in training_data]

Lets take a look at the first few rows of the dataset we have loaded

In [None]:
for row in training_data[:20]:
    print(row)

What we can see here is on the left are the tokens in a work order description and on the right are the labels that have been applied to these tokens by a human annotator. The prefixes on the labels are used to indicate whether the tokens are ngrams. For example, two adjacent tokens with `B-PhysicalObject` and `I-PhysicalObject` indicates that the token is a bigram (2-gram). More information about this notation can be found [here](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) and is specific to token-classification NLP.

In [None]:
# The number of documents we are going to use to train our model
training_docs = [
    list(group) for _, group in itertools.groupby(training_data, key="".__ne__)
]
training_docs = [doc for doc in training_docs if doc != [""]]

In [None]:
print(f"Number of training documents: {len(training_docs)}")

### 📌 Activity 4.3.D <a class="anchor" id="4.3.D-activity"></a>

Post your answer to [Menti](https://www.menti.com/6tap8gv81d)
- What factors do you think impact the process of acquiring human-labelled data for training machine learning systems?
- What do you think the limitations of this small set of data will be?

### 4.3.E Model Development <a class="anchor" id="4.3.E-model-development"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_step_E.png" width="75%"/>
</center>

Now that we have both an understanding of the concepts we want to use to answer the questions we want to ask on our data and a set of human-annotated examples to train a supervised machine learning model using these concepts, we can train a machine learning model to try and do this process automatically.

There are many frameworks/packages that allow the development and traning of machine learning models for NLP in Python such as [Flair](https://github.com/flairNLP/flair), [HuggingFace](https://github.com/huggingface/transformers), [PyTorch](https://github.com/pytorch/pytorch), [FairSeq](https://github.com/facebookresearch/fairseq), [Tensorflow](https://github.com/tensorflow/tensorflow), [Keras](https://github.com/keras-team/keras). In this section, we will use **Flair** as it allows us easy access to state-of-the-art models with low amounts of code.

Below, we are going to train a named entity recognition model using Flair, but the specific details of the training process is out of the scope of this program. Refer to [this reference](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md) for a more detailed guide to train Flair models.

⚠️ Note the model that we train/use will not be perfect as only a limited amount of data has been acquired for this part of the program. To improve this model, more data is required. If you're interested in extending the model that we are using, feel free to get in contact to get more specific guidance to do so.

In [None]:
def train_ner_model(data_dir: str = "../data/ner", model_dir: str = "../data/ner"):
    """Trains a simple named entity recognition model using Flair"""

    assert os.path.isdir(
        data_dir
    ), "Directory for data does not exist - please create and add data then try again."
    assert os.path.isdir(
        model_dir
    ), "Directory for model does not exist - please create and try again."

    flair.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    columns = {0: "text", 1: "ner"}

    # 1. load training data from disk
    corpus: Corpus = ColumnCorpus(
        data_dir,
        columns,
        train_file="train.txt",
        dev_file="valid.txt",
        test_file="test.txt",
    )

    # 2. specify the type of label we want to predict
    label_type = "ner"

    # 3. make the tag dictionary from the corpus
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    # 4. specify the type of embeddings we want to use
    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

    # 5. initialize sequence tagger
    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    # 6. initialize trainer
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

    # 7. start training
    trainer.train(
        model_dir,
        learning_rate=0.1,
        mini_batch_size=32,
        max_epochs=50,
        embeddings_storage_mode=None,
    )

Here we'll either load a model we already have or train one from scratch. Training from scratch will take approximately 10 minutes.

In [None]:
# If you want to bypass the training (it may be too slow in the labs environment) we'll use the model we downloaded at the start of the notebook.
if os.path.isfile(r"../data/ner/ds-best-model.pt"):
    ner_model = SequenceTagger.load(r"../data/ner/ds-best-model.pt")
else:
    print("Model does not exist - go to the top of the notebook and download it.")

Uncomment the code in the two cells below (and comment the cell above) to train the model yourself.

In [None]:
# Train model from scratch
# train_ner_model()

In [None]:
# Load trained model (note the name is different compared to the downloaded model (best-model.pt vs ds-best-model.pt)
# ner_model = SequenceTagger.load(r'../data/ner/best-model.pt')

### 4.3.F Model Application <a class="anchor" id="4.3.F-model-application"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_step_F.png" width="75%"/>
</center>

Now that we have used our human curated training data to train our supervised information extraction model (or loaded the pretrained model), lets try it out on a few texts. This is similar to the process we performed in the demonstration.

In [None]:
example_texts = [
    "change out leaking mechanical seal",
    "blown rubber",
    "replace idler bearing corroded",
    "conveyor belt holed",
]

In [None]:
# Create a sentence object and make a prediction
example_sentences = [Sentence(text.lower()) for text in example_texts]

In [None]:
# Make predictions using our trained model
ner_model.predict(example_sentences)

In [None]:
# Print out predictions on sentence objects
for sentence in example_sentences:
    print(f"{sentence}\n")

#### 4.3.F.1 Using a general models on technical data <a class="anchor" id="4.3.F.1-general-model"></a>
Similar to the notebook in Week 3 where we discussed the limitations and caution required when working with general word embeddings, here we'll briefly explore and discuss the use of a general information extraction model on our technical text examples.

In [None]:
# This model is trained on a large corpus of general documents with a schema that does not capture the same information above.
general_ner_model = SequenceTagger.load("ner")

In [None]:
# Inference
example_sentences_general = [Sentence(text.lower()) for text in example_texts]
general_ner_model.predict(example_sentences_general)

for sentence in example_sentences_general:
    print(f"{sentence}\n")

As we can see, using general models on our data without instilling our domain expertise leads to either no results or those that are unexpected. Obviously, this extreme example is contrived and the general model wasn't trained on the same semantics and conceptual model that we used previously, but nonetheless it highlights the need for having specific models for specific problems and domains.

In [None]:
# We'll delete the general model to free up the memory it took up. Rerun the cells above to load the model again if you like.
try:
    del general_ner_model
except:
    print("General model already deleted")

#### 📌 Activity 4.3.F <a class="anchor" id="4.3.F-activity"></a>

Post your answer on [Menti](https://www.menti.com/jnh9u973nk)
- What are some observations you've made when changing the example texts?

### 4.3.G Analysis <a class="anchor" id="4.3.G-analysis"></a>

</br></br>

<center>
    <img src="./images/W4_flow_diagram_step_G.png" width="75%"/>
</center>

Now that we have a common understanding of the fundamentals underpinning machine learning for NLP tasks, we are going to focus our attention on using the model we have created to try and answer the following questions using our work order dataset:
1. Which assets have the most activities performed on them, and what activities are they?
2. What activities are being performed preventatively versus correctively?
3. What behaviour are assets demonstrating?
4. Are we seeing failure modes we expect?
5. Are we seeing failure modes we didn't expect?

Before we can attempt to answer these questions, we first need to extract all of the entities from the work order dataset. To summarise the code below, what we are doing is:
1. Extracting the work order descriptions from our work order dataframe
2. Converting each work order description into a sentence object (expected by our model)
3. Making predictions on our sentence objects in batches (to ensure our system doesn't run out of memory or take too long)
4. Extracting the predicted entities from the sentence objects

In [None]:
def extract_entities(
    ner_model: SequenceTagger, df: pd.DataFrame, rows: int = None
) -> pd.DataFrame:
    """Extracts entities from maintenance work order record descriptions"""

    if rows != None:
        df = df[:rows]

    # Convert the dataframe into a list of dictionary objects
    mwo_descriptions_with_ids = df[
        ["id", "functional_loc", "description", "wo_order_type"]
    ].to_dict(orient="records")

    # We need to build up a list of sentence objects to perform inference on
    sentence_objects = [
        Sentence(mwo_obj["description"]) for mwo_obj in mwo_descriptions_with_ids
    ]

    # Perform inference on the list of sentence objects (note: this may take a moment if there are lots of sentences)
    batch_size = 64
    for i in tqdm(range(0, len(sentence_objects), batch_size), desc="Processing texts"):
        ner_model.predict(sentence_objects[i : i + batch_size])

    # Add entities to mwo descriptions
    for idx, mwo_obj in tqdm(
        enumerate(mwo_descriptions_with_ids), desc="Extracting entities"
    ):
        # sentence_obj should be aligned with idx of mwo_obj
        entities = [
            (entity.text, entity.get_label("ner").value)
            for entity in sentence_objects[idx].get_spans("ner")
        ]
        mwo_obj["entities"] = entities

    mwo_obj_per_group = {}
    for mwo_obj in tqdm(mwo_descriptions_with_ids, desc="Aggregating entities"):
        floc = mwo_obj["functional_loc"]

        if floc in mwo_obj_per_group.keys():
            mwo_obj_per_group[floc].append(mwo_obj)
        else:
            mwo_obj_per_group[floc] = [mwo_obj]

    # Process aggregated entities into a format that can be converted into a pandas DataFrame for plotting and analysis.
    # We will separate out the dataframes into entities associated with corrective, preventative and any MWO order type.
    data_rows = []
    data_rows_corrective = []
    data_rows_preventative = []
    for group in tqdm(mwo_obj_per_group, desc="Counting entities"):
        # Corrective entities
        corrective_entities = list(
            itertools.chain.from_iterable(
                [
                    item["entities"]
                    for item in mwo_obj_per_group[group]
                    if item["wo_order_type"] == "PM01"
                ]
            )
        )
        corrective_entity_counts = Counter(corrective_entities)
        if 0 < len(corrective_entity_counts):
            data_rows_corrective.extend(
                [
                    {
                        "floc": group,
                        "text": k[0],
                        "label": k[1],
                        "freq": v,
                        "wo_type": "corrective",
                    }
                    for k, v in corrective_entity_counts.items()
                ]
            )

        # Preventative entities
        preventative_entities = list(
            itertools.chain.from_iterable(
                [
                    item["entities"]
                    for item in mwo_obj_per_group[group]
                    if item["wo_order_type"] == "PM02"
                ]
            )
        )
        preventative_entity_counts = Counter(preventative_entities)
        if 0 < len(preventative_entity_counts):
            data_rows_preventative.extend(
                [
                    {
                        "floc": group,
                        "text": k[0],
                        "label": k[1],
                        "freq": v,
                        "wo_type": "preventative",
                    }
                    for k, v in preventative_entity_counts.items()
                ]
            )

        # All entities
        all_entities = list(
            itertools.chain.from_iterable(
                [item["entities"] for item in mwo_obj_per_group[group]]
            )
        )
        all_entity_counts = Counter(all_entities)

        # Convert to dict objects
        data_rows.extend(
            [
                {
                    "floc": group,
                    "text": k[0],
                    "label": k[1],
                    "freq": v,
                    "wo_type": "all",
                }
                for k, v in all_entity_counts.items()
            ]
        )

    # Create DataFame containing entity information
    data_rows_combined = data_rows + data_rows_corrective + data_rows_preventative

    # Save the sentence objects with predictions
    mwo_objects = [
        {**mwo, "sentence_obj": sentence_objects[idx]}
        for idx, mwo in enumerate(mwo_descriptions_with_ids)
    ]

    return pd.DataFrame(data_rows_combined), mwo_objects

Lets extract all of the entities from the cleaned dataset we loaded earlier in the notebook. Please note that processing all 40,000 documents will take around 15-25 minutes, so by default we'll process a subset of this. However, feel free to run the entire process in your own time to gain further insights. To run the entire dataset, set `rows=None` in the `extract_entities` function.

In [None]:
%%time
# Change rows to None if you want to run the entire dataset rows=None
df_analysis, mwo_objects = extract_entities(ner_model=ner_model, df=df, rows=20000)

Using the entities we've extracted from our maintenance work order texts, we can easily perform aggregations on the type of information that our concepts are being applied to automatically. Lets explore the type of entities that have been extracted. Note that we have a column called `wo_type` to help use seperate the aggregated entities for preventative and corrective work order types.

In [None]:
# Lets check out how big the data we created are
print(
    f'Total {len(df_analysis)} - Corrective {len(df_analysis[df_analysis["wo_type"] == "corrective"])} Preventative {len(df_analysis[df_analysis["wo_type"] == "preventative"])}'
)

In [None]:
df_analysis.head()

#### 4.3.G.1 Extracted States <a class="anchor" id="4.3.G.1-extracted-states"></a>
What type of states have we extracted?

In [None]:
df_states = (
    df_analysis[
        (df_analysis["wo_type"] == "all") & (df_analysis["label"].str.contains("State"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)
total_states = df_states["text"].nunique()

In [None]:
fig = px.bar(
    df_states,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of States ({total_states} extracted)",
    orientation="h",
)
fig.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}, autosize=True
)
fig.show()

Lets look at the states that occur on corrective and preventative work orders

In [None]:
# Prepare the data
df_states_corr = (
    df_analysis[
        (df_analysis["wo_type"] == "corrective")
        & (df_analysis["label"].str.contains("State"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)
df_states_prev = (
    df_analysis[
        (df_analysis["wo_type"] == "preventative")
        & (df_analysis["label"].str.contains("State"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)

In [None]:
fig_corr_state = px.bar(
    df_states_corr,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Extracted Corrective States",
    orientation="h",
)
fig_corr_state.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}
)
fig_prev_state = px.bar(
    df_states_prev,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Extracted Preventative States",
    orientation="h",
)
fig_prev_state.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}
)

fig_corr_state.show()
fig_prev_state.show()

#### 4.3.G.2 Extracted Activities <a class="anchor" id="4.3.G.2-extracted-activities"></a>
What type of activities have we extracted?

In [None]:
df_activities = (
    df_analysis[
        (df_analysis["wo_type"] == "all")
        & (df_analysis["label"].str.contains("Activity"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)
total_activities = df_activities["text"].nunique()

In [None]:
fig = px.bar(
    df_activities,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Activities ({total_activities} extracted)",
    orientation="h",
)
fig.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}, autosize=True
)
fig.show()

Lets look at the activities performed under corrective and preventative maintenance

In [None]:
df_activities_corr = (
    df_analysis[
        (df_analysis["wo_type"] == "corrective")
        & (df_analysis["label"].str.contains("Activity"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)
df_activities_prev = (
    df_analysis[
        (df_analysis["wo_type"] == "preventative")
        & (df_analysis["label"].str.contains("Activity"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)

In [None]:
fig_corr_act = px.bar(
    df_activities_corr,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Extracted Corrective Activities",
    orientation="h",
)
fig_corr_act.update_layout(barmode="stack", yaxis={"categoryorder": "total descending"})
fig_prev_act = px.bar(
    df_activities_prev,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Extracted Preventative Activities",
    orientation="h",
)
fig_prev_act.update_layout(barmode="stack", yaxis={"categoryorder": "total descending"})

fig_corr_act.show()
fig_prev_act.show()

#### 4.3.G.3 Extracted Physical Objects <a class="anchor" id="4.3.G.3-extracted-physical-objects"></a>
What type of physical objects have we extracted?

In [None]:
df_physical_objects = (
    df_analysis[
        (df_analysis["wo_type"] == "all")
        & (df_analysis["label"].str.contains("PhysicalObject"))
    ]
    .groupby(["text", "label"])
    .sum()
    .reset_index()
)
total_physical_objects = df_physical_objects["text"].nunique()

df_physical_objects = df_physical_objects.sort_values("freq", ascending=False)[:50]

In [None]:
fig = px.bar(
    df_physical_objects,
    y="text",
    x="freq",
    color="label",
    title=f"Distribution of Top 50 of {total_physical_objects} Physical Objects Extracted",
    orientation="h",
)
fig.update_layout(
    barmode="stack",
    yaxis={"categoryorder": "total descending"},
    autosize=True,
)
fig.show()

#### 4.3.G.4 - Which assets have the most activities performed on them, and what activities are they?
Lets take a look at which assets have the most activities performed on them and what activities they are.

In [None]:
floc_with_most_activities = (
    df_analysis[
        (df_analysis["wo_type"] == "all")
        & (df_analysis["label"].str.contains("Activity"))
    ]
    .groupby("floc")
    .sum()
)

In [None]:
floc_with_most_activities = floc_with_most_activities.sort_values(
    by="freq", ascending=False
).reset_index()["floc"][0]

In [None]:
df_floc_max_activities = df_analysis[
    (df_analysis["wo_type"] == "all")
    & (df_analysis["floc"] == floc_with_most_activities)
    & (df_analysis["label"].str.contains("Activity"))
]
fig = px.bar(
    df_floc_max_activities,
    y="text",
    x="freq",
    color="label",
    title=f"Group with Most Activities ({floc_with_most_activities})",
    orientation="h",
)
fig.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}, autosize=True
)
fig.show()

In [None]:
df[df["functional_loc"] == floc_with_most_activities].head()

Lets see which asset has the most corrective activities performed on it.

In [None]:
floc_with_most_corr_activities = (
    df_analysis[
        (df_analysis["wo_type"] == "corrective")
        & (df_analysis["label"].str.contains("Activity"))
    ]
    .groupby("floc")
    .sum()
)

In [None]:
floc_with_most_corr_activities = floc_with_most_corr_activities.sort_values(
    by="freq", ascending=False
).reset_index()["floc"][0]

In [None]:
df_floc_max_corr_activities = df_analysis[
    (df_analysis["wo_type"] == "corrective")
    & (df_analysis["floc"] == floc_with_most_corr_activities)
    & (df_analysis["label"].str.contains("Activity"))
]
fig_floc_max_corr_act = px.bar(
    df_floc_max_corr_activities,
    y="text",
    x="freq",
    color="label",
    title=f"Group with Most Corrective Activities ({floc_with_most_corr_activities})",
    orientation="h",
)
fig_floc_max_corr_act.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}, autosize=True
)
fig_floc_max_corr_act.show()

#### 4.3.G.5 - Which group is exhibiting the most undesirable states?

In [None]:
floc_with_most_undesirable_states = (
    df_analysis[
        (df_analysis["wo_type"] == "all")
        & (df_analysis["label"].str.contains("UndesirableState"))
    ]
    .groupby("floc")
    .sum()
)

In [None]:
floc_with_most_undesirable_states = floc_with_most_undesirable_states.sort_values(
    by="freq", ascending=False
).reset_index()["floc"][0]

In [None]:
df_floc_max_activities = df_analysis[
    (df_analysis["wo_type"] == "all")
    & (df_analysis["floc"] == floc_with_most_undesirable_states)
    & (df_analysis["label"].str.contains("UndesirableState"))
]
fig = px.bar(
    df_floc_max_activities,
    y="text",
    x="freq",
    color="label",
    title=f"Group with Most Undesirable States ({floc_with_most_undesirable_states})",
    orientation="h",
)
fig.update_layout(
    barmode="stack", yaxis={"categoryorder": "total descending"}, autosize=True
)
fig.show()

In [None]:
df[df["functional_loc"] == floc_with_most_undesirable_states].head()

### 📌 Activity 4.3.G <a class="anchor" id="4.3.G-activity"></a>

What are your thoughts on what we have done so far? Is there anywhere you can see this being useful in your workflows?

### 4.4 - Network Graph Analysis <a class="anchor" id="4.4-network-graph-analysis"></a>

One of the limitations of what we have done above is that we are unable to discern "who did what to whom". The process of eliciting this type of information is complex, so we are going to simplify it down in this section of the notebook. However, the general process we are going to dive into is in the realm of network graph analysis where we will link the entities we have detected in our maintenance work order data with relationships. As mentioned, to keep this simple we will use rules to link entities with the relations `appears_with` and `related_to`. For the intereted individual, more complex automatic extraction of relationships between entities is performed as part of the NLP tasks *relation classification* and *relation extraction*.

To make this concrete, consider the work order description "replace pump impeller" where we can extract the entities `(replace)[Activity]`, `(pump)[PhysicalObject]` and `(impeller)[PhysicalObject]`. We can add our relations by associating our entities together e.g. `(replace)[Activity]-[appears_with]-(pump)[PhysicalObject]`, `(replace)[Activity]-[appears_with]-(impeller)[PhysicalObject]` and `(pump)[PhysicalObject]-[related_to]-(impeller)[PhysicalObject]`. These items are referred to as **triples** or **triplets**.

⚠️ Unfortunately we are limited to what Python can do for network analysis, but other tools exist that can make this make easier and more insightful, such as all-in-one software like [Neo4J](https://neo4j.com/) or drawing libraries such as [d3](https://d3js.org/).

⚠️ A lot of the code used below has been developed for partipants of the program and are required to enable data wrangling and visualisation of the network graphs we are going to create. Feel free to explore these further if you're interested, however the intention here is to provide an expose to this type of analysis.

#### 4.4.1 - Triple Generation and Network Creation <a class="anchor" id="4.4.1-triple-generation-network-creation"></a>

Recall when we extracted entities from our maintenance work order data we created a large set of `sentence objects` and made predictions over them using our trained machine learning model. Here, we are going to use these objects to generate a set of *triples* that will be used to build our network graph. The triples that we create will be in the form of a list of tuples e.g. `[(subject text, subject label, relation, target text, target label)]`.

We are going to group all of our information based on the `functional location` but feel free to modify the code to group by other structured fields such as `sort fields` or `functional location descriptions`.

Lets create triples from the entities exteacted from each functional location in our dataset.

In [None]:
unique_labels = {}
group_triples = {}
for mwo_object in mwo_objects:
    triples = []
    floc = mwo_object["functional_loc"]

    # Extract the entities from the sentence objects within the mwo_object dictionary
    entities = [
        (e.text, e.get_label("ner").value)
        for e in mwo_object["sentence_obj"].get_spans("ner")
    ]

    # Create relations using heuristics (real relation extraction/classification is out of scope)
    # - PhysicalObject APPEARS_WITH Activity
    # - PhysicalObject APPEARS_WITH State
    # - PhysicalObject RELATED_TO PhysicalObject (left to right)
    phys_obj_entities = [e for e in entities if "PhysicalObject" in e[1]]
    phys_obj_triples = [
        (po_es[0][0], po_es[0][1], "related_to", po_es[1][0], po_es[1][1])
        for po_es in zip(phys_obj_entities[:-1], phys_obj_entities[1:])
    ]
    triples.extend(phys_obj_triples)
    other_entities = [e for e in entities if "PhysicalObject" not in e[1]]

    # We are going to make pairwise relations between our entities
    for other_entity in other_entities:
        other_entity_text, other_entity_label = other_entity
        for po_entity in phys_obj_entities:
            po_entity_text, po_entity_label = po_entity

            # Create triple
            triples.append(
                (
                    other_entity_text,
                    other_entity_label,
                    "appears_with",
                    po_entity_text,
                    po_entity_label,
                )
            )

    if floc in group_triples.keys():
        group_triples[floc].extend(triples)
    else:
        group_triples[floc] = triples

Lets have a look at the number of triples created per functional location group

In [None]:
# Get counts of triples on each group
group_triple_counts = sorted([len(triples) for triples in group_triples.values()])

In [None]:
fig_triple_dist = px.bar(
    x=range(len(group_triple_counts)), y=group_triple_counts, title="Triples per group"
)
fig_triple_dist.show()

Lets define some utility functions for the network we will build below. You can skip these sections, but in general we are defining functions to:
- Assign a color to the nodes in our network based on the entity label
- Assign a radius to the nodes in our network based on how frequently they occur in the dataset
- Aggregate triples together to visualise a large group-agnostic network
- Convert extracted triples to a Python NetworkX structure

We are using the popular package NetworkX for this analysis, find out more [here](https://networkx.org/).

In [None]:
def get_node_color(
    label: str, colors={"activity": "blue", "state": "red", "physicalobject": "green"}
):
    """Applies color based on parent name"""
    parent_label = label.lower().split("/")[0]
    return colors[parent_label]

In [None]:
def get_node_radius(freq: int, base_radius: int = 10):
    """Function for using cube root for node radii"""
    return base_radius * (freq ** (1 / 3))

In [None]:
def aggregate_triples(group_triples: List[tuple]):
    """Returns a set of aggregated triples from a set of grouped triples"""
    return list(
        itertools.chain.from_iterable([triples for triples in group_triples.values()])
    )

In [None]:
def convert_triples_to_networkx(triples: List[tuple]):
    """Converts a set of triples into network format (nodes, edges, edge list)"""

    # Get aggregate subgraph from all group triples
    all_nodes = list(
        itertools.chain.from_iterable(
            [[(triple[0], triple[1]), (triple[3], triple[4])] for triple in triples]
        )
    )

    # Get unique nodes and their frequency (this will be used in the node size/radius)
    # Create dict of node value/label tuples and counts
    node_counts = Counter(all_nodes)
    unique_nodes = list(node_counts.keys())

    # Add frequency to unique nodes
    unique_node_objects = [
        {
            "value": node[0],
            "label": node[1],
            "freq": node_counts[node],
            "radius": get_node_radius(node_counts[node]),
        }
        for node in unique_nodes
    ]

    # Create data required for network
    nx_nodes = [
        (
            node["value"],
            {
                "label": node["label"],
                "color": get_node_color(node["label"]),
                "value": node["value"],
                "freq": node["freq"],
                "radius": node["radius"],
                "id": idx,
            },
        )
        for idx, node in enumerate(unique_node_objects)
    ]

    nx_edges = [(triple[0], triple[3]) for triple in triples]
    nx_edge_labels = {(triple[0], triple[3]): triple[2] for triple in triples}

    # edge frequencies
    edge_counts = Counter(nx_edges)
    unique_edges_w_freq = list(edge_counts.keys())

    return nx_nodes, nx_edges, nx_edge_labels

In [None]:
def create_network_graph(nodes, edges, edge_labels):
    """Creates a network graph data structure using NetworkX"""

    # Create graph
    G = nx.DiGraph()
    G.add_nodes_from(nodes)
    G.add_edges_from(edges)

    # node labels
    unique_node_labels = list(
        set([(node[1]["label"], node[1]["id"]) for node in G.nodes(data=True)])
    )

    # edge classes
    unique_edge_labels = list(set([edge for edge in edge_labels.values()]))

    return G, edge_labels, unique_node_labels, unique_edge_labels

In [None]:
def show_network_graph(
    nodes,
    edges,
    edge_labels,
    group_name: str = None,
    edge_color_map: dict = {"related_to": "black", "appears_with": "red"},
):
    """Creates a network graph using NetworkX and Bokeh/Panel"""

    # Create network graph
    G, edge_labels, unique_node_labels, unique_edge_labels = create_network_graph(
        nodes, edges, edge_labels
    )

    # Prepare data for renderers
    edge_attrs = {}

    for start_node, end_node, _ in G.edges(data=True):
        edge_label = edge_labels[(start_node, end_node)]
        edge_color = edge_color_map[edge_label]
        edge_attrs[(start_node, end_node)] = edge_color

    nx.set_edge_attributes(G, edge_attrs, "edge_color")

    # Show with Bokeh
    plot = Plot(
        width=800, height=600, x_range=Range1d(-1.5, 1.5), y_range=Range1d(-1.5, 1.5)
    )
    plot.title.text = (
        "Interactive Maintenance Graph" + f" ({group_name})"
        if group_name != None
        else ""
    )

    tooltips = None  # [("value", "@index"), ("label", "@label")]
    node_hover_tool = HoverTool(tooltips=tooltips)
    plot.add_tools(node_hover_tool, BoxZoomTool(), ResetTool(), PanTool())

    graph_renderer = from_networkx(G, nx.spring_layout, scale=1, center=(0, 0))

    graph_renderer.node_renderer.glyph = Circle(
        size="radius", fill_color="color", id="id"
    )
    graph_renderer.node_renderer.hover_glyph = Circle(size=15, fill_color=Spectral4[1])

    graph_renderer.edge_renderer.glyph = MultiLine(
        line_alpha=0.5, line_width=1, line_color="edge_color"
    )
    graph_renderer.edge_renderer.hover_glyph = MultiLine(
        line_color=Spectral4[1], line_width=5
    )
    plot.renderers.append(graph_renderer)

    # Add node labels
    try:
        x, y = zip(*graph_renderer.layout_provider.graph_layout.values())
        node_labels = nx.get_node_attributes(G, "value")
        source = ColumnDataSource(
            {"x": x, "y": y, "club": [label for label in node_labels.values()]}
        )
        labels = LabelSet(x="x", y="y", text="club", source=source)
        plot.renderers.append(labels)

        # Interative
        graph_renderer.selection_policy = NodesAndLinkedEdges()
        graph_renderer.inspection_policy = EdgesAndLinkedNodes()

        # Add legend
        legend_items = [
            LegendItem(
                label=node_label[0],
                renderers=[graph_renderer.node_renderer],
                index=node_label[1],
            )
            for node_label in unique_node_labels
        ]
        legend = Legend(items=legend_items)
        plot.add_layout(legend)
        plot.legend.title = "Nodes"

        # Remove duplicates from legend
        legend_tmp = {x.label["value"]: x for x in plot.legend.items}
        plot.legend.items.clear()
        plot.legend.items.extend(legend_tmp.values())

        return show(plot)
    except:
        print("Insufficient data")

#### 4.4.2 - Network Visualisation and Analysis <a class="anchor" id="4.4.2-network-visualisation-analysis"></a>
Now that we have defined the functions required to convert our extracted information into triples to build a network from, we can explore our information and gain insight into the behaviour of our assets more precisely. Note that we are using the relations `appears_with` (<span style="color:red;">red</span>) and `related_to` (black) in our network.

#### 4.4.2.1 Entire Graph <a class="anchor" id="4.4.2.1-entire-graph"></a>

We can also explore the entire graph we've created, but in Python visualising thousands of nodes and edges is difficult for the third party packages we are using. Instead, we'll sample a subset of the entire graph. Feel free to try up the amount of data in your own time outside of the labs environment. You could download the data we have created and visualise it in tools such as Neo4J.

In [None]:
# We'll set the maximum number of triples we want to render in the graph; lower this if any problems arise.
MAX_TRIPLES = 100

In [None]:
# Create object of aggregated triples over all groups
all_triples = list(
    itertools.chain.from_iterable([triples for triples in group_triples.values()])
)
print(f"Number of triples in entire graph = {len(all_triples)}")

In [None]:
# Lets render the graph (or a subset of it based on our max triples variable)
if len(all_triples) < MAX_TRIPLES:
    nx_nodes_agg, nx_edges_agg, nx_edge_labels_agg = convert_triples_to_networkx(
        triples=all_triples
    )
    show_network_graph(nx_nodes_agg, nx_edges_agg, nx_edge_labels_agg)
else:
    print(
        f"Too many triples ({len(all_triples)}) - visualisation will have trouble rendering - reducing graph to max triples"
    )
    nx_nodes_agg, nx_edges_agg, nx_edge_labels_agg = convert_triples_to_networkx(
        triples=all_triples[:MAX_TRIPLES]
    )
    show_network_graph(nx_nodes_agg, nx_edges_agg, nx_edge_labels_agg)

#### 4.4.2.2 Functional Location Graph <a class="anchor" id="4.4.2.1-functional-location-graph"></a>

In the larger graph above, it is difficult to precisely see what is happening with a specific asset. Instead of looking at everything, lets examine networks for each functional location group individually. Here we'll randomly sample from the list of functional location groups we processed, but feel free to uncomment the commented code and put any functional location in.

In [None]:
# Show network for a randomly selected group
sampled_group = random.choice(list(group_triples.keys()))
# sampled_group = '1071-30-05-01-CVR102'

assert sampled_group in list(group_triples.keys()), "Group does not exist - try again."

nx_nodes, nx_edges, nx_edge_labels = convert_triples_to_networkx(
    triples=group_triples[sampled_group]
)
show_network_graph(
    nodes=nx_nodes, edges=nx_edges, edge_labels=nx_edge_labels, group_name=sampled_group
)

#### 4.4.2.3 Querying the Network Graph <a class="anchor" id="4.4.2.3-query-network-graph"></a>

Network graphs are useful for visualising extracted information to gain an intuition for what is happening within natural language texts, but we can also query them to show us:
- All the states and activities the physical object *X* is involved in, or
- All the states the physical object *X* appears with.

Note that purpose built graph-based software are better for this type of query work, but NetworkX will suffice for these simple queries.

Lets query the graph we have constructed

In [None]:
# Enter a term to search the network with
search_term = "brake"  # Try 'worn', 'brake', 'gearbox', 'fail'

In [None]:
# Filter the triples with the search term (we'll limit the triples to the MAX_TRIPLES in case a large graph is returned)
search_triples = [
    triple
    for triple in all_triples
    if (search_term in triple[0]) | (search_term in triple[3])
]
print(f"Triples containing {search_term} = {len(search_triples)}")

nx_nodes_search, nx_edges_search, nx_edge_labels_search = convert_triples_to_networkx(
    triples=search_triples[:MAX_TRIPLES]
)
show_network_graph(nx_nodes_search, nx_edges_search, nx_edge_labels_search)

Lets query the graph to find all problems (undesirable states) with specific physical objects, for example all gearboxes.

In [None]:
# Enter a physical object search the network with
physical_object_search_term = "gbox"

# Uncomment if you want to randomly sample a physical object to look at
# physical_object_search_term = random.choice(
#     df_physical_objects["text"].unique().tolist()
# )

In [None]:
# Filter the network graph by physical object by name and limit nodes based on their type (note we'll limit the triples based on the MAX_TRIPLES variable)
matched_triples = [
    triple
    for triple in all_triples
    if (
        (physical_object_search_term in triple[0])
        & ("physicalobject" in triple[1].lower())
        & ("undesirablestate" in triple[4].lower())
    )
    | (
        (physical_object_search_term in triple[3])
        & ("physicalobject" in triple[4].lower())
        & ("undesirablestate" in triple[1].lower())
    )
]
print(
    f'Triples relating to the physical object "{physical_object_search_term}" = {len(matched_triples)}'
)

(
    nx_nodes_matched,
    nx_edges_matched,
    nx_edge_labels_matched,
) = convert_triples_to_networkx(triples=matched_triples[:MAX_TRIPLES])
show_network_graph(nx_nodes_matched, nx_edges_matched, nx_edge_labels_matched)

## Wrap Up & Homework <a class="anchor" id="summary"></a>

Homework for next week
- Load your own work order datasets into this notebook and rerun
- Think about applications of supervised NLP in your own workflows
- If you're interested in supervised machine learning - checkout available annotation software

Your feedback today is welcome. Provide your answers in [Menti](https://www.menti.com/mqxk64gi8y):
- What is one thing you liked about today?
- What would you like to see more of?