# <center> Data-Driven Reliability Metrics </center>
## <center> Week 3 </center>
## <center> Interactive Notebook </center>

</br>
</br>

<center>📚 Source: W3-1 Data-Driven Reliability Metrics</center>

⚠️ To make the most out of this notebook, some level of Python programing is desirable. If you've never used Python (or any other programing language) checkout resources [here](https://wiki.python.org/moin/BeginnersGuide/Programmers) to get up to speed. This is not mandatory, but will allow you to modify the code examples and complete the activities.

⚠️ This notebook has interactive elements, pleasure ensure that your notebook is set-up to use `ipywidgets` by running `jupyter nbextension enable --py widgetsnbextension --sys-prefix` in the jupyterhub terminal.

## Notebook Overview
Within this notebook, we will move from toy to real failure datasets derived from maintenance work order records. The overarching application in this notebook is to semi-automatically calculate reliability metrics from maintenance work orders in a standardised, reproducible, low-resource, manner. To achieve this, first, we will gain insight into our data by performing exploratory data analysis (EDA). Second, we will become acquianted with fundamental natural language processing (NLP) techniques, that we will use to improve our datasets quality and for identification of end-of-life events. Third, we'll use expert logic to classify and reason about end-of-life events before performing statistical data life analysis via fitting to a 2-parameter Weibull distribution.

<img src="./images/nb_3-1_excel_to_distribution.png" alt="excel to reliability metrics" width="75%;"/>

Legend
- ⚡ indicates a new concept

## Table of Contents
* [Week 2 - Recap](#week-2-recap)
* [3 - Overview of data-driven reliability metrics from maintenance work order data](#3-overview)
* [3.A - Exploratory analysis of maintenance work order data](#3-A)
    * [Activity 3.A](#3-A-activity)
* [3.B - Preparing maintenance work order data for analysis](#3-B)
    * [Activity 3.B](#3-B-activity)
* [3.C - Identification of end-of-life events within unstructured maintenance text](#3-C)
    * [3.C.1 - Basic key-word search](#3-C-1)
        * [Activity 3.C.1 - Identifying end-of-life events (EOL) via key-word search](#3-C-1-activity)
    * [3.C.2 - Expanded key-word search using word embeddings](#3-C-2)
        * [Activity 3.C.2](#3-C-2-activity)
* [3.D - Classification of identified end-of-life events](#3-D)
    * [Activity 3.D](#3-D-activity)
* [3.E - Reasoning about classified end-of-life events](#3-E)
    * [Activity 3.E](#3-E-activity)
* [3.F - Scalable statistical data life analysis](#3-F)
    * [3.F.1 - Analysing the effect of our decisions on reliability measures](#3-F-1)
    * [Activity 3.F](#3-F-activity)
* [Summary](#summary)
* [Appendix](#appendix)

## Notebook Objectives
- Perform exploratory data analysis on maintenance work order data
- Load, wrangle and process maintenance work order data to extract MTBF in a standardised and reproducible way without external third-party software

## Learning Outcomes
- Understand how to load, wrangle and process maintenance work order data
- Be comfortable with exploratory data analysis
- Understand fundamental NLP concepts such as tokenization, ngrams, etc.
- Understand how to visualise information in text using visualisation packages and dimensionality reduction

## Recap from Week 2 <a class="anchor" id="week-2-recap"></a>

Provide your answers in [Menti]()
- Do you have any questions or comments from last week?

## Notebook Setup

In [1]:
# Package for ensuring code we write is formatted nicely
%load_ext nb_black

<IPython.core.display.Javascript object>

### Import Packages

Standard packages

In [2]:
import os
import itertools
import random
from collections import Counter

<IPython.core.display.Javascript object>

- [pandas](https://pandas.pydata.org/) - Package for data handling and wrangling
- [pandas_profiling](https://github.com/ydataai/pandas-profiling) - Package for generating a profile over a pandas dataframe
- [reliability](https://reliability.readthedocs.io/en/latest/index.html) - Package for performing Weibull analysis
- [plotly](https://plotly.com/graphing-libraries/) - Package for interactive visualisation

In [3]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import ipywidgets as widgets
from IPython.display import display, clear_output
from reliability.Fitters import Fit_Weibull_2P
import plotly.express as px

<IPython.core.display.Javascript object>

Configuration

In [4]:
dir_path = os.path.abspath("")
pd.set_option("display.max_rows", None)

<IPython.core.display.Javascript object>

## 3 - Overview of data-driven reliability metrics from maintenance work order data <a class="anchor" id="3-overview"></a>
This section will explore identifying end-of-life events from the unstructured fields of maintenance work orders to allow us to classify them as failures, suspensions, or other.

<img src="./images/nb_3-1_flow_diagram.png"/>

This section of the notebook will take us through different stages required to identify end-of-life events from real maintenance data, and using fortuitous data allow us to calculate reliability measures at scale. Illustrated in the figure, we will explore 6 mmain steps:

- A: Maintenance work order records
- B: **Data preparation**
- C: **Identification** of end-of-life events
- D: **Classification** of end-of-life events
- E: **Reasoning** about end-of-life events, and
- F: Statistical data life **analysis**

### 3.A - Maintenance work order records <a class="anchor" id="3-A"></a>

Here we'll load a dataset containing ~40,000 maintenance records. Before we can use it in our notebook, we'll first ensure that the data is in the right format and all records have a short text description. If you have brought your own .csv dataset of work orders, feel free to place it into the `data` directory and replace the path in `data_path_large` to your files name. Please ensure that the column headers of your data are aligned with the `expected_cols` below.

In [5]:
data_path_large = "../data/rh_mod_v1.csv"

<IPython.core.display.Javascript object>

In [6]:
expected_cols = [
    "id",
    "description",
    "wo_order_type",
    "total_actual_costs",
    "actual_start_date",
    "actual_finish_date",
    "total_actual_work",
    "functional_loc_desc",
    "functional_loc",
]

<IPython.core.display.Javascript object>

In [7]:
date_cols = [
    "basic_start_date",
    "created_on",
    "actual_finish_date",
    "basic_finish_date",
    "actual_start_date",
]

df_large = pd.read_csv(
    os.path.join(dir_path, data_path_large),
    parse_dates=date_cols,
    dayfirst=True,
    encoding="ISO-8859-1",
    thousands=",",
    dtype={"description": "str", "total_actual_costs": "float"},
)

assert set(expected_cols).issubset(
    set(df_large.columns)
), "Uploaded data does not have all the expected columns"

# Lets remove any records that do not have a short text description
df_large = df_large[~df_large["description"].isna()]

# Before we continue we'll convert the functional description into a more unique name for plotting later on
df_large["object_desc"] = df_large["functional_loc_desc"].apply(
    lambda desc: " ".join(
        [word for word in desc.replace("-", " ").split(" ") if word.isalpha()]
    )
)

<IPython.core.display.Javascript object>

Lets review the dataset we have loaded

In [8]:
df_large.head(2).T

Unnamed: 0,0,1
id,1,2
description,3Y MEC SDN REPL Conveyor Belt CVR068,CVR029 Replace Conveyor Belt
basic_start_date,2017-01-02 00:00:00,2020-10-28 00:00:00
priority,High,High
maint_activity_type,PM,PDM
wo_order_type,PM02,PM01
assembly_description,Conveyor Belt CVR068,"BELT,CONVEYOR,2200XST2500,19/6,FXS/SLL"
sort_field,261-CVR068,227-CVR029
created_on,2019-04-29 00:00:00,2019-11-06 00:00:00
total_actual_costs,1.43126e+06,990589


<IPython.core.display.Javascript object>

We can quickly gain insight into our work order data by using the python package [pandas profiling](https://github.com/ydataai/pandas-profiling) which profiles the data giving us a quick overview of its contents.

In [9]:
profile = ProfileReport(
    df_large,
    title="Maintenance Work Order Data Profile",
    explorative=True,
    progress_bar=False,
)

<IPython.core.display.Javascript object>

In [10]:
profile.to_notebook_iframe()

<IPython.core.display.Javascript object>

Further information about using pandas profiling can be found [here](https://pandas-profiling.ydata.ai/docs/master/). There are a lot of articles overviewing the capabilities of this exploratory data analysis tool such as [this one](https://towardsdatascience.com/learning-pandas-profiling-fc533336edc7).

### Activity 3.A <a class="anchor" id="3-A-activity"></a>

Provide your answer in [Menti]()

- What proportion of work orders are PMs compared to corrective activities?
- Find something interesting in the data to share with the class

### 3.B - Data preparation <a class="anchor" id="3-B"></a>
<img src="./images/nb_3-1_flow_diagram-B.png"/>
</br>
Now that we have familiarity with our dataset, we need to address the elephant in the room - data quality. Our dataset has many types of data classified as `date`, `categorical`, `numerical`, and `string`. In this section, we are particularly interested in the quality of the `string` data which manifests as the short text description of our work order records. Due to the unstructured nature of text in maintenance records, they can become unwieldly due to noise.


Noise in technical user-generated text is largely produced as a result of three factors:
- 1. limited time,
- 2. space constraints, and 
- 3. technical nature of the text

As a result of these factors, technical texts such as maintenance work order short/long text will contains a lot of noise resulting from domain-specific terms, erroneous spelling, phonetic substitutions, informal grammar, etc.

It should be noted that the considerations we will make, and processes we employ, are also directly applicable to other types of technical texts such as: condition monitoring reports (vibration analysis, etc.), notification text, down time accounting text, FLAC reports, maintenance procedures, etc.

#### 3.B.1 - Fundamentals of natural language processing
Before we dive into methods to improve the quality of our text data to support downstream tasks, first, lets familiarise ourself with some fundamental concepts of natural language processing (NLP) and get some some lingo/jargon along the way. There are a few more concepts also contained in the [Appendix]() for you to review at your leisure.

If you're interested in learning more about NLP concepts in general CORE offers a [data science springboard](https://www.corehub.com.au/professional-program) that includes this and much more!

For the following exercises we will use the texts contained within our maintenance record dataset. Moreover, we'll be using the Python [Natural Language Toolkit (NLTK)](https://www.nltk.org/) for this section of the notebook, however there are many packages available for processing, cleaning, and analysing text data. 

In [11]:
# Import required packages and functions from nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import ngrams
from nltk.text import Text

<IPython.core.display.Javascript object>

**⚡ Concept - Corpus**</br>
A corpus is a set of text documents, where a document can be a word, sentence, paragraph, report, etc.

In [12]:
mwo_corpus = df_large["description"].apply(lambda x: str(x)).tolist()
print(f"MWO corpus consists of {len(mwo_corpus)} texts")

MWO corpus consists of 45039 texts


<IPython.core.display.Javascript object>

**⚡ Concept - Tokenization**</br>
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing.

In [13]:
# Load the NLTK punkt tokenizer (find out more here - https://www.nltk.org/_modules/nltk/tokenize/punkt.html)
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tyler\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<IPython.core.display.Javascript object>

In [14]:
# Lets 'tokenize' a sentence using the NLTK (Tip: include punctuation (.,) and newlines (\n) to see how it impacts tokenization)
sentence = "[CA] replace suction pipe on pump pu001 u/s"
# sentence = random.sample(mwo_corpus, 1)[0] # Uncomment to randomly sample data from the MWO corpus

tokenized_sentence = word_tokenize(sentence)
print(
    f"Input:\n{sentence}\nTokens:\n{tokenized_sentence} ({len(tokenized_sentence)} tokens)"
)

Input:
[CA] replace suction pipe on pump pu001 u/s
Tokens:
['[', 'CA', ']', 'replace', 'suction', 'pipe', 'on', 'pump', 'pu001', 'u/s'] (10 tokens)


<IPython.core.display.Javascript object>

**⚡ Concept - Vocabulary**</br>
A vocabularly is set of terms that correspond to a particular subject matter. Now that we are familiar with tokenization, we can get a feel for how big our vocabulary is. The vocabulary of maintenance texts will be quite large, even though the length of the texts are small, as references to functional locations etc will be counted as unique words.

In [15]:
# Vocabulary depends on tokenization scheme, here we'll use an off-the-shelf tokenizer. Note we are removing all the casing from our
# texts in this operation, as the MWO data uses capitalization inconsistently and erroneous. Hence, for all intensive-purposes, the computer
# thinks that ON/on/On/oN are different tokens/words, although they mean the same thing.
mwo_tokens = list(itertools.chain.from_iterable([word_tokenize(text.lower()) for text in mwo_corpus]))
mwo_vocab = set(mwo_tokens)

# Lets get the vocab including casing for comparison.
mwo_tokens_cased = list(itertools.chain.from_iterable([word_tokenize(text) for text in mwo_corpus]))
mwo_vocab_cased = set(mwo_tokens_cased)

print(f"MWO corpus has a vocabulary size of {len(mwo_vocab)} (with casuing: {len(mwo_vocab_cased)}, {(1 - len(mwo_vocab)/len(mwo_vocab_cased)) * 100:0.1f}%)")

MWO corpus has a vocabulary size of 7100 (with casuing: 9187, 22.7%)


<IPython.core.display.Javascript object>

In addition to the unique set of tokens (words) in our corpus (e.g. our vocabulary), it's also useful to know the word frequency distribution.

In [16]:
# Here we'll use Pythons built in 'Counter' to count the tokens in our corpus.
# This is a very handy native object for counting (see: https://docs.python.org/3/library/collections.html#collections.Counter)
mwo_token_counts = Counter(mwo_tokens)

<IPython.core.display.Javascript object>

In [17]:
# Lets look at the some statistics of our corpus including the top 10 tokens and hapaxes. A hapaxe is a token with frequency of 1.
print(f"Total tokens:\n{sum(mwo_token_counts.values())}")
print(f"Top 10 tokens:\n{mwo_token_counts.most_common(10)}")
print(f"Rarest 10 tokens:\n{mwo_token_counts.most_common()[-10:]}")


Total tokens:
238967
Top 10 tokens:
[('on', 17588), ('insp', 14398), ('1w', 10264), ('blt', 8210), ('sdn', 5737), ('lub', 4537), ('mec', 4300), ('ins', 4265), ('replace', 3893), ('off', 3420)]
Rarest 10 tokens:
[('cance', 1), ('cvr162mdm', 1), ('cvr163mdm', 1), ('p10', 1), ('program', 1), ('cvr116-pws1576', 1), ('lay', 1), ('future', 1), ('use', 1), ('only', 1)]


<IPython.core.display.Javascript object>

**⚡ Concept - Stopwords**</br>
Stopwords usually refer to the most common words in a language. For natural language processing applications, stopwords may contain less information than rarer words. For example in the text `replace the pump`, the token `the` is usually considered a stopword as it does not add any extra information (we can still understand the text as `replace pump`).

For technical language like that found in maintenance work orders, many stopwords will not be used as these texts usually lack grammar. Moreover, standardised stopword lists include words that may be crucial for tasks such as identifying failure modes in text.

In [18]:
# Lets look at common stopwords (NLTK has these built in!)
general_stopwords = stopwords.words("english")
print(general_stopwords)
print(f"\nNumber of stopwords: {len(general_stopwords)}")

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

<IPython.core.display.Javascript object>

In [19]:
# Lets see the impact of stop word removal from technical texts
sentence_with_stopwords = "replace pump not pumping to capacity"
# sentence_with_stopwords = random.sample(mwo_corpus, 1)[0] # Uncomment to randomly sample data from the MWO corpus

# Here we'll tokenize the sentence on whitespace and then remove any token that is within the stopword list
# We lowercase the tokens when we check as the stopwords have no casing
sentence_stopwords_rmv = " ".join(
    [
        token
        for token in sentence_with_stopwords.split(" ")
        if token.lower() not in general_stopwords
    ]
)

print(f"Input:\t{sentence_with_stopwords}\nOutput:\t{sentence_stopwords_rmv}")

Input:	replace pump not pumping to capacity
Output:	replace pump pumping capacity


<IPython.core.display.Javascript object>

**⚡ Concept - N-grams**</br>
Now that we are familiar with the concept of `tokens`, n-grams are simply the process of representing a set of n continguous (adjacent) tokens. Common names used under the concept of n-gram are unigram (n=1), bigram (n=2), trigram (n=3) where n is the number of contiguous tokens in the set. To make this more concrete, consider the tokens `["idler", "not", "working"]`. Here, we can extract the following sets of n-grams:
- unigram (n=1)
    - token sets: `["idler"]`, `["not"]`, `["working"]`
    - grams: idler, not, working
- bigram (n=2)
    - token sets: `["idler", "not"]`, `["not", "working"]`
    - grams: idler not, not working
- trigram (n=3):
    - token sets: `["idler", "not", "working"]`
    - grams: idler not working
    
As we can see, chunks of grams can represent different concepts, in this example the bigram "not working" has important meaning in maintenance applications.

In [20]:
# Lets build some ngrams using NLTK (note: NLTK expects the sentence to be tokenized)
sentence_for_ngrams = "idler not working"
# sentence_for_ngrams = random.sample(mwo_corpus, 1)[0] # Uncomment to randomly sample data from the MWO corpus

# Tokenize sentence (we'll use the NLTK punkt tokenizer here)
sentence_for_ngrams_tokenized = word_tokenize(sentence_for_ngrams)

n_values = [1, 2, 3, 4]

for n in n_values:
    # The ngram function returns a `zip` object
    grams = [" ".join(gram) for gram in ngrams(sentence_for_ngrams_tokenized, n)]
    print(f"{n}-grams: {grams}\n")

1-grams: ['idler', 'not', 'working']

2-grams: ['idler not', 'not working']

3-grams: ['idler not working']

4-grams: []



<IPython.core.display.Javascript object>

⚡ **Concept - Phrases/Chunking**</br>
As seen above, tokens within a text can form many different sized ngrams ('not', 'working' vs 'not working'). How do we know which ngrams should be formed between which tokens? To do this we can "chunk" or "phrase" our text. Here we will train a phrasing algorithm that will learn to detect common phrases aka multi-word expressions automatically from our corpus.

This is an important tool in our NLP toolkit as it enables us to treat words as chunks of meaning rather than just a bag of individual words. 

In [21]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

<IPython.core.display.Javascript object>

In [22]:
print(ENGLISH_CONNECTOR_WORDS)

frozenset({'and', 'without', 'in', 'with', 'an', 'from', 'by', 'or', 'of', 'for', 'a', 'at', 'the', 'to', 'on'})


<IPython.core.display.Javascript object>

In [23]:
# Here we'll build a phrase model to detect common phrases
# Find out more: https://radimrehurek.com/gensim/models/phrases.html

# We need to feed in our corpus after it has been tokenized
tokenized_mwo_corpus_for_embedding = [
    word_tokenize(text.lower()) for text in mwo_corpus
]

# Now we'll train our phrasing model but only account for words with a minimum frequency of 5
# (you can make this lower such as 1 to account for hapaxes)
phrase_model = Phrases(
    tokenized_mwo_corpus_for_embedding,
    min_count=5,
    threshold=10,
    connector_words=ENGLISH_CONNECTOR_WORDS,
)

<IPython.core.display.Javascript object>

In [24]:
# Lets try the phrasing model out
text_to_phrase = 'change out pump impeller - not working'
text_to_phrase_tokenized = word_tokenize(text_to_phrase)

# Now we'll pass it into our phrasing model
phrase_model[text_to_phrase_tokenized]


['change_out', 'pump', 'impeller', '-', 'not_working']

<IPython.core.display.Javascript object>

**⚡ Concept - Word Representation and Similarities**</br>
The last concept we'll review is that of similarity between words. To make this concrete, lets think about the words `oil`, `fuel`, `coolant` and `water`. From our background knowledge, we know these share similar meaning, i.e. they are all fluids and they may be found in industrial contexts. However, how do we make machines gain this intution? One answer is `word embeddings`.

Like the saying "you can tell a word by the company that it keeps", word embeddings are numerical representations that allows word with similar meaning to have similar values. Given a large enough corpus, these word embeddings can be learnt. 

All state-of-the-art NLP use embeddings in one form or another, so an intuition toward them is important. However, diving into the details of embeddings is out of the scope of this notebook, but if you're interested check out [here](https://machinelearningmastery.com/what-are-word-embeddings/) for further information.

<img src="./images/nb_3-1_embeddings.png" alt="word embeddings" width="75%"/>

Here we'll load more packages
- [gensim](https://radimrehurek.com/gensim/) - package used to learn word embeddings from text
- [sklearn](https://scikit-learn.org/) - package for machine learning that we'll use for dimensionality reduction
- [numpy](https://numpy.org/) - package for dealing with arrays and numerical information
- [plotly](https://plotly.com/) - package for interactive visualisation

In [25]:
from gensim.models import Word2Vec
from gensim.models import phrases
from sklearn.decomposition import PCA

<IPython.core.display.Javascript object>

In [26]:
class EmbeddingTrainer:
    def __init__(self):
        self.min_token_count = 2
        self.window_size = 3
        self.model_size = 300

    def train_model(self, docs, iterations: int = 50):
        w2v_model = Word2Vec(
            sentences=docs,
            vector_size=self.model_size,
            min_count=self.min_token_count,
            window=self.window_size,
            epochs=iterations,
        )
        print(f"Model summary:\n {w2v_model}")
        return w2v_model

<IPython.core.display.Javascript object>

Here we'll train a word embedding model called Word2Vec ([find out more here](https://radimrehurek.com/gensim/models/word2vec.html)). Please note that we will be doing this on the original corpus without any normalisation or cleaning.

In [27]:
# First lets tokenize our corpus using the punkt tokenizer from NLTK; we'll also lower case everything to
# improve the support of each word. However, this is not prescriptive.
tokenized_mwo_corpus_for_embedding = [
    word_tokenize(text.lower()) for text in mwo_corpus
]

<IPython.core.display.Javascript object>

In [28]:
# Again, we'll build a phrase model to detect common phrases
phrase_model = phrases.Phrases(
    tokenized_mwo_corpus_for_embedding,
    min_count=5,
    threshold=10,
    connector_words=phrases.ENGLISH_CONNECTOR_WORDS,
)

<IPython.core.display.Javascript object>

In [29]:
# Lets learn word embeddings from the text in our corpus
# Here we will pass our tokenized corpus into our pretrained phrasing model

w2v_model = Word2Vec(
    sentences=[phrase_model[t] for t in tokenized_mwo_corpus_for_embedding],
    vector_size=300,
    min_count=5,
    window=3,
    epochs=50,
)

<IPython.core.display.Javascript object>

In [30]:
# Vocabulary size of the model we created
print(f"Size of word embedding vocabulary: {len(w2v_model.wv.index_to_key)}")

Size of word embedding vocabulary: 2070


<IPython.core.display.Javascript object>

Now that we have a numerical representation of words in our corpus, lets inspect some word similarities. Try words such as `seized`, `replaced`, etc.

In [31]:
def get_similar_words(model=w2v_model, word: str = ""):
    try:
        if word in w2v_model.wv.index_to_key:
            sim_words = model.wv.most_similar(word)
            print(
                "\n".join(
                    [f"{word} ({similarity*100:0.1f}%)" for word, similarity in sim_words]
                )
            )
        else:
            print("Word not in vocabulary - try again!")
    except Exception as e:
        print(f'Failed due to {e}')

<IPython.core.display.Javascript object>

In [32]:
# Caution: We need to ensure that we use words that exist in our vocabulary!
base_word = "seized"

get_similar_words(word=base_word)

siezed (80.8%)
collapsed (75.0%)
pizza_cutter (74.5%)
collasped (72.3%)
failed (70.9%)
noisy (70.7%)
colapsed (70.0%)
flat (69.6%)
collapsed_return (66.9%)
callapsed (65.7%)


<IPython.core.display.Javascript object>

⚠️ An important caveat for using embeddings in technical languages is the type of embedding may be restricted on what can be represented depending on the way the embeddings were learnt. In this notebook, we are using `word` embeddings, hence the embeddings are built off of the words in our vocabulary (over the minimum threshold we specified). Hence, the embedding model will NOT represent words that are hapaxes if we set our threshold to greater than 1, it will be considered OUT OF VOCABULARY.

This is an important area for performing NLP on technical languages when using off-the-shelf embedding models (which is the default approach in NLP for most domains). First, important domain specific words may not be present (e.g. `primary_scraper` is not a common word in the English language and has specific meaning in technical contexts). Second, even if the embedding model can take the word, it doesn't mean that the context it creates from it, is correct.

This isn't to say that off-the-shelf embeddings cannot be used, as they still have great utility (being trained on billions of words), they just need to be used with an understanding of their potential limitations in technical settings.

Note: Not all embeddings are WORD embeddings, there are also sub-word embeddings, character embeddings, document embeddings, etc.

Lets quickly quantify this issue that we've discussed above by comparing our domain-specific embeddings to those from a general domain.

In [33]:
# Lets quickly look at this in action for word embeddings.
# Source: https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html#how-to-download-pre-trained-models-and-corpora
import gensim.downloader as api
corpus = api.load('text8')
general_embedding_model = Word2Vec(corpus)

<IPython.core.display.Javascript object>

In [34]:
# Get similar words to our base word using the general model
get_similar_words(model=general_embedding_model, word=base_word)

attacked (78.5%)
surrendered (75.8%)
secured (75.6%)
ousted (74.8%)
captured (73.6%)
overthrown (73.1%)
besieged (72.2%)
recaptured (71.0%)
defeated (70.5%)
assassinated (69.4%)


<IPython.core.display.Javascript object>

In [35]:
# lets try a domain-specific abbreviation, 'c/o' and see what happens.
get_similar_words(model=general_embedding_model, word="c/o")

Failed due to "Key 'c/o' not present in vocabulary"


<IPython.core.display.Javascript object>

In [36]:
# Check how similar words are
word_a = 'seized'
word_b = 'failed'
print(f'Similarity between {word_a} and {word_b}')
print(f"Domain-specific model: {w2v_model.wv.similarity(word_a, word_b) * 100:0.1f}%")
print(f"General model: {general_embedding_model.wv.similarity(word_a, word_b) * 100:0.1f}%")

Similarity between seized and failed
Domain-specific model: 70.9%
General model: 35.9%


<IPython.core.display.Javascript object>

Before we move onto the next section, lets visualise the embeddings we have learnt from our MWOs using the package `plotly`. Recall that the embeddings we created are of the size 300, hence, each word has 300 numerical values. To make this interpretable, we need to reduce the dimensionality to 2 so we can visualise it on a plot. To do this, we'll use a dimensionality reduction technique called [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (note there are many other techniques out there).

The main reason for visualising embeddings is to identify semantic and syntactic trends in the data. For instance, words like 'replace', 'replacement', 'change out', 'repair' lie close together due to their semantics (they are activities), and 'frame' and 'frames' are close together due to syntax (plurality).

In [37]:
# First we need to build a vector of words and their embeddings
X = np.array([w2v_model.wv[word] for word in w2v_model.wv.index_to_key])
print(X.shape)

(2070, 300)


<IPython.core.display.Javascript object>

In [38]:
# Lets reduce the dimensionality of our embedding array
pca = PCA(n_components=2)
X_pca = pca.fit(X).transform(X)

<IPython.core.display.Javascript object>

In [39]:
# Visualise the 2-dimensional data and explore
# Note: the data will become less 'messy' if you were to subset the dataset on particular assets of interest

tokens_to_show = 500

df_embed_original = pd.DataFrame(
    {
        "x": X_pca[:, 0],
        "y": X_pca[:, 1],
        "word": w2v_model.wv.index_to_key,
        "label": np.array(["original"] * len(X_pca)),
    },
    columns=["x", "y", "word", "label"],
)

fig = px.scatter(
    df_embed_original.iloc[:tokens_to_show],
    x="x",
    y="y",
    text="word",
    title="Principle Component Analysis - Embedded Words from maintenance work orders",
)
fig.update_traces(textposition="top center")
fig.update_layout(xaxis_title="Dimension 1", yaxis_title="Dimension 2")
fig.show()

<IPython.core.display.Javascript object>

Being able to visualise our technical text can provide us with immediate insight into the way information is stated in text, but what else can we do with these numerical representations? We could:
- Find activities performed in response to an end-of-life event of interest
- Aggregate the word level embeddings for entire work orders and have a document embedding we could use to find work orders that share similar meaning.

## Activity 3.B.1

Post your answers on [Menti]()
- What are your thoughts on the fundamentals of NLP? Is there anything surprising that you have learnt?
- How do you deal with noisy text in your work?
- How would you tackle fixing this text - "re*pl;ace the ## 1 p/p   impeller!! @ the pump stn"?

### 3.B.1 - Takeaways
- Tokenization is a crucial part of natural language processing
- The cleanliness of data can impact on the ability to tokenize
- Using text preprocessing blindly can accidentally remove meaning from texts e.g. removing stopwords that have technical important
- Embeddings are a powerful technique for numerically representing meaning in text 

#### 3.B.2 - Text Cleaning
There are various methods that can be adopted to improve the quality of text data such as using off-the-shelf tools, dictionaries, learning algorithms, etc.

For technical language, which maintenance records are composed of, off-the-shelf tools do not work very effectively due to the specific language used. Dictionaries are a popular approach due to their simplicity, for example a dictionary could be of the format:

```
rp : replace
rple : replace
replc : replace
```

However, the use of such dictionaries is challenging as they are most useful in `find-and-replace` strategies. For example, `rple pump impeller` we could find that the word `rple` exists in our dictionary and normalise it to `replace`. This would result in improved coverage of any techniques we were to apply to our data, however in instances where words are ambiguous and change meaning based on their context, this becomes difficult. Consider the two texts `replace a/c cable` and `replace a/c vents`. The former may be an abbreviation for `a/c -> alternating current` and the latter `a/c -> air conditioner`, hence, by doing a `find-and-replace` strategy we may erroneous normalise our data.

Here we'll develop a function to perform some basic text cleaning. In this notebook, our main focus is on identifying words and phrases that indicate end-of-life, hence identifiers, numbes, etc, can be removed as they increase our vocabulary unncessarily. The steps we'll perform here, including:
- Removing special characters
    - `**c/out gearbox` → `c/out gearbox`
- Removing superfluous whitespace
    - `idler    bearing replacement` → `idler bearing replacement`
- Removing stopwords (cautiously)
    - `replace the bearing` → `replace bearing`
- Replacing known noisy words/abbreviations with a controlled dictionary
    - `c/out bearing` → `change out bearing`

Lets write functions to perform each of these cleaning stages. We'll use an exemplar text that has the form of `re*pl;ace the ## 1 p/p   impeller!! @ the pump stn` and our goal is to normalise it deterministacally to **`replace the number 1 pump impeller at the pump station`**.

Note: What we're creating here are called [lambda functions](https://www.w3schools.com/python/python_lambda.asp) in python.

Here we'll use the package called `re` which is for [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) in Python. There are many tools to help you develop regular expressions online such as [regexr](https://regexr.com/).

In [40]:
import re

<IPython.core.display.Javascript object>

In [41]:
test_text = "re*pl;ace the ## 1 p/p   impeller!! @ the pump stn"

<IPython.core.display.Javascript object>

In [42]:
# Remove all characters except for alphanumerical and reserved special characters e.g. @, -, #, /, and .
# Note: We're using regular expressions here, so some characters need to be 'escaped' as they are special 'metacharacters'
# Refer to this for further information: https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

chars_to_keep = "\.\#\@\/"

fnc_rmv_chars = lambda text: re.sub(rf"[^a-zA-Z0-9 {chars_to_keep}]", "", text)
fnc_rmv_chars(test_text)

'replace the ## 1 p/p   impeller @ the pump stn'

<IPython.core.display.Javascript object>

In [43]:
# Remove duplicate reserved special characters (like ##, @@@, etc.)
# This way if we want to normalise # to number and we have ## we wont' get numbernumber

fnc_rmv_dupe_chars = lambda text: re.sub(rf"([{chars_to_keep}])\1+", r"\1", text)

fnc_rmv_dupe_chars(test_text)

're*pl;ace the # 1 p/p   impeller!! @ the pump stn'

<IPython.core.display.Javascript object>

In [44]:
# Removing superfluous whitespace (e.g. more than 1 whitespace between words)
# Doing this means we do not have tokens that are whitespace e.g. 'replace   engine' -> ['replace', '', '', 'engine']

fnc_rmv_whitespace = lambda text: re.sub(r" {2,}", " ", text)
fnc_rmv_whitespace(test_text)

're*pl;ace the ## 1 p/p impeller!! @ the pump stn'

<IPython.core.display.Javascript object>

In [45]:
# Removing stopwords (cautiously)
# Note: stopwords are unigrams, so we can simply split our text on whitespace and check whether each token
# is in a stopword list; if it is, skip it, otherwise keep it.

# Recall that some stopwords are meaningful for us, hence we will filter the list down first.

stopwords_to_keep = [
    "on",
    "off",
    "over",
    "under",
    "no",
    "not",
    "don",
    "don't",
    "aren",
    "aren't",
    "no",
    "not",
    "didn't",
    "doesn",
    "doesn't",
    "hadn",
    "hadn't",
    "hasn",
    "hasn't",
    "haven",
    "haven't",
    "isn",
    "isn't",
    "won",
    "won't",
    "wouldn",
    "wouldn't",
]
filtered_stopwords = [
    word for word in stopwords.words("english") if word not in stopwords_to_keep
]

fnc_rmv_stopwords = lambda text: " ".join(
    [token for token in text.split(" ") if token not in filtered_stopwords]
)

fnc_rmv_stopwords(test_text)

're*pl;ace ## 1 p/p   impeller!! @ pump stn'

<IPython.core.display.Javascript object>

In [46]:
# Normalising noisy words with a controlled dictionary

word_normalisation_dictionary = {
    "#": "number",
    "@": "at",
    "u/s": "unserviceable",
    "changeout": "change out",
    "c/o": "change out",
    "c/out": "change out",
    "rplc": "replace",
    "p/p": "pump",
    "stn": "station",
    "repl": "replacement",
}

# Note this could be performed entirely with more complex regular expressions
def dictionary_normalisation(text: str, norm_dict: dict) -> str:
    for noisy_word, clean_word in norm_dict.items():
        if noisy_word in chars_to_keep:
            text = text.replace(noisy_word, f" {clean_word} ")
        text = re.sub(rf"(?<!-)\b{noisy_word}\b(?!-)", f" {clean_word} ", text)

    return text


# test on test_text
dictionary_normalisation(text=test_text, norm_dict=word_normalisation_dictionary)

're*pl;ace the  number  number  1  pump    impeller!!  at  the pump  station '

<IPython.core.display.Javascript object>

In [47]:
# Putting them all together. Note the order of these operations is important.
# We don't want to try and remove stopwords or normalise words when they have special characters, or extra whitespace etc.


def text_cleaner(text: str, norm_dict: dict) -> str:
    """Cleans and normalises a given text. Steps performed include: removing special characters, removing duplicate non-alphanumerical characters,
    removing stopwords, normalising terms using a dictionary, and remvoing superfluous whitespace"""
    text = fnc_rmv_chars(text)
    text = fnc_rmv_dupe_chars(text)
    text = fnc_rmv_stopwords(text)
    text = dictionary_normalisation(text, norm_dict)
    text = fnc_rmv_whitespace(text)

    return text

<IPython.core.display.Javascript object>

In [48]:
# Lets check whether we have achieved our goal of normalising the starting text
cleaned_text = text_cleaner(text=test_text, norm_dict=word_normalisation_dictionary)
print(f"Input: {test_text}\nOutput: {cleaned_text}")

Input: re*pl;ace the ## 1 p/p   impeller!! @ the pump stn
Output: replace number 1 pump impeller at pump station 


<IPython.core.display.Javascript object>

The steps we've seen are typical for preprocessing noisy natural language texts, however newer technology has currently started to focus on this problem using deep learning algorithms. Similar to technologies such as the spell checker on your mobile phone, and services such as Grammarly, these technologies aim to fix noisy text without writing heuristics as we've done here.

However, the technical text which you deal with is not amenable to these types of services as the language used is too domain-specific. We will not go into the details of it in this notebook, but research in the [UWA NLP-TLP group](https://nlp-tlp.org/) is developing translation models to go from noisy technical text to clean technical text without any rules. Just like English to French translation, the group treats noisy to clean text in a similar way, going from:

```
Input: re*pl;ace the ## 1 p/p   impeller!! @ the pump stn
Output: replace number 1 pump impeller at pump station
```

Before we move onto the next section, lets quickly overlay our data after its cleaned on the embeddings we made previously.

In [49]:
# Create dimensionality reduced embedding data for cleaned dataset
cleaned_tokenized_mwo_corpus_for_embedding = [
    word_tokenize(
        text_cleaner(text=text.lower(), norm_dict=word_normalisation_dictionary)
    )
    for text in mwo_corpus
]
cleaned_phrase_model = phrases.Phrases(
    cleaned_tokenized_mwo_corpus_for_embedding,
    min_count=5,
    threshold=10,
    connector_words=phrases.ENGLISH_CONNECTOR_WORDS,
)
cleaned_w2v_model = Word2Vec(
    sentences=[
        cleaned_phrase_model[t] for t in cleaned_tokenized_mwo_corpus_for_embedding
    ],
    vector_size=300,
    min_count=5,
    window=3,
    epochs=50,
)
X_cleaned = np.array(
    [cleaned_w2v_model.wv[word] for word in cleaned_w2v_model.wv.index_to_key]
)
pca_cleaned = PCA(n_components=2)
X_pca_cleaned = pca.fit(X_cleaned).transform(X_cleaned)

<IPython.core.display.Javascript object>

In [50]:
# Create dataframes of both datasets to plot
df_embed_cleaned = pd.DataFrame(
    {
        "x": X_pca_cleaned[:, 0],
        "y": X_pca_cleaned[:, 1],
        "word": cleaned_w2v_model.wv.index_to_key,
        "label": np.array(["cleaned"] * len(X_pca_cleaned)),
    },
    columns=["x", "y", "word", "label"],
)
df_embed_comparison = pd.concat([df_embed_original, df_embed_cleaned])

<IPython.core.display.Javascript object>

In [51]:
fig_embed_comparison = px.scatter(
    df_embed_comparison.sort_values(by="word").iloc[:500],  # .sample(500),
    x="x",
    y="y",
    text="word",
    title="Comparison of embedded words from maintenance work orders (original and cleaned)",
    color="label",
)
fig_embed_comparison.update_traces(textposition="top center")
fig_embed_comparison.update_layout(xaxis_title="Dimension 1", yaxis_title="Dimension 2")
fig_embed_comparison.show()

<IPython.core.display.Javascript object>

In [52]:
word_to_compare = "not_working"
print("Original")
original_sim = get_similar_words(model=w2v_model, word=word_to_compare)
print("\nCleaned")
cleaned_sim = get_similar_words(model=cleaned_w2v_model, word=word_to_compare)

Original
investigate/repair (67.2%)
faulted (64.0%)
false_trips (62.2%)
active (62.2%)
along (61.8%)
reattach (61.2%)
fault_finding (60.7%)
faulty (59.9%)
not_resetting (58.8%)
activated (58.7%)

Cleaned
investigate/repair (67.2%)
activated (65.6%)
faulted (64.7%)
fault_find (62.1%)
fault_finding (61.3%)
reconnect (60.7%)
faulty (60.5%)
active (60.4%)
ground (60.3%)
confirm (60.0%)


<IPython.core.display.Javascript object>

Because embedings are learnt based on co-occurrence, e.g. similiar words are used in similar contexts, we see that the cleaning effort doesn't make a massive difference on embedding similarities. However, in the next sections of this notebook, and next week we'll see why this stage is important for natural language processing on technical texts.

A natural question that occurs when using dictionary based lexical normalsiation is where do the dictionaries come from? There are two methods - 1. you can curate them manually from your own knowledge and available resources, or 2. use available software such as [LexiClean](https://lexiclean.nlp-tlp.org) (developed by the UWA NLP-TLP group) which helps you quickly normalise texts and build replacement dictionaries.

### Activity 3.B.2 Part A <a class="anchor" id="3-B-activity"></a>

Given the text below:
- Tokenize it using the NLTK tokenizer, and
- Extract all of the phrases.

In [None]:
# Activity 3.B.2 Part A
a3b2a_text = "[PM] change out idler bearing."

Uncomment and run the following cell to see the answer to this activity (after you try it!).

In [None]:
# %load ./solutions/W3/tokenization_phrasing.py

### Activity 3.B.2 Part B <a class="anchor" id="3-B-2-activity"></a>

Given the following MWO corpus, first clean the texts and then find the total number of tokens and the top 10 most frequent tokens.

Optional:
- Top 10 rarest tokens

In [53]:
# Corpus for getting intuition for fundemantals of NLP
# this will include noisy data to ensure that we can highlight the rarity of words, etc.
random.seed(1234)
a3b2b_corpus = random.sample(mwo_corpus, 25)
print("\n".join(a3b2b_corpus))

161/162 Transfer repairs parent metal
8W ELE ON INSP Conveyor CVR029
12W MEC SDN REPL Damaged Idlers CVR029
4W MEC ON INSP Transfer Chute CVR323
4W LUB ON INSP CVR003 SSC008
8W MEC SDN RPL Idlers CVR301
1W BLT ON INSP CVR161
48W OFF Replace Sec Scraper Tips CVR062
CVR005 Scope CVR005 Chute Mods Xfer
2W LUB ON INSP Conveyor CVR063
4W MEC ON INSP Conveyor CVR068
CVR023 BRK001 Replace Caliper/Brake Pad
CVR061 Replace Idlers
Modify Belt Reeler Kickrails
6M MEC SDN CPMS T/U Winch CVR123
1W BLT ON INSP CVR165
1W LUB ON INSP CVR122
4W CMN ON INSP CVR163
CVR115 Standard belt replacement
Supply Oil Cooler Motor CVR122
13W MEC SDN INS Brake CVR041
Impact idler frame tab broken off.
CVR065 Scope Boom Spray Water Valve
CVR061 Replace idlers and rollers
CVR241 RPL Sec & Tert Scrapers


<IPython.core.display.Javascript object>

In [54]:
# Use this area for activity 3.B


<IPython.core.display.Javascript object>

Uncomment and run the following cell to see the answer to this activity (after you try it!)

In [55]:
# %load ./solutions/W3/corpus_stats.py

<IPython.core.display.Javascript object>

### 3.C - Identification of end-of-life events  <a class="anchor" id="3-C"></a>
</br>
<center>
    <img src="./images/nb_3-1_flow_diagram-C.png"/>
</center>
</br>
Now that we have an intuition towards the fundamental concepts of natural language processing such that we can speak a common language, lets use what we've learnt to identify end-of-life events in maintenance work order short text. We treat end-of-life (EOL) events as any event that results in the inability of an item to function (e.g. functional failure). For example, "mechanical seal blown on pump" would indicate an EOL event for the pump.


There are numerous ways this can be tackled, but the two we will explore in this notebook include:
- 1. Basic keyword search (e.g. what we would do if we were using spreadsheet software), annd
- 2. Expanded keyword search by using word embeddings to identify trigger terms

For each of the proposed techniques that we will review, they become increasingly data intensive but more effective.
- Basic keyword search - no annotated data, least effective
- Expanded keyword search using embeddings - low amount of annotated data, more effective than basic keyword search

**⚡ Concept - Annotation**</br>
Annotation is the process of labeling data to guide models and teach them to predict the outcome we want. Consider the text `replace air compressor` - we may want a model that can predict the word `replace` is an `activity`, so we would annotate a number of examples texts expressing this information and hopefully be able to teach a model to perform this task automatically.

## 3.2.C.1 - Basic Keyword Search  <a class="anchor" id="3-C-1"></a>

The technique we'll explore is `basic keyword searching`, that is, like in spreadsheet software (Excel, etc.) we will search our data for words or phrases that may indicate an end-of-life event. For example, consider the texts:
- `replace` pump impeller
- Bredel pump `blown`
- `cracked` piston in piston pump
- PU001 `not pumping` efficiently

From these texts, we could elicit the terms `replace`, `blown`, `cracked`, `not pumping` as indicators of end-of-life. 

First, lets review our dataset and get some group intuitions for this task 

In [56]:
# Lets make a copy of our original dataset
df_kws = df_large.copy()

df_kws_len = len(df_kws)  # Get the length of the copied dataset to use for comparisons

<IPython.core.display.Javascript object>

In [57]:
print(f"Records in dataset: {len(df_kws)}")

Records in dataset: 45039


<IPython.core.display.Javascript object>

In [58]:
# Print dataset
df_kws.head(5)

Unnamed: 0,id,description,basic_start_date,priority,...,functional_loc_desc,functional_loc,area,object_desc
0,1,3Y MEC SDN REPL Conveyor Belt CVR068,2017-01-02,High,...,Belt-CVR068,1071-30-25-09-CVR068-MECH-BLT001,mine,Belt
1,2,CVR029 Replace Conveyor Belt,2020-10-28,High,...,Belt CVR029,1071-30-05-07-CVR029-MECH-BLT001,mine,Belt
2,3,144W MEC SDN REPL CVR069 Conveyor Belt,2018-10-13,High,...,Belt CVR069,1071-30-25-09-CVR069-MECH-BLT001,mine,Belt
3,4,3Y MEC SDN REPL Conveyor Belt CVR069,2021-05-05,High,...,Belt CVR069,1071-30-25-09-CVR069-MECH-BLT001,mine,Belt
4,5,CVR029 Replace Conveyor Belt,2018-06-22,High,...,Belt CVR029,1071-30-05-07-CVR029-MECH-BLT001,mine,Belt


<IPython.core.display.Javascript object>

### Activity 3.C.1 - Identifying end-of-life events (EOL) via key-word search  <a class="anchor" id="3-C-1-activity"></a>

Post your answers on [Menti]().

With reference to the work order dataset,
- What words or phrases would you use to search for end-of-life events?
- What limitations do you think there are to the approach of key-word search?

Lets use the `pandas` package to perform basic keyword search on our dataset.

This is how we could filter all of our records for the term `replace`, similar to what we saw in the previous weeks.

In [59]:
# Let's filter our pump dataset for all records that contain the term `replace`
# We will ignore the casing of the words here so that replace and Replace are treated equally.

search_term = "replace"  # Change to any EOL word/phrase

df_kws_replace = df_kws[df_kws["description"].str.contains(search_term, case=False)]
print(
    f"Number of records containing {search_term}: {len(df_kws_replace)}/{df_kws_len} ({len(df_kws_replace)/df_kws_len*100:0.2f}%)"
)

Number of records containing replace: 4370/45039 (9.70%)


<IPython.core.display.Javascript object>

We obviously know that there are numerous words and phrases that can indicate EOL, so lets expand our search.

In [60]:
# Lets expand our vocabulary and search for other words
# Note: the term `vocabulary` is used here similar to
# the dictionary definition, however the subject is specifically words that represent end-of-life events
eol_terms = ["replace", "change out"]
pattern = "|".join(eol_terms)

df_kws_eol = df_kws[df_kws["description"].str.contains(pattern, case=False)]
print(
    f"Number of records matching EOL dictionary: {len(df_kws_eol)}/{df_kws_len} ({len(df_kws_eol)/df_kws_len*100:0.2f}%)"
)

Number of records matching EOL dictionary: 4617/45039 (10.25%)


<IPython.core.display.Javascript object>

### 3.C.1 - Identifying end-of-life events (EOL) via key-word search
#### Take aways
- Key-word search is useful *IF* your data has no data quality issues *AND* you have an extensive vocabularly of search terms that have broad coverage of end-of-life terms.
- Key-word search does not help us distinguish whether the `item` in question is functional. For example, "replace pump paint" may include an EOL term "replace" but the semantics of the text is not something we would consider as evidence for statistical life data analysis.

#### Questions raised
- How do we find end-of-life terms to search for, outside of those most common such as `replace`, `change out`, etc?
- How do we deal with poor data quality that impacts our search? For example, `rpl` and `replc` being erroneous versions of `replace`.
- How do we isolate our searches for functional items that contribute to the lifetime of the asset rather than ancilliaries?

### 3.C.2 - Expanded keyword search using embeddings  <a class="anchor" id="3-C-2"></a>
The next technique we will explore uses the concept of word embeddings to create a dictionary (or "gazetteer") of search terms that we can use to expand our EOL terms. This is useful if we cannot exhaustively list all of the ways EOL events can be expressed in maintenance texts.

The initial steps we'll take here use the concepts we've already been exposed to, namely, ngrams and word embeddings.

In [61]:
# Lets make another copy of our original dataset
df_expanded_kws = df_large.copy()

<IPython.core.display.Javascript object>

In [62]:
# Lets create a corpus from our maintenance texts
mwo_corpus_expanded_kws = df_expanded_kws["description"].tolist()

print(mwo_corpus_expanded_kws[:3])

['3Y MEC SDN REPL Conveyor Belt CVR068', 'CVR029 Replace Conveyor Belt', '144W MEC SDN REPL CVR069 Conveyor Belt']


<IPython.core.display.Javascript object>

We can make the assumption that lowercasing the corpus will not impact the ability to identify end-of-life terms as these are not represented as proper nouns, etc. However, removing casing will improve our embeddings.

In [63]:
# Like we did in the fundamentals section, we will lowercase and tokenize all of our texts
tokenized_mwo_corpus_expanded_kws = [
    word_tokenize(
        text_cleaner(text=text.lower(), norm_dict=word_normalisation_dictionary)
    )
    for text in mwo_corpus_expanded_kws
]

print(tokenized_mwo_corpus_expanded_kws[:3])

[['3y', 'mec', 'sdn', 'replacement', 'conveyor', 'belt', 'cvr068'], ['cvr029', 'replace', 'conveyor', 'belt'], ['144w', 'mec', 'sdn', 'replacement', 'cvr069', 'conveyor', 'belt']]


<IPython.core.display.Javascript object>

In [64]:
# We know that ngrams capture important information, so we'll create a phraser model that automatically identifies ngrams from our corpus
phrase_model_expanded_kws = phrases.Phrases(
    tokenized_mwo_corpus_expanded_kws,
    min_count=5,
    threshold=10,
    connector_words=phrases.ENGLISH_CONNECTOR_WORDS,
)

<IPython.core.display.Javascript object>

In [65]:
# Lets check that our phraser model works as expected
phrase_model_expanded_kws[["change", "out", "converyor", "idler"]]

['change_out', 'converyor', 'idler']

<IPython.core.display.Javascript object>

In [66]:
# Here we'll train an embedding model on our phrased data
w2v_model_expanded_kws = Word2Vec(
    sentences=[phrase_model_expanded_kws[t] for t in tokenized_mwo_corpus_expanded_kws],
    vector_size=300,
    min_count=5,
    window=3,
    epochs=50,
)

<IPython.core.display.Javascript object>

Now that we have an embedding model trained on our corpus that has been passed through our phrasing model, we can exploit the numerical space of the embedding to find new EOL terms to improve the coverage of our keyword search process.

In [67]:
# Recall that given a word, we can find all of those that are similar.
# Hence, words used in similar contexts will share similar numerical values (be close in n-dimensional space)
# Therefore, we can start with a few seed terms and iteratively explore the embedding vector space to
# incrementally build up a vocabularly of EOL terms.

# Our initial kws terms (similar to the second part of the previous activity)
eol_expanded_kws_terms = ["replace", "change_out"]

<IPython.core.display.Javascript object>

In [68]:
def get_similar_terms(
    model, term: str, saved_terms: list, irrelevant_terms: list, topn: int
) -> tuple:
    """
    Function for finding terms similar to a provider term whilst discarding terms already seen.
    """

    tries = 0
    max_tries = 100
    similar_terms = []
    similar_terms_proba = []
    while len(similar_terms) <= topn:
        similar_terms_temp, similar_terms_proba_temp = zip(
            *model.wv.most_similar([term], topn=topn * 10)
        )

        # remove similar terms already in terms list
        new_terms_idx = [
            idx
            for idx, similar_term in enumerate(similar_terms_temp)
            if similar_term not in irrelevant_terms + saved_terms
        ]
        # slice terms and proba lists
        similar_terms_temp_slice = np.array(list(similar_terms_temp))[new_terms_idx]
        similar_terms_proba_temp_slice = np.array(list(similar_terms_proba_temp))[
            new_terms_idx
        ]

        similar_terms.extend(similar_terms_temp_slice)
        similar_terms_proba.extend(similar_terms_proba_temp_slice)

        tries += 1

        if tries == max_tries:
            return None, None

    return (similar_terms[:topn], similar_terms_proba[:topn])

<IPython.core.display.Javascript object>

In [69]:
test_terms, test_probs = get_similar_terms(
    model=cleaned_w2v_model,
    term="replace",
    saved_terms=["changed"],
    irrelevant_terms=["replaced", "change_out"],
    topn=10,
)
print(test_terms, test_probs)

['need_replacing', 'change', 'cvr023replace', 'cvr102replace', 'replace/repair', 'rep', 'repair', 'near', 'repalce', 'popped'] [0.6875156164169312, 0.5918599367141724, 0.5803054571151733, 0.5482270121574402, 0.5387700200080872, 0.536491870880127, 0.5334898233413696, 0.4799257218837738, 0.47497066855430603, 0.47295424342155457]


<IPython.core.display.Javascript object>

As we can see, we can easily identify top-n phrases that are similar to some other given phrase. Now, if we knew some seed end-of-life phrases (such as 'replace', 'change out', etc.), we could easily search the embedding space to find all of the other ones using the semantics and syntactic properties represented in our learnt embedding space. An overview of the process we are going to perform is shown below.

<img src="./images/nb_3-1_term_expansion.png" width="75%"/>

In [70]:
# Application state
idx = 0
topn = 10
terms = ["replace", "change_out", "failed"]
irrelevant_terms = []
finished = False
sampled_terms, _ = get_similar_terms(
    model=cleaned_w2v_model,
    term=terms[0],
    saved_terms=terms,
    irrelevant_terms=irrelevant_terms,
    topn=topn,
)

# Application and logic
title = widgets.HTML(
    value=f"Term <b>{terms[0]}</b> (1/{len(terms)})",
)

button = widgets.Button(
    description="Save and continue",
    tooltip="Click to save and continue to next term",
    button_style="",
    icon="check",
)

finish_btn = widgets.Button(
    description="Im finished", tooltip="Click to finish", button_style="", icon="check"
)

multiselect = widgets.SelectMultiple(
    options=sampled_terms,
    value=[],
    rows=topn,
)
output = widgets.Output(layout={"border": "1px solid black"})


@output.capture()
def on_button_click(b):
    global idx, finished, saved_terms, terms, sampled_terms
    clear_output()
    finished = idx == len(terms) - 1

    # Add selected terms to running list
    terms.extend(list(multiselect.value))
    # Add terms not selected to irrelevant list
    irrelevant_terms.extend(list(set(sampled_terms) - set(multiselect.value)))

    print(f"Terms saved: {len(terms)}\nTerms Discarded: {len(irrelevant_terms)}")

    if not finished:
        idx += 1
        next_term = terms[idx]
        title.value = f"Term <b>{terms[idx]}</b> ({idx+1}/{len(terms)})"

        # Sample terms
        sampled_terms, sampled_probs = get_similar_terms(
            model=cleaned_w2v_model,
            term=next_term,
            saved_terms=terms,
            irrelevant_terms=irrelevant_terms,
            topn=topn,
        )

        multiselect.options = sampled_terms

    else:
        b.disabled = True
        multiselect.disabled = True


button.on_click(on_button_click)


@output.capture()
def on_finish_btn_click(b):
    global finished
    b.disabled = True
    multiselect.disabled = True
    button.disabled = True


finish_btn.on_click(on_finish_btn_click)

box_layout = widgets.Layout(
    display="flex", flex_flow="column", align_items="center", width="50%"
)

box = widgets.HBox(
    children=[title, multiselect, button, finish_btn, output], layout=box_layout
)

<IPython.core.display.Javascript object>

Here we are going to use a widget built using [ipywidgets](https://ipywidgets.readthedocs.io/en/stable/index.html) to interactively identify end-of-life terms (phrases and words) to build up our end-of-life dictionary to improve coverage of our end-of-life identification.

In [71]:
# Show application
display(box)

HBox(children=(HTML(value='Term <b>replace</b> (1/3)'), SelectMultiple(options=('need_replacing', 'replaced', …

<IPython.core.display.Javascript object>

Now that we've elicited our end-of-life terms and phrases, lets have a look a the terms we identified and those we disregarded.

In [72]:
# Lets look at the terms we identified
list(set(terms))[:10]

['collasped',
 'failed',
 'need_replacing',
 'siezed',
 'change_out',
 'replacing',
 'collapsed',
 'callapsed',
 'seized',
 'colapsed']

<IPython.core.display.Javascript object>

In [73]:
# Lets look at the terms we discarded
irrelevant_terms[:10]

['changed',
 'near',
 'change',
 'cvr102replace',
 'replace/repair',
 'repair',
 'cvr023replace',
 'rep',
 'upgrade',
 'refit']

<IPython.core.display.Javascript object>

Let's compare the coverage we got from using NLP techniques to find additional EOL terms.

In [74]:
# We need to remove the underscore added by our phrasing model before we can string search as these do not exist in the original texts
eol_expanded_kws_terms = [token.replace("_", " ") for token in list(set(terms))]

<IPython.core.display.Javascript object>

In [75]:
print(f'Number of expanded search terms: {len(eol_expanded_kws_terms)}')

Number of expanded search terms: 13


<IPython.core.display.Javascript object>

In [76]:
# As we can see, with only a small amount of effort we're able to identify many more records that may have potential for evidence in statistical data life analysis.

expanded_pattern = "|".join(eol_expanded_kws_terms)

df_expanded_kws_eol = df_expanded_kws[
    df_expanded_kws["description"].str.contains(expanded_pattern, case=False)
]

<IPython.core.display.Javascript object>

In [77]:
print(
    f"""
1. Basic keyword search
{len(df_kws_eol)}/{df_kws_len} ({len(df_kws_eol)/df_kws_len*100:0.2f}%)

2. Expanded keyword search
{len(df_expanded_kws_eol)}/{df_kws_len} ({len(df_expanded_kws_eol)/df_kws_len*100:0.2f}%)

Additional records identified
{len(df_expanded_kws_eol)-len(df_kws_eol)} ({(1 - (len(df_kws_eol)/len(df_expanded_kws_eol))) * 100:0.1f}%)
"""
)


1. Basic keyword search
4617/45039 (10.25%)

2. Expanded keyword search
5179/45039 (11.50%)

Additional records identified
562 (10.9%)



<IPython.core.display.Javascript object>

### Activity 3.C.2 - Expanded keyword search using embeddings  <a class="anchor" id="3-C-2-activity"></a>

Post your answers in [Menti]()
- What are your thoughts on the process we've employed so far?
- What limitations do you think there are to this approach?

### 3.D - Classification of end-of-life events  <a class="anchor" id="3-D"></a>
<img src="./images/nb_3-1_flow_diagram-D.png"/>

Now that we have an idea of how to identify end-of-life events in maintenance texts, we need to know how to classify these pieces of evidence as either failure or suspension before we can perform Weibull analysis.


- End-of-life classification: 
Now that we have an idea of the process of eliciting end-of-life events from unstructured text in maintenance work orders, how do we classify these into failures or suspensions? 

- Primer on end-of-life identification and classification: 
Given a set of MWOs, how do we determine which mention end-of-life events, and of those that do, how do we determine whether they are failures or suspensions?

In [78]:
# Lets make another copy of the original dataframe
df_eol_clf = df_large.copy()

<IPython.core.display.Javascript object>

In [79]:
# Lets use the EOL terms we extracted to mark each row in the dataframe as either having an EOL or not
# Recall that the `expanded_pattern` was created in the previous section

df_eol_clf["eol"] = df_eol_clf["description"].str.contains(expanded_pattern, case=False)

<IPython.core.display.Javascript object>

In [80]:
# Lets look at the boolean column we added with the descriptions and order types
df_eol_clf[["description", "eol", "wo_order_type"]].iloc[:25]

Unnamed: 0,description,eol,wo_order_type
0,3Y MEC SDN REPL Conveyor Belt CVR068,False,PM02
1,CVR029 Replace Conveyor Belt,True,PM01
2,144W MEC SDN REPL CVR069 Conveyor Belt,False,PM02
3,3Y MEC SDN REPL Conveyor Belt CVR069,False,PM02
4,CVR029 Replace Conveyor Belt,True,PM01
5,CVR062 Change out belt,True,PM01
6,CVR030 (ENG) Project Works,False,PM01
7,2Y MEC SDN REPL Conveyor Belt CVR323,False,PM02
8,CB MEC SDN REPL Conveyor Belt CVR061,False,PM02
9,CVR064 Belt Replacement,True,PM01


<IPython.core.display.Javascript object>

Now that we have identified records with potential EOL events, we need to classify them as either failure or suspension. 

To classify an identified end-of-life event as failure or suspension, we'll leverage the fortuitous data fields accompanying the work order texts.

For our dataset, we'll use the PM codes where we'll use 'expert logic' to say that anything that is PM01/PM03 (corrective/breakdown - if PM03 is applicable) is a failure and anything that is preventative/planned (PM02) is a suspension. If the data provides more granular information, you could use that instead. At the minimum, maintenance work orders should contain PM codes.

In [81]:
# Here we will specify the column we'll use to classify the work as failure/suspension
fail_suspension_col = "wo_order_type"

<IPython.core.display.Javascript object>

In [82]:
# Lets take a look at all the unique values in this column
# Like we saw earlier, we can see a clear distribution of the types of work orders that are created.
df_eol_clf[fail_suspension_col].value_counts()

PM02    25640
PM01    19384
PM05        9
PM03        6
Name: wo_order_type, dtype: int64

<IPython.core.display.Javascript object>

Using the values in the failure/suspension column, we can perform the classification of the EOL events.

In [83]:
# Here we will say that anything that is corrective/breakdown is PM01 and
# anything that is planned/preventative is PM02. If the order type is anything else (e.g. PM03/PM05) we'll ignore it
fail_values = ["PM01"]
suspension_values = ["PM02"]

<IPython.core.display.Javascript object>

In [84]:
# Lets add a new column to our dataset that classifies the record as a failure (F) or suspension (S)
df_eol_clf["fs_clf"] = np.where(
    (df_eol_clf[fail_suspension_col].isin(suspension_values)),
    "suspension",
    np.where(df_eol_clf[fail_suspension_col].isin(fail_values), "failure", "other"),
)

<IPython.core.display.Javascript object>

In [85]:
# Lets take a look at the work orders classified as 'other' - do we need to include these?
df_eol_clf[df_eol_clf["fs_clf"] == "other"]

Unnamed: 0,id,description,basic_start_date,priority,...,area,object_desc,eol,fs_clf
56,57,CVR030 Belt Replacement Breakdown,2018-06-07,High,...,mine,Fixed Stacker Boom Conveyor,True,other
334,335,CVR068 Minprovise Recovery Team,2020-03-04,High,...,mine,Stockyard Reclaim Conveyor,False,other
1335,1336,CVR068 Eilbeck labour hire,2020-03-03,High,...,mine,Gravity Take Up,False,other
1458,1459,JAW CRUSHER INTEGRATION,2017-02-27,High,...,mine,Overland Conveyor South,False,other
3269,3270,Supply spare Rollers to SCT,2016-04-07,Immediate,...,mine,Idlers,False,other
5800,5801,Supply 5 carry rollers to SCT,2016-03-05,Urgent,...,mine,Idlers,False,other
27253,27254,Spares WO CVR030 Return Frames,2019-10-04,High,...,mine,Idlers,False,other
27254,27255,Supply spare Gearbox to SCT,2015-12-10,Urgent,...,mine,Gearbox,False,other
27447,27448,SCT request for spare - 1250kw motor,2016-04-20,Medium,...,port,Motor,False,other
27448,27449,Replace Drive 2 Motor,2016-10-01,Immediate,...,port,Drive Assembly,True,other


<IPython.core.display.Javascript object>

Now that we have both identified potential EOL events and provided a classification for them based on a structured fortuitous field, we can now filter the dataset for records that may be used as evidence for statistical data life analysis. Here, we will remove all records that do not have an EOL event identified.

In [86]:
# By default, pandas will return the slice of the dataframe for all the instances that are True
df_eol_filtered = df_eol_clf[df_eol_clf["eol"]]

<IPython.core.display.Javascript object>

In [87]:
print(f"Records with potential EOL events: {len(df_eol_filtered)}/{len(df_large)}")

Records with potential EOL events: 5179/45039


<IPython.core.display.Javascript object>

Before we can move onto using these records in our analysis for reliability measures, we first need to try and reason about whether the events actually manifested. For example, if a work order text indicated the change out of a component but it either hasn't been actioned (has no start date) or has no, or low, associated cost/time, do we believe that this should be included in our analysis?

### Activity 3.D - Classification of end-of-life events  <a class="anchor" id="3-D-activity"></a>

Post your answers to [Menti]()
- What fields would you use to help reason about whether an end-of-life event actually occurred?

### 3.2.E - Reasoning about end-of-life events  <a class="anchor" id="3-E"></a>
<img src="./images/nb_3-1_flow_diagram-E.png"/>

Using the dataframe with identified EOL events and classifications, we can further filter the dataframe using intuition. This approach is iterative, but gives us a good rule of thumb.

First, we know that we need the time between event to calculate reliability measures, so lets remove all records that do not have an actual start date.

In [88]:
# Lets perform a preliminary filter of the dataframe.
# We know to get reliability measures such as MTTF/MTBF we need to know a date, for this work we'll take the actual start date as being the point of end-of-life
# However, this date could also be the creation date of the work order or associated notification.
df_eol_filtered = df_eol_filtered[~df_eol_filtered["actual_start_date"].isna()]

<IPython.core.display.Javascript object>

In [89]:
print(f'Number of filtered records with actual start date {len(df_eol_filtered)}')

Number of filtered records with actual start date 4240


<IPython.core.display.Javascript object>

Second, we know that from experience that just because a maintenance work order is raised, it doesn't mean it is executed, so we need a proxy for determining whether work was carried out. Here, we'll use the fortuitious field of actual total cost and/or hours. However, fields from other linked data sources could also be used, such as spares information to determine whether a part was changed out.

Lets analyse the data we currently have to gauge what threshold seems reasonable to set initially. Note we could use our profiled data at the start of this notebook to gain similar insight.

In [91]:
# We can use the handly 
df_eol_filtered.describe()

Unnamed: 0,id,total_actual_costs,total_actual_work
count,4240.0,4240.0,4240.0
mean,14655.6,10020.8,9.60474
std,12486.2,48467.3,23.2375
min,2.0,-10351.1,0.0
25%,4155.25,378.772,2.0
50%,8964.5,907.115,4.0
75%,28646.2,3214.3,8.0
max,45055.0,1274710.0,385.0


<IPython.core.display.Javascript object>

Lets set a general threshold for all of the assets at once, however in the future you could improve this logic by finding asset specific thresholds.

In [97]:
cost_threshold = 2000  # dollars
work_threshold = 4  # hours

<IPython.core.display.Javascript object>

In [98]:
## Example process for finding mean values for total acutal cost and work for each floc in our data - We'll leave this as an exercise for those interested to extend this notebook.
#df_floc_groups = df_eol_filtered.groupby(by=['functional_loc'])

## Here we are going to find the mean value for total actual cost and work as an initial threshold for our model
#floc_thresholds = {}
#for name, group in df_floc_groups:
#    print(name, group)
#    
#    floc_thresholds[name] = {'total_actual_costs': group['total_actual_costs'].mean(), 'total_actual_work': group['total_actual_work'].mean()}

<IPython.core.display.Javascript object>

In [99]:
df_eol_filtered = df_eol_filtered[
    (cost_threshold <= df_eol_filtered["total_actual_costs"])
    | (work_threshold <= df_eol_filtered["total_actual_work"])
]

<IPython.core.display.Javascript object>

In [100]:
print(f"Number of work orders after filtering operations: {len(df_eol_filtered)}")

Number of work orders after filtering operations: 3059


<IPython.core.display.Javascript object>

Using the two expert thresholds on the structured fortuitious data fields, we can move onto converting our classified records into a format suitable for Weibull analysis. However, it should be noted that the thresholds we've blanket applied can be different for each asset. Therefore, eliciting more accurate representations of cost/work requirements for different assets will improve the classification process.

### Activity 3.E - Reasoning about end-of-life events  <a class="anchor" id="3-E-activity"></a>
</br>

Post your answers in [Menti]()
- What are the limitations to this approach? How could we make it better?

### 3.F - Statistical data life analysis  <a class="anchor" id="3-F"></a>
<img src="./images/nb_3-1_flow_diagram-F.png"/>

Now that we have filtered our initial dataset down to records that are likely to have had an end-of-life event manifest, we can convert our data into the format expected by the Reliability package for Weibull analysis.

Before we do this, lets first set a threshold for the minimum number of evidence we expect an asset to have before it can have an analysis performed on it. Typically, a minimum of 5 points of evidence are required for statistical significance. By setting it here, we have clear control of the process.

In [101]:
min_evidence_points = 5

<IPython.core.display.Javascript object>

In [102]:
# First lets check out how many failures/suspensions we have in our dataset
df_eol_filtered["fs_clf"].value_counts()

failure       2916
suspension     142
other            1
Name: fs_clf, dtype: int64

<IPython.core.display.Javascript object>

This next step will group each of the records by their functional location and create a failure/suspension dataset.

In [103]:
groups = df_eol_filtered.groupby(["functional_loc"])

<IPython.core.display.Javascript object>

In [104]:
print(f'Number of functional locations at the start of analysis: {len(groups)}')

Number of functional locations at the start of analysis: 535


<IPython.core.display.Javascript object>

In [105]:
# Here we will iterate over the groups of functional locations,
# skipping past any that do not meet the minimum evidence requirements
fs_data_per_group = {}
for name, values in groups:

    if len(values) < min_evidence_points:
        #         print('N\t', name)
        continue
    else:
        #         print(len(values))
        #         print('Y\t', name)
        values = values[["actual_start_date", "fs_clf"]]

        #         print(values.head())

        # Lets sort the groups data by actual_start_date so that the earliest come first
        fs_data = values.sort_values(by=["actual_start_date"], ascending=True)

        #         print(fs_data.head())

        # Calculate the time between event (we'll use days here)
        fs_data["time"] = fs_data["actual_start_date"].diff() / np.timedelta64(1, "D")
        fs_data["time"] = fs_data["time"].fillna(0)

        #         print(fs_data)

        # Encode failures and suspensions as 1 and 0, respectively
        fs_data["fs_clf"].replace(to_replace="failure", value=1, inplace=True)
        fs_data["fs_clf"].replace(to_replace="suspension", value=0, inplace=True)

        #         print(fs_data)

        fs_data_per_group[name] = fs_data[["fs_clf", "time"]]

<IPython.core.display.Javascript object>

In [106]:
# How many flocs do we have sufficient evidence for?
print(
    f"Number of groups with sufficient evidence: {len(fs_data_per_group)} / {len(groups)}"
)

Number of groups with sufficient evidence: 163 / 535


<IPython.core.display.Javascript object>

For each of the groups that have sufficient evidence, we can now use the Reliability package to compute their mean time values using 2-parameter Weibull analysis programmatically.

In [107]:
results_per_group = {}
for name, fs_data in fs_data_per_group.items():

    fs_data = fs_data[
        0 < fs_data["time"]
    ]  # Do not take into account any events that have 0 time

    failures = fs_data[fs_data["fs_clf"] == 1]
    right_censored = fs_data[fs_data["fs_clf"] == 0]  # suspensions

    failure_times = failures["time"].tolist()
    right_censored_times = right_censored["time"].tolist()

    #     print(failures)
    #     print(right_censored)

    if 1 < len(failures):
        wbfit = Fit_Weibull_2P(
            failures=failure_times,
            right_censored=right_censored_times
            if len(right_censored_times) > 0
            else None,
            show_probability_plot=False,
            print_results=False,
        )
        results_per_group[name] = {
            "alpha": wbfit.alpha,
            "beta": wbfit.beta,
            "mean": wbfit.distribution.mean,
            "time_on_test": fs_data["time"].sum(),
            "evidence": len(fs_data),
        }
    else:
        print(f"{name} only has censored events")
        continue

1071-30-05-01-CVR123-MECH-SCP001 only has censored events
1071-30-05-02-CVR223-MECH-SCP001 only has censored events
1071-30-05-07-CVR030-MECH-BLT001 only has censored events
1071-30-25-01-CVR063-MECH-SCP001 only has censored events


<IPython.core.display.Javascript object>

In [111]:
# Lets plot our results
df_results = pd.DataFrame.from_dict(results_per_group).T
df_results.head()

Unnamed: 0,alpha,beta,mean,time_on_test,evidence
1071-30-05-01-CVR101-MECH-IDL001,197.578,2.52736,175.352,1047,6
1071-30-05-01-CVR102,88.8807,1.05609,86.987,959,11
1071-30-05-01-CVR102-ELEC,383.683,1.23526,358.341,1431,4
1071-30-05-01-CVR102-ELEC-MDT001,257.378,1.00996,256.315,1282,5
1071-30-05-01-CVR102-MECH-IDL001,85.8879,1.51053,77.4701,1398,18


<IPython.core.display.Javascript object>

Lets join the object descriptions back onto our data using a `left join`, this will allow us to categorise our data when we're visualising our results.

In [112]:
# First we'll pop out the index and make it a column (we'll use this as a key to join on)
df_results = df_results.rename_axis("functional_loc").reset_index()

<IPython.core.display.Javascript object>

In [113]:
# Now we'll add the object_desc column to the results dataframe joining on the functional_loc column
df_results_with_objs = pd.merge(
    df_results,
    df_large[["functional_loc", "object_desc"]].drop_duplicates(),
    on="functional_loc",
    how="left",
)

<IPython.core.display.Javascript object>

In [114]:
df_results_with_objs.head(5)

Unnamed: 0,functional_loc,alpha,beta,mean,time_on_test,evidence,object_desc
0,1071-30-05-01-CVR101-MECH-IDL001,197.578,2.52736,175.352,1047,6,Idlers
1,1071-30-05-01-CVR102,88.8807,1.05609,86.987,959,11,Secondary Sizer Feed
2,1071-30-05-01-CVR102-ELEC,383.683,1.23526,358.341,1431,4,Electrical Inst Control Systems
3,1071-30-05-01-CVR102-ELEC-MDT001,257.378,1.00996,256.315,1282,5,Metal Detector
4,1071-30-05-01-CVR102-MECH-IDL001,85.8879,1.51053,77.4701,1398,18,Idlers


<IPython.core.display.Javascript object>

In [115]:
fig_bubble = px.scatter(
    df_results_with_objs,
    x="alpha",
    y="beta",
    size="evidence",
    color="mean",
    symbol="object_desc",
    hover_name=df_results.index,
    color_continuous_scale="Bluered_r",
)
fig_bubble = fig_bubble.update_layout(
    title_text=f"Overview of Results ({len(df_results)} assets)"
)
fig_bubble = fig_bubble.update_layout(coloraxis_colorbar=dict(orientation="h"))

<IPython.core.display.Javascript object>

Tips: 
- Double click legend items to isolate them
- Drag your mouse cursor to zoom into regions and navigate

In [116]:
fig_bubble.show()

<IPython.core.display.Javascript object>

The approach we've explored has allowed us to quickly identify, classify and reason about EOL events exhibited in maintenance work order records. Using this information, we have been able to extract reliability measures in a standardised, and scalable, manner. However, the approach we've demonstrated is not a panacea, but rather should be used as a rule of thumb and a method for quickly structuring data for this task.

In this brief next section, we'll review the classifications made and use our own judgement on them before computing the reliability measures.

In [None]:
# TODO: jupyter widget that iterates over the functional locations and lets users correct the classifications of their system

### 3.F.1 - Analysing the effect of our decisions on reliability measures  <a class="anchor" id="3-F-1"></a>

In this section, we will investigate how the thresholds we set impact the semi-automated reliability meaures we extract from our maintenance work order data.

To make this process interactive, what we've done so far has been wrapped into the function `get_measures`.

In [117]:
def get_measures(
    df: pd.DataFrame,
    eol_terms: list,
    fs_col_name: str = "wo_order_type",
    fs_col_fail_values: list = ["PM01", "PM03"],
    fs_col_suspension_values: list = ["PM02"],
    min_evidence_points: int = 5,
    cost_threshold: int = 2000,
    work_threshold: int = 4,
) -> pd.DataFrame:
    """Function for dynamically calculating reliability measures using expert logic and the Reliability package

    Arguments
        df : a Pandas DataFrame containing maintenance work order data
        eol_terms : a list of terms indicating end-of-life events
        fs_col_name : the name of the column in the supplied DataFrame that can be used to classify end-of-life events as failure or suspension
        fs_col_fail_values : a list of values that will be matched to indicate a failure event
        fs_col_suspension_values : a list of values that will be matched to indicate a suspension event

    Returns
        - df_results_with_objs : Pandas DataFrame with Weibull values for each functional location
        - df_filtered : Pandas DataFrame with original data filtered and classified

    """

    expected_cols = [
        "description",
        "total_actual_costs",
        "total_actual_work",
        "functional_loc",
        "actual_start_date",
    ]

    #     assert that required fields are supplied to the function

    # Make small dataframe containing object descriptions to join onto results
    df_obj_desc = df[["functional_loc", "object_desc"]].drop_duplicates()

    # Perform preliminary filtering of
    # - records with no actual_start_date
    # - records with neither cost or work information
    # - functional locations with # records below min_evidence_points

    df_filtered = df[~df["actual_start_date"].isna()]
    df_filtered = df_filtered[
        (
            ~df_filtered["total_actual_costs"].isna()
            | ~df_filtered["total_actual_work"].isna()
        )
    ]
    counts = df_filtered["functional_loc"].value_counts()
    idx = counts[counts <= min_evidence_points]
    df_filtered = df_filtered[~df_filtered["functional_loc"].isin(idx)]

    # Create pattern for matching
    eol_search_pattern = "|".join(eol_terms)

    # Perform EOL identification
    df_filtered["eol"] = df_filtered["description"].str.contains(
        eol_search_pattern, case=False
    )

    # Perform EOL classification
    df_filtered["fs_clf"] = np.where(
        (df_filtered[fs_col_name].isin(fs_col_suspension_values)),
        "suspension",
        np.where(df_filtered[fs_col_name].isin(fs_col_fail_values), "failure", "other"),
    )

    # Filter data on threshold
    df_filtered = df_filtered[
        (cost_threshold <= df_filtered["total_actual_costs"])
        | (work_threshold <= df_filtered["total_actual_work"])
    ]

    #     print('filtered df size', len(df_filtered))

    # Groupby functional_loc to get failure/suspension data for reliability measure estimations
    fs_data_per_group = {}
    for name, values in df_filtered.groupby(["functional_loc"]):
        if len(values) < min_evidence_points:
            continue
        else:
            values = values[["actual_start_date", "fs_clf"]]
            # Lets sort the groups data by actual_start_date so that the earliest come first
            fs_data = values.sort_values(by=["actual_start_date"], ascending=True)

            # Calculate the time between event (we'll use days here)
            fs_data["time"] = fs_data["actual_start_date"].diff() / np.timedelta64(
                1, "D"
            )
            fs_data["time"] = fs_data["time"].fillna(0)

            # Encode failures and suspensions as 1 and 0, respectively
            fs_data["fs_clf"].replace(to_replace="failure", value=1, inplace=True)
            fs_data["fs_clf"].replace(to_replace="suspension", value=0, inplace=True)

            fs_data_per_group[name] = fs_data[["fs_clf", "time"]]

    # Get reliability measures from fitting to 2P Weibull distribution
    results_per_group = {}
    for name, fs_data in fs_data_per_group.items():

        fs_data = fs_data[
            0 < fs_data["time"]
        ]  # Do not take into account any events that have 0 time

        failures = fs_data[fs_data["fs_clf"] == 1]
        right_censored = fs_data[fs_data["fs_clf"] == 0]  # suspensions

        failure_times = failures["time"].tolist()
        right_censored_times = right_censored["time"].tolist()

        if 1 < len(failures):
            wbfit = Fit_Weibull_2P(
                failures=failure_times,
                right_censored=right_censored_times
                if len(right_censored_times) > 0
                else None,
                show_probability_plot=False,
                print_results=False,
            )
            results_per_group[name] = {
                "alpha": wbfit.alpha,
                "beta": wbfit.beta,
                "mean": wbfit.distribution.mean,
                "time_on_test": fs_data["time"].sum(),
                "evidence": len(fs_data),
            }

    # Join object descriptions onto dataframe for visualisation
    df_results = pd.DataFrame.from_dict(results_per_group).T
    df_results = df_results.rename_axis("functional_loc").reset_index()
    df_results_with_objs = pd.merge(
        df_results, df_obj_desc, on="functional_loc", how="left"
    )

    return df_results_with_objs, df_filtered

<IPython.core.display.Javascript object>

In [118]:
# Here we'll look at what our aggregate function is outputting.
df_test_results, df_test_clf = get_measures(
    df=df_large,
    eol_terms=["replace", "change out", "c/o"],
    fs_col_name="wo_order_type",
    fs_col_fail_values=["PM01"],
    fs_col_suspension_values=["PM02"],
    min_evidence_points=5,
    cost_threshold=2000,
    work_threshold=4,
)

<IPython.core.display.Javascript object>

In [119]:
# Output results includes Weibull parameters and other meta data
df_test_results.head(5)

Unnamed: 0,functional_loc,alpha,beta,mean,time_on_test,evidence,object_desc
0,1071-30-05-01-CVR101,109.212,1.00924,108.793,1523,14,Spillage
1,1071-30-05-01-CVR101-ELEC,84.3982,0.855174,91.4765,913,10,Electrical Inst Control Systems
2,1071-30-05-01-CVR101-MECH-BLT001,226.139,1.29603,208.986,1550,10,Belt
3,1071-30-05-01-CVR101-MECH-IDL001,211.346,1.55681,189.996,1543,11,Idlers
4,1071-30-05-01-CVR101-MECH-SCP001,209.88,1.04572,206.165,1049,7,Scrapers


<IPython.core.display.Javascript object>

In [121]:
# Output classifications include classifications for EOL identification and failure suspension
df_test_clf.head(5)

Unnamed: 0,id,description,basic_start_date,priority,...,area,object_desc,eol,fs_clf
0,1,3Y MEC SDN REPL Conveyor Belt CVR068,2017-01-02,High,...,mine,Belt,False,suspension
1,2,CVR029 Replace Conveyor Belt,2020-10-28,High,...,mine,Belt,True,failure
2,3,144W MEC SDN REPL CVR069 Conveyor Belt,2018-10-13,High,...,mine,Belt,False,suspension
3,4,3Y MEC SDN REPL Conveyor Belt CVR069,2021-05-05,High,...,mine,Belt,False,suspension
4,5,CVR029 Replace Conveyor Belt,2018-06-22,High,...,mine,Belt,True,failure


<IPython.core.display.Javascript object>

Recalling that we set our thresholds to two values that were applied across all our assets, obviously this is not ideal nor representative. As indicated earlier, you could perform analysis on a functional location basis and update this notebook to account for specific thresholds for each asset/floc type. However, what we are going to do here is perform an exploratory visual analysis of our data using an interactive package called [Dash](https://dash.plotly.com/) (an extention of Ploty). If you're familiar with Business Intelligence tools, it will give you the same feel.

In [122]:
from dash import Dash, dcc, html, Input, Output
from jupyter_dash import JupyterDash

<IPython.core.display.Javascript object>

In [123]:
# Build App
app = JupyterDash(__name__)
app.layout = html.Div([
    html.H1("Interactive Data-Driven Reliability Metrics Results"),
    dcc.Graph(id='graph'),
    html.Label([
        "Min. evidence points",
            dcc.Slider(
        5,
        20,
        step=1,
        value=10,
        id='evidence-points-slider',
                            updatemode='mouseup'
    )]),
    html.Label([
    "Cost threshold ($)",
        dcc.Slider(
    0,
    100000,
    step=5000,
    value=50000,
    id='cost-threshold-slider',
            updatemode='mouseup'
    )]),
    html.Label([
    "Work threshold (hr)",
        dcc.Slider(
    0,
    48,
    step=4,
    value=24,
    id='work-threshold-slider',
            updatemode='mouseup'
            
    )])
])

# Define callback to update graph
@app.callback(
    Output('graph', 'figure'),
    [
        Input("evidence-points-slider", "value"),
        Input("cost-threshold-slider", "value"),
        Input("work-threshold-slider", "value")
    ]
)
def update_figure(evidence_points, cost_threshold, work_threshold):
    
    df_test_interactive, _ = get_measures(df = df_large,
                  eol_terms = ['replace', 'change out', 'c/o'],
                  fs_col_name = 'wo_order_type',
                  fs_col_fail_values = ['PM01'],
                  fs_col_suspension_values = ['PM02'],
                  min_evidence_points = evidence_points,
                  cost_threshold = cost_threshold,
                  work_threshold = work_threshold
                 )
    
    fig_interactive = px.scatter(df_test_interactive,
                        x="alpha",
                        y="beta",
                        size="evidence",
                        color="mean",
                        symbol ="object_desc",
                        hover_name="functional_loc",
                        color_continuous_scale='Bluered_r'
                                )
    fig_interactive = fig_interactive.update_layout(coloraxis_colorbar=dict(orientation='h'))
    
    fig_interactive = fig_interactive.update_layout(transition_duration=500)
    
    return fig_interactive


<IPython.core.display.Javascript object>

Lets run the application and explore our results.

⚠️ Please be aware that every time you interact with the sliders, it will recompute EVERY Weibull estimate for each applicable asset. So please be patient between updates.

In [124]:
# Run app and display result inline in the notebook
app.run_server(mode="inline")

<IPython.core.display.Javascript object>

Through the use of interactive visualisation using [Dash]() we can easily get an intuition for the mean values of a range of assets. For instance, if we set the thresholds very low, we are likely to capture evidence that may not be correct, however as we increase the thresholds our belief in evidence being correct also goes up. Hence, we can get a feel for the range of mean time estimates on a given asset through this minima/maxima analysis.

## Extra Analysis
In this section we'll use our Python programming skills to perform additional analysis on our maintenance work order dataset.

Before we do this, we'll create a dataframe containing the results of our analysis.

In [125]:
analysis_cost_threshold = 10000
analysis_work_threshold = 8
analysis_min_evidence = 10
analysis_eol_terms = ["replace", "change out", "c/o"]

<IPython.core.display.Javascript object>

In [126]:
df_analysis_results, df_analysis_filtered = get_measures(
    df=df_large,
    eol_terms=analysis_eol_terms,
    fs_col_name="wo_order_type",
    fs_col_fail_values=["PM01"],
    fs_col_suspension_values=["PM02"],
    min_evidence_points=10,
    cost_threshold=analysis_cost_threshold,
    work_threshold=analysis_work_threshold,
)

<IPython.core.display.Javascript object>

In [127]:
# Lets take a look at the data we are working with
df_analysis_results.head()

Unnamed: 0,functional_loc,alpha,beta,mean,time_on_test,evidence,object_desc
0,1071-30-05-01-CVR101-MECH-IDL001,273.713,2.37639,242.599,1543,9,Idlers
1,1071-30-05-01-CVR102,63.616,0.877455,67.8992,1667,27,Secondary Sizer Feed
2,1071-30-05-01-CVR102-ELEC,108.859,0.733518,132.065,1266,9,Electrical Inst Control Systems
3,1071-30-05-01-CVR102-MECH-BLT001,161.577,2.08306,143.117,1488,11,Belt
4,1071-30-05-01-CVR102-MECH-IDL001,68.4661,1.11653,65.7465,1598,26,Idlers


<IPython.core.display.Javascript object>

In [128]:
# Sort by mean value
df_analysis_results.sort_values(by=["mean"], inplace=True, ascending=True)

<IPython.core.display.Javascript object>

### Find the Top 10 Bad Actors
Here we'll identify the top 10 bad actors in terms of their mean time values (lower value is worse).

In [129]:
# Bad actors globally
top_10_bad_actors = df_analysis_results[:10]
top_10_bad_actors.head(10)

Unnamed: 0,functional_loc,alpha,beta,mean,time_on_test,evidence,object_desc
38,1071-30-05-05-CVR023-MECH-IDL001,30.5084,1.08319,29.5909,1815,72,Idlers
54,1071-30-05-07-CVR030,28.5267,0.880208,30.3923,1810,60,Fixed Stacker Boom Conveyor
124,1071-30-25-11-CVR070-MECH-IDL001,41.7191,1.09815,40.2777,1784,59,Idlers
50,1071-30-05-07-CVR029-MECH-IDL001,43.0598,1.1632,40.8462,1929,61,Idlers
121,1071-30-25-11-CVR070,42.1543,1.04884,41.3618,1739,42,Reclaimer Boom Conveyor
119,1071-30-25-09-CVR069-MECH-IDL001,41.7377,0.991843,41.884,1840,53,Idlers
122,1071-30-25-11-CVR070-ELEC,37.694,0.798002,42.7841,1666,39,Electrical Inst Control Systems
36,1071-30-05-05-CVR023-ELEC,35.1202,0.673827,46.1692,1552,33,Electrical Inst Control Systems
151,1082-30-20-01-CVR161,48.7959,1.17267,46.1829,1917,245,Overland Conveyor
35,1071-30-05-05-CVR023,48.9477,1.1361,46.7518,1863,120,Overland Conveyor South


<IPython.core.display.Javascript object>

In [134]:
# Top 10 bad actors based on FLOC
top_10_bad_actors_floc = df_analysis_results[:10]
top_10_bad_actors_floc.head(10)

Unnamed: 0,functional_loc,alpha,beta,mean,time_on_test,evidence,object_desc
38,1071-30-05-05-CVR023-MECH-IDL001,30.5084,1.08319,29.5909,1815,72,Idlers
54,1071-30-05-07-CVR030,28.5267,0.880208,30.3923,1810,60,Fixed Stacker Boom Conveyor
124,1071-30-25-11-CVR070-MECH-IDL001,41.7191,1.09815,40.2777,1784,59,Idlers
50,1071-30-05-07-CVR029-MECH-IDL001,43.0598,1.1632,40.8462,1929,61,Idlers
121,1071-30-25-11-CVR070,42.1543,1.04884,41.3618,1739,42,Reclaimer Boom Conveyor
119,1071-30-25-09-CVR069-MECH-IDL001,41.7377,0.991843,41.884,1840,53,Idlers
122,1071-30-25-11-CVR070-ELEC,37.694,0.798002,42.7841,1666,39,Electrical Inst Control Systems
36,1071-30-05-05-CVR023-ELEC,35.1202,0.673827,46.1692,1552,33,Electrical Inst Control Systems
151,1082-30-20-01-CVR161,48.7959,1.17267,46.1829,1917,245,Overland Conveyor
35,1071-30-05-05-CVR023,48.9477,1.1361,46.7518,1863,120,Overland Conveyor South


<IPython.core.display.Javascript object>

In [137]:
# Let get the FLOC associated with the worst actor
worst_actor_floc = top_10_bad_actors_floc.iloc[0]['functional_loc']

<IPython.core.display.Javascript object>

In [138]:
# Lets take a look at the records associated with the functional location
df_analysis_filtered_floc = df_analysis_filtered[
    df_analysis_filtered["functional_loc"] == worst_actor_floc
].sort_values(by=["eol"], ascending=False)
df_analysis_filtered_floc[
    ["description", "total_actual_costs", "total_actual_work", "eol", "fs_clf"]
]

Unnamed: 0,description,total_actual_costs,total_actual_work,eol,fs_clf
2607,CVR023 Replace Collapsed Rollers,4198.94,8.0,True,failure
1971,CVR023-Change out idlers,6490.45,32.0,True,failure
2068,replace seized idlers on conveyor cvr23,6087.58,8.0,True,failure
2089,"Replace return,trough rollers",6007.84,24.0,True,failure
2139,CVR023 change out rollers,5802.02,16.0,True,failure
2367,Stage rollers for C/O,4981.82,8.0,True,failure
2553,Change out idlers,4363.77,24.0,True,failure
2794,CVR023 Replace Failed Inverted V Roller,3674.82,18.0,True,failure
3019,1WK CVR023 Replace Severe Idlers WK48,3249.28,21.0,True,failure
3033,CVR023 Change out through Rollers,3225.88,12.0,True,failure


<IPython.core.display.Javascript object>

### Failure behavior classification
Here we'll use the alpha and beta values from our analysis to classify all of the applicable assets. Recall that $\beta$ < 1 is early life failures (infant mortality), $\beta$ = 1 is random failures, and $\beta$ > 1 is wear out failures.

In [139]:
df_analysis_results_fail_clf = df_analysis_results.copy()

<IPython.core.display.Javascript object>

Lets round the numbers to the first digit

In [140]:
df_analysis_results_fail_clf = df_analysis_results_fail_clf.round(1)

<IPython.core.display.Javascript object>

In [141]:
df_analysis_results_fail_clf["behaviour"] = np.where(
    df_analysis_results_fail_clf["beta"] < 1,
    "early_life",
    np.where(df_analysis_results_fail_clf["beta"] == 1, "random", "wear_out"),
)

<IPython.core.display.Javascript object>

Lets visualise the failure behaviour / Weibull shape parameter of our assets

In [142]:
fig = px.bar(
    df_analysis_results_fail_clf,
    x="functional_loc",
    y="mean",
    color="behaviour",
    title="Analysis of Webull shape parameter",
)
fig.show()

<IPython.core.display.Javascript object>

## Summary of Data-Driven Reliability Metrics  <a class="anchor" id="summary"></a>
### Wrap up & homework

Homework for next week
1. ...
2. ...

Your feedback today is welcome. Provide your answers in [Menti]()
- What is one thing you liked about today?
- What would you like to see more of?

# Appendix  <a class="anchor" id="appendix"></a>

**⚡ Concept - Stemming and lemmatization**</br>
Imagine you have the following texts `replaced pump - working well` and `replace pump - working good`. It may be desirable to reduce each word to a root form to reduce the size of the vocabulary within our corpus. Stemming is a technique that stems all words to a common root by removing or replacing word suffixes (e.g. "replaced" to "replace") whereas lemmatization returns the base form of words (e.g. "well" to "better")

In [143]:
# For this section we'll need access to an external resource called WordNet
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Tyler\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Tyler\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

<IPython.core.display.Javascript object>

In [144]:
# Lets instantiate a stemmer from NLTK
stemmer = PorterStemmer()

<IPython.core.display.Javascript object>

In [145]:
# Try 'replace pump - not pumping'
sentence_to_stem = 'replace pump - not pumping'
sentence_to_stem_tokenized = word_tokenize(sentence_to_stem)

# Like the other exercises, we need to tokenize our sentence
stemmed_sentence = [stemmer.stem(token) for token in sentence_to_stem_tokenized]
print(stemmed_sentence)

['replac', 'pump', '-', 'not', 'pump']


<IPython.core.display.Javascript object>

In [146]:
# Lets instantiate a lemmatizer from NLTK
lemmatizer = WordNetLemmatizer()

<IPython.core.display.Javascript object>

In [147]:
# interchange, interchanger, interchanging

sentence_to_lemma = 'interchange interchanger'
sentence_to_lemma_tokenized = word_tokenize(sentence_to_lemma)

lemmatized_sentence = [lemmatizer.lemmatize(token) for token in sentence_to_lemma_tokenized]
print(lemmatized_sentence)

['interchange', 'interchanger']


<IPython.core.display.Javascript object>

**⚡ Concept - Concordancing**</br>
Concordancing gives us a view of every occurrence of a given word, together with some context. For example, we may be interested in the context that the word `seized` occurs in which may result in `...bearing seized on pulley...` and `...lack of lube - seized ...`.

Being able to analyse the use of words in context allows us to quicky get an understanding of what is happening in our texts.

In [148]:
# Before we can perform concordancing, we need to convert our MWO corpus into a single string and then 
# into a special NLTK text object. We'll need to put a special token into the object to delineate between texts. 
special_token = " | "
texts_for_concordancing = Text(word_tokenize(f"{special_token}".join(mwo_corpus)))

<IPython.core.display.Javascript object>

Lets take a look at concordances for single words such as `broken`, or negation such as `not`.

In [149]:
texts_for_concordancing.concordance('not')

Displaying 25 of 284 matches:
ate & Inse > Inse | Outstanding PO DO NOT TECO | CVR123 Replace Impact Plate & 
 Idle | CVR063 REPL Lower Insert | Do Not TECO outstanding PO | 24W MEC SDN REP
TRIAL - CVR030 Impact Idlers | # # DO NOT TECO # # CVR241 Hot Splice Repair | 8
ing on overload . | Outstanding PO Do Not TECO | 24W MEC SDN REPL H/Chute Liner
rd & Soft Skirts | Reported load cell not communicating | CVR025 Replace Return
lers | CVR041 Replace GTU Rubber | DO NOT TECO , WAITING FOR PARTS , CVR223 M |
t | Preps - Conveyor Belt CVR123 | DO NOT TECO CVR062-Check brake fail to a | R
ace skirts and clean | CVR042 U/Speed Not Working | CB MEC SDN RPL Brake Pads D
Power Cell - Siemens | METAL DETECTOR NOT RESETTING | CVR123-CVR023 Repair Tran
CVR123 - replace damaged rollers | DO NOT TECO - IDLER FRAMES ON ORDER | Replac
CVR029 & CVR030 | RTD 0211 Drive side not reading | CVR341 Refurbish Secondary 
lty multi I\O card | CVR029 Sirens Do Not Sound on Start Up | SSC056-CVR001 TLO
DN REPL Im

<IPython.core.display.Javascript object>

We can also look at concordances between multiple words, such as "not working"

In [150]:
texts_for_concordancing.concordance(["not", "working"])

Displaying 25 of 66 matches:
skirts and clean | CVR042 U/Speed Not Working | CB MEC SDN RPL Brake Pads DRV1 
e Failed Idlers | Mulitple lights not working on CVR029 | CVR030 Replace DRV002
42 Weightometer Alignment | Brake not working | Relace 8 x Tail End Return Roll
ed V Worn Frame | DBD002 Lighting not working in Auto | CVR070 Fit Conduits to 
 Metal Detector | cvr030 lighting not working | Isolating and suspending for op
g repairs on CVR025 | Tilt switch not working | replace return rollers | CVR065
d Frame | CVR023 beacon HRN152 is not working | Scrubbing Line 5 BLS5118 U\S | 
sfer Chute CVR223 | CVR223 lights not working | CVR068 Poly Return Roller Upgra
t switches | CVR023 Tunnel lights not working | CVR023 Tunnel emergency lightin
Pizza Cutters | CVR123 thermostat not working . | CVR431 BDS 4134 Broken off | 
 ZSL3148 fa | CVR302 weightometer not working | Weighto no Speed reference CVR1
ft skirt | CVR030 Load Cell XT001 not working | CVR061 Inspect pullwire cotton 
56 | CVR023

<IPython.core.display.Javascript object>

Due to the terse nature of maintenance work order short text, concordancing only provides a glimpse into the context that word/phrases are occurring. However, if you have longer documents such as long text or reports, this can provide quick insight.