# Basic NLP Course

## Bag of Words Representation and First Machine Learning Model

Bag of Words (BoW) is a simple and widely used representation for text data in Natural Language Processing (NLP). It converts text into a numerical format that can be used as input for machine learning models.

- **Definition**: BoW represents text as a collection of words, disregarding grammar and word order, while keeping track of the frequency of each word.
- **Steps**:
    1. Tokenize the text into words.
    2. Create a vocabulary of unique words.
    3. Represent each document as a vector based on the frequency of words in the vocabulary.
- **Example**: For the sentences:
    - "I love NLP."
    - "NLP is fun."
    
    The vocabulary is: `["I", "love", "NLP", "is", "fun"]`. The sentences are represented as:
    - [1, 1, 1, 0, 0]
    - [0, 0, 1, 1, 1]

- **First ML Model**: Using BoW, you can train a simple machine learning model such as Naive Bayes or Logistic Regression for tasks like text classification.

In [82]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [83]:
data = pd.read_csv('../data/work_orders_sample.csv')
data.head()

Unnamed: 0,failure_mode,description
0,Internal leakage,Compressor CP-001 is experiencing internal lea...
1,Abnormal instrument reading,Compressor CP-101 is showing abnormal pressure...
2,Abnormal instrument reading,Compressor C-101 is giving an abnormal high pr...
3,Abnormal instrument reading,Compressor C-101-A is giving abnormal instrume...
4,Abnormal instrument reading,Compressor CP-101 is giving an abnormal instru...


In [84]:
# evaluate the number of unique values in the failure_mode
data.failure_mode.value_counts()

failure_mode
External leakage - process medium    201
Erratic output                       201
Breakdown                            198
Abnormal instrument reading          192
External leakage - utility medium    144
Failure to start on demand            93
Internal leakage                      71
High output                           71
Low output                            71
Noise                                 71
Overheating                           71
Parameter deviation                   71
Plugged / Choked                      71
Minor in-service problems             71
Structural deficiency                 71
Failure to stop on demand             71
Spurious stop                         71
Vibration                             71
Name: count, dtype: int64

In [85]:
# Create a set of unique failure modes
failure_mode_set = set(data['failure_mode'])

# Map each failure mode to an index
failure_mode_mapping = {mode: idx for idx, mode in enumerate(failure_mode_set)}

# Replace the failure_mode column with the corresponding indices
data['failure_mode'] = data['failure_mode'].map(failure_mode_mapping)

# Display the updated dataframe and the mapping
print(data.head())
print(failure_mode_mapping)

   failure_mode                                        description
0             6  Compressor CP-001 is experiencing internal lea...
1            17  Compressor CP-101 is showing abnormal pressure...
2            17  Compressor C-101 is giving an abnormal high pr...
3            17  Compressor C-101-A is giving abnormal instrume...
4            17  Compressor CP-101 is giving an abnormal instru...
{'External leakage - utility medium': 0, 'Failure to stop on demand': 1, 'Structural deficiency': 2, 'External leakage - process medium': 3, 'Failure to start on demand': 4, 'Noise': 5, 'Internal leakage': 6, 'Plugged / Choked': 7, 'Parameter deviation': 8, 'Spurious stop': 9, 'Vibration': 10, 'Erratic output': 11, 'Breakdown': 12, 'Minor in-service problems': 13, 'Low output': 14, 'High output': 15, 'Overheating': 16, 'Abnormal instrument reading': 17}


- this means we are creating a classification model that will detect Breakdown failure modes work orders (one vs all)

In [86]:
# splitting up train and test set
x_train, x_test, y_train, y_test = train_test_split(data.description, data[['failure_mode']], test_size=0.2, stratify=data[['failure_mode']])

In [87]:
# print the shape of the train and test set
print(f'Train set shape: {x_train.shape}, Test set shape: {x_test.shape}')

Train set shape: (1504,), Test set shape: (377,)


In [88]:
# lets train CountVectorizer (BOW) - create a vocabulary of all documents and then, create a vctor for each document counting the frequency of word occurrence
bow = CountVectorizer()
x_train_bow = bow.fit_transform(x_train)
x_test_bow = bow.transform(x_test)


In [89]:
# convert to array
x_train_bow = x_train_bow.toarray()
x_test_bow = x_test_bow.toarray()

In [90]:
print(x_train_bow.shape)

(1504, 1417)


In [91]:
# check vocabulary
bow.vocabulary_

{'compressor': 289,
 'breakdown': 185,
 'repair': 1052,
 'seized': 1127,
 'unit': 1332,
 'due': 450,
 'to': 1294,
 'thermal': 1280,
 'overload': 895,
 '101': 19,
 'is': 710,
 'experiencing': 497,
 'erratic': 474,
 'output': 888,
 'causing': 219,
 'temperature': 1271,
 'fluctuations': 547,
 'likely': 746,
 'piping': 933,
 'issues': 714,
 'specifically': 1189,
 'suspected': 1251,
 'blockage': 173,
 'in': 653,
 'the': 1278,
 'discharge': 419,
 'line': 749,
 'which': 1398,
 'may': 795,
 'require': 1061,
 'replacement': 1057,
 'of': 860,
 'valve': 1356,
 'cp': 348,
 '001': 1,
 'overheating': 894,
 'control': 323,
 'system': 1258,
 'malfunction': 785,
 'fluid': 548,
 'leaks': 734,
 'and': 117,
 'requires': 1063,
 'inspection': 683,
 'potential': 953,
 'sensor': 1130,
 'prevent': 962,
 'damage': 376,
 'downtime': 433,
 'characterized': 234,
 'by': 198,
 'oscillating': 883,
 'pressure': 960,
 'readings': 1011,
 'improper': 652,
 'installation': 685,
 'suction': 1236,
 'resulting': 1081,
 'devi

In [92]:
# checking non-zero positions
sample = x_train_bow[0]
np.where(sample != 0)

(array([ 185,  289,  450,  895, 1052, 1127, 1280, 1294, 1332]),)

In [93]:
# train Naive Bayes
nb = MultinomialNB()
nb.fit(x_train_bow, y_train)

  y = column_or_1d(y, warn=True)


0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [94]:
# makes prediction in the test set
y_hat_test = nb.predict(x_test_bow)

In [95]:
# print the classification report
print(classification_report(y_test, y_hat_test))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        29
           1       0.93      0.93      0.93        14
           2       0.93      1.00      0.97        14
           3       0.95      1.00      0.98        41
           4       0.90      0.95      0.92        19
           5       1.00      0.79      0.88        14
           6       0.80      0.86      0.83        14
           7       1.00      0.93      0.96        14
           8       0.93      0.93      0.93        14
           9       1.00      1.00      1.00        14
          10       0.77      0.71      0.74        14
          11       0.93      0.98      0.95        41
          12       0.90      0.95      0.93        40
          13       0.92      0.86      0.89        14
          14       1.00      0.86      0.92        14
          15       0.92      0.86      0.89        14
          16       0.92      0.86      0.89        14
          17       0.93    

In [96]:
# repeat the process using pipeline
pipeline = Pipeline([
    ('bow', CountVectorizer()),
    ('nb', MultinomialNB())
])
pipeline.fit(x_train, y_train)
y_hat_test_pipe = pipeline.predict(x_test)
print(classification_report(y_test, y_hat_test_pipe))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        29
           1       0.93      0.93      0.93        14
           2       0.93      1.00      0.97        14
           3       0.95      1.00      0.98        41
           4       0.90      0.95      0.92        19
           5       1.00      0.79      0.88        14
           6       0.80      0.86      0.83        14
           7       1.00      0.93      0.96        14
           8       0.93      0.93      0.93        14
           9       1.00      1.00      1.00        14
          10       0.77      0.71      0.74        14
          11       0.93      0.98      0.95        41
          12       0.90      0.95      0.93        40
          13       0.92      0.86      0.89        14
          14       1.00      0.86      0.92        14
          15       0.92      0.86      0.89        14
          16       0.92      0.86      0.89        14
          17       0.93    

  y = column_or_1d(y, warn=True)
