# Basic NLP Course

In this notebook, we will explore how to train models for text classification tasks using embeddings. Embeddings are dense vector representations of text that capture semantic meaning, enabling more effective and efficient learning for NLP tasks.

In [12]:
import spacy
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split


# load large model
nlp = spacy.load("en_core_web_lg")

In [2]:
# load work order samples
data = pd.read_csv('../data/work_orders_sample.csv')
data.head()

Unnamed: 0,failure_mode,description
0,Internal leakage,Compressor CP-001 is experiencing internal lea...
1,Abnormal instrument reading,Compressor CP-101 is showing abnormal pressure...
2,Abnormal instrument reading,Compressor C-101 is giving an abnormal high pr...
3,Abnormal instrument reading,Compressor C-101-A is giving abnormal instrume...
4,Abnormal instrument reading,Compressor CP-101 is giving an abnormal instru...


In [3]:
# Create a set of unique failure modes
failure_mode_set = set(data['failure_mode'])

# Map each failure mode to an index
failure_mode_mapping = {mode: idx for idx, mode in enumerate(failure_mode_set)}

# Replace the failure_mode column with the corresponding indices
data['failure_mode'] = data['failure_mode'].map(failure_mode_mapping)

# Display the updated dataframe and the mapping
print(data.head())
print(failure_mode_mapping)

   failure_mode                                        description
0             5  Compressor CP-001 is experiencing internal lea...
1            17  Compressor CP-101 is showing abnormal pressure...
2            17  Compressor C-101 is giving an abnormal high pr...
3            17  Compressor C-101-A is giving abnormal instrume...
4            17  Compressor CP-101 is giving an abnormal instru...
{'Failure to stop on demand': 0, 'High output': 1, 'Breakdown': 2, 'Overheating': 3, 'External leakage - utility medium': 4, 'Internal leakage': 5, 'Erratic output': 6, 'Low output': 7, 'Minor in-service problems': 8, 'Noise': 9, 'Vibration': 10, 'Structural deficiency': 11, 'Plugged / Choked': 12, 'Failure to start on demand': 13, 'Spurious stop': 14, 'External leakage - process medium': 15, 'Parameter deviation': 16, 'Abnormal instrument reading': 17}


In [4]:
# create the vector columns
data['vector'] = data['description'].apply(lambda desc: nlp(desc).vector)
data.head()

Unnamed: 0,failure_mode,description,vector
0,5,Compressor CP-001 is experiencing internal lea...,"[0.038941465, 0.26220626, -0.10477795, 0.04772..."
1,17,Compressor CP-101 is showing abnormal pressure...,"[-0.051272765, 0.27505192, -0.09706584, 0.0453..."
2,17,Compressor C-101 is giving an abnormal high pr...,"[-0.041457046, 0.3033054, -0.05593128, 0.02128..."
3,17,Compressor C-101-A is giving abnormal instrume...,"[-0.0275956, 0.27556953, -0.09625757, 0.077970..."
4,17,Compressor CP-101 is giving an abnormal instru...,"[-0.054595143, 0.28313974, -0.09315031, 0.0417..."


In [5]:
# split the data
x_train, x_test, y_train, y_test = train_test_split(
    data['vector'],
    data['failure_mode'],
    test_size=0.2,
    random_state=42,
    stratify=data['failure_mode']
)

In [6]:
x_train = np.stack(x_train)
x_test = np.stack(x_test)

In [10]:
# create a pipeline
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', MultinomialNB())
])

# train the model
pipeline.fit(x_train, y_train)

# make predictions
y_pred = pipeline.predict(x_test)

In [11]:
# print classification report
print(classification_report(y_test, y_pred, target_names=failure_mode_mapping.keys()))

                                   precision    recall  f1-score   support

        Failure to stop on demand       1.00      0.21      0.35        14
                      High output       0.56      0.36      0.43        14
                        Breakdown       0.68      0.97      0.80        79
                      Overheating       1.00      0.43      0.60        14
External leakage - utility medium       0.98      0.74      0.84        61
                 Internal leakage       0.43      0.21      0.29        14
                   Erratic output       0.85      0.97      0.91        80
                       Low output       0.43      0.21      0.29        14
        Minor in-service problems       0.88      0.50      0.64        14
                            Noise       0.88      0.50      0.64        14
                        Vibration       0.75      0.64      0.69        14
            Structural deficiency       1.00      0.71      0.83        14
                 Plugged

In [13]:
# train a kNN model
knn_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', KNeighborsClassifier(n_neighbors=5, metric='euclidean'))
])

knn_pipeline.fit(x_train, y_train)
y_pred = knn_pipeline.predict(x_test)

# print classification report
print(classification_report(y_test, y_pred, target_names=failure_mode_mapping.keys()))

                                   precision    recall  f1-score   support

        Failure to stop on demand       0.58      0.79      0.67        14
                      High output       0.24      0.29      0.26        14
                        Breakdown       0.92      0.91      0.92        79
                      Overheating       1.00      0.57      0.73        14
External leakage - utility medium       0.95      0.98      0.97        61
                 Internal leakage       0.62      0.57      0.59        14
                   Erratic output       0.85      0.96      0.90        80
                       Low output       0.29      0.29      0.29        14
        Minor in-service problems       0.89      0.57      0.70        14
                            Noise       0.67      0.71      0.69        14
                        Vibration       0.83      0.71      0.77        14
            Structural deficiency       1.00      0.79      0.88        14
                 Plugged

# Why Embeddings Underperform Compared to Bag-of-n-Grams

## 1. Class Distribution and Rarity
- The dataset is **imbalanced**: some classes have 80 samples, others only 14.  
- Embeddings smooth differences in vector space, so **rare classes with overlapping semantics** (e.g., *High output* vs *Low output*) get misclassified.  
- Bag-of-n-Grams relies on **hard lexical cues** (“high”, “low”) that help distinguish these small classes.

## 2. Nature of Work Order Texts
- Work order descriptions are **short, formulaic, and jargon-heavy**.  
- Embeddings work best in **longer contexts**, where semantic similarity matters.  
- Here, **small token changes** (“fail to start” vs “fail to stop”) are crucial; Bag-of-n-Grams preserves these distinctions, embeddings blur them.

## 3. General vs Domain-Specific Vocabulary
- SpaCy’s embeddings are trained on **general text corpora** (Wikipedia, Common Crawl).  
- Technical terms (*choked, erratic output, parameter deviation*) may be poorly represented.  
- Bag-of-n-Grams learns **directly from the dataset**, avoiding this mismatch.

## 4. Model Bias Toward Frequent Classes
- Embeddings rely on **semantic closeness**, which favors **majority classes** that dominate vector space representation.  
- Sparse Bag-of-n-Grams gives rare classes more weight if they use **unique terms**.

## 5. Task Requirements
- Failure mode classification depends on **precise keywords** (e.g., “overheating”, “leakage”).  
- Embeddings may cluster “overheating” with “hot” or “temperature rise,” but in this dataset only the **exact term** matters.  
- Bag-of-n-Grams outperforms because the task rewards **lexical precision over semantic similarity**.

---

## Key Takeaway
Embeddings underperform because they **over-generalize** in a setting where:
- The dataset is **small and imbalanced**.  
- The texts are **short and keyword-driven**.  
- The domain vocabulary is **poorly covered** by pre-trained vectors.  

Bag-of-n-Grams, despite being simple, is better aligned with the task because it captures **explicit lexical cues**.
