
# MOT Defect Embeddings Demo with MiniLM

**Author:** Donald Simpson  
**Data:** Contains public sector information licensed under the [Open Government Licence v3.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)

## What This Notebook Does

This notebook demonstrates how to use **embeddings** to analyse messy, unstructured text data - specifically MOT (Ministry of Transport) defect notes. You'll learn how to:

- **Convert text to numbers**: Transform defect notes into numerical vectors that capture meaning
- **Find hidden patterns**: Use clustering to group similar defects automatically  
- **Search by meaning**: Find related defects using semantic search instead of keyword matching

## Why This Matters

MOT testers write defect notes in their own words: "brake pipe corroded", "brake hose deteriorated", "brakes imbalanced". Traditional keyword searches miss the connection between these different phrasings. Embeddings solve this by understanding that these all describe brake-related issues.

## Getting Started

Just run each cell in order: the notebook will automatically install any missing dependencies.

The first cell may take up to 30 seconds to install dependencies, the rest are all very quick.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DonaldSimpson/mot_embeddings_demo/blob/main/mot_embeddings_demo.ipynb)


In [None]:

# Install dependencies automatically (only needed in Colab)
try:
    import sentence_transformers
    import sklearn
    import matplotlib
    print("✅ All dependencies already installed!")
except ImportError:
    print("📦 Installing required packages...")
    !pip install sentence-transformers scikit-learn matplotlib
    print("✅ Installation complete!")


In [None]:

# Suppress Hugging Face authentication warnings (optional authentication)
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="huggingface_hub")

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np


## Sample MOT Defect Notes

Here's a small collection of real MOT defect notes for demonstration. Notice how different testers describe similar issues in various ways:

- **Brake issues**: "brake pipe corroded" vs "brake hose deteriorated" vs "brakes imbalanced"
- **Lighting problems**: "headlamp aim too high" 
- **Steering/suspension**: "steering rack gaiter damaged" vs "excessive play in steering column"

This variety is exactly why traditional keyword searches struggle - but embeddings can see the underlying patterns!

In [None]:

notes = [
    # Brake-related issues
    "Nearside rear brake pipe corroded",
    "Brake hose deteriorated", 
    "Brakes imbalanced across an axle",
    "Nearside rear brake hose perished",
    "Brake disc worn and pitted",
    "Brake pad material below minimum thickness",
    "Brake fluid level low",
    "Brake warning light illuminated",
    
    # Lighting problems
    "Headlamp aim too high",
    "Offside headlamp not working",
    "Nearside rear light bulb failed",
    "Fog light inoperative",
    "Number plate light not functioning",
    "Indicators not working properly",
    
    # Steering and suspension
    "Steering rack gaiter damaged",
    "Excessive play in steering column",
    "Nearside rear suspension arm corroded",
    "Shock absorber leaking",
    "Steering wheel alignment incorrect",
    "Suspension spring broken",
    "Track rod end worn",
    
    # Tyre and wheel issues
    "Offside front tyre worn close to legal limit",
    "Nearside rear tyre sidewall damaged",
    "Wheel bearing noisy",
    "Tyre pressure monitoring system fault",
    "Alloy wheel corroded",
    
    # Engine and exhaust
    "Exhaust leaking gases",
    "Engine oil leak",
    "Catalytic converter inefficient",
    "Air filter blocked",
    "Engine management light on",
    "Exhaust system corroded"
]


## Step 1: Generate Embeddings with MiniLM

Now we'll convert our text notes into numerical vectors (embeddings) using the MiniLM model. Think of this as creating a unique "fingerprint" for each piece of text that captures its meaning.

**What's happening here:**
- Each defect note becomes a 384-dimensional vector
- Notes with similar meanings will have similar vectors
- The model was trained on millions of text examples to understand language patterns

**MiniLM** is a lightweight but effective model - perfect for this demo!

In [None]:

print("Loading MiniLM model (this may take a moment on first run)...")
model = SentenceTransformer("all-MiniLM-L6-v2")

print("Converting text to embeddings...")
embeddings = model.encode(notes)
print(f"✅ Generated embeddings: {embeddings.shape}")
print(f"📊 Processed {len(notes)} defect notes into {embeddings.shape[1]}-dimensional vectors")


## Step 2: Clustering Defects with K-Means

Now we'll group similar defects together automatically using K-means clustering. The algorithm will discover patterns in our embeddings and group related issues.

**What's happening here:**
- **K-means clustering** groups similar embeddings together
- We're using 5 clusters to group our diverse defect types (you can experiment with different numbers)
- **PCA (Principal Component Analysis)** reduces our 384-dimensional vectors to 2D for visualisation
- The scatter plot shows how defects cluster together

**Look for patterns:** You should see distinct clusters for brake issues, lighting problems, steering/suspension, tyres, and engine/exhaust defects. Each cluster represents a different type of vehicle problem!

In [None]:

print("Running K-means clustering...")
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(embeddings)

print("Reducing dimensions for visualization...")
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

print("Creating visualization...")
plt.figure(figsize=(12,8))
scatter = plt.scatter(reduced[:,0], reduced[:,1], c=labels, cmap="viridis", s=100, alpha=0.7)

# Add labels for each point
for i, txt in enumerate(notes):
    plt.annotate(txt, (reduced[i,0]+0.01, reduced[i,1]+0.01), fontsize=8, alpha=0.8)

plt.title("Clustering MOT Defect Notes with MiniLM Embeddings", fontsize=14, fontweight='bold')
plt.xlabel("First Principal Component", fontsize=12)
plt.ylabel("Second Principal Component", fontsize=12)
plt.colorbar(scatter, label="Cluster")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"✅ Created {len(set(labels))} clusters from {len(notes)} defect notes")


## Step 3: Semantic Search

Finally, let's try semantic search - finding defects that are similar in meaning to a query, not just exact word matches.

**What's happening here:**
- We encode our search query "brake failure" into the same embedding space
- **Cosine similarity** measures how similar our query is to each defect note
- We rank results by similarity score (0.0 = no similarity, 1.0 = identical meaning)

**Try different queries:** Change the query variable to "tyre wear", "steering problem", or "engine issue" to see how the model understands different concepts!

**The magic:** Notice how "brake failure" finds brake-related defects even though none of them contain the word "failure" - that's semantic understanding in action!
## Next Steps: Experiment and Explore

Now that you understand the basics, here are some ways to experiment with this notebook:

### Try These Modifications:

1. **Add your own defect notes** - Replace the sample data with notes from your own vehicle's MOT history
2. **Change the number of clusters** - Try `n_clusters=2`, `4`, or `5` to see how groupings change
3. **Experiment with different queries** - Try "safety concern", "performance issue", or "electrical fault"
4. **Try a different model** - Replace `"all-MiniLM-L6-v2"` with `"multi-qa-mpnet-base-dot-v1"` for potentially better results

### Scale Up:

- **Larger datasets**: Try with hundreds or thousands of MOT records
- **Different domains**: Apply the same techniques to customer feedback, support tickets, or any unstructured text
- **Production deployment**: Check out the [MLOps blog post](https://www.donaldsimpson.co.uk/2025/09/22/mlops-for-devops-engineers-minilm-mlflow-demo/) to see how to turn this into a production pipeline

### Real-World Applications:

This same approach powers [CarHunch](https://www.carhunch.com) - a platform that analyses millions of MOT records to provide vehicle insights. 

The techniques you've just learned can help people make informed decisions about their vehicles.


## Let's Try It: Semantic Search in Action

Now let's run the semantic search to see how it works in practice. We'll search for "brake failure" and see which defect notes are most similar in meaning.

**What this code does:**
1. **Encode the query**: Convert "brake failure" into the same embedding space as our defect notes
2. **Calculate similarities**: Use cosine similarity to find how similar our query is to each defect note
3. **Rank results**: Sort by similarity score and show the top 5 matches
4. **Display results**: Show each match with its similarity score (0.0 = no similarity, 1.0 = identical meaning)

**The magic**: Notice how "brake failure" finds brake-related defects even though none of them contain the word "failure" - that's semantic understanding


In [None]:
# Run the semantic search
query = "brake failure"
print(f"🔍 Searching for: '{query}'")
print("=" * 50)

qvec = model.encode([query])
sims = cosine_similarity(qvec, embeddings)[0]

print(f"Top 5 most similar defect notes:\n")
top = np.argsort(-sims)[:5]
for rank, i in enumerate(top, 1):
    print(f"{rank}. {notes[i]}")
    print(f"   Similarity: {sims[i]:.3f} ({sims[i]*100:.1f}% match)")
    print()

print("💡 Notice how 'brake failure' finds brake-related defects even though")
print("   none of them contain the word 'failure' - that's semantic understanding!")


## What You've Just Learned

Congratulations! You've successfully:

1. **Converted text to embeddings** - Transformed 32 MOT defect notes into 384-dimensional vectors
2. **Discovered hidden patterns** - Used clustering to automatically group similar defects
3. **Performed semantic search** - Found related defects by meaning, not just keywords

**Key Insight**: Embeddings understand that "brake failure" is related to "brake pipe corroded" and "brake hose deteriorated" even though they use completely different words. This is the power of semantic understanding!

**Real-World Impact**: These same techniques power [CarHunch](https://www.carhunch.com), helping people make informed decisions about their vehicles by analysing millions of MOT records.
