
# MOT Defect Embeddings Demo with MiniLM

Author: Donald Simpson  
Data: Contains public sector information licensed under the [Open Government Licence v3.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)

This notebook shows how to:
- Convert MOT defect notes into embeddings using MiniLM
- Cluster similar defects
- Run a simple semantic search query

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DonaldSimpson/mot_embeddings_demo/blob/main/mot_embeddings_demo.ipynb)


In [None]:

# Install dependencies (uncomment if running in Colab)
# !pip install sentence-transformers scikit-learn matplotlib


In [None]:

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np


## Sample MOT defect notes
A small subset for demonstration purposes.

In [None]:

notes = [
    "Nearside rear brake pipe corroded",
    "Brake hose deteriorated",
    "Brakes imbalanced across an axle",
    "Headlamp aim too high",
    "Exhaust leaking gases",
    "Offside front tyre worn close to legal limit",
    "Nearside rear suspension arm corroded",
    "Steering rack gaiter damaged",
    "Nearside rear brake hose perished",
    "Excessive play in steering column"
]


## Generate embeddings with MiniLM

In [None]:

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(notes)
print(f"Generated embeddings: {embeddings.shape}")


## Clustering defects with KMeans
We cluster the embeddings and visualise them in 2D with PCA.

In [None]:

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(embeddings)

pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

plt.figure(figsize=(8,6))
plt.scatter(reduced[:,0], reduced[:,1], c=labels, cmap="viridis")
for i, txt in enumerate(notes):
    plt.annotate(txt, (reduced[i,0]+0.01, reduced[i,1]+0.01), fontsize=8)
plt.title("Clustering MOT Defect Notes with MiniLM Embeddings")
plt.show()


## Semantic search
Find defects most similar to a query.

In [None]:

query = "brake failure"
qvec = model.encode([query])
sims = cosine_similarity(qvec, embeddings)[0]

print(f"Top matches for query: '{query}'\n")
top = np.argsort(-sims)[:5]
for i in top:
    print(f"{notes[i]} (similarity {sims[i]:.2f})")
