### 💬 Tokenization and Embedding
In this notebook, we'll explore two fundamental Natural Language Processing (NLP) concepts:

1. ✂️ **Tokenization** — breaking text into smaller units like words or subwords
2. 🔢 **Embedding** — converting tokens into numerical vectors

We'll use a small set of machine log entries as examples, simulating predictive maintenance tasks.

In [1]:
# 📄 Sample Log Entries (Textual Features)
log_entries = [
    "[2025-01-01 07:12:34] Warning: Motor bearing temperature exceeded threshold (85.3°C). Recommended inspection within next 24 hours.",
    "[2025-01-01 13:45:10] Info: Vibration levels increased by 35% over baseline. Possible rotor imbalance detected.",
    "[2025-01-02 03:22:50] Error: Sudden voltage dip recorded during startup. Check power supply stability and soft starter settings.",
    "[2025-01-02 17:18:05] Maintenance: Lubrication applied to drive-end bearing. Next scheduled lubrication due in 500 operating hours."
]
log_entries

 '[2025-01-01 13:45:10] Info: Vibration levels increased by 35% over baseline. Possible rotor imbalance detected.',
 '[2025-01-02 03:22:50] Error: Sudden voltage dip recorded during startup. Check power supply stability and soft starter settings.',
 '[2025-01-02 17:18:05] Maintenance: Lubrication applied to drive-end bearing. Next scheduled lubrication due in 500 operating hours.']

#### ✂️ Tokenization using `nltk`
**Tokenization** is the process of splitting text into individual components — usually words or subwords.
We'll use `nltk` for simple word-level tokenization.

In [2]:
# ✂️ Tokenize log entries
# !pip install nltk
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

tokenized_logs = [word_tokenize(entry) for entry in log_entries]
for i, tokens in enumerate(tokenized_logs):
    print(f"Log {i+1} Tokens:", tokens)

Log 2 Tokens: ['[', '2025-01-01', '13:45:10', ']', 'Info', ':', 'Vibration', 'levels', 'increased', 'by', '35', '%', 'over', 'baseline', '.', 'Possible', 'rotor', 'imbalance', 'detected', '.']
Log 3 Tokens: ['[', '2025-01-02', '03:22:50', ']', 'Error', ':', 'Sudden', 'voltage', 'dip', 'recorded', 'during', 'startup', '.', 'Check', 'power', 'supply', 'stability', 'and', 'soft', 'starter', 'settings', '.']
Log 4 Tokens: ['[', '2025-01-02', '17:18:05', ']', 'Maintenance', ':', 'Lubrication', 'applied', 'to', 'drive-end', 'bearing', '.', 'Next', 'scheduled', 'lubrication', 'due', 'in', '500', 'operating', 'hours', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alaa.rashwan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 🔢 Embedding using `sentence-transformers`
We’ll use the `all-MiniLM-L6-v2` model to generate dense vector representations of log entries.
These embeddings capture semantic meaning of the full sentence.

In [3]:
# 🔢 Sentence Embedding
# !pip install sentence_transformers
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(log_entries)

# Display embedding shape and sample
print(f"Each log entry is embedded into a vector of shape: {embeddings[0].shape}")
print("\nSample embedding (first log):\n", embeddings[0])

Each log entry is embedded into a vector of shape: (384,)

Sample embedding (first log):
 [-1.60595458e-02  8.76776502e-03  6.47896016e-03  4.76096272e-02
  5.92179857e-02 -6.62303194e-02 -3.72492634e-02 -1.50096947e-02
 -5.40389903e-02 -1.94809143e-03  5.81466109e-02  3.88616207e-03
  1.97070297e-02  2.65249610e-03 -8.87184590e-03 -3.46903615e-02
  3.17541398e-02 -9.15662423e-02 -6.81873709e-02 -5.25095873e-02
  2.51352228e-02  7.94864893e-02 -4.58747298e-02  6.54792562e-02
 -9.44329202e-02  1.81149878e-02  6.76388619e-03  6.98614419e-02
 -7.22240135e-02  8.91826078e-02 -6.64988607e-02  4.88785617e-02
 -5.80304079e-02  1.82077382e-02  2.72095725e-02  1.78410728e-02
 -6.17585965e-02 -7.34469742e-02 -3.29818167e-02 -7.36914799e-02
  3.90591174e-02 -1.34725524e-02  1.13352835e-01  8.22264776e-02
  2.80987378e-02  4.78578620e-02 -2.96776164e-02 -8.27977434e-02
  6.52854741e-02 -7.74518847e-02 -1.36947306e-03  2.03280225e-02
  1.44724339e-01 -8.22459683e-02 -7.66755119e-02 -1.95445567e-02


#### 🧠 Optional: Cosine Similarity Between Logs
We can compare log message similarity using cosine similarity between their embeddings.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)
similarity_df = pd.DataFrame(similarity_matrix, index=[f"Log {i+1}" for i in range(len(log_entries))],
                              columns=[f"Log {i+1}" for i in range(len(log_entries))])
similarity_df

Unnamed: 0,Log 1,Log 2,Log 3,Log 4
Log 1,1.0,0.358248,0.19575,0.502088
Log 2,0.358248,1.0,0.281684,0.263989
Log 3,0.19575,0.281684,1.0,0.016385
Log 4,0.502088,0.263989,0.016385,1.0
