Columns:
TopologyId: Categorical (machine/station ID).
MessageTime, EventDateTime: Timestamps for error and resolution.
MessageText: PLC error text (e.g., "Error : 977 : ...").
translated_comments: Operator free-text observation (e.g., "detained EOL").
maintenance_comments: Target variable (e.g., "Error removal, automatic mode reset...").
msg_hour, msg_dayofweek, msg_weekend, event_hour, event_dayofweek, event_weekend: Temporal features.
msg_shift, event_shift: Shift categories (e.g., "night").
response_time_minutes: Numerical (time to resolve).
Sample Insight:
Multiple MessageText (e.g., "Error : 977", "Error : 1132") map to one EventDateTime (04:20:51.840) and one maintenance_comments, suggesting batch resolution within a 5-hour window.
translated_comments ("detained EOL") is consistent across rows, possibly tied to the batch.
Goal: Predict maintenance_comments from MessageText, translated_comments, and supporting features (TopologyId, response_time_minutes, etc.) using XGBoost as a baseline.

In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import pickle




In [4]:

path = r'C:/Users/yasmin_s1/sanjida_projects/projects/mater_thesis/ML/Datasets/Dataset_7months/'
df = pd.read_csv(path + 'cleaned_dataset.csv')

The goal was to remove true duplicates—rows that are identical in every meaningful way (same error, same operator comment, same solution, same machine, same error time). MessageTime captures the error occurrence, while EventDateTime is the resolution time, which might group multiple errors.
Example: Rows 1-5 in my sample differ by MessageText and MessageTime but share EventDateTime and maintenance_comments. These aren’t duplicates—they’re distinct errors in a batch.

In [6]:
# Convert timestamps
df['MessageTime'] = pd.to_datetime(df['MessageTime'])
df['EventDateTime'] = pd.to_datetime(df['EventDateTime'])

# Remove true duplicates (exact matches across key columns)
strict_subset = ['TopologyId', 'MessageText', 'operator_comments', 'maintenance_comments', 
                 'response_time_minutes', 'MessageTime']
df = df.drop_duplicates(subset=strict_subset)
print(f"Rows after deduplication: {len(df)}")

Rows after deduplication: 65325


In [8]:
# Handle missing values
df = df.dropna(subset=['MessageText', 'operator_comments', 'maintenance_comments'])
print(f"Rows after dropping NA: {len(df)}, unique maintenance_comments: {df['maintenance_comments'].nunique()}")

Rows after dropping NA: 65325, unique maintenance_comments: 1628


384D embeddings refer to vector representations in a 384-dimensional space. In the context of machine learning and natural language processing (NLP), embeddings are numeric representations of text or other entities that capture semantic meaning in a lower-dimensional vector space.

In [9]:
# Step 2: Feature Engineering
# Text embeddings for MessageText and translated_comments
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384D embeddings
msg_embeddings = model.encode(df['MessageText'].tolist(), show_progress_bar=True)
trans_embeddings = model.encode(df['operator_comments'].tolist(), show_progress_bar=True)

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Batches:   0%|          | 0/2042 [00:00<?, ?it/s]

Batches:   0%|          | 0/2042 [00:00<?, ?it/s]

In [10]:
# Convert to DataFrames
msg_emb_df = pd.DataFrame(msg_embeddings, columns=[f'msg_emb_{i}' for i in range(msg_embeddings.shape[1])])
op_emb_df = pd.DataFrame(trans_embeddings, columns=[f'op_emb_{i}' for i in range(trans_embeddings.shape[1])])
df = pd.concat([df.reset_index(drop=True), msg_emb_df, op_emb_df], axis=1)

In [11]:
# Encode categorical features
le_topology = LabelEncoder()
le_shift_msg = LabelEncoder()
le_shift_event = LabelEncoder()
df['TopologyId_encoded'] = le_topology.fit_transform(df['TopologyId'])
df['msg_shift_encoded'] = le_shift_msg.fit_transform(df['msg_shift'])
df['event_shift_encoded'] = le_shift_event.fit_transform(df['event_shift'])

In [12]:
# Target encoding
le_target = LabelEncoder()
df['maintenance_comments_encoded'] = le_target.fit_transform(df['maintenance_comments'])

In [14]:
# Select features for XGBoost (excluding MessageTime, EventDateTime)
features = (['TopologyId_encoded', 'response_time_minutes', 'msg_hour', 'msg_dayofweek', 
             'msg_weekend', 'event_hour', 'event_dayofweek', 'event_weekend', 
             'msg_shift_encoded', 'event_shift_encoded'] + 
            [f'msg_emb_{i}' for i in range(msg_embeddings.shape[1])] + 
            [f'op_emb_{i}' for i in range(trans_embeddings.shape[1])])
X = df[features]
y = df['maintenance_comments_encoded']

In [15]:
# Save processed dataset
df.to_csv(path + "processed_dataset_for_XGB.csv", index=False)
print(f"Features shape: {X.shape}, unique labels: {y.nunique()}")

Features shape: (65325, 778), unique labels: 1628


In [65]:
df = pd.read_csv(path + "processed_dataset_for_XGB.csv")
print(f"Loaded rows: {len(df)}, columns: {len(df.columns)}")

Loaded rows: 65325, columns: 787


In [66]:
# Step 2: Filter Rare Classes (<2 instances)
class_counts = df['maintenance_comments_encoded'].value_counts()
rare_classes = class_counts[class_counts < 2].index
df = df[~df['maintenance_comments_encoded'].isin(rare_classes)]
print(f"Rows after removing rare classes: {len(df)}, unique labels: {df['maintenance_comments_encoded'].nunique()}")

Rows after removing rare classes: 65181, unique labels: 1484


In [67]:
# Step 3: Define Features and Target
features = (['TopologyId_encoded', 'response_time_minutes', 'msg_hour', 'msg_dayofweek', 
             'msg_weekend', 'event_hour', 'event_dayofweek', 'event_weekend', 
             'msg_shift_encoded', 'event_shift_encoded'] + 
            [f'msg_emb_{i}' for i in range(384)] + 
            [f'op_emb_{i}' for i in range(384)])
X = df[features]
y = df['maintenance_comments_encoded']
# Verify shapes
print(f"Features shape: {X.shape}, unique labels: {y.nunique()}")

Features shape: (65181, 778), unique labels: 1484


In [68]:
# Step 4: Train-Test Split with Dynamic Filter
# First split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"After first split - Train: {X_train.shape}, Temp: {X_temp.shape}")

After first split - Train: (45626, 778), Temp: (19555, 778)


In [69]:
# Filter rare classes in y_temp
temp_df = pd.DataFrame({'y_temp': y_temp}, index=X_temp.index)
temp_counts = temp_df['y_temp'].value_counts()
temp_rare = temp_counts[temp_counts < 2].index
keep_indices = ~temp_df['y_temp'].isin(temp_rare)
X_temp = X_temp[keep_indices]
y_temp = y_temp[keep_indices]
print(f"After filtering y_temp - Temp rows: {len(X_temp)}, unique labels: {y_temp.nunique()}")

After filtering y_temp - Temp rows: 19159, unique labels: 1088


In [70]:
# Second split
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print(f"Final split - Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")

Final split - Train: (45626, 778), Val: (9579, 778), Test: (9580, 778)


In [71]:
# Step 5: Re-encode Labels After Filtering
# Combine all final labels
final_labels = np.concatenate([y_train, y_val, y_test])
le_final = LabelEncoder()
final_labels_encoded = le_final.fit_transform(final_labels)  # Re-encode to 0 to num_classes-1


In [72]:
# Split re-encoded labels back
train_size = len(y_train)
val_size = len(y_val)
y_train = final_labels_encoded[:train_size]
y_val = final_labels_encoded[train_size:train_size + val_size]
y_test = final_labels_encoded[train_size + val_size:]

In [73]:
# Verify new range
print(f"Re-encoded y_train min/max: {y_train.min()}/{y_train.max()}")
print(f"Re-encoded y_val min/max: {y_val.min()}/{y_val.max()}")
print(f"Re-encoded y_test min/max: {y_test.min()}/{y_test.max()}")

Re-encoded y_train min/max: 0/1483
Re-encoded y_val min/max: 2/1483
Re-encoded y_test min/max: 2/1483


In [74]:
# Step 6: Recreate LabelEncoder for Reporting
le_target = LabelEncoder()
le_target.fit(df['maintenance_comments'])

In [51]:
from xgboost import XGBClassifier, DMatrix, train, callback

In [75]:
# Step 7: Convert to DMatrix and Train
dtrain = DMatrix(X_train, label=y_train)
dval = DMatrix(X_val, label=y_val)
dtest = DMatrix(X_test, label=y_test)

In [76]:
# Define parameters with correct num_class
params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'objective': 'multi:softmax',
    'num_class': len(np.unique(final_labels)),  # 1,088 classes
    'random_state': 42,
    'tree_method': 'hist',
    'eval_metric': 'mlogloss'
}

In [77]:
# Define early stopping callback
early_stopping = callback.EarlyStopping(
    rounds=10,
    metric_name='mlogloss',
    data_name='val'
)


In [None]:
# Train model
evals = [(dtrain, 'train'), (dval, 'val')]
model = train(
    params,
    dtrain,
    num_boost_round=100,
    evals=evals,
    callbacks=[early_stopping],
    verbose_eval=True
)

[0]	train-mlogloss:5.13060	val-mlogloss:5.09918
[1]	train-mlogloss:4.48444	val-mlogloss:4.43134
[2]	train-mlogloss:6.42494	val-mlogloss:6.30355
[3]	train-mlogloss:5.50031	val-mlogloss:5.42696
[4]	train-mlogloss:7.08069	val-mlogloss:7.01117
[5]	train-mlogloss:5.10093	val-mlogloss:5.04813
[6]	train-mlogloss:4.77419	val-mlogloss:4.72750
[7]	train-mlogloss:4.16834	val-mlogloss:4.12108
[8]	train-mlogloss:3.39872	val-mlogloss:3.36767
[9]	train-mlogloss:3.14805	val-mlogloss:3.11778
[10]	train-mlogloss:2.87465	val-mlogloss:2.83800
[11]	train-mlogloss:2.23662	val-mlogloss:2.20020
[12]	train-mlogloss:2.07126	val-mlogloss:2.03026
[13]	train-mlogloss:1.93066	val-mlogloss:1.89311
[14]	train-mlogloss:1.81940	val-mlogloss:1.78452


In [None]:
# Step 8: Evaluate Model and Feature Importance
y_pred_val = model.predict(dval).astype(int)
y_pred_test = model.predict(dtest).astype(int)

y_val_orig = le_final.inverse_transform(y_val)
y_test_orig = le_final.inverse_transform(y_test)
y_pred_val_orig = le_final.inverse_transform(y_pred_val)
y_pred_test_orig = le_final.inverse_transform(y_pred_test)

In [None]:
print("\nValidation Results:")
print(f"Accuracy: {accuracy_score(y_val_orig, y_pred_val_orig):.4f}")
print(classification_report(y_val_orig, y_pred_val_orig, target_names=le_target.classes_[~np.isin(np.arange(len(le_target.classes_)), rare_classes)], output_dict=False))

print("\nTest Results:")
print(f"Accuracy: {accuracy_score(y_test_orig, y_pred_test_orig):.4f}")
print(classification_report(y_test_orig, y_pred_test_orig, target_names=le_target.classes_[~np.isin(np.arange(len(le_target.classes_)), rare_classes)], output_dict=False))

In [None]:
importances = model.get_score(importance_type='weight')
feature_importance_df = pd.DataFrame({
    'Feature': [features[int(k[1:])] for k in importances.keys()],
    'Importance': list(importances.values())
})
print("\nFeature Importance (Top 10):")
print(feature_importance_df.sort_values(by='Importance', ascending=False).head(10))

In [None]:
import joblib
import numpy as np
import os

path2 = r'C:/Users/yasmin_s1/sanjida_projects/projects/sanjida-dev1/log-parser/master-thesis/thesis work/ML/trained_model/XGB_model/'

In [None]:
# Step 9: Save Model and LabelEncoders with Confirmation
model_file = path2 + "xgboost_model.joblib"
joblib.dump(model, model_file)
joblib.dump(le_target, path2 + "label_encoder_target.joblib")
joblib.dump(le_final, path2 + "label_encoder_final.joblib")