# **Mini Project Title : Equipment Maintenance Scheduling**
# **Text Classification by Fine-tuning Language Model**

#**1. Data Loading**

In [None]:
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('NLP_dataset.csv', encoding='ISO-8859-1')

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['priority'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['complaint description'],
    'labels': train_data['priority']
})

val_df = pd.DataFrame({
    'text': val_data['complaint description'],
    'labels': val_data['priority']
})

# Display sample data
print(train_df.head())
print(val_df.head())


Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m637.8 kB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m560.5 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.43.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->simpletransformers)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datase

#**2. Text Processing**

In [None]:
# Use val_data as the test dataset
test_df = pd.DataFrame({
    'text': val_data['complaint description'],
    'labels': val_data['priority']
})

# Check sample test data
print(test_df.head())


                                                   text  labels
940   Power consumption by the pallet jack has spike...  Medium
297   The temperature control system on the industri...    High
271   I have been facing a persistent issue with my ...    High
948   There is a minor but persistent issue with the...  Medium
1065  The injection molding machines temperature co...  Medium


In [None]:
#Text Preprocessing

import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    text = text.strip()

    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())

                                                   text  labels
532   the diesel engine cooling fan is unable to mai...  Urgent
1105  the hair dryer produces only cold air despite ...  Medium
1479  the water purifier screendisplay has dead pixe...     Low
945   there is a minor but persistent issue with the...  Medium
1477  the air conditioner is making a strange noise ...     Low


#**3. Text Embedding using BERT and RoBERTa**

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
train_df["labels"] = encoder.fit_transform(train_df["labels"])
val_df["labels"] = encoder.transform(val_df["labels"])


In [None]:
#Text Embedding using BERT and RoBERTa
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=4, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=4, use_cuda=False)  # Set use_cuda=True if using a GPU

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
print(train_df.head())

                                                   text  labels
532   the diesel engine cooling fan is unable to mai...       3
1105  the hair dryer produces only cold air despite ...       2
1479  the water purifier screendisplay has dead pixe...       1
945   there is a minor but persistent issue with the...       2
1477  the air conditioner is making a strange noise ...       1


In [None]:
print(train_df["labels"].unique())
print(val_df["labels"].unique())
print(test_df["labels"].unique())


[3 2 1 0]
[2 0 3 1]
['Medium' 'High' 'Urgent' 'Low']


#**4. Model Training with BERT and RoBERTa**

#### **Basic Model Training**

#### **Train the BERT Model**

In [None]:
bert_model.train_model(train_df)

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/149 [00:00<?, ?it/s]

(149, 1.3387587114468518)

#### **Train the RoBERTa Model**

In [None]:
roberta_model.train_model(train_df, output_dir="new_outputs/")




  0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/149 [00:00<?, ?it/s]

(149, 1.335012248698497)

#### **Model Training with Hyperparameters**

In [None]:
from simpletransformers.classification import ClassificationArgs

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,
    eval_batch_size=8,
    learning_rate=3e-5,
    max_seq_length=128,
    weight_decay=0.01,
    warmup_steps=0,
    logging_steps=50,
    save_steps=200,
    overwrite_output_dir=True  #  Allow overwriting existing outputs
)

from simpletransformers.classification import ClassificationModel

# Train the BERT model
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=4, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Train the RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=4, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/149 [00:00<?, ?it/s]

(447, 1.1733353086632636)

#**5. Evaluation on Validation Set**
#### **Evaluate BERT Model**

In [None]:
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)

0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/38 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.27088552309657965), 'eval_loss': 1.1418287456035614}


#### **Evaluate RoBERTa Model**

In [None]:
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)

0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/38 [00:00<?, ?it/s]

RoBERTa Evaluation Results:
{'mcc': np.float64(0.24478121008311116), 'eval_loss': 1.132736811512395}


#**6. Saving the Best Model**

#### **Saving the BERT Model**

In [None]:
bert_model.save_model('bert_best_model')

#### **Saving the RoBERTa Model**

In [None]:
roberta_model.save_model('roberta_best_model')

In [None]:
# Save BERT Model
bert_model.model.save_pretrained("bert_model")   # Saves the model weights
bert_model.tokenizer.save_pretrained("bert_model")  # Saves tokenizer
print("BERT model saved successfully!")

# Save RoBERTa Model
roberta_model.model.save_pretrained("roberta_model")   # Saves the model weights
roberta_model.tokenizer.save_pretrained("roberta_model")  # Saves tokenizer
print(" RoBERTa model saved successfully!")


BERT model saved successfully!
 RoBERTa model saved successfully!


In [None]:
# Save the trained model and tokenizer
bert_model.model.save_pretrained("bert_best_model")  # Saves model weights
bert_model.tokenizer.save_pretrained("bert_best_model")  # Saves tokenizer

print("Model saved successfully!")


Model saved successfully!


In [None]:
import os

print(" Files inside 'bert_best_model':", os.listdir("bert_best_model"))


 Files inside 'bert_best_model': ['special_tokens_map.json', 'vocab.txt', 'tokenizer.json', 'config.json', 'tokenizer_config.json', 'model.safetensors']


#**7. Prediction on Real-World Input**

#### **Prediction Using BERT Model**

In [None]:
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["The industrial air compressor has suddenly shut down during peak operation, causing unexpected shutdowns during critical operations. Immediate intervention is necessary to prevent further complications."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: [3]


In [None]:
roberta_model.model.save_pretrained("roberta_best_model")  # Saves model weights
roberta_model.tokenizer.save_pretrained("roberta_best_model")  # Saves tokenizer

print("✅ Model saved successfully!")

✅ Model saved successfully!


#### **Prediction Using RoBERTa Model**

In [None]:
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["The industrial air compressor has suddenly shut down during peak operation, causing unexpected shutdowns during critical operations. Immediate intervention is necessary to prevent further complications."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Predictions: [3]


In [None]:
# Use val_data as the test dataset
test_df = pd.DataFrame({
    'text': val_data['complaint description'],
    'labels': val_data['priority']
})

# Check sample test data
print(test_df.head())


                                                   text  labels
940   Power consumption by the pallet jack has spike...  Medium
297   The temperature control system on the industri...    High
271   I have been facing a persistent issue with my ...    High
948   There is a minor but persistent issue with the...  Medium
1065  The injection molding machines temperature co...  Medium


In [None]:
priority_mapping = {
    "Urgent": 0,
    "High": 1,
    "Medium": 2,
    "Low": 3
}
test_df = pd.DataFrame({
    'text': val_data['complaint description'],  # Using val_data as test
    'labels': val_data['priority'].map(priority_mapping)  # Convert using map
})
print(test_df.head())

                                                   text  labels
940   Power consumption by the pallet jack has spike...       2
297   The temperature control system on the industri...       1
271   I have been facing a persistent issue with my ...       1
948   There is a minor but persistent issue with the...       2
1065  The injection molding machines temperature co...       2


#### **Accuracy Calculation of BERT and RoBERTa Model**

In [None]:
from sklearn.metrics import accuracy_score

# Convert model outputs to predicted labels
predicted_labels_bert = model_outputs_bert.argmax(axis=1)
predicted_labels_roberta = model_outputs_roberta.argmax(axis=1)

# Calculate accuracy
accuracy_bert = accuracy_score(test_df['labels'], predicted_labels_bert)
accuracy_roberta = accuracy_score(test_df['labels'], predicted_labels_roberta)

print(f"BERT Model Accuracy: {accuracy_bert:.4f}")
print(f"RoBERTa Model Accuracy: {accuracy_roberta:.4f}")


BERT Model Accuracy: 0.3255
RoBERTa Model Accuracy: 0.3054


In [None]:
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Priority mapping (adjust as per your training labels)
priority_labels = {0: "Urgent", 1: "High", 2: "Medium",3:"Low"}

def get_priority(description):
    # Predict the class using the RoBERTa model
    predictions, _ = roberta_model.predict([description])

    # Convert model prediction to priority label
    priority = priority_labels.get(predictions[0], "Unknown")
    return priority

# Get user input dynamically
while True:
    user_input = input("Enter the system description (or type 'exit' to quit): ")

    if user_input.lower() == "exit":
        print("Exiting the program.")
        break

    priority = get_priority(user_input)
    print(f"Predicted Priority: {priority}\n")


Enter the system description (or type 'exit' to quit): The diesel engine cooling fan has developed a persistent rattling noise, indicating internal component wear, leading to prolonged downtime, affecting operational schedules. Immediate intervention is necessary to prevent further complications.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted Priority: Urgent

Enter the system description (or type 'exit' to quit): The high-voltage electrical panel has suffered visible corrosion on critical components, threatening structural integrity, creating an environmental hazard due to leakage of hazardous substances. Immediate intervention is necessary to prevent further complications.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted Priority: Low

Enter the system description (or type 'exit' to quit): The fire suppression system has suffered visible corrosion on critical components, threatening structural integrity, leading to unpredictable failures that could escalate if not repaired soon. Immediate intervention is necessary to prevent further complications.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted Priority: Low

Enter the system description (or type 'exit' to quit): I have been facing a persistent issue with my power adapter. The printer leaves streaks on all documents, making them unreadable. It started happening unexpectedly, and despite trying multiple fixes, the problem still remains. This has severely impacted my work, and I am unable to use the device efficiently. I have checked online forums and tried some troubleshooting steps, but nothing seems to work. I would appreciate immediate assistance to resolve this matter as soon as possible.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted Priority: High

Enter the system description (or type 'exit' to quit): exit
Exiting the program.


In [None]:
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Priority mapping (adjust as per your training labels)
priority_labels = {0: "Urgent", 1: "High", 2: "Medium", 3: "Low"}

def get_priority_bert(description):
    # Predict the class using the BERT model
    predictions, _ = bert_model.predict([description])

    # Convert model prediction to priority label
    priority = priority_labels.get(predictions[0], "Unknown")
    return priority

# Get user input dynamically
while True:
    user_input = input("Enter the system description (or type 'exit' to quit): ")

    if user_input.lower() == "exit":
        print("Exiting the program.")
        break

    priority = get_priority_bert(user_input)
    print(f"BERT Model Predicted Priority: {priority}\n")


Enter the system description (or type 'exit' to quit):  The diesel engine cooling fan has developed a persistent rattling noise, indicating internal component wear, leading to prolonged downtime, affecting operational schedules. Immediate intervention is necessary to prevent further complications.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Model Predicted Priority: Urgent

Enter the system description (or type 'exit' to quit):  I have been facing a persistent issue with my power adapter. The printer leaves streaks on all documents, making them unreadable. It started happening unexpectedly, and despite trying multiple fixes, the problem still remains. This has severely impacted my work, and I am unable to use the device efficiently. I have checked online forums and tried some troubleshooting steps, but nothing seems to work. I would appreciate immediate assistance to resolve this matter as soon as possible.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Model Predicted Priority: Medium

Enter the system description (or type 'exit' to quit): The fire suppression system has suffered visible corrosion on critical components, threatening structural integrity, leading to unpredictable failures that could escalate if not repaired soon. Immediate intervention is necessary to prevent further complications.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Model Predicted Priority: Low

Enter the system description (or type 'exit' to quit): exit
Exiting the program.


#**8. Analysis**
### Discussion of Results

1. **BERT**:
  - **Performance**: The BERT model achieved a Matthews Correlation Coefficient (MCC) of 0.2709 and an accuracy of 0.3255 on the validation dataset.
   - **Analysis**: BERT captured contextual relationships well but showed moderate performance, likely due to limited fine-tuning and class imbalance. It still provides a strong baseline for classification tasks.

2. **RoBERTa**:
   - **Performance**: RoBERTa achieved an MCC of 0.2448 and an accuracy of 0.3054 on the validation dataset.
   - **Analysis**: Although RoBERTa is typically an optimized and better-performing model compared to BERT, in this case, it underperformed. This may be attributed to similar issues faced by BERT, such as limited data, class imbalance, and a need for further fine-tuning.


### Best Performing Feature Set

- **Transformer Models (BERT and RoBERTa)**: Transformer models like BERT and RoBERTa performed better than traditional methods (BoW, TF-IDF) by effectively capturing contextual meaning and understanding customer intent.
### Challenges and Interesting Findings

- **Transformer Dominance**: BERT and RoBERTa outperformed traditional models due to their strong contextual understanding.
- **Class Imbalance**: Uneven class distribution affected model performance but was handled relatively well by transformers.
- **Training Time**: Transformers required significantly more computational resources and time compared to traditional models.

### Potential Improvements and Further Experiments

1. **Fine-Tuning**: Further tuning on domain-specific data can improve model accuracy.
2. **Data Augmentation**: Generating more data can help balance classes and enhance model performance.
3. **Ensemble Methods**: Combining models like BERT and RoBERTa can boost overall prediction accuracy.