
# BERT-based Text Classification for Safety Analytics  
## Predicting Fatal vs Nonfatal Accidents  
### Case Study: OSHA Accident Reports



## Objective
Classify OSHA accident descriptions as **Fatal** or **Nonfatal** using BERT.
This notebook is designed for non-CS students.



## Step 0: Install Libraries (Run Once)


In [18]:

!pip install transformers torch scikit-learn pandas





## Step 1: Import Libraries


In [19]:

import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments



## Step 2: Load OSHA Dataset


In [20]:

df = pd.read_excel("/content/OSHA data1.xlsx")
df = df[["Description", "Class"]]
df.head()


Unnamed: 0,Description,Class
0,"At 9:00 a.m. on August 10, 2017, an employee w...",Nonfatal
1,"At 9:45 a.m. on July 17, 2017, an employee was...",Nonfatal
2,"At 7:30 a.m. on June 30, 2017, an employee was...",Nonfatal
3,"At 2:00 p.m. on June 30, 2017, an employee was...",Fatal
4,"At 12:20 p.m. on June 23, 2017, an employee wa...",Nonfatal



## Step 3: Encode Labels
Nonfatal → 0, Fatal → 1


In [21]:

df["Label"] = df["Class"].map({"Nonfatal": 0, "Fatal": 1})
df["Label"].value_counts()


Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
1,441
0,59



## Step 4: Train-Test Split


In [22]:

X_train, X_test, y_train, y_test = train_test_split(
    df["Description"].astype(str).tolist(),
    df["Label"].tolist(),
    test_size=0.3,
    random_state=42,
    stratify=df["Label"]
)



## Step 5: Load BERT Tokenizer


In [23]:

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")



## Step 6: Tokenize Text


In [24]:

def tokenize_text(texts):
    return tokenizer(texts, padding=True, truncation=True, max_length=128)

train_encodings = tokenize_text(X_train)
test_encodings = tokenize_text(X_test)



## Step 7: Create Dataset Class


In [25]:

class OSHA_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = OSHA_Dataset(train_encodings, y_train)
test_dataset = OSHA_Dataset(test_encodings, y_test)



## Step 8: Load BERT Model


In [26]:

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



## Step 9: Training Configuration


In [27]:
training_args = TrainingArguments(
    output_dir="./bert_results",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=50,
    report_to="none"
)


## Step 10: Train the Model


In [28]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,0.391376
2,0.367500,0.249158


TrainOutput(global_step=88, training_loss=0.28403496742248535, metrics={'train_runtime': 19.032, 'train_samples_per_second': 36.78, 'train_steps_per_second': 4.624, 'total_flos': 46044434688000.0, 'train_loss': 0.28403496742248535, 'epoch': 2.0})


## Step 11: Evaluate the Model


In [None]:

predictions = trainer.predict(test_dataset)
y_pred = predictions.predictions.argmax(axis=1)

print(classification_report(y_test, y_pred, target_names=["Nonfatal", "Fatal"]))



## Step 12: Predict New Accident Severity


In [30]:

def predict_severity(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    outputs = model(**inputs)
    return "Fatal" if torch.argmax(outputs.logits, dim=1).item() == 1 else "Nonfatal"

predict_severity("The worker fell from a ladder during maintenance work.")


RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)