### Using Legal-BERT for Named Entity Recognition
#### Date: 2/26/2024

Setting up the NER model using a specialized BERT variant, [Legal-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased?text=The+applicant+submitted+that+her+husband+was+subjected+to+treatment+amounting+to+%5BMASK%5D+whilst+in+the+custody+of+police.), which is suited for legal documents. Given that police complaint reports can be considered a form of legal text, this choice is apt. To proceed with implementing the NER model, I follow these steps:

1. Select a sample of the text and pre-process the text data

After loading, we need to preprocessing can include removing or replacing certain characters, dealing with case sensitivity (we are using uncased so it should not matter), and possibly segmenting the text into smaller chunks if the documents are lengthy.

In [1]:
import os
import random
from text_parser import TextParser

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jonathanjuarez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jonathanjuarez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Create a list of the txt files for processing NER later
PATH = "/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/text_files"
text_parser = TextParser(PATH, nlp_task=None)
cases = os.listdir(PATH)
cases = [case for case in cases if case.endswith(".txt")]

Initializing parsers for None


In [10]:
random.seed(43)
new_cases_sample = random.sample(cases, 5)

2. Load Data

In [11]:
case_texts = []
for case in new_cases_sample:
    with open(os.path.join(PATH, case), 'r', encoding='utf-8') as file:
        text = file.read()
        preprocessed_text = text_parser.preprocess(text, return_as_list=False, remove_numbers=False, stem=False)
        case_texts.append(preprocessed_text)

3. Loading Legal-Bert and tokenizing

Tokenize the preprocessed text data. This involves converting the raw text into tokens (words or subwords) that can be fed into the model. The tokenizer will also add necessary tokens like [CLS] and [SEP] for BERT models.

NER pipeline allows for easier to abstracts the complex code needed from the library, we just need to call our task 'ner' and call our model and tokenizer.

In [87]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("nlpaueb/legal-bert-base-uncased")
inputs = tokenizer(case_texts, padding=True, truncation=True, return_tensors="pt")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [90]:
for text in case_texts:
    entities = ner_pipeline(text)
    print(entities)

[{'entity': 'LABEL_1', 'score': 0.5649708, 'index': 1, 'word': 'independent', 'start': 0, 'end': 11}, {'entity': 'LABEL_1', 'score': 0.6266297, 'index': 2, 'word': 'police', 'start': 12, 'end': 18}, {'entity': 'LABEL_1', 'score': 0.59128684, 'index': 3, 'word': 'review', 'start': 19, 'end': 25}, {'entity': 'LABEL_0', 'score': 0.52860135, 'index': 4, 'word': 'authority', 'start': 26, 'end': 35}, {'entity': 'LABEL_1', 'score': 0.6339626, 'index': 5, 'word': 'log', 'start': 36, 'end': 39}, {'entity': 'LABEL_0', 'score': 0.583323, 'index': 6, 'word': '1078', 'start': 40, 'end': 44}, {'entity': 'LABEL_0', 'score': 0.5433303, 'index': 7, 'word': '##819', 'start': 44, 'end': 47}, {'entity': 'LABEL_0', 'score': 0.5081024, 'index': 8, 'word': 'page', 'start': 48, 'end': 52}, {'entity': 'LABEL_0', 'score': 0.5477352, 'index': 9, 'word': '3', 'start': 53, 'end': 54}, {'entity': 'LABEL_0', 'score': 0.5800761, 'index': 10, 'word': 'of', 'start': 55, 'end': 57}, {'entity': 'LABEL_0', 'score': 0.5939

In [92]:
# Example post-processing
for entity in entities:
    # Filter by confidence score, if necessary
    if entity['score'] > 0.5:
        print(f"Entity: {entity['word']}, Type: {entity['entity']}, Score: {entity['score']}")


Entity: civilian, Type: LABEL_1, Score: 0.5814656019210815
Entity: office, Type: LABEL_1, Score: 0.6517918705940247
Entity: of, Type: LABEL_0, Score: 0.5196552276611328
Entity: poli, Type: LABEL_1, Score: 0.5830131769180298
Entity: ce, Type: LABEL_1, Score: 0.5410661697387695
Entity: account, Type: LABEL_1, Score: 0.5518920421600342
Entity: ##ability, Type: LABEL_1, Score: 0.553440272808075
Entity: log, Type: LABEL_1, Score: 0.5995700359344482
Entity: 2021, Type: LABEL_0, Score: 0.628811776638031
Entity: 234, Type: LABEL_0, Score: 0.5006281733512878
Entity: ##1, Type: LABEL_1, Score: 0.5527348518371582
Entity: 1, Type: LABEL_1, Score: 0.5344969630241394
Entity: summary, Type: LABEL_0, Score: 0.5381017327308655
Entity: report, Type: LABEL_1, Score: 0.6091073155403137
Entity: of, Type: LABEL_0, Score: 0.5527588725090027
Entity: investigation, Type: LABEL_1, Score: 0.6295204758644104
Entity: date, Type: LABEL_0, Score: 0.5843188166618347
Entity: time, Type: LABEL_0, Score: 0.5323267579078

We have labels in this model that are not defined. Integrating a domain-specific model like Legal-BERT with NER and ensuring it uses meaningful labels (e.g., Person, Organization, Location) requires a bit more effort because the base Legal-BERT model is trained on legal text for understanding language, not specifically for NER tasks. The pipeline("ner") expects a model that has been specifically fine-tuned for NER, which includes a final layer trained to recognize and classify named entities. Without this fine-tuning, the model doesn't know how to classify tokens into entity categories. Our next step is to do this.

In [6]:
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Load the pre-trained model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

results = []
for text in case_texts:
    entities = ner_pipeline(text)
    results.append(entities)

# Print the identified entities
for entity in results:
    print(entity)

[{'entity': 'I-MISC', 'score': 0.632722, 'index': 1, 'word': 'C', 'start': 0, 'end': 1}, {'entity': 'I-MISC', 'score': 0.6865795, 'index': 2, 'word': '##I', 'start': 1, 'end': 2}, {'entity': 'I-MISC', 'score': 0.8707013, 'index': 4, 'word': '##L', 'start': 4, 'end': 5}, {'entity': 'I-LOC', 'score': 0.9282275, 'index': 78, 'word': 'Marsh', 'start': 217, 'end': 222}, {'entity': 'I-LOC', 'score': 0.941689, 'index': 79, 'word': '##field', 'start': 222, 'end': 227}, {'entity': 'I-LOC', 'score': 0.98169327, 'index': 81, 'word': 'Chicago', 'start': 232, 'end': 239}, {'entity': 'B-LOC', 'score': 0.6701342, 'index': 82, 'word': 'IL', 'start': 240, 'end': 242}, {'entity': 'I-LOC', 'score': 0.95550734, 'index': 127, 'word': 'Marsh', 'start': 455, 'end': 460}, {'entity': 'I-LOC', 'score': 0.9289607, 'index': 128, 'word': '##field', 'start': 460, 'end': 465}, {'entity': 'I-LOC', 'score': 0.852332, 'index': 129, 'word': 'Ave', 'start': 466, 'end': 469}, {'entity': 'I-LOC', 'score': 0.990915, 'index'

The CoNLL-2003 shared task dataset is one of the most famous datasets for NER, which contains annotations for English and German. The English part includes labels for persons, organizations, locations, and miscellaneous entities.

In [8]:
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("conll2003")
model_name = "nlpaueb/legal-bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

After loading the CoNLL-2003 dataset, we need to preprocess it to make it compatible with the model and tokenizer. This involves tokenizing the text and aligning the entity labels with the tokens produced by the tokenizer.

In [9]:
# Function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, padding="max_length", max_length=128)
    
    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their word in the input sequence
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)  # Special token or same word as previous token
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply the function to the dataset
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

In [10]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})


In [19]:
ner_feature = tokenized_datasets['train'].features['ner_tags']
label_list = ner_feature.feature.names
print(label_list)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Creating a compute_metrics function is useful to evaluate the trainer model.

In [20]:
# pip install seqeval

from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
from transformers import EvalPrediction
import numpy as np

label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

def compute_metrics(p: EvalPrediction):
    predictions = np.argmax(p.predictions, axis=2)
    true_labels = p.label_ids
    true_labels_list = [[] for _ in range(true_labels.shape[0])]
    predictions_list = [[] for _ in range(predictions.shape[0])]

    for i in range(true_labels.shape[0]):
        for j in range(true_labels.shape[1]):
            if true_labels[i, j] != -100:  # Ignore special tokens
                true_labels_list[i].append(label_list[true_labels[i][j]])
                predictions_list[i].append(label_list[predictions[i][j]])

    return {
        "precision": precision_score(true_labels_list, predictions_list),
        "recall": recall_score(true_labels_list, predictions_list),
        "f1": f1_score(true_labels_list, predictions_list),
        "classification_report": classification_report(true_labels_list, predictions_list)
    }



Define Training Arguments and Trainer

After preprocessing, we define the training arguments and set up the Trainer. This part involves specifying the training parameters and linking the dataset for training and evaluation.

In [21]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=9)  # Ensure num_labels matches dataset's label count

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Running the model locally on CPU took 6.3 hours. Running on CUDA in Google Colab would be preferable but ran into issues.

In [22]:
trainer.train()

  0%|          | 0/2634 [00:00<?, ?it/s]

Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.4397, 'grad_norm': 1.677411437034607, 'learning_rate': 5e-05, 'epoch': 0.57}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.0800144299864769, 'eval_precision': 0.8938775510204081, 'eval_recall': 0.8855939342881213, 'eval_f1': 0.889716462124418, 'eval_classification_report': '              precision    recall  f1-score   support\n\n         LOC       0.93      0.93      0.93      1837\n        MISC       0.86      0.73      0.79       922\n         ORG       0.79      0.85      0.82      1341\n         PER       0.95      0.95      0.95      1835\n\n   micro avg       0.89      0.89      0.89      5935\n   macro avg       0.88      0.86      0.87      5935\nweighted avg       0.90      0.89      0.89      5935\n', 'eval_runtime': 245.8231, 'eval_samples_per_second': 13.221, 'eval_steps_per_second': 1.656, 'epoch': 1.0}
{'loss': 0.0923, 'grad_norm': 1.32707941532135, 'learning_rate': 3.828491096532334e-05, 'epoch': 1.14}
{'loss': 0.0547, 'grad_norm': 2.3681766986846924, 'learning_rate': 2.6569821930646678e-05, 'epoch': 1.71}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.06410541385412216, 'eval_precision': 0.918950731461241, 'eval_recall': 0.9208087615838247, 'eval_f1': 0.9198788082814342, 'eval_classification_report': '              precision    recall  f1-score   support\n\n         LOC       0.94      0.96      0.95      1837\n        MISC       0.89      0.83      0.86       922\n         ORG       0.86      0.88      0.87      1341\n         PER       0.96      0.96      0.96      1835\n\n   micro avg       0.92      0.92      0.92      5935\n   macro avg       0.91      0.91      0.91      5935\nweighted avg       0.92      0.92      0.92      5935\n', 'eval_runtime': 219.734, 'eval_samples_per_second': 14.791, 'eval_steps_per_second': 1.852, 'epoch': 2.0}
{'loss': 0.0368, 'grad_norm': 2.0753135681152344, 'learning_rate': 1.4854732895970011e-05, 'epoch': 2.28}
{'loss': 0.0228, 'grad_norm': 0.14437231421470642, 'learning_rate': 3.1396438612933463e-06, 'epoch': 2.85}


  0%|          | 0/407 [00:00<?, ?it/s]

{'eval_loss': 0.07058090716600418, 'eval_precision': 0.9209951252311313, 'eval_recall': 0.923167649536647, 'eval_f1': 0.9220801077078424, 'eval_classification_report': '              precision    recall  f1-score   support\n\n         LOC       0.95      0.95      0.95      1837\n        MISC       0.87      0.85      0.86       922\n         ORG       0.87      0.89      0.88      1341\n         PER       0.96      0.96      0.96      1835\n\n   micro avg       0.92      0.92      0.92      5935\n   macro avg       0.91      0.91      0.91      5935\nweighted avg       0.92      0.92      0.92      5935\n', 'eval_runtime': 214.7963, 'eval_samples_per_second': 15.131, 'eval_steps_per_second': 1.895, 'epoch': 3.0}
{'train_runtime': 17031.1053, 'train_samples_per_second': 2.473, 'train_steps_per_second': 0.155, 'train_loss': 0.12362717505015472, 'epoch': 3.0}


TrainOutput(global_step=2634, training_loss=0.12362717505015472, metrics={'train_runtime': 17031.1053, 'train_samples_per_second': 2.473, 'train_steps_per_second': 0.155, 'train_loss': 0.12362717505015472, 'epoch': 3.0})

In [23]:
model_path = "/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/"

# Save model and tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/tokenizer_config.json',
 '/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/special_tokens_map.json',
 '/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/vocab.txt',
 '/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/added_tokens.json',
 '/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/tokenizer.json')

In [24]:
trainer.evaluate(tokenized_datasets['test'])

  0%|          | 0/432 [00:00<?, ?it/s]

{'eval_loss': 0.151297926902771,
 'eval_precision': 0.8671902268760907,
 'eval_recall': 0.880716058135413,
 'eval_f1': 0.8739008090045726,
 'eval_classification_report': '              precision    recall  f1-score   support\n\n         LOC       0.91      0.91      0.91      1665\n        MISC       0.71      0.75      0.73       702\n         ORG       0.82      0.85      0.83      1661\n         PER       0.95      0.94      0.94      1614\n\n   micro avg       0.87      0.88      0.87      5642\n   macro avg       0.85      0.86      0.85      5642\nweighted avg       0.87      0.88      0.87      5642\n',
 'eval_runtime': 247.3041,
 'eval_samples_per_second': 13.963,
 'eval_steps_per_second': 1.747,
 'epoch': 3.0}

Presenting the Evaluation Results

In [30]:
eval_results = {
    'eval_loss': 0.151297926902771,
    'eval_precision': 0.8671902268760907,
    'eval_recall': 0.880716058135413,
    'eval_f1': 0.8739008090045726,
    'eval_classification_report': '''
              precision    recall  f1-score   support

         LOC       0.91      0.91      0.91      1665
        MISC       0.71      0.75      0.73       702
         ORG       0.82      0.85      0.83      1661
         PER       0.95      0.94      0.94      1614

   micro avg       0.87      0.88      0.87      5642
   macro avg       0.85      0.86      0.85      5642
weighted avg       0.87      0.88      0.87      5642
    ''',
    'eval_runtime': 247.3041,
    'eval_samples_per_second': 13.963,
    'eval_steps_per_second': 1.747,
    'epoch': 3.0
}

# Print each key-value pair in eval_results
for key, value in eval_results.items():
    print(f"{key}: {value}\n")


eval_loss: 0.151297926902771

eval_precision: 0.8671902268760907

eval_recall: 0.880716058135413

eval_f1: 0.8739008090045726

eval_classification_report: 
              precision    recall  f1-score   support

         LOC       0.91      0.91      0.91      1665
        MISC       0.71      0.75      0.73       702
         ORG       0.82      0.85      0.83      1661
         PER       0.95      0.94      0.94      1614

   micro avg       0.87      0.88      0.87      5642
   macro avg       0.85      0.86      0.85      5642
weighted avg       0.87      0.88      0.87      5642
    

eval_runtime: 247.3041

eval_samples_per_second: 13.963

eval_steps_per_second: 1.747

epoch: 3.0



### Applying the Model to Complaints Data

Load the model and tokenizer from our specified path.

In [8]:
# load model and tokenizer
from transformers import AutoModelForTokenClassification, AutoTokenizer, TrainingArguments, Trainer
model_path = "/Users/jonathanjuarez/Documents/Advanced ML/NLP-Police-Complaints/"

model_legalBERT = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Prepare the Complaint Text: We can store the complaint text into a variable, then tokenize.

In [12]:
print(case_texts[0])

  log 1086919   1     1         date incident  september 18 2017   time incident  700 pm   location incident  xxxx  marshfield ave chicago il   date copa notification  september 28 2017   time copa notification  430 pm      september 18th 2017 ch icago police officer executed search warrant  first  second floor unit  xxxx  marshfield ave nue  point evening  search completed owner resident building  involved civilian 1  involved  civilian 1  came home see someone searched building  target search  warrant  son involved civilian 2  lived n second floor unit building  involved civilian 1  involved c ivilian 1  alleges  chicago police officer illegally searched  basement marshfield address   involved civilian 1  alleges chicago police  officer damaged home stole co in safe second floor unit     ii involved  party     involved officer 1  involved officer  star xxxxx  employee id  xxxxxx  date  appointment xxxx2013  police officer  unit xxx   gang  investigation division dob xxxx1981  male  w

In [13]:
# Replace this with any complaint text
complaint_text = case_texts[0]
# Tokenize the text
inputs = tokenizer(case_texts[0], padding=True, truncation=True, return_tensors="pt")

Get Model Predictions: Use the model to get predictions for the tokenized text:

In [14]:
# Get predictions for tokenized text
outputs = model_legalBERT(**inputs)
predictions = outputs.logits.argmax(-1).squeeze().tolist()

Convert Predictions to Labels: Translate the numerical predictions back to their string labels:

In [16]:
# Convert predictions to labels
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
predicted_labels = [label_list[p] for p in predictions if p != -100]  # Exclude special tokens
print(predicted_labels)


['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'I-ORG', 'I-ORG', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '

To align the predicted labels with their corresponding words from the input text, we'll need to process the tokenized input and match each token to its predicted label. 

In [17]:
def predict_and_align_labels(text, model, tokenizer, label_list):
    """
    Predicts the NER labels for the given text and aligns them with the words.

    Args:
    text (str): The text for which to predict NER labels.
    model (transformers.PreTrainedModel): The NER model.
    tokenizer (transformers.PreTrainedTokenizer): The tokenizer.
    label_list (list): The list of labels used by the NER model.

    Returns:
    list: A list of tuples where each tuple is (word, predicted_label).
    """
    # Tokenize the text and align labels
    tokenized_input = tokenizer(text, return_tensors="pt", truncation=True, padding=True, is_split_into_words=False, return_offsets_mapping=True)
    offset_mapping = tokenized_input.pop('offset_mapping')
    outputs = model(**tokenized_input)
    predictions = outputs.logits.argmax(-1).squeeze().tolist()

    # Decode the tokens and align them with their labels
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"].squeeze().tolist())
    word_level_predictions = []
    for idx, (offset, prediction) in enumerate(zip(offset_mapping.squeeze().tolist(), predictions)):
        if offset[0] == 0 and tokens[idx] not in tokenizer.all_special_tokens:  # Skip special tokens and subwords
            aligned_label = label_list[prediction]
            word_level_predictions.append((tokens[idx], aligned_label))

    # Print statements for debugging
    print(f"Tokens: {tokens}")
    print(f"Predictions: {predictions}")
    print(f"Word-level predictions: {word_level_predictions}")

    return word_level_predictions


In [18]:
aligned_predictions = predict_and_align_labels(complaint_text, model_legalBERT, tokenizer, label_list)

# Display the results
for word, label in aligned_predictions:
    print(f"{word}: {label}")

Tokens: ['[CLS]', 'log', '1086', '##919', '1', '1', 'date', 'incident', 'september', '18', '2017', 'time', 'incident', '700', 'pm', 'location', 'incident', 'xxxx', 'marsh', '##field', 'ave', 'chicago', 'i', '##l', 'date', 'cop', '##a', 'notification', 'september', '28', '2017', 'time', 'cop', '##a', 'notification', '430', 'pm', 'september', '18', '##th', '2017', 'ch', 'i', '##ca', '##go', 'police', 'officer', 'executed', 'search', 'warrant', 'first', 'second', 'floor', 'unit', 'xxxx', 'marsh', '##field', 'ave', 'nu', '##e', 'point', 'evening', 'search', 'completed', 'owner', 'resident', 'building', 'involved', 'civilian', '1', 'involved', 'civilian', '1', 'came', 'home', 'see', 'someone', 'search', '##ed', 'building', 'target', 'search', 'warrant', 'son', 'involved', 'civilian', '2', 'lived', 'n', 'second', 'floor', 'unit', 'building', 'involved', 'civilian', '1', 'involved', 'c', 'iv', '##ilia', '##n', '1', 'alleges', 'chicago', 'police', 'officer', 'illegally', 'search', '##ed', 'bas

In [24]:
def predict_and_align_labels(text, model, tokenizer, label_list):
    """
    Predicts the NER labels for the given text and aligns them with the words.
    
    Args:
    text (str): The text for which to predict NER labels.
    model (transformers.PreTrainedModel): The NER model.
    tokenizer (transformers.PreTrainedTokenizer): The tokenizer.
    label_list (list): The list of labels used by the NER model.
    
    Returns:
    list: A list of tuples where each tuple is (word, predicted_label).
    """
    # Tokenize the text and exclude offset_mapping from the model input
    tokenized_input = tokenizer(text, return_tensors="pt", truncation=True, padding=True, is_split_into_words=False, return_offsets_mapping=True)
    offset_mapping = tokenized_input.pop('offset_mapping').squeeze()
    model_input = {key: val for key, val in tokenized_input.items()}  # Prepare model input excluding offset_mapping
    outputs = model(**model_input)
    predictions = outputs.logits.argmax(-1).squeeze().tolist()

    # Decode the tokens and align them with their labels
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"].squeeze().tolist())
    word_level_predictions = []
    current_word = ""
    current_label_index = -1

    for idx, (offset, token) in enumerate(zip(offset_mapping, tokens)):
        if token.startswith("##"):
            # Append subtoken to the current word if it's not a special token
            current_word += token[2:]
        else:
            # When reaching a new word, save the previous one if it's not empty
            if current_word and current_label_index != -1:
                word_level_predictions.append((current_word, label_list[current_label_index]))
                current_word = ""
            if token not in tokenizer.all_special_tokens:
                # Start a new word
                current_word = token
                current_label_index = predictions[idx]

    # Save the last accumulated word if present
    if current_word and current_label_index != -1:
        word_level_predictions.append((current_word, label_list[current_label_index]))

    # Print statements for debugging
    print(f"Tokens: {tokens}")
    print(f"Predictions: {predictions}")
    print(f"Word-level predictions: {word_level_predictions}")

    return word_level_predictions

In [25]:
word_level_predictions = predict_and_align_labels(complaint_text, model_legalBERT, tokenizer, label_list)

for word, label in word_level_predictions:
    print(f"{word}: {label}")

Tokens: ['[CLS]', 'log', '1086', '##919', '1', '1', 'date', 'incident', 'september', '18', '2017', 'time', 'incident', '700', 'pm', 'location', 'incident', 'xxxx', 'marsh', '##field', 'ave', 'chicago', 'i', '##l', 'date', 'cop', '##a', 'notification', 'september', '28', '2017', 'time', 'cop', '##a', 'notification', '430', 'pm', 'september', '18', '##th', '2017', 'ch', 'i', '##ca', '##go', 'police', 'officer', 'executed', 'search', 'warrant', 'first', 'second', 'floor', 'unit', 'xxxx', 'marsh', '##field', 'ave', 'nu', '##e', 'point', 'evening', 'search', 'completed', 'owner', 'resident', 'building', 'involved', 'civilian', '1', 'involved', 'civilian', '1', 'came', 'home', 'see', 'someone', 'search', '##ed', 'building', 'target', 'search', 'warrant', 'son', 'involved', 'civilian', '2', 'lived', 'n', 'second', 'floor', 'unit', 'building', 'involved', 'civilian', '1', 'involved', 'c', 'iv', '##ilia', '##n', '1', 'alleges', 'chicago', 'police', 'officer', 'illegally', 'search', '##ed', 'bas

In [22]:
def merge_entities(word_level_predictions):
    merged_predictions = []
    current_entity = []
    
    for word, label in word_level_predictions:
        if label.startswith("B-") or label == "O":
            if current_entity:
                merged_predictions.append((" ".join([w for w, _ in current_entity]), current_entity[0][1]))
                current_entity = []
        if label.startswith("B-") or label.startswith("I-"):
            current_entity.append((word, label))
        if label == "O":
            merged_predictions.append((word, label))
    
    # Add the last entity
    if current_entity:
        merged_predictions.append((" ".join([w for w, _ in current_entity]), current_entity[0][1]))
    
    return merged_predictions

In [23]:
merged_predictions = merge_entities(word_level_predictions)
for word, label in merged_predictions:
    print(f"{word}: {label}")

log: O
1086919: O
1: O
1: O
date: O
incident: O
september: O
18: O
2017: O
time: O
incident: O
700: O
pm: O
location: O
incident: O
xxxx: O
marshfield: B-ORG
ave: O
chicago il: B-ORG
date: O
copa: B-ORG
notification: O
september: O
28: O
2017: O
time: O
copa: B-ORG
notification: O
430: O
pm: O
september: O
18th: O
2017: O
ch icago: B-ORG
police: O
officer: O
executed: O
search: O
warrant: O
first: O
second: O
floor: O
unit: O
xxxx: O
marshfield ave: B-ORG
nue point: B-LOC
evening: O
search: O
completed: O
owner: O
resident: O
building: O
involved: O
civilian: O
1: O
involved: O
civilian: O
1: O
came: O
home: O
see: O
someone: O
searched: O
building: O
target: O
search: O
warrant: O
son: O
involved: O
civilian: O
2: O
lived: O
n: O
second: O
floor: O
unit: O
building: O
involved: O
civilian: O
1: O
involved: O
c: O
ivilian: I-PER
1: O
alleges: O
chicago: B-ORG
police: O
officer: O
illegally: O
searched: O
basement: O
marshfield: B-LOC
address: O
involved: O
civilian: O
1: O
alleges: O
c