Efficient Deep Learning-based Sentence Boundary Detection in Legal Text

Accepted at NLLP 2022

| [ Paper ] |

Introduction

A key component of the Natural Language Processing (NLP) pipeline is Sentence Boundary Detection (SBD). Erroneous SBD could affect other processing steps and reduce performance. A few criteria based on punctuation and capitalization are necessary to identify sentence borders in well-defined corpora. However, due to several grammatical ambiguities, the complex structure of legal data poses difficulties for SBD. In this paper, we have trained a neural network framework for identifying the end of the sentence in legal text. We used several state-of-the-art deep learning models, analyzed their performance, and identified that Convolutional Neural Network(CNN) outperformed other deep learning frameworks. We compared the results with rule-based, statistical, and transformer-based frameworks. The best neural network model outscored the popular rule-based framework with an improvement of 8% in the F1 score. Although domain-specific statistical models have slightly improved performance, the trained CNN is 80 times faster in run-time and doesn{'}t require much feature engineering. Furthermore, after extensive pretraining, the transformer models fall short in overall performance compared to the best deep learning model.

Results

Citation

If you find our code implementation helpful for your own research or work, please cite our paper.

@inproceedings{sheik-etal-2022-efficient,
    title = "Efficient Deep Learning-based Sentence Boundary Detection in Legal Text",
    author = "Sheik, Reshma  and
      T, Gokul  and
      Nirmala, S",
    booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.nllp-1.18",
    pages = "208--217",
    abstract = "A key component of the Natural Language Processing (NLP) pipeline is Sentence Boundary Detection (SBD). Erroneous SBD could affect other processing steps and reduce performance. A few criteria based on punctuation and capitalization are necessary to identify sentence borders in well-defined corpora. However, due to several grammatical ambiguities, the complex structure of legal data poses difficulties for SBD. In this paper, we have trained a neural network framework for identifying the end of the sentence in legal text. We used several state-of-the-art deep learning models, analyzed their performance, and identified that Convolutional Neural Network(CNN) outperformed other deep learning frameworks. We compared the results with rule-based, statistical, and transformer-based frameworks. The best neural network model outscored the popular rule-based framework with an improvement of 8{{\%} in the F1 score. Although domain-specific statistical models have slightly improved performance, the trained CNN is 80 times faster in run-time and doesn{'}t require much feature engineering. Furthermore, after extensive pretraining, the transformer models fall short in overall performance compared to the best deep learning model.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NeuralNetworks		NeuralNetworks
data		data
luima_sbd		luima_sbd
README.md		README.md
dataset_analysis.ipynb		dataset_analysis.ipynb
punkt.ipynb		punkt.ipynb
pysbd_crf_analysis.ipynb		pysbd_crf_analysis.ipynb
testing.ipynb		testing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Deep Learning-based Sentence Boundary Detection in Legal Text

Introduction

Results

Citation

About

Releases

Packages

Contributors 2

Languages

NLLP-ML/SBD

Folders and files

Latest commit

History

Repository files navigation

Efficient Deep Learning-based Sentence Boundary Detection in Legal Text

Introduction

Results

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages