Automatic Charge Identification in Indian Legal Documents

Identifying charges from the Indian Penal Code given the textual description of the charges and facts of a criminal case.

Introduction

This is the repository for the paper titled "Automatic Crime Identification from Facts: A Few Sentence-Level Crime Annotations is All You Need" which was presented at The 28th International Conference on Computational Linguistics, 2020.

Identifying the relevant charges given the fact descriptions of a legal scenario and the statutory laws defining charges is one of the most important tasks in the judicial process of countries following Civil Law System. This task is challenging, since the statutory laws are usually written in formal and abstract language to encapsulate wide-ranging scenarios. Meanwhile, the fact descriptions can be informal, and can contain a lot of text (like background information) that do not indicate any crime, but are included for the sake of informativeness and completion. Additionally, more than one charge may be relevant, and the frequency distribution of charges is usually highly skewed (long-tail distribution).

We annotate a small set of fact descriptions with sentence-level charges, i.e., for every sentence in the fact description, we annotate the charges which may be relevant given that sentence alone. We use a model that treats text (fact and charge descriptions alike) as a hierarchy of sentences and words, and constructs intermediate sentence embeddings for each sentence as well as a document embedding for the entire text. We use multi-task learning to optimize both sentence and document-level losses simultaneously.

We make available:

(1) A dataset containing: (a) Charge descriptions of 20 charges (topics in the Indian Penal Code, 1860); (b) A training set consisting of 120 fact descriptions with relevant sentence and document-level charge labels; (c) A test set consisting of 70 fact descriptions with relevant document-level charge labels only.

(2) The implementation of our proposed approach

Citation

If you use this code, please cite our paper:

@inproceedings{paul-etal-2020-automatic,
    title = "Automatic Charge Identification from Facts: A Few Sentence-Level Charge Annotations is All You Need",
    author = "Paul, Shounak  and
      Goyal, Pawan  and
      Ghosh, Saptarshi",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2020.coling-main.88",
    doi = "10.18653/v1/2020.coling-main.88",
    pages = "1011--1022"
}

Dataset

Charge Descriptions

The file "Labels.jsonl" contains the charge descriptions. Each line should contain a JSON string which represents the charge description. Each charge description is a Python Dict with the following keys:

  chargeid: str -> Charge ID
  text: List[str] -> List of sentences

Fact Descriptions

The file "Train-Sent.jsonl" and "Test-Doc.jsonl" are fact description datasets. Each line should contain a JSON string which represents the fact description. Each fact description is a Python Dict with the following keys:

  factid: str -> Fact ID
  text: List[str] -> List of sentences
  sent_labels: List[List[str]] -> List of List of chargeid, each sublist is the sent-level charge; Optional; not needed for inference or vanilla single-task training
  doc_labels: List[str] -> List of chargeid, entire document-level charges; Optional, not needed for inference

Pretrained Word Embeddings

Download this file and put it inside the ptembs folder.

Training

Input Data

Setup the Charge and Fact Description files as mentioned above. 'sent_labels' are compulsory for multi-task learning, not required for single-task learning.

Usage

Output Data

Inference

Input Data

Setup the Charge and Fact Description files as mentioned above. 'sent_labels' and 'doc_labels' are not compulsory.

Usage

The generalized usage command is given as:

  python main.py --[arg1] <arg1 param> --[arg2] <arg2 param> ...

To check out the details:

  python main.py -h

Output Data

The following are saved in the saved folder (specified by --save_path):

  model.pt: torch.nn.Module -> State dict of best model (based on macro-F1 on validation set)
  metrics.json: JSON Dict --> Best Label-wise and macro Precision, Recall and F1 score on (based on macro-F1 on validation set)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
model		model
ptembs		ptembs
saved		saved
LICENSE		LICENSE
README.md		README.md
main.py		main.py
prepare_data.py		prepare_data.py
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Charge Identification in Indian Legal Documents

Introduction

Citation

Dataset

Charge Descriptions

Fact Descriptions

Pretrained Word Embeddings

Training

Input Data

Usage

Output Data

Inference

Input Data

Usage

Output Data

About

Releases

Packages

Languages

License

Law-AI/automatic-charge-identification

Folders and files

Latest commit

History

Repository files navigation

Automatic Charge Identification in Indian Legal Documents

Introduction

Citation

Dataset

Charge Descriptions

Fact Descriptions

Pretrained Word Embeddings

Training

Input Data

Usage

Output Data

Inference

Input Data

Usage

Output Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages