# Introduction

This noteook contains implementation of a LEGAL-BERT based retreival for LEGSTAT IR Term Project. 

There are 197 statutes (documents) and 50 train queries. The task is to generate trec file for 10 test queries.

## Authors
- Sayan Mahapatra
- Mainak Chowdhury
- Upasana Mandal
- Khyati Puhup


# Setup Environment


In [1]:
!rm -rf sample_data/
!rm -rf IRTP/
!git clone https://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@github.com/MeSayan/IRTP.git
!cd IRTP/
!chmod a+x IRTP/trec_eval.8.1/trec_eval.8.1/trec_eval

Cloning into 'IRTP'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (249/249), done.[K
remote: Compressing objects: 100% (238/238), done.[K
remote: Total 249 (delta 10), reused 247 (delta 8), pack-reused 0[K
Receiving objects: 100% (249/249), 547.48 KiB | 5.07 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [None]:
!echo -e " scikit-learn==1.0 \n numpy==1.19.5 \n pandas==1.1.5 \n nltk==3.4 \n transformers==4.12.3" > requirements.txt
!pip install -U -r requirements.txt

Collecting scikit-learn==1.0
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
[K     |████████████████████████████████| 23.1 MB 1.6 MB/s 
Collecting nltk==3.4
  Downloading nltk-3.4.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 41.4 MB/s 
[?25hCollecting transformers==4.12.3
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 39.9 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting singledispatch
  Downloading singledispatch-3.7.0-py2.py3-none-any.whl (9.2 kB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |██████████

# Functions

- get_all_documents() // return list of documents 
- get_all_queries() // return list of queries     
- clean() // tokenization, stop word, punctuation removal      
- preprocessor() // lemmatization, steming etc    
- generate() // return vectors (embeddings) for query / docs 
- evaluate_docs() // compute similarity of doc vector and query vector 
- generate_test_trec_file() // generate test trec file 
- generate_test_trec_file() // generate tain trec file for evaluation by trec tool

In [None]:
import pandas as pd
import sklearn
import numpy as np
import string
import pprint

pp = pprint.PrettyPrinter()

import torch
import logging

import matplotlib.pyplot as plt
% matplotlib inline

import nltk
import os
import glob
import re

nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

from sklearn.preprocessing import normalize

print(sklearn.__version__)
print(np.__version__)
print(pd.__version__)
print(nltk.__version__)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


1.0
1.19.5
1.1.5
3.4


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def get_all_documents():
  path = "IRTP/Object_statutes/*.txt"
  doc_vex=glob.glob(path)
  doc_vex.sort(key=lambda f: int(re.sub('\D', '', f)))
  doc_head=[]
  doc_cont=[]
  for i in doc_vex:
    storex=""
    f=open(i,"r")
    for j in f:
      storex+=j #store file content in storex and append the sting in doc_cont 
    doc_cont.append(storex)
    doct=i.split("IRTP/Object_statutes/")
    doctx=doct[1].split(".txt")
    doc_head.append(doctx[0]) #contains the file name (Except .txt)
  return doc_head,doc_cont

In [None]:
def get_all_queries(pathx):
  fx=open(pathx,"r") 
  quer_vec_head=[]
  quer_vec_cont=[]
  for j in fx:
    stor=j.split("||")
    quer_vec_head.append(stor[0]) #take query names like AILA_Q1,AILA_Q2 etc
    quer_vec_cont.append(stor[1]) #take query details of each query AILA_Qi i in 1...n, n is number of queries
  return quer_vec_head,quer_vec_cont


In [None]:
def clean(items):
  """ Tokenize string, remove punctuation & stopwords """
  words = []
  cleaned_docs = []
  st = set(stopwords.words('english'))
  for item in items:
    sentences = sent_tokenize(item)
    lowercase_words = [word.lower() for sentence in sentences for word in word_tokenize(sentence)]
    
    # custom Filtering
    # 1. w.e.f.<Date> -> [w.e.f., <Date>]
    # 2. w.r.e.f.<Date> -> [w.r.e.f, <Date>]
    # 3. X.-Y -> [X, Y]
    # 4. X.—Y -> [X, Y]
    # 5. X- -> X
    # 6. -X -> X
    # 7. .X -> X
    # 8. X. -> X
    # 9. 'X or X' -> X
    # 10. X-Y -> [X, Y]
    nl = []
    for word in lowercase_words:
      if 'w.e.f.' in word:
        a, b = word.split('w.e.f.', 1)
        nl.append(a)
        nl.append(b)
      elif 'w.r.e.f.' in word:
        a, b = word.split('w.r.e.f', 1)
        nl.append(a)
        nl.append(b)
      elif '.-' in word:
        nl.extend(word.split('.-'))
      elif '.—' in word:
        nl.extend(word.split('.—'))
      elif (word.endswith('-') and not word.endswith('/-')) or ((word.endswith('—') and not word.endswith('/—'))):
        nl.append(word[:-1])
      elif word.startswith('-') or word.startswith('—'):
        nl.append(word[1:])
      elif word.startswith("."):
        nl.append(word[1:])
      elif word.endswith("."):
        nl.append(word[:-1])
      elif word.startswith("'") and word.endswith("'"):
        nl.append(word[1:-1])
      elif word.startswith("'"):
        nl.append(word[1:])
      elif word.endswith("'"):
        nl.append(word[:-1])
      elif '-' in word:
        nl.extend(word.split('-'))
      else:
        nl.append(word)

    punctuation_symbols = string.punctuation + '‘’“”—``'
    punctuation_removed_words = [word for word in nl if not word in punctuation_symbols]
    stopwords_removed_words = [word for word in punctuation_removed_words if not word in st]
    n2 = [word for word in stopwords_removed_words 
          if (re.match(r"^[']?[a-z]*[-]{0,1}[a-z]*$", word) and 
          word not in ['title', 'desc'] and # Remove 'title' & 'desc'
          len(word) > 3 # remove 1 and 2 letter words
          )]
    words.append(n2)

  for words_of_a_sentence in words:
    cleaned_docs.append(" ".join(words_of_a_sentence))

  return cleaned_docs


In [None]:
def preprocessor(items):
  items = clean(items)
  # items is now tokenized and stop words removed
  return items


In [None]:
model_name = 'nlpaueb/legal-bert-base-uncased'
from transformers import AutoTokenizer, AutoModel 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_hidden_states=True)
model.eval()

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/217k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at nlpaueb/legal-bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:
def generate_vectors(model, tokenizer, docs):
  doc_vectors = []
  weights = []
  for i in range(len(docs)):
    doc = docs[i]
    print(f"\rVectorizing item {i+1} / {len(docs)}", end='')
    encoded_doc = tokenizer.encode_plus(
                          doc, 
                          add_special_tokens = True,    
                          truncation = True, 
                          max_length=512, 
                          return_attention_mask = True, 
                          return_tensors = "pt")
    with torch.no_grad():
      outputs = model(**encoded_doc, output_hidden_states=True)

    # take output from only last layer 
    tok_embedding = outputs.last_hidden_state[0] # shape: [#tokens * 768]
    
    # Sentence embedding
    # take mean of token vectors for a sentence
    sent_embedding = torch.mean(tok_embedding, dim=0) # shape: [1, 768]
    # convert to numpy
    sent_vec = sent_embedding.detach().numpy()
    doc_vectors.append(sent_vec)
    
    # tokenized length of doc
    w = encoded_doc['input_ids'].size()[1]
    weights.append(w)

  # document length normalization
  weights = np.array(weights, dtype=np.float64)
  weights = np.mean(weights) / weights # avg. document length / document length
  weights = (weights.T)[:, None]
  D = np.array(doc_vectors)
  D = D * weights
  D = normalize(D, axis=1, norm='l2')
  return D


In [None]:
def generate_test_trec_file(D, Q, C, queries, file_name, threshold=0):
  with open(file_name, "w") as f:
    for q in range(len(queries)):
      drv = C[q]
      sdrv = np.flip(np.argsort(drv), axis = 0)
      c = 1
      for d in sdrv:
        if C[q][d] > threshold:
          print(f"AILA_TQ{q+1} Q0 {doc_head[d]} {c} {C[q][d]} LEG_STAT_TRIER R3", file=f)
          c += 1

In [None]:
def generate_train_trec_file(D, Q, C, queries, file_name, threshold=0):
  with open(file_name, "w") as f:
    for q in range(len(queries)):
      drv = C[q]
      sdrv = np.flip(np.argsort(drv), axis = 0)
      c = 1
      for d in sdrv:
        if C[q][d] > threshold:
          print(f"AILA_Q{q+1} Q0 {doc_head[d]} {c} {C[q][d]} LEG_STAT_TRIER R3", file=f)
          c += 1

# Generate Trec & Evaluate Trec File (Training)

In [None]:
#Generate
doc_head, docs = get_all_documents()
query_head, queries = get_all_queries("IRTP/Query_doc_train.txt")
docs = preprocessor(docs)
queries = preprocessor(queries)
print("Embedding documents")
D_tr = generate_vectors(model, tokenizer, docs)
print("\nEmbedding Querries")
Q_tr = generate_vectors(model, tokenizer, queries)
C_tr = Q_tr.dot(D_tr.T) # Q * D^T
print("\nGenerating Trec File (Train)")
generate_train_trec_file(D_tr, Q_tr, C_tr, queries, "trec_output_file_train_data.txt")
print("Evaluating Trec File")
#Evaluate
!IRTP/trec_eval.8.1/trec_eval.8.1/trec_eval  IRTP/relevance_judgements_train.txt ./trec_output_file_train_data.txt

Embedding documents
Vectorizing item 197 / 197
Embedding Querries
Vectorizing item 50 / 50
Generating Trec File (Train)
Evaluating Trec File
num_q          	all	50
num_ret        	all	9850
num_rel        	all	221
num_rel_ret    	all	217
map            	all	0.0664
gm_ap          	all	0.0417
R-prec         	all	0.0490
bpref          	all	0.0371
recip_rank     	all	0.1630
ircl_prn.0.00  	all	0.1689
ircl_prn.0.10  	all	0.1689
ircl_prn.0.20  	all	0.1689
ircl_prn.0.30  	all	0.0644
ircl_prn.0.40  	all	0.0639
ircl_prn.0.50  	all	0.0460
ircl_prn.0.60  	all	0.0403
ircl_prn.0.70  	all	0.0366
ircl_prn.0.80  	all	0.0275
ircl_prn.0.90  	all	0.0253
ircl_prn.1.00  	all	0.0253
P5             	all	0.0440
P10            	all	0.0420
P15            	all	0.0400
P20            	all	0.0380
P30            	all	0.0313
P100           	all	0.0190
P200           	all	0.0217
P500           	all	0.0087
P1000          	all	0.0043


## Generate Trec File (Test)

In [None]:
# Generate
doc_head, docs = get_all_documents()
query_head, queries = get_all_queries("IRTP/Query_doc_test.txt")
docs = preprocessor(docs)
queries = preprocessor(queries)
print("Embedding documents")
D_te = generate_vectors(model, tokenizer, docs)
print("\nEmbedding Querries")
Q_te = generate_vectors(model, tokenizer, queries)
C_te = Q_te.dot(D_te.T) # Q * D^T
print("\nGenerating Trec File (Test)")
generate_test_trec_file(D_te, Q_te, C_te, queries, "trec_output_file_test_data.txt")

Embedding documents
Vectorizing item 197 / 197
Embedding Querries
Vectorizing item 10 / 10
Generating Trec File (Test)


# References

- Chris McCormick and Nick Ryan. (2019, May 14). BERT Word Embeddings Tutorial. Retrieved from http://www.mccormickml.com
- https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209

