# Introduction

This noteook contains implementation of a TF-IDF retreival for LEGSTAT IR Term Project. 

There are 197 statutes (documents) and 50 train queries. The task is to train TFIDF model and generate trec file for 10 test queries.

## Authors
- Sayan Mahapatra
- Mainak Chowdhury
- Upasana Mandal
- Khyati Puhup


# Setup Environment


In [1]:
!rm -rf sample_data/
!rm -rf IRTP/
!git clone https://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@github.com/MeSayan/IRTP.git
!cd IRTP/

Cloning into 'IRTP'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (249/249), done.[K
remote: Compressing objects: 100% (238/238), done.[K
remote: Total 249 (delta 10), reused 247 (delta 8), pack-reused 0[K
Receiving objects: 100% (249/249), 547.48 KiB | 13.04 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [None]:
!echo -e " scikit-learn==1.0 \n numpy==1.19.5 \n pandas==1.1.5 \n nltk==3.4" > requirements.txt
!pip install -U -r requirements.txt

Collecting scikit-learn==1.0
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
[K     |████████████████████████████████| 23.1 MB 1.6 MB/s 
Collecting nltk==3.4
  Downloading nltk-3.4.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 44.9 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting singledispatch
  Downloading singledispatch-3.7.0-py2.py3-none-any.whl (9.2 kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4-py3-none-any.whl size=1436396 sha256=634c4fffe3cb342b282be98ab3b6990cd447d7c732587afa6de6a8e947307166
  Stored in directory: /root/.cache/pip/wheels/13/b8/81/2349be11dd144dc7b68ab983b58cd2fae353cdc50bbdeb09d0
Successfully built nltk
Installing collected packages: threadpoolctl, singledispatch, scikit-learn, nltk
  Attempting uninstall: scikit-learn
    Found exi

# Functions

- get_all_documents() // return list of documents 
- get_all_queries() // return list of queries     
- clean() // tokenization, stop word, punctuation removal      
- preprocessor() // lemmatization, steming etc    
- generate_doc_vectors() // tf_idf vectors        
- generate_query_vector() // tf_idf vector of query 
- evaluate_docs() // compute similarity of doc vector and query vector 
- generate_trec_file() // generate trek file for evaluatiob by trec tool 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import sklearn
import numpy as np
import string

import nltk
import os
import glob
import re

nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

print(sklearn.__version__)
print(np.__version__)
print(pd.__version__)
print(nltk.__version__)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


1.0
1.19.5
1.1.5
3.4


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def get_all_documents():
  path = "IRTP/Object_statutes/*.txt"
  doc_vex=glob.glob(path)
  doc_vex.sort(key=lambda f: int(re.sub('\D', '', f)))
  doc_head=[]
  doc_cont=[]
  for i in doc_vex:
    storex=""
    f=open(i,"r")
    for j in f:
      storex+=j #store file content in storex and append the sting in doc_cont 
    doc_cont.append(storex)
    doct=i.split("IRTP/Object_statutes/")
    doctx=doct[1].split(".txt")
    doc_head.append(doctx[0]) #contains the file name (Except .txt)
  return doc_head,doc_cont

In [None]:
def get_all_queries(pathx):
  fx=open(pathx,"r") 
  quer_vec_head=[]
  quer_vec_cont=[]
  for j in fx:
    stor=j.split("||")
    quer_vec_head.append(stor[0]) #take query names like AILA_Q1,AILA_Q2 etc
    quer_vec_cont.append(stor[1]) #take query details of each query AILA_Qi i in 1...n, n is number of queries
  return quer_vec_head,quer_vec_cont


In [None]:
def clean(items):
  """ Tokenize string, remove punctuation & stopwords """
  words = []
  cleaned_docs = []
  st = set(stopwords.words('english'))
  for item in items:
    sentences = sent_tokenize(item)
    lowercase_words = [word.lower() for sentence in sentences for word in word_tokenize(sentence)]
    
    # custom Filtering
    # 1. w.e.f.<Date> -> [w.e.f., <Date>]
    # 2. w.r.e.f.<Date> -> [w.r.e.f, <Date>]
    # 3. X.-Y -> [X, Y]
    # 4. X.—Y -> [X, Y]
    # 5. X- -> X
    # 6. -X -> X
    # 7. .X -> X
    # 8. X. -> X
    # 9. 'X or X' -> X
    # 10. X-Y -> [X, Y]
    nl = []
    for word in lowercase_words:
      if 'w.e.f.' in word:
        a, b = word.split('w.e.f.', 1)
        nl.append(a)
        nl.append(b)
      elif 'w.r.e.f.' in word:
        a, b = word.split('w.r.e.f', 1)
        nl.append(a)
        nl.append(b)
      elif '.-' in word:
        nl.extend(word.split('.-'))
      elif '.—' in word:
        nl.extend(word.split('.—'))
      elif (word.endswith('-') and not word.endswith('/-')) or ((word.endswith('—') and not word.endswith('/—'))):
        nl.append(word[:-1])
      elif word.startswith('-') or word.startswith('—'):
        nl.append(word[1:])
      elif word.startswith("."):
        nl.append(word[1:])
      elif word.endswith("."):
        nl.append(word[:-1])
      elif word.startswith("'") and word.endswith("'"):
        nl.append(word[1:-1])
      elif word.startswith("'"):
        nl.append(word[1:])
      elif word.endswith("'"):
        nl.append(word[:-1])
      elif '-' in word:
        nl.extend(word.split('-'))
      else:
        nl.append(word)

    punctuation_symbols = string.punctuation + '‘’“”—``'
    punctuation_removed_words = [word for word in nl if not word in punctuation_symbols]
    stopwords_removed_words = [word for word in punctuation_removed_words if not word in st]
    n2 = [word for word in stopwords_removed_words 
          if (re.match(r"^[']?[a-z]*[-]{0,1}[a-z]*$", word) and 
          word not in ['title', 'desc'] and # Remove 'title' & 'desc'
          len(word) > 3 # remove 1 and 2 letter words
          )]
    words.append(n2)

  for words_of_a_sentence in words:
    cleaned_docs.append(words_of_a_sentence)

  return cleaned_docs


In [None]:
def preprocessor(items):
  items = clean(items)
  # items is now tokenized and stop words removed
  return items


In [None]:
def pt(doc):
  # Use a pass through function since docs already tokenized and preprocessed
  return doc

def generate_doc_vectors(docs):
  global vocab
  doc_vectorizer = TfidfVectorizer(tokenizer=pt, preprocessor=pt, use_idf=True, smooth_idf=True)
  doc_vectors = doc_vectorizer.fit_transform(docs)
  vocab = doc_vectorizer.get_feature_names_out()
  df = pd.DataFrame(doc_vectors.todense(), 
                    index=range(1, len(docs)+1), 
                    columns=vocab, dtype=np.float64)
  return df, doc_vectorizer


In [None]:
def generate_query_vectors(vectorizer, queries):
  query_vectors = vectorizer.transform(queries)
  df = pd.DataFrame(query_vectors.todense(), index=range(1, len(queries)+1),
                   columns=vocab, dtype=np.float64)
  return df

In [None]:
def generate_trec_file(file_name):
  with open(file_name, "w") as f:
    for q in range(len(queries)):
      drv = C[q]
      sdrv = np.flip(np.argsort(drv), axis = 0)
      c = 1
      episilon = 0
      for d in sdrv:
        if C[q][d] > episilon:
          print(f"AILA_Q{q+1} Q0 {doc_head[d]} {c} {C[q][d]} LEG_STAT_TRIER R2", file=f)
          c += 1

## Generate Trec File

In [None]:
doc_head, docs = get_all_documents()
query_head, queries = get_all_queries("IRTP/Query_doc_test.txt")
docs = preprocessor(docs)
queries = preprocessor(queries)
df_D, doc_vectorizer = generate_doc_vectors(docs)
df_Q = generate_query_vectors(doc_vectorizer, queries)
Q = df_Q.to_numpy()
D = df_D.to_numpy()
C = Q.dot(D.T) # Q * D^T
generate_trec_file("trec_output_file_test_data.txt")
queries

## Evaluate Trec File (For Training Data)

In [None]:
!chmod a+x IRTP/trec_eval.8.1/trec_eval.8.1/trec_eval
!IRTP/trec_eval.8.1/trec_eval.8.1/trec_eval  IRTP/relevance_judgements_train.txt ./trec_output_file_test_data.txt

num_q          	all	10
num_ret        	all	1175
num_rel        	all	44
num_rel_ret    	all	21
map            	all	0.0858
gm_ap          	all	0.0054
R-prec         	all	0.0750
bpref          	all	0.0625
recip_rank     	all	0.1574
ircl_prn.0.00  	all	0.1609
ircl_prn.0.10  	all	0.1609
ircl_prn.0.20  	all	0.1609
ircl_prn.0.30  	all	0.1422
ircl_prn.0.40  	all	0.1409
ircl_prn.0.50  	all	0.1268
ircl_prn.0.60  	all	0.0313
ircl_prn.0.70  	all	0.0312
ircl_prn.0.80  	all	0.0312
ircl_prn.0.90  	all	0.0255
ircl_prn.1.00  	all	0.0255
P5             	all	0.0600
P10            	all	0.0400
P15            	all	0.0400
P20            	all	0.0400
P30            	all	0.0300
P100           	all	0.0170
P200           	all	0.0105
P500           	all	0.0042
P1000          	all	0.0021


# References

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system
- https://radimrehurek.com/gensim/models/tfidfmodel.html


