### Data processing

This notebook processes the dataset in json format. Each json file has several data samples where a question is mapped to the relevant passages that answer the question accurately and completely. For instance, this would be a sample from the testing dataset.

```json
  {
    "QuestionID": "7b3f8c8d-98a9-4a4b-9c1e-6a4c7aeb2f10",
    "Question": "If an eligible applicant later receives a payment outside this scheme for the same damage, what are their notification and repayment obligations to the local authority?",
    "Passages": [
      {
        "DocumentID": "si-2020-0025",
        "PassageID": "reg-11",
        "Passage": "11. (1) Without prejudice to the generality of Regulations 8(2)(h), 9(2)(m) \nand 10(2)(e), where, in relation to a relevant dwelling in respect of which a \nconfirmation of grant approval has been issued under Regulation 9 or in respect \nof which a payment has been made to an individual under these Regulations, a \npayment otherwise than under these Regulations is made to or for the benefit of \nthe eligible applicant or individual concerned, as the case may be, in respect of \ndamage to the dwelling arising out of or in connection with the use of defective \nconcrete blocks in its construction, that eligible applicant or individual, as the \ncase may be, shall give notice in writing to the relevant local authority of the \npayment and the amount thereof within 28 days of the making of that payment. \n\n(2) On receipt of a notice under paragraph (1), where a payment has \npreviously been made under these Regulations to the eligible applicant or \nindividual concerned, as the case may be, the relevant local authority shall give \nnotice in writing to the eligible applicant or individual concerned, as the case \nmay be, of the total amount paid under these Regulations to the eligible \napplicant or individual, as the case may be. \n\n(3) On receipt of the notice under paragraph (2), the eligible applicant or \nindividual concerned, as the case may be, shall be immediately liable to pay to \nthe relevant local authority the lesser of the following amounts: \n\n(a) \n\nthe amount equal to the payment or payments made under these \nRegulations as set out in the notice referred to in paragraph (2), \nor \n\n(b) \n\nthe amount equal to the payment referred to in paragraph (1)."
      }
    ],
    "Group": 1
  },
```

In [None]:
# Import libraries
from datasets import Dataset # type: ignore
from pandas import to_pickle
import json
from re import compile
from sklearn.model_selection import train_test_split
import os

In [None]:
def simple_cleaning(query: str) -> str:
    pattern_newline = compile(r'[\n\t\u200e]')
    pattern_multiple_spaces = compile(r' +') 

    cln_query = pattern_newline.sub(' ', query)
    cln_query = pattern_multiple_spaces.sub(' ', cln_query).strip()
    return cln_query

### Corpus

Creates a corpus pickle from the processed Irish Statutory Instruments documents and saves it to disk

In [None]:
with open('../all_data.json', encoding='utf-8') as f:
    all_data = json.load(f)

collection = []
seen = set()

for q in all_data:
    for psg in q['Passages']:
        psg_id = f"{psg['DocumentID']}-{psg['PassageID']}"
        if psg_id not in seen:
            passage_text = psg['PassageID'] + " " + psg['Passage']
            if len(passage_text) > 100:
                collection.append({
                    'text': passage_text,
                    'ID': psg_id,
                    'DocumentId': psg['DocumentID'],
                    'PassageId': psg['PassageID'],
                })
                seen.add(psg_id)
                
corpus = {f"{doc['DocumentId']}-{doc['PassageId']}": doc["text"] for doc in collection}
# Save the corpus to disk
to_pickle(corpus, './data/corpus.pkl')


In [9]:
# Quick check:
import pickle
with open('./data/corpus.pkl', 'rb') as f:
    corpus = pickle.load(f)
first_key = list(corpus.keys())[0]
print("Sample:", corpus[first_key][:2000])

Sample: reg-1 1. (1) These Regulations may be cited as the Betting Duty and Betting 

Intermediary Duty (Amendment) Regulations 2020. 

(2) These Regulations come into operation with immediate effect. 

Interpretation
