### Data processing

This notebook processes the dataset in json format. Each json file has several data samples where a question is mapped to the relevant passages that answer the question accurately and completely. For instance, this would be a sample from the testing dataset.

```json
  {
    "QuestionID": "8d9c9c4a-2a66-4d1e-8b0b-0bbf3e3f1c2e",
    "Question": "When does a person who purchases a dwelling on or after the commencement of these Regulations cease to qualify as a “relevant owner” for the scheme?",
    "Passages": [
      {
        "DocumentID": "si-2020-0025",
        "PassageID": "reg-5",
        "Passage": "(2) Where a person purchases a dwelling on or after the date of the coming \ninto operation of these Regulations, he or she shall not be a relevant owner for \n\n \n \n \n\f[25] 7 \n\nthe purposes of these Regulations where he or she knew, or ought to have \nknown, that defective concrete blocks were used in the construction of the \ndwelling."
      }
    ],
    "Group": 1
  },
```

In [1]:
# Import libraries

# Data handling
from datasets import Dataset
from pandas import to_pickle

# Other utils
import json
from re import compile
import os

In [2]:
# Function to perform some basic clean up of the query
def simple_cleaning(query: str) -> str:
    pattern_newline = compile(r'[\n\t\u200e]')  # Remove new lines, tabs, and undesired characters
    pattern_multiple_spaces = compile(r' +')  # Remove contiguous blank spaces

    cln_query = pattern_newline.sub(' ', query)
    cln_query = pattern_multiple_spaces.sub(' ', cln_query).strip()
    return cln_query

In [None]:
with open('../all_data.json', encoding='utf-8') as f:
    all_data = json.load(f)

# Split into train (70%), eval (15%), test (15%)
from sklearn.model_selection import train_test_split

data_train, temp = train_test_split(all_data, test_size=0.3, random_state=42)
data_eval, data_test = train_test_split(temp, test_size=0.5, random_state=42)

print(f"Train: {len(data_train)}, Eval: {len(data_eval)}, Test: {len(data_test)}")

Train: 2865, Eval: 614, Test: 615


In [4]:
len(data_train), len(data_eval), len(data_test)

(2865, 614, 615)

### Corpus

Creates a corpus pickle from the list of 40 regulatory documents and saves it to disk

In [None]:
with open('../all_data.json', encoding='utf-8') as f:
    all_data = json.load(f)

collection = []
seen = set()

for q in all_data:
    for psg in q['Passages']:
        psg_id = f"{psg['DocumentID']}-{psg['PassageID']}"
        if psg_id not in seen:
            passage_text = psg['PassageID'] + " " + psg['Passage']
            if len(passage_text) > 100:
                collection.append({
                    'text': passage_text,
                    'ID': psg_id,
                    'DocumentId': psg['DocumentID'],
                    'PassageId': psg['PassageID'],
                })
                seen.add(psg_id)
                
corpus = {f"{doc['DocumentId']}-{doc['PassageId']}": doc["text"] for doc in collection}
# Save the corpus to disk
to_pickle(corpus, './data/corpus.pkl')


In [9]:
# Quick check:
import pickle
with open('./data/corpus.pkl', 'rb') as f:
    corpus = pickle.load(f)
first_key = list(corpus.keys())[0]
print("Sample:", corpus[first_key][:2000])

Sample: reg-1 1. (1) These Regulations may be cited as the Betting Duty and Betting 

Intermediary Duty (Amendment) Regulations 2020. 

(2) These Regulations come into operation with immediate effect. 

Interpretation
