## Table of Contents
* [0. Import libraries and setup file path](#0)
* [1. Retrieve Domain Knowledge](#1)
    * [1.1. Extract domain document page content](#1.1)
    * [1.2. Document chunking](#1.2)
    	* [1). Create chunks of same length from sentences](#1.2.1)
    	* [2). Join chunk of sentences to a single string](#1.2.2)
    * [1.3. Chunk embedding](#1.3)
    	* [1). Import pre-trained sentence-transformers model](#1.3.1)
    	* [2). Save the domain knowledge with embeddings](#1.3.2)
* [2. Prompt Augmentation](#2)
    * [2.1. Semantic search top k relevent chunks](#2.1)
* [3. Text Generation](#3)
    * [3.1. Import pre-trained seq2seq model](#3.1)
    * [3.2. Test the Model with Zero-Shot Inferencing](#3.2)
* [4. Prompt Augmentation](#4)
    * [4.1. Fine-tune the generation model for abstractive question answering](#4.1)
        * [1). Preprocess the abstractive-qa dataset](#4.1.1)
        * [2). Setup the LoRA model for Fine-Tuning](#4.1.2)
    	* [3). Train the LoRA Adapter](#4.1.3)
    * [4.2. Prompt engineering a causal language model for abstractive question answering](#4.2)
        * [1). Model selection and setup](#4.2.1)
        * [2). Answer from causal model without prompt engineering](#4.2.2)
    	* [3). Answer from causal model with prompt engineering](#4.2.3)
* [5. Model Evaluation](#5)
    * [5.1. Evaluate the Model Quantitatively (with ROUGE Metric)](#5.1)
        * [1). Create sample questions and human_baseline answers](#5.1.1)
        * [2). Get answers from the causal language model](#5.1.2)
    	* [3). Get answers from fine-tuned seq2seq language model](#5.1.3)
    * [5.2. Import the ROUGE and BLEU metric](#5.2)
    * [5.3. Evaluate models performance](#5.3)

## 0. Import libraries and setup file path<a id="0"></a>

In [1]:
import os
import re
import fitz 
import spacy
import torch
import transformers
import numpy as np
import pandas as pd

from tqdm.auto import tqdm
from huggingface_hub import login
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForCausalLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer, AutoModelForSeq2SeqLM
from datasets import load_dataset, load_metric

DOMAIN_DOCUMENT_PATH = 'domain_document/'
DOMAIN_KNOWLEDGE_PATH = 'domain_document/domain_knowledge/'
CHUNK_SIZE = 10 
INTERSECT_SIZE = 2

# Load the English language model
NLP = spacy.load("en_core_web_sm")

## 1. Retrieve Domain Knowledge<a id="1"></a>
### 1.1. Extract domain document page content<a id="1.1"></a>

In [2]:
def get_document(title):
    '''
    Extract document content and preprocess
    '''
    # Open the document with given name
    path = DOMAIN_DOCUMENT_PATH + title
    document = fitz.open(path)  
    # Read the document and get content page by page
    content_list = list()
    for index, page in tqdm(enumerate(document)): 
        raw_content = page.get_text() 
        content = raw_content.replace("\n", " ").strip()
        content_list.append({'title': title,
                             "page": (index + 1),  
                             "content": content})
    return content_list

def get_all_document():
    '''
    Extract content of all documents in the domain knowledge folder
    '''
    # Get titles of all documents
    path = DOMAIN_DOCUMENT_PATH
    doc_list = [_ for _ in os.listdir(path) if os.path.isfile(os.path.join(path, _))]
    # Extract content for each document
    all_content_list = list()
    for title in tqdm(doc_list):
        all_content_list.extend(get_document(title))
    return all_content_list

In [3]:
# Example of one extracted page of document
document_1 = get_all_document()
document_1[11]

  0%|          | 0/8 [00:00<?, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

{'title': 'Enterprise-Analytics-Report-508-FINAL.pdf',
 'page': 12,
 'content': "12    Leveraging Technologies to Analyze and Visualize  Data  The relationship between the CDO and CIO is also critical to the success of enterprise analytics  programs. The collaboration between these roles varies across agencies, with different models in  place such as reporting and non-reporting relationships. Regardless of the specific structure, the  interplay between the CDO and the CIO is crucial for effectively linking enterprise analytics  programs and IT strategies, thus ensuring the necessary technology and infrastructure are in  place to support the agency's enterprise analytics initiatives.  Table 2. Has your agency implemented an enterprise analytics platform for integrating and  analyzing data across various components and functional silos?  Agency  Type  Mature  IMP*   Pilot  IMP*  Experimental  Development  Planned  IMP*  No  Current  Plans  CFO Act  Agency*  2  4  1  3  2  Non-CFO  Act Ag

### 1.2. Document chunking<a id="1.2"></a>
#### 1.2.1. Create chunks of same length from sentences<a id="1.2.1"></a>

In [4]:
def get_sentences(content_list, nlp = NLP):
    '''
    Split page of content to sentences
    '''
    for page in tqdm(content_list):
        # Process the text and create sentences
        content = page['content']
        doc = nlp(content) 
        sentences = [sentence.text for sentence in doc.sents]
        # Create a new item in dictionary
        page["sentences"] = sentences
    return content_list

In [5]:
def get_chunks(content_list, s_1 = CHUNK_SIZE, s_2 = INTERSECT_SIZE):
    '''
    Create chunks from sentences
    '''
    page_len = len(content_list)
    skip_num = 0
    for i in tqdm(range(page_len)):
        page = content_list[i]
        sentences = page["sentences"]
        if skip_num < len(sentences):
            start_i = max(0, skip_num-s_2)
            chunks = [sentences[i:(i+s_1)] for i in range(start_i, 
                                                          len(sentences), 
                                                          (s_1-s_2))]
        # Fill each chunk to the same length
        if (len(chunks[-1]) < s_1) and (i != (page_len-1)):
            skip_num = s_1 - len(chunks[-1])
            next_page = content_list[i+1]
            fill_sentences = next_page["sentences"][:skip_num]
            chunks[-1].extend(fill_sentences)
        page["chunks"] = chunks
    return content_list

In [6]:
# Example of extracted sentences from one page of document
document_2 = get_sentences(document_1)
#document_2[11]['sentences']

# Example of chunk of sentences of one page
document_3 = get_chunks(document_2)
example_chunk_len = len(document_3[11]['chunks'][-1])
print(f'Number of sentences in the chunk: {example_chunk_len}.')
print('-' * 25)
print(document_3[11]['chunks'][-1])

  0%|          | 0/454 [00:00<?, ?it/s]

  0%|          | 0/454 [00:00<?, ?it/s]

Number of sentences in the chunk: 10.
-------------------------
['Non-CFO Act Agency CFO Act Agency*', '13    Effective collaboration, both within and between departments, is crucial.', "The USDA's EDAPT  system and the State Department's Data.", 'State both facilitated collaboration by integrating data  from various sources into a single platform, allowing for easier sharing and analysis.', 'The  partnership models between the CDO and CIO also played an integral role in the successful  implementation and management of these systems.  ', 'Given the lack of maturity in platform and data integration but the relatively higher maturity in  enterprise analytics programs, CDOs should leverage the interest in analytics to accelerate the  maturity of their overall data infrastructure to make the most of their analytics capabilities.  ', 'Facilitate rapid prototyping and proofs of concept on advanced analytics tools and capabilities  with innovative programs and customer agencies, while enablin

In [7]:
document_3[11].keys()

dict_keys(['title', 'page', 'content', 'sentences', 'chunks'])

#### 1.2.2. Join chunk of sentences to a single string<a id="1.2.2"></a>

In [8]:
def join_chunk_sentences(chunk_list):
    '''
    Join the sentences in one chunk as a single string
    '''
    joined_chunk = " ".join(chunk_list)
    # Use re.sub to replace multiple spaces with a single space
    chunk_string = re.sub(r'\s+', ' ', joined_chunk).strip()
    return chunk_string

def get_chunk_df(content_list):
    '''
    Create a new content_list df based on generated chunks
    '''
    # Transform the original content_list to a pandas dataframe
    raw_df = pd.DataFrame(content_list)
    # Explode raw_df and create a new column of chunk string
    new_df = raw_df.explode(['chunks'], ignore_index=True)
    new_df['chunk_string'] = new_df['chunks'].apply(join_chunk_sentences)
    # Select columns of interest
    out_df = new_df[['title', 'page', 'chunk_string']]
    return out_df

chunk_df = get_chunk_df(document_3)
chunk_df.head(3)

Unnamed: 0,title,page,chunk_string
0,Enterprise-Analytics-Report-508-FINAL.pdf,1,FEDERAL CHIEF DATA OFFICERS (CDO) COUNCIL The ...
1,Enterprise-Analytics-Report-508-FINAL.pdf,2,18 Appendix: Agency Case Studies ................
2,Enterprise-Analytics-Report-508-FINAL.pdf,3,They worked to build critical relationships an...


### 1.3 Chunk embedding<a id="1.3"></a>
#### 1.3.1. Import pre-trained sentence-transformers model<a id="1.3.1"></a> 

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [10]:
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device)
embedding_model.to(device) 

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [11]:
#chunk_df['chunk_embedding'] = chunk_df['chunk_string'].apply(lambda s: 
                                                             #embedding_model.encode(s))

# Batch processing
chunk_strings =  chunk_df['chunk_string'].tolist()
chunk_embeddings = embedding_model.encode(chunk_strings,
                                          batch_size=32, 
                                          convert_to_tensor=True)

In [12]:
chunk_embeddings.shape

torch.Size([1211, 768])

#### 1.3.2. Save the domain knowledge with embeddings<a id="1.3.2"></a>

In [13]:
chunk_df.to_csv((DOMAIN_KNOWLEDGE_PATH+'domain_knowledge_chunks.csv'), encoding='utf-8', index=False)
np.savetxt((DOMAIN_KNOWLEDGE_PATH+'domain_knowledge_embeddings.csv'), np.array(chunk_embeddings.cpu()), delimiter=",")

## 2. Prompt Augmentation<a id="2"></a>
### 2.1. Semantic search top k relevent chunks<a id="2.1"></a>

In [14]:
def get_top_chunks(query, chunk_embeddings, model, k=5):
    '''
    Retrieve top-k relevent chunks to input query
    Return a list of chunk string
    '''
    # Embed the query with the same embedding model
    query_embedding = model.encode(query, convert_to_tensor=True)
    # Get the similarity matrix with dot product
    similarity_mat = util.dot_score(query_embedding, chunk_embeddings)[0]
    # Get the top-k indices
    top_indices = torch.topk(similarity_mat, k=k)[1].tolist()
    # Get chunks
    top_chunks = chunk_df.iloc[top_indices]['chunk_string'].tolist()
    return top_chunks

In [15]:
# Example: Get top 5 chunks related to the example query
example_query = 'How to protect personal data?'
top_chunks = get_top_chunks(example_query, chunk_embeddings, embedding_model)
# Print top 2 related chunks
print(top_chunks[:2])

['(77) Guidance on the implementation of appropriate measures and on the demonstration of compliance by the controller or the processor, especially as regards the identification of the risk related to the processing, their assessment in terms of origin, nature, likelihood and severity, and the identification of best practices to mitigate the risk, could be provided in particular by means of approved codes of conduct, approved certifications, guidelines provided by the Board or indications provided by a data protection officer. The Board may also issue guidelines on processing operations that are considered to be unlikely to result in a high risk to the rights and freedoms of natural persons and indicate what measures may be sufficient in such cases to address such risk. (78) The protection of the rights and freedoms of natural persons with regard to the processing of personal data require that appropriate technical and organisational measures be taken to ensure that the requirements of

## 3. Text Generation<a id="3"></a>
### 3.1. Import pre-trained seq2seq model<a id="3.1"></a>

In [16]:
generation_model_id = 'google/flan-t5-base'

In [17]:
original_model = AutoModelForSeq2SeqLM.from_pretrained(generation_model_id, 
                                                       torch_dtype=torch.bfloat16, 
                                                       low_cpu_mem_usage=False)
generation_tokenizer = AutoTokenizer.from_pretrained(generation_model_id)
generation_tokenizer.model_max_length=2048

# Send the model to GPU
original_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [19]:
generation_tokenizer.is_fast

True

#### 3.2. Test the Model with Zero-Shot Inferencing<a id="3.2"></a>

In [18]:
def prepend_prompt(context, question):
    '''
    Prepend instruction to the question
    '''
    prepend_str = 'Based on the following context, please answer the query by extracting relevant passages from the context.\nContext:\n'
    adj_question = prepend_str + context + '\nQuestion:\n' + question + '\nAnswer:'
    return adj_question

In [21]:
# Zero-Shot inference with example context and question
example_context = '- ' + '\n- '.join(top_chunks)
example_prompt = prepend_prompt(example_context, example_query)
#print(example_prompt)
#print('-' * 100)

example_inputs = generation_tokenizer(example_prompt, padding=True, truncation=True, return_tensors='pt').to(device)
example_output = original_model.generate(example_inputs['input_ids'], max_new_tokens=256)[0]
original_model_answer = generation_tokenizer.decode(example_output, skip_special_tokens=True)
print(original_model_answer)

(78) The protection of the rights and freedoms of natural persons with regard to the processing of personal data require that appropriate technical and organisational measures be taken to ensure that the requirements of this Regulation are met. In order to maintain security and to prevent processing in infringement of this Regulation, the controller should adopt internal policies and implement measures which meet in particular the principles of data protection by design and data protection by default. Such measures could consist, inter alia, of minimising the processing of personal data, pseudonymising personal data as soon as possible, transparency with regard to the functions and processing of personal data, enabling the data subject to monitor the data processing, enabling the controller to create and improve security features. When developing, designing, selecting and using applications, services and applications that are based on the processing of personal data or process personal

## 4. Prompt Augmentation<a id="4"></a>
### 4.1. Fine-tune the generation model for abstractive question answering<a id="4.1"></a>
#### 4.1.1. Preprocess the abstractive-qa dataset<a id="4.1.1"></a>

In [25]:
qa_dataset = load_dataset('rajpurkar/squad')

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [33]:
# Example of qa_dataset's context and question-answer pair:
dash_line = '-' * 100
example_context = qa_dataset['train'][0]['context']
example_question = qa_dataset['train'][0]['question']
example_answers = qa_dataset['train'][0]['answers']['text'][0]
print(f'Context: {example_context}')
print(dash_line)
print(f'Query: {example_question}')
print(dash_line)
print(f'Answer: {example_answers}')

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
----------------------------------------------------------------------------------------------------
Query: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
----------------------------------------------------------------------------------------------------
Answer: Saint 

In [34]:
# Tokenize the question-answer pairs into explicit instructions for the generation model
def tokenize(examples, tokenizer=generation_tokenizer, device=device, max_length=2048, answer_max_length=256):
    '''
    Concatentate the qa pairs with the predefined prompt 
    Tokenize with context, and define the tokenized answer as the label
    '''
    # Prepend the instruction to the context and question
    prepend_str = 'Based on the following context, please answer the query by extracting relevant passages from the context.\nContext:\n'
    adj_questions = [prepend_str + context + '\nQuestion:\n' + question + '\nAnswer:'
                     for (context, question) 
                     in zip(examples['context'], examples['question'])]
    answers = [each['text'][0] for each in examples['answers']]
    
    # Tokenize the questions and answers
    tokenized_inputs = tokenizer(adj_questions, padding=True, truncation=True, 
                                 max_length=max_length, return_tensors="pt")
    tokenized_labels = tokenizer(answers, padding=True, truncation=True, 
                                 max_length=answer_max_length, return_tensors="pt")
    
    # Ensure the labels are prepared for the model
    labels = tokenized_labels['input_ids']
    # Replace the token IDs for padding with -100
    labels[labels == tokenizer.pad_token_id] = -100

    model_inputs = {
        'input_ids': tokenized_inputs['input_ids'],
        'attention_mask': tokenized_inputs['attention_mask'],
        'labels': labels
    }
    return model_inputs

In [35]:
tokenized_qa_dataset = qa_dataset.map(tokenize, batched=True)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

#### 4.1.2. Setup the LoRA model for Fine-Tuning<a id="4.1.2"></a>

In [41]:
from peft import LoraConfig, TaskType, get_peft_model

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,
    lora_alpha=32,
    lora_dropout=0.1
)

# Add LoRA adapter layers/parameters to the original LLM to be trained:
peft_model = get_peft_model(original_model, lora_config)
# Print the number of trainable model parameters in the LoRA model:
peft_model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4092820552029972


#### 4.1.3. Train the LoRA Adapter<a id="4.1.3"></a>

In [52]:
# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results_lora_qa",
    num_train_epochs=3,
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=8,
    learning_rate=1e-3, 
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs_lora_qa',
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=2,
    do_train=True,
    do_eval=True
)

# Define data collator
data_collator = DataCollatorForSeq2Seq(generation_tokenizer, model=peft_model)

# Create a Seq2SeqTrainer instance
peft_trainer = Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_qa_dataset['train'],
    eval_dataset=tokenized_qa_dataset['validation'],
    tokenizer=generation_tokenizer,
    data_collator=data_collator
)

In [53]:
peft_trainer.train()

Step,Training Loss,Validation Loss
500,0.3849,0.322577
1000,0.3594,0.331443
1500,0.3118,0.333329
2000,0.3385,0.33971
2500,0.3481,0.33312
3000,0.3495,0.341716
3500,0.3271,0.339754
4000,0.4342,0.336479
4500,0.2907,0.336209
5000,0.39,0.318322


TrainOutput(global_step=32850, training_loss=0.3099739267474077, metrics={'train_runtime': 10391.7266, 'train_samples_per_second': 25.289, 'train_steps_per_second': 3.161, 'total_flos': 2.680364888277166e+17, 'train_loss': 0.3099739267474077, 'epoch': 3.0})

In [54]:
# Save the model to a local folder
peft_model.save_pretrained('./flan-t5-base-qa-lora')

In [93]:
from peft import AutoPeftModelForSeq2SeqLM
from transformers import AutoTokenizer

peft_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-qa-lora')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

peft_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k): Linear(in_features=768, out_features=768, bias=False)
              (

In [19]:
# Zero-Shot inference with example context and question
example_context = '- ' + '\n- '.join(top_chunks)
example_prompt = prepend_prompt(example_context, example_query)
#print(example_prompt)
#print('-' * 100)

example_inputs = tokenizer(example_prompt, padding=True, truncation=True, return_tensors='pt').to(device)
peft_output = peft_model.generate(example_inputs['input_ids'], max_new_tokens=256)[0]
peft_model_answer = tokenizer.decode(peft_output, skip_special_tokens=True)
print(peft_model_answer)

a controller determines the purposes and means of the processing jointly with other controllers or where a processing operation is carried


### 4.2. Prompt engineering a causal language model for abstractive question answering<a id="4.2"></a>
#### 4.2.1. Model selection and setup<a id="4.2.1"></a>

In [20]:
causal_model_id = "google/gemma-7b-it"

In [21]:
causal_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=causal_model_id)
causal_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=causal_model_id,
                                                    torch_dtype=torch.bfloat16,
                                                    low_cpu_mem_usage=False,
                                                    attn_implementation='flash_attention_2')
causal_model.to(device)

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaFlashAttention2(
          (q_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear(in_features=24576, out_features=3072, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): Gem

#### 4.2.2. Answer from causal model without prompt engineering<a id="4.2.2"></a>

In [22]:
def get_chat_prompt(raw_prompt, causal_tokenizer):
    '''
    Fill the prompt following huggingface chat model template
    Return a tokenizable string the model expects
    '''
    template = [{"role": "user", "content": raw_prompt}]
    chat_prompt = causal_tokenizer.apply_chat_template(conversation=template,
                                                       tokenize=False, 
                                                       add_generation_prompt=True)
    return chat_prompt

def get_answer(chat_prompt, causal_tokenizer, causal_model, 
               device = device,
               max_tokens = 256):
    '''
    Get formatted original answer given the prompt
    '''
    prompt_ids = causal_tokenizer(chat_prompt, return_tensors="pt").to(device)
    raw_answer_ids = causal_model.generate(**prompt_ids, max_new_tokens=max_tokens)
    # Get the answer in original format
    raw_answer = causal_tokenizer.decode(raw_answer_ids[0])
    # Change the format of the answer
    remove_list = [chat_prompt, '<bos>', '<eos>']
    big_regex = re.compile('|'.join(map(re.escape, remove_list)))
    answer = big_regex.sub('', raw_answer)
    return answer

In [45]:
# Example of fill the chat model prompt
raw_prompt = 'How to protect personal data?'
chat_prompt = get_chat_prompt(raw_prompt, causal_tokenizer)
print(chat_prompt)

<bos><start_of_turn>user
How to protect personal data?<end_of_turn>
<start_of_turn>model



In [29]:
# Example of formatted answer w/o prompt engineering to the example prompt
example_answer = get_answer(chat_prompt, causal_tokenizer, causal_model)
print(example_answer)

**Personal Data Protection Measures:**

**1. Secure Data Collection and Storage:**

* Use encrypted data collection forms and secure storage solutions (e.g., encrypted hard drives, cloud storage with strong security protocols).
* Implement access controls to restrict unauthorized access to personal data.
* Comply with industry standards for data protection (e.g., GDPR, CCPA).

**2. Data Masking and Anonymization:**

* Mask sensitive personal data (e.g., names, addresses, social security numbers) with anonymization techniques.
* Use pseudonymization or data anonymization to remove identifying information from data.

**3. Access Control and Authentication:**

* Implement strong authentication methods (e.g., multi-factor authentication, biometric authentication).
* Limit data access to authorized personnel only.
* Use role-based access control (RBAC) to restrict data access based on user roles.

**4. Data Retention and Deletion:**

* Establish clear data retention policies to determine ho

#### 4.2.3. Answer from causal model with prompt engineering<a id="4.2.3"></a>

In [23]:
# Create an instruction prompt with few-shot inference

rag_prompt = '''Please answer the user's query with information from given context.
Please extract relevant passages from the context before responding.
Return the answer in an informative and concise manner. 
\nUse the following examples as reference for the ideal answer style:
\nExample 1:
Query: What is data risk?
Answer: Data risk refers to the potential for data to be lost, stolen, corrupted, or otherwise compromised, leading to negative consequences for individuals or organizations. This risk encompasses a wide range of issues, including but not limited to data breaches, unauthorized access, data corruption, and privacy violations. The implications of data risk can be severe, impacting the financial health, reputation, and legal standing of organizations, as well as the privacy and security of individuals' personal information.
\nExample 2:
Query: Where does data risk arise from?
Answer: Data risk arises from various sources, including cyberattacks (such as hacking, phishing, and malware), human error (like accidental deletion or mishandling of data), technical failures (such as software bugs or hardware malfunctions), and natural disasters. The growing reliance on digital technology and the increasing volume of data being stored and processed have heightened the importance of managing data risks effectively.
\nExample 3:
Query: How to conduct effective data risk management?
Answer: Effective data risk management involves identifying potential risks, assessing their likelihood and potential impact, and implementing measures to mitigate these risks. This includes employing strong cybersecurity measures, developing and enforcing robust data handling policies, regular auditing and monitoring of data access and usage, and ensuring compliance with data protection regulations.
\nNow use the following context to answer user's query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
Query: {query}
Answer:
'''

def set_prompt(top_chunks, query, rag_prompt, tokenizer = causal_tokenizer):
    '''
    Use few-shot inference to conduct prompt engineering
    '''
    # Create the context from top-k chunks
    context = "- " + "\n- ".join(top_chunks)
    # Create the prompt with examples  
    full_prompt = rag_prompt.format(context=context, query=query)
    chat_template = [
        {"role": "user",
        "content": full_prompt}
    ]
    out_prompt = tokenizer.apply_chat_template(conversation=chat_template,
                                               tokenize=False,
                                               add_generation_prompt=True)
    return out_prompt

In [24]:
# Print the example prompt with few-shot inference
example_prompt = set_prompt(top_chunks, 
                            example_query, 
                            rag_prompt)
print(example_prompt)

<bos><start_of_turn>user
Please answer the user's query with information from given context.
Please extract relevant passages from the context before responding.
Return the answer in an informative and concise manner. 

Use the following examples as reference for the ideal answer style:

Example 1:
Query: What is data risk?
Answer: Data risk refers to the potential for data to be lost, stolen, corrupted, or otherwise compromised, leading to negative consequences for individuals or organizations. This risk encompasses a wide range of issues, including but not limited to data breaches, unauthorized access, data corruption, and privacy violations. The implications of data risk can be severe, impacting the financial health, reputation, and legal standing of organizations, as well as the privacy and security of individuals' personal information.

Example 2:
Query: Where does data risk arise from?
Answer: Data risk arises from various sources, including cyberattacks (such as hacking, phishin

In [45]:
# Get the answer from the causal language model with few-shot inference

def get_answer_w_few_shot(prompt, temperature=0.9, model=causal_model, tokenizer=causal_tokenizer, device=device):
    '''
    Return the answer from the model with engineered prompt
    '''
    input_ids = tokenizer(prompt, return_tensors="pt").to(device)
    # Generate the answer tokens
    out_tokens = model.generate(**input_ids,
                                do_sample=True, 
                                max_new_tokens=512,
                                temperature=temperature)
    # Get the answer
    raw_answer = tokenizer.decode(out_tokens[0])
    # Change the format of the answer
    remove_list = [prompt, '<bos>', '<eos>']
    big_regex = re.compile('|'.join(map(re.escape, remove_list)))
    answer = big_regex.sub('', raw_answer)
    return answer
    
def get_pure_answers(raw_answer):
    '''
    Get answers without Relevent passages
    '''
    try:
        start_index = raw_answer.index('**Answer:**')
        return raw_answer[start_index+13:]
    except:
        try:
            start_index = raw_answer.index('**Conclusion:**')
            return raw_answer[start_index+17:]
        except:
            return raw_answer

In [29]:
# Print the example answer with few-shot inference
example_answer = get_answer_w_few_shot(example_prompt)
print(example_answer)

**Relevant passages:**

**1. Data risk:**
- "Data risk refers to the potential for data to be lost, stolen, corrupted, or otherwise compromised, leading to negative consequences for individuals or organizations."
- "The protection of the rights and freedoms of natural persons with regard to the processing of personal data require that appropriate technical and organisational measures be taken to ensure that the requirements of this Regulation are met."

**2. Data protection measures:**
- "In order to maintain security and to prevent processing in infringement of this Regulation, the controller or processor should evaluate the risks inherent in the processing and implement measures to mitigate those risks."
- "To further strengthen the control over his or her own data, where the processing of personal data is carried out by automated means, the data subject should also be allowed to receive personal data concerning him or her which he or she has provided to a controller in a structured,

In [30]:
pure_example_answer = get_pure_answers(example_answer)
print(pure_example_answer)

To protect personal data, appropriate technical and organizational measures must be taken to ensure compliance with the European Union Regulation on the Protection of Personal Data (GDPR). These measures include evaluating the risks inherent in the processing, implementing measures to mitigate risks, and ensuring data protection by design and by default. Additionally, data subjects have the right to access, rectify, erase, or restrict the processing of their personal data. To enforce these rights, controllers should provide clear and concise information about the purpose, duration, and recipients of personal data processing. Measures to verify the identity of data subjects and prevent unauthorized access to personal data should also be implemented.


## 5. Model Evaluation<a id="5"></a>
### 5.1. Evaluate the Model Quantitatively (with ROUGE Metric)<a id="5.1"></a>
#### 5.1.1. Create sample questions and human_baseline answers<a id="5.1.1"></a>

In [115]:
question_list = [
    'What is data governance?',
    'When to establish a data governance body?',
    'Who is authorized to establish a data governance body?',
    'What is data identification as one key activity of data governance?',
    'What is data management policy as one key activity of data governance?',
    'What is joint controllers?',
    'How to handle cybersecurity risk for an organization?',
    'What is CSF Core?',
    'When should agencies consider a mixed-mode approach for survey data collection?',
    'How to enhance transparency and compliance with the regulation of GDPR?',
    'What are OMB statistical classifications?'
]

human_answers = [
    'Data governance is the process of setting and enforcing priorities for managing and using data as a strategic asset. A data governance body with authority and oversight over the management of agency data assets is a key piece of data infrastructure. These bodies are commonly called by such names as Data Governance Boards, Data Councils, or Data Strategy Teams. The data governance body establishes policies, procedures, and roles for developing, overseeing, and coordinating data management policy and helps prioritize data resource allocations to answer agency key questions and meet stakeholder needs.',
    'Agencies should make establishing a data governance body a top priority, thereby setting up the organizational structure to address data and related infrastructure needs.',
    'A data governance body is authorized and chartered by the agency head or delegated authority, chaired by the Chief Data Officer (CDO), and includes senior staff with responsibility for diverse aspects of data management as well as senior officials from agency program areas. In addition to the CDO, membership should include the Evaluation Official (EO) and the Statistical Official (SO) named in accordance with the Evidence Act.',
    'Identify data assets and develop a data inventory with appropriate metadata.',
    "Develop short statements of management intent and fundamental rules for governing the creation, acquisition, privacy, integrity, security, quality, and use of data and information.",
    "Where two or more controllers jointly determine the purposes and means of processing, they shall be joint controllers.",
    "An organization may choose to handle risk in one or more ways — including mitigating, transferring, avoiding, or accepting negative risks and realizing, sharing, enhancing, or accepting positive risks — depending on the potential impacts and likelihoods. Importantly, an organization can use the CSF both internally to manage its cybersecurity capabilities and externally to oversee or communicate with third parties.",
    "CSF Core, the nucleus of the CSF, which is a taxonomy of high-level cybersecurity outcomes that can help any organization manage its cybersecurity risks. The CSF Core components are a hierarchy of Functions, Categories, and Subcategories that detail each outcome.",
    "The two main reasons to consider using more than one mode of collection simultaneously are cost and response rates. The typical mixed mode approach is to use a less costly method for initial contact and a more costly mode for follow-up with nonrespondents, such as using a mail survey with telephone nonresponse follow-up or a telephone survey with an in-person nonresponse follow-up.",
    "In order to enhance transparency and compliance with this Regulation, the establishment of certification mechanisms and data protection seals and marks should be encouraged, allowing data subjects to quickly assess the level of data protection of relevant products and services.",
    "OMB currently has a number of different statistical classifications for demographic, economic, and geographic data, including data on race and ethnicity, industries, occupations, and statistical areas described in more detail in the following questions. In addition, there are some standard definitions of economic concepts for statistical purposes, and standard sources for Federal data for some demographic and economic statistics."
]

#### 5.1.2. Get answers from the causal language model<a id="5.1.2"></a>

In [35]:
def get_causal_answer(query, chunk_embeddings=chunk_embeddings, embedding_model=embedding_model, rag_prompt=rag_prompt):
    '''
    Return the output from prompt engineered causal language model
    '''
    top_chunks = get_top_chunks(query, chunk_embeddings, embedding_model)
    prompt = set_prompt(top_chunks, query, rag_prompt)
    raw_answer = get_answer_w_few_shot(prompt)
    return get_pure_answers(raw_answer)

In [88]:
causal_answers = [get_causal_answer(each) for each in question_list]

In [92]:
# Check causal answers format
print(causal_answers[4])

Data management policy is one key activity of data governance as stated in the text. It is the development of short statements of management intent and fundamental rules for governing the creation, acquisition, privacy, integrity, security, quality, and use of data and information.


#### 5.1.3. Get answers from fine-tuned seq2seq language model<a id="5.1.3"></a>

In [94]:
def get_seq2seq_answer(query, chunk_embeddings=chunk_embeddings, embedding_model=embedding_model, tokenizer=tokenizer, peft_model=peft_model, device=device):
    top_chunks = get_top_chunks(query, chunk_embeddings, embedding_model)
    context = '- ' + '\n- '.join(top_chunks)
    prompt = prepend_prompt(context, query)
    # Get the input for seq2seq model
    inputs = tokenizer(prompt, padding=True, truncation=True, return_tensors='pt').to(device)
    # Get the answer from seq2seq model
    peft_output = peft_model.generate(inputs['input_ids'], max_new_tokens=512)[0]
    answer = tokenizer.decode(peft_output, skip_special_tokens=True)
    return answer

In [95]:
seq2seq_answers = [get_seq2seq_answer(each) for each in question_list]

In [109]:
# Check seq2seq answers format
print(seq2seq_answers[3])

Data Governance Boards, Data Councils, or Data Strategy Teams


### 5.2. Import the ROUGE and BLEU metric<a id="5.2"></a>

In [119]:
import evaluate
rouge = evaluate.load('rouge')
bleu = evaluate.load('bleu')

### 5.3. Evaluate models performance<a id="5.3"></a>

In [120]:
# ROUGE meric
seq2seq_model_results = rouge.compute(
    predictions=seq2seq_answers,
    references=human_answers
)

causal_model_results = rouge.compute(
    predictions=causal_answers,
    references=human_answers
)

print('seq2seq model:')
print(seq2seq_model_results)
print('causal model:')
print(causal_model_results)

seq2seq model:
{'rouge1': 0.15428224385214537, 'rouge2': 0.02203652537232967, 'rougeL': 0.12614576381579035, 'rougeLsum': 0.12585373876963152}
causal model:
{'rouge1': 0.3617099989759032, 'rouge2': 0.2523914257083444, 'rougeL': 0.3010443648148077, 'rougeLsum': 0.3145119502546009}


In [123]:
# BLEU metric
seq2seq_model_results_1 = bleu.compute(
    predictions=seq2seq_answers,
    references=human_answers
)

causal_model_results_1 = bleu.compute(
    predictions=causal_answers,
    references=human_answers
)

print('seq2seq model:')
print(seq2seq_model_results_1)
print('causal model:')
print(causal_model_results_1)

seq2seq model:
{'bleu': 0.00956830823474908, 'precisions': [0.3312883435582822, 0.09868421052631579, 0.07092198581560284, 0.06153846153846154], 'brevity_penalty': 0.08754670802118468, 'length_ratio': 0.2910714285714286, 'translation_length': 163, 'reference_length': 560}
causal model:
{'bleu': 0.1018484734792892, 'precisions': [0.17144373673036092, 0.0998398291510945, 0.08431793770139635, 0.07455429497568881], 'brevity_penalty': 1.0, 'length_ratio': 3.3642857142857143, 'translation_length': 1884, 'reference_length': 560}
