# Imports

In [1]:
!pip install -qU datasets pinecone-client sentence-transformers torch
import pandas as pd                                         #using dataframe
from datasets import load_dataset                           #HuggingFace
from tqdm.auto import tqdm                                  #progress bar
import pinecone                                             #pinecone
import torch                                                #device-check
from sentence_transformers import SentenceTransformer       #LLM1
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM #LLM2
from pprint import pprint                                   #better formatet output

# Loading Dataset

Context will be loaded from HuggingFace. Dataset is named DUG_dataset and shared in the profile BodoZnipes/Dual-Use-Goods. The important fields will be extractet and stored in "docs". Then the extractet fields are converted into a dataframe.

In [2]:
# load the dataset from huggingface
dual_use_context = load_dataset('BodoZnipes/Dual-Use-Goods', split='train')


Downloading readme:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/213k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [26]:
#checking context
dual_use_context[355]

{'passage_text': 'Capable of producing an ultimate vacuum better than 13 mPa.',
 'list_number': '2B231',
 'description': 'Vacuum pumps having all of the following characteristics are classified as dual-use-goods:',
 'paragraph': 'c',
 'id': 339}

In [4]:
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(dual_use_context, desc="Processing"):
    # extract the fields we need
    doc = {
        "paragraph": d["paragraph"],
        "list_number": d["list_number"],
        "passage_text": d["passage_text"],
        "description": d["description"]
    }
    # add the dict containing fields  which are needed to docs list
    docs.append(doc)

Processing:   0%|          | 0/361 [00:00<?, ?it/s]

In [5]:
# create a pandas dataframe with the documents we extracted
df_dug = pd.DataFrame(docs)
df_dug.head()

Unnamed: 0,paragraph,list_number,passage_text,description
0,0,0A*,0,0
1,a,0A001,Nuclear reactors,Nuclear reactors and specially designed or pre...
2,b,0A001,"Metal vessels, or major shop-fabricated parts ...",Nuclear reactors and specially designed or pre...
3,c,0A001,Manipulative equipment specially designed or p...,Nuclear reactors and specially designed or pre...
4,d,0A001,Control rods specially designed or prepared fo...,Nuclear reactors and specially designed or pre...


# **Initialize Pinecone Index**

In order to be able to use the Pinecone vector database, a connection and indexing must first take place. In addition, a name for the data pool is assigned. The name for this pinecone-index is "question-answering-ba-thesis". This is given with the parameters metric and dimension during indexing.

In [6]:
# connect to pinecone environment
pinecone.init(
    api_key="f5fed55f-c87c-4c61-8f35-cc618e1389da",
    environment="gcp-starter"
)


In [7]:
index_name = "question-answering-ba-thesis"

# check if the abstractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension= 768, # if dimesnion is unknown use dimension=model.get_sentence_embedding_dimension(),
        metric="cosine"
    )

# connect to the created abstractive-question-answering index
index = pinecone.Index(index_name)

In [8]:
index_name = "question-answering-ba-thesis"

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension= 768,
        metric="cosine"
    )
index = pinecone.Index(index_name)

# **Initialize Retriever**



Next the retreiver is initialized. The retriever will mainly do two things:



*   Generate embeddings for all historical passages (context vectors/embeddings)
*   Generate embeddings for the questions (query vector/embedding)



The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. Used is a SentenceTransformer model based on Microsoft's MPNet as retriever. This model performs well for comparing the similarity between queries and documents.

In [9]:
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface LLM1
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

Downloading (…)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

# **Generate Embeddings and Upsert**
Next step is to  generate embeddings for the context passages. When passing the entries to Pinecone, an id , context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to the embeddings, such as the list_number, paragraph , passage text, description etc.

In [10]:
#check first 5 entries
df_dug.iloc[:5].to_dict(orient="records")

[{'paragraph': '0',
  'list_number': '0A*',
  'passage_text': '0',
  'description': '0'},
 {'paragraph': 'a',
  'list_number': '0A001',
  'passage_text': 'Nuclear reactors',
  'description': 'Nuclear reactors and specially designed or prepared equipment and components therefor, as follows are classified as dual-use-goods:'},
 {'paragraph': 'b',
  'list_number': '0A001',
  'passage_text': 'Metal vessels, or major shop-fabricated parts therefor, including the reactor vessel head for a reactor \r\npressure vessel, specially designed or prepared to contain the core of a "nuclear reactor',
  'description': 'Nuclear reactors and specially designed or prepared equipment and components therefor, as follows are classified as dual-use-goods:'},
 {'paragraph': 'c',
  'list_number': '0A001',
  'passage_text': 'Manipulative equipment specially designed or prepared for inserting or removing fuel in a "nuclear \r\nreactor',
  'description': 'Nuclear reactors and specially designed or prepared equipme

In [11]:
batch_size = len(df_dug)

# generate embeddings for the entire DataFrame
emb = retriever.encode(df_dug["passage_text"].tolist()).tolist()
# get metadata
meta = df_dug.to_dict(orient="records")
# create unique IDs
ids = [f"{idx}" for idx in range(batch_size)]
# add all to upsert list
to_upsert = list(zip(ids, emb, meta))
# upsert/insert these records to pinecone
_ = index.upsert(vectors=to_upsert)

# Initialize Generator

Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.
Load LLM2 from HuggingFace with name 'google/flan-t5-large'. 2 helpfunctions are implemented to format results from pinecone and execute the search. In the last part the answer is generated with different parameters like the answer_length.

In [12]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
generator = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [13]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    encoded_query = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    expected_passages = index.query(encoded_query, top_k=top_k, include_metadata=True)
    return expected_passages

def format_query(query, context):
    # extract passage_text from Pinecone search result and add the  tag
    pinecone_results = [f"{m['metadata']['passage_text']},{m['metadata']['description']}" for m in context] #
    # concatinate all context passages
    added_context = " ".join(pinecone_results)
    # contcatinate the query and context passages
    output = f"question: {query} context: {added_context}"
    return output

In [14]:
def query_pinecone(query, top_k):
    encoded_query = retriever.encode([query]).tolist()
    expected_passages = index.query(encoded_query, top_k=top_k, include_metadata=True)
    return expected_passages

def format_query(query, context):
    pinecone_results = [f"  descritpion: {m['metadata']['description']}, passage:{m['metadata']['passage_text']}" for m in context]   #f"{m['metadata']['passage_text'] ,,,
    added_context = " ".join(pinecone_results)
    output = f"question: {query} {added_context}"
    return output

In [15]:
# enter the query
query = "I have a magnetic coated tarp. May I use this?"

result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '71',
              'metadata': {'description': 'Manufactures of non-"fusible" '
                                          'aromatic polyimides in film, sheet, '
                                          'tape or ribbon form having any of '
                                          'the following are classified as '
                                          'dual-use-goods:',
                           'list_number': '1A003',
                           'paragraph': 'b',
                           'passage_text': 'Coated or laminated with carbon, '
                                           'graphite, metals or magnetic '
                                           'substances Note: 1A003 does not '
                                           'control manufactures when coated '
                                           'or laminated with copper and '
                                           'designed for the production \r\n'
                                         

In [16]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)


('question: I have a magnetic coated tarp. May I use this?   descritpion: '
 'Manufactures of non-"fusible" aromatic polyimides in film, sheet, tape or '
 'ribbon form having any of the following are classified as dual-use-goods:, '
 'passage:Coated or laminated with carbon, graphite, metals or magnetic '
 'substances Note: 1A003 does not control manufactures when coated or '
 'laminated with copper and designed for the production \r\n'
 'of electronic printed circuit boards.\r\n'
 'N.B. For "fusible" aromatic polyimides in any form, see 1C008.a.3.')


In [17]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=70)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)


generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('Yes, it is a dual-use-good. It is not a tarp. It is not a sheet. It is not a '
 'tape. It is not a ribbon. It is not a film. It is not a sheet. It is not a '
 'tape. It is not a ribbon')
