<a href="https://colab.research.google.com/github/R-ohit-B-isht/openfabrics-test/blob/main/Science_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Science Question Answering



- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [None]:
!pip install -qU datasets pinecone-client sentence-transformers torch

In [35]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = text.split()
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

from textblob import TextBlob

def correct_spelling(text):
    blob = TextBlob(text)
    corrected_text = str(blob.correct())
    return corrected_text

def convert_to_lowercase(text):
    return text.lower()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Load and Prepare Dataset

In [6]:
from datasets import load_dataset

# load the dataset from huggingface  and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True
).shuffle(seed=960)

In [15]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [18]:
# filter only documents with Science as section_title
Science = wiki_data.filter(
    lambda d: d['section_title'].startswith('Science')
)

iterate through the dataset and apply our filter to select the 50,000 Science passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [None]:
from tqdm.auto import tqdm  

total_doc_count = 50000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(Science, total=total_doc_count):
    # extract the fields we need
    doc = {
        "article_title": d["article_title"],
        "passage_text": d["passage_text"]
    }
    # add the dict containing fields we need to docs list
    docs.append(doc)

    # stop iteration once we reach 50k
    if counter == total_doc_count:
        break

    # increase the counter on every iteration
    counter += 1

In [None]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df

# Initialize Pinecone Index

In [4]:
import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="fdb1d596-e03e-4928-ae15-b1f9ba1fe7d6",
    environment="asia-southeast1-gcp"  # find next to API key in console
)

  from tqdm.autonotebook import tqdm


In [5]:
index_name = "science-question-answering"

# check if the Science-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# connect to Science-question-answering index we created
index = pinecone.Index(index_name)

In [12]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 700}},
 'total_vector_count': 700}

# Initialize Retriever

In [6]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

Downloading (…)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

# Generate Embeddings and Upsert

In [7]:
device

'cuda'

In [None]:
from tqdm.auto import tqdm  
# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch["passage_text"].tolist()).tolist()
    # get metadata
    meta = batch.to_dict(orient="records")
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

# Initialize Generator

In [10]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to('cpu')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [13]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    return xc

In [14]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

In [None]:
query = "What is Atacama Cosmology Telescope	?"
result = query_pinecone(query, top_k=3)
result

In [18]:
from pprint import pprint

In [None]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

In [34]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt")
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [48]:
generate_answer(query)

('The Atacama Cosmology Telescope was built in the 1960s. It is the largest '
 'telescope in the world, and is the only one in the Southern Hemisphere. It '
 'is the largest telescope')


In [45]:
query = "when was Atacama Cosmology Telescope  ?"
query=correct_spelling(query)
query=convert_to_lowercase(query)
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The Atacama Cosmology Telescope was built in the 1960s. It is the largest '
 'telescope in the world, and is the only one in the Southern Hemisphere. It '
 'is the largest telescope')


In [46]:
# context["matches"]
query

'question: when was atacama cosmology telescope  ? context: <P> Atacama Pathfinder Experiment Science Submillimetre astronomy is a relatively unexplored frontier in astronomy and reveals a Universe that cannot be seen in the more familiar visible or infrared light. It is ideal for studying the "cold Universe": light at these wavelengths shines from vast cold clouds in interstellar space, at temperatures only a few tens of degrees above absolute zero. Astronomers use this light to study the chemical and physical conditions in these molecular clouds — the dense regions of gas and cosmic dust where new stars are being born. Seen in visible light, these regions of the Universe are <P> often dark and obscured due to the dust, but they shine brightly in the millimetre and submillimetre part of the spectrum. This wavelength range is also ideal for studying some of the earliest and most distant galaxies in the Universe, whose light has been redshifted into these longer wavelengths.\nAPEX scien

In [43]:
query = "where did it made?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is the right subreddit to ask this question, but I "
 "think it's interesting that you're asking about the Chinese. I'm not sure if "
 'this is the right')


 final few questions.

In [50]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The war of currents was a naval battle between the Royal Navy and the Royal '
 'Navy of the United States. The Royal Navy was the largest naval force in the '
 'world, and the Royal Navy was')


In [51]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the moon was Neil Armstrong, who walked on the moon '
 'in 1969. He was the first man to walk on the moon, and he was the first man '
 'to')


In [52]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $100 billion to build.')


As we can see, the model can generate some decent answers.