# Mini Question Answering System with Llama2

This notebook uses Llama 2.0 model to develop a Question Answering system based on a wikepedia article on Space Exploration.

In [1]:
#Import Libraries
import re
import string
import nltk
import torch
import transformers
import warnings 

from time import time
from urllib.request import urlopen
from bs4 import BeautifulSoup
from math import ceil
from wordcloud import WordCloud, STOPWORDS
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords, wordnet as wn
from nltk import ngrams, FreqDist, word_tokenize, pos_tag

from qdrant_client import QdrantClient
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM,T5Tokenizer, T5ForConditionalGeneration
from torch import cuda, bfloat16
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain_community.vectorstores.qdrant import Qdrant
from langchain_community.embeddings.huggingface import HuggingFaceBgeEmbeddings
from langchain_community.vectorstores.qdrant import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

In [2]:
warnings.filterwarnings('ignore')

In [3]:
#Download nltk packages for data cleaning
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rmunda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rmunda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\rmunda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
#Class for web scraping
class WebScrap:
    def __init__(self, url):
        self.url = url
    
    def get_web_data(self):
        """
        This function extracts text from the provided web page url
        """
        source = urlopen(self.url).read()
        soup = BeautifulSoup(source, "html.parser")
        paras = []
        heads = []
        # Extract the plain text content from paragraphs
        for paragraph in soup.find_all('p'):
            paras.append(str(paragraph.text))
        # Extract text from paragraph headers
        for head in soup.find_all('span', attrs={'mw-headline'}):
            heads.append(str(head.text))
        # Interleave paragraphs & headers
        text = [val for pair in zip(paras, heads) for val in pair]
        text = ' '.join(text)
        # Drop footnote superscripts in brackets
        text = re.sub(r"\[.*?\]+", '', text)
        # Replace '\n' (a new line) with '' and end the string at $1000.
        text = text.replace('\n', '')[:-15]
        return text

In [5]:
# Specify url of the wikipedia page
url = 'https://en.wikipedia.org/wiki/Space_exploration'
web_scrapper = WebScrap(url)
text = web_scrapper.get_web_data()

In [6]:
text

' History of exploration Space exploration is the use of astronomy and space technology to explore outer space. While the exploration of space is currently carried out mainly by astronomers with telescopes, its physical exploration is conducted both by uncrewed robotic space probes and human spaceflight. Space exploration, like its classical form astronomy, is one of the main sources for space science. First telescopes While the observation of objects in space, known as astronomy, predates reliable recorded history, it was the development of large and relatively efficient rockets during the mid-twentieth century that allowed physical space exploration to become a reality. Common rationales for exploring space include advancing scientific research, national prestige, uniting different nations, ensuring the future survival of humanity, and developing military and strategic advantages against other countries. First outer space flights The early era of space exploration was driven by a "Sp

In [7]:
#Class for data cleaning
class DataCleaning:
    def __init__(self, text, min_length, summarizer_model, grammar_model, 
                 length_penalty, num_beams, truncation, early_stopping, skip_special_tokens):
        self.text = text
        self.min_length = min_length
        self.summarizer_model = summarizer_model
        self.grammar_model = grammar_model
        self.length_penalty = length_penalty
        self.num_beams = num_beams
        self.truncation = truncation
        self.early_stopping = early_stopping
        self.skip_special_tokens = skip_special_tokens
    
    def summarize_text(self, content, max_length, tokenizer, model):
        """
        This function summarizes long content into concised form using a provided model.
        """
        #Preprocess Content
        inputs = tokenizer.encode("summarize: " + content, return_tensors="pt", max_length=max_length, 
                                  truncation=self.truncation)
        #Generate Summary
        summary_ids = model.generate(inputs, max_length=max_length, min_length=self.min_length, 
                                     length_penalty=self.length_penalty, num_beams=self.num_beams, 
                                     early_stopping=self.early_stopping)
        #Decode Summary
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=self.skip_special_tokens)
        return summary
    
    def grammar_corrector(self, content, max_length, tokenizer, model):
        """
        This function corrects the grammar for a provided content.
        """
        #Preprocess Content
        input_ids = tokenizer("grammar: " + content, return_tensors='pt').input_ids
        #Generate Corrected Text
        outputs = model.generate(input_ids,max_length=max_length, min_length=self.min_length)
        #Decode Corrected Text
        corrected_content = tokenizer.decode(outputs[0], skip_special_tokens=self.skip_special_tokens)
        return corrected_content.lower().strip()
    
    def normalize_text(self):
        """
        This function normalizes a given text and removes whitespace characters
        """
        #Normalize Text
        normalized = " ".join([i for i in self.text.lower().split()])
        #Split lines based on delimiter
        lines = normalized.split(".")
        return lines
    
    def remove_special_characters(self, content):
        """
        This function removes special characters from a given text.
        """
        return re.sub('[%s]' % re.escape(string.punctuation.replace('.','') +r'“'+'”'+'’'+"'"), '', content)
    
    def clean_data(self):
        """
        This process preprocesses the text. It applies following transformations
        1. Normalization
        2. Summarization
        3. Grammar Correction
        4. Special Character Removal & Whitespaces
        """
        output = []
        #Normalization
        lines = self.normalize_text()
        #Initialize Summary Tokenizer
        summary_tokenizer = T5Tokenizer.from_pretrained(self.summarizer_model)
        #Initialize Summarization Model
        summary_model = T5ForConditionalGeneration.from_pretrained(self.summarizer_model)
        #Initialize Grammar Tokenizer
        grammar_tokenizer = AutoTokenizer.from_pretrained(self.grammar_model)
        #Initialize Grammar Model
        grammar_model = AutoModelForSeq2SeqLM.from_pretrained(self.grammar_model)
        #Loop for each line in the text
        for item in lines:
            max_length = ceil(len(item)/4)
            if max_length < self.min_length:
                max_length = self.min_length
            #Summarization
            summary = self.summarize_text(item, max_length, summary_tokenizer, summary_model)
            #Grammar Correction
            corrected_summary = self.grammar_corrector(item, max_length, grammar_tokenizer, grammar_model)
            output.append(corrected_summary)
        cleaned_text = " ".join([i for i in output])
        #Special Character Removal
        result = self.remove_special_characters(cleaned_text)
        return result

In [8]:
data_cleaner = DataCleaning(text=text, min_length=10, summarizer_model= "t5-base", 
                            grammar_model= "addy88/t5-grammar-correction", length_penalty=2.0,
                            num_beams=5, truncation=True, early_stopping=True,skip_special_tokens=True)
clean_text =  data_cleaner.clean_data()

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
clean_text

'exploration space exploration is the use of astronomy and space technology to explore outer space. while the exploration of space is currently carried out mainly by astronomers with telescopes its physical exploration is conducted both by uncrewed robotic space probes and human spaceflight. space exploration like its classical form astronomy is one of the main sources for space science. first telescopes while the observation of objects in space known as astronomy predates reliable recorded history it was the development of large and relatively efficient rockets during the midtwentieth century that allowed physical space exploration to become a reality. common rationales for exploring space include advancing scientific research national prestige uniting different nations ensuring the future survival of humanity and developing military and strategic advantages against other countries. the early era of space exploration was driven by a space race between the soviet union and the united s

In [10]:
#Class for creating & storing embeddings
class CreateStoreEmbeddings:
    def __init__(self, text, chunk_size, chunk_overlap, is_separator_regex, 
                 separators, length_function, embedding_model, model_kwargs,
                 encode_kwargs, vector_db_url,prefer_grpc,collection_name):
        self.text = text
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.is_separator_regex = is_separator_regex
        self.separators = separators
        self.length_function = length_function
        self.embedding_model = embedding_model
        self.model_kwargs = model_kwargs
        self.encode_kwargs = encode_kwargs
        self.vector_db_url = vector_db_url
        self.prefer_grpc = prefer_grpc
        self.collection_name = collection_name
    
    def create_chunks(self):
        """
        This function splits the text into chunks based on specified chunk size
        """
        # Initialize the text splitter with provided parameters
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap,
        length_function=self.length_function, is_separator_regex=self.is_separator_regex, separators = self.separators)
        #Create chunks
        text_docs = text_splitter.create_documents([self.text])
        chunks = text_splitter.split_documents(text_docs)
        return chunks
    
    def create_embeddings(self):
        """
        This function creates embedding model
        """
        # Load the embedding model 
        embeddings = HuggingFaceBgeEmbeddings(
            model_name=self.embedding_model,
            model_kwargs=self.model_kwargs,
            encode_kwargs=self.encode_kwargs
        )
        return embeddings
    
    def process_embeddings(self):
        """
        This function creates the vector embeddings and stores them in vector database
        """
        #Get document chunks
        chunks = self.create_chunks()
        #Get embedding model
        embeddings = self.create_embeddings()
        qdrant = Qdrant.from_documents(chunks, embeddings, url=self.vector_db_url,
        prefer_grpc=self.prefer_grpc, collection_name=self.collection_name)


In [11]:
#Get the device on system
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [12]:
embedder = CreateStoreEmbeddings(text=clean_text, chunk_size=300, chunk_overlap=50, 
                      is_separator_regex=True, separators=["."], length_function=len, 
                      embedding_model="BAAI/bge-large-en", model_kwargs={'device': device},
                      encode_kwargs={'normalize_embeddings': False}, vector_db_url="http://localhost:6333",
                      prefer_grpc=False, collection_name="qa_space_explr_db")
embedder.process_embeddings()

In [13]:
#Class for Question Answering System
class QuestionAnswer():
    def __init__(self, model_id, embedding_model, model_kwargs, encode_kwargs,
                 vector_db_url, prefer_grpc,collection_name, max_doc, 
                 temperature, prompt_template, query):
        self.model_id = model_id
        self.embedding_model = embedding_model
        self.model_kwargs = model_kwargs
        self.encode_kwargs = encode_kwargs
        self.vector_db_url = vector_db_url
        self.prefer_grpc = prefer_grpc
        self.collection_name = collection_name
        self.max_doc = max_doc
        self.temperature = temperature
        self.prompt_template = prompt_template
        self.query = query

    
    def load_embedding_model(self):
        """
        This functions loads the vector embedding model
        """
        embeddings = HuggingFaceBgeEmbeddings(
            model_name=self.embedding_model,
            model_kwargs=self.model_kwargs,
            encode_kwargs=self.encode_kwargs
        )
        return embeddings

    
    def get_vector_db_retriever(self):
        """
        This function generates the vector store retriever
        """
        embeddings = self.load_embedding_model()
        #Initialize Qdrant client
        client = QdrantClient(
            url=self.vector_db_url, prefer_grpc=self.prefer_grpc
        )
        db = Qdrant(client=client, embeddings=embeddings, collection_name=self.collection_name)
        #Create retriever instance
        retriever = db.as_retriever(search_kwargs={"k":self.max_doc})
        return retriever
    
    
    def create_llm_pipeline(self):
        """
        This function initializes model, tokenizer and query pipeline
        """
        #Set quantization configuration to load large model with less GPU memory
        bnb_config = transformers.BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16)
        #Load model configuration
        model_config = transformers.AutoConfig.from_pretrained(self.model_id)
        #Instantiate the model class
        model = transformers.AutoModelForCausalLM.from_pretrained(self.model_id, trust_remote_code=True,
                                                                  config=model_config, quantization_config=bnb_config,
                                                                  device_map="auto")
        #Instantiate model tokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.model_id)
        #Instantiate query pipeline
        query_pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer,
        torch_dtype=torch.float16, device_map="auto")
        llm = HuggingFacePipeline(pipeline=query_pipeline,  model_kwargs={"temperature": self.temperature})
        return llm

    
    def query_resolver(self):
        """
        This function takes a user query and starts a QA chain to provide the relevant answer
        """
        retriever = self.get_vector_db_retriever()
        llm = self.create_llm_pipeline()
        #Question answer prompt template for model
        prompt = PromptTemplate(template=prompt_template, input_variables=['question','context'])
        #Initialize chain
        qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, 
                                         return_source_documents=False, chain_type_kwargs={"prompt": prompt}, 
                                         verbose=True)
        result = qa.run(self.query)
        return result

In [14]:
prompt_template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer. Answer must be detailed and well explained.
Helpful answer:
"""

In [15]:
query = "Who was the first woman to go to space?"

In [16]:
time_1 = time()
resolver = QuestionAnswer(model_id=r"Llama-2-7b", embedding_model= "BAAI/bge-large-en", model_kwargs={'device': device},
                         encode_kwargs={'normalize_embeddings': False}, vector_db_url="http://localhost:6333",
                         prefer_grpc=False, collection_name="space_explr_db", max_doc=5, 
                         temperature=0.1, prompt_template=prompt_template, query=query)
answer = resolver.query_resolver()
time_2 = time()
print(f"Inference time: {round(time_2-time_1, 3)} sec.")
print(f"Query: {query}\n")
print("\nResult: ", answer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 32.559 sec.
Query: Who was the first woman to go to space?


Result:  Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: ure and pressure data was encoded in the duration of radio beeps. the satellite was not punctured by a meteoroid. sputnik 1 was launched by an it burned up upon reentry on 3 the first successful human spaceflight was vostok 1 east 1 carrying the 27yearold russian cosmonaut yuri gagar the spacecraft

after the first 20 years of exploration focus shifted from oneoff flights to renewable hardware. focus shifted from competition to cooperation with the international space station. iss is the first object in orbit with the substantial completion of iss following sts133 in march 2011. plans for space

as landmarks. the soviet space program achieved many of the first