<a href="https://colab.research.google.com/github/Shahriar10k/RAG-with-NSU/blob/main/RAG_with_GPT_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Use playground mode**

**File>Open in playground mode**

For Text to csv [click this](https://cutt.ly/texttocsv)


# Install the dependencies

Install the packages. Restart runtime after first time install in colab.


In [None]:
# Install the latest release of Haystack in your own environment
! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]

!pip install openai

# Import the packages

In [None]:
import json
import openai
import logging
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
from typing import List
import requests
import pandas as pd
from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import RAGenerator, DensePassageRetriever
from haystack.utils import fetch_archive_from_http

from haystack.pipelines import GenerativeQAPipeline, DocumentSearchPipeline
from haystack.utils import print_answers

from pprint import pprint, PrettyPrinter
from typing import Dict, Any, List, Optional
from collections import defaultdict


from haystack.schema import Document, Answer, SpeechAnswer
from haystack.document_stores.sql import DocumentORM
logger = logging.getLogger(__name__)

# Dataset

In [None]:
# Directory of the csv file in colab runtime folder
doc_dir = "/content/curated_dataset_100.csv"

#Set separator as tab(\t) if the csv is tab separated, comma(,) if the csv is comma separated
df = pd.read_csv(doc_dir, sep=",")

# Minimal cleaning
df.fillna(value="", inplace=True)

print(df.head())

                                         title  \
0                 ENV 455 Research Methodology   
1  CEE 467 Irrigation and Drainage Engineering   
2          SAIDUR RAHMAN(administrative staff)   
3                 Dr. Mohammad Rashedur Rahman   
4                 Dr. Mohammad Rashedur Rahman   

                                                                              text  
0  ENV 455 Research Methodology Topics include purpose of scientific research; ...  
1  CEE 467 Irrigation and Drainage Engineering Importance of irrigation; source...  
2  SAIDUR RAHMAN Senior Programme Officer MBA in Human Resource, Southeast Univ...  
3  DR. MOHAMMAD RASHEDUR RAHMAN Professor & Graduate Co-ordinator Ph.D. in Comp...  
4  University of Manitoba, Canada. During his graduate studies in both schools ...  


# Cast data into Haystack Document Objects


In [None]:
titles = list(df["title"].values)
texts = list(df["text"].values)
documents: List[Document] = []
for title, text in zip(titles, texts):
    documents.append(Document(content=text, meta={"name": title or ""}))

# FAISSDocumentStore, DensePassageRetriever and RAGenerator

In [None]:
# Initialize FAISS document store.
# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)

# Initialize DPR Retriever to encode documents, encode question and query documents
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
)



INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO:haystack.modeling.utils:Using devices: CPU
INFO:haystack.modeling.utils:Number of GPUs: 0


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english


In [None]:
# Initialize RAG Generator
generator = RAGenerator(
    model_name_or_path="facebook/rag-token-nq",
    use_gpu=True,
    top_k=1,
    #max_length=200,
    min_length=2,
    embed_title=True,
    num_beams=2,
)

INFO:haystack.modeling.utils:Using devices: CPU
INFO:haystack.modeling.utils:Number of GPUs: 0


Downloading:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

  f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions. "


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.


Downloading:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.weight', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RagTokenForGeneration were not initialized from the model checkpoint at facebook/rag-token-nq and are newly initialized: ['rag.generator.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

# Update the document

We write documents to the DocumentStore, first by deleting any remaining documents then calling write_documents(). The update_embeddings() method uses the retriever to create an embedding for each document.

In [None]:
# Delete existing documents in documents store
document_store.delete_documents()

# Write documents to document store
document_store.write_documents(documents)

# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)

Writing Documents:   0%|          | 0/837 [00:00<?, ?it/s]

INFO:haystack.document_stores.faiss:Updating embeddings for 836 docs...


Updating Embedding:   0%|          | 0/836 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/848 [00:00<?, ? Docs/s]

# Function for answer using RAG generator

Custom function for print

In [None]:

def print_ans(results: dict):
    """
    Utility function to print results of Haystack pipelines
    :param results: Results that the pipeline returned.
    :param details: Defines the level of details to print. Possible values: minimum, medium, all.
    :param max_text_len: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.
    :return: None
    """
    # Defines the fields to keep in the Answer for each detail level
    fields_to_keep_by_level = {
        "minimum": {
            Answer: ["answer"],
        },       
    }

    if not "answers" in results.keys():
        raise ValueError(
            "The results object does not seem to come from a Reader: "
            f"it does not contain the 'answers' key, but only: {results.keys()}.  "
            "Try print_documents or print_questions."
        )

    if "query" in results.keys():
        print(f"\nQuestion: {results['query']}\nAnswer:")

        answers = results["answers"][0]
        doc=results["documents"][0]

        print(answers.answer)
        pprint(doc.content)
       

Wrap the question and answering in a function

In [None]:
def bolo_with_rag(question):
          import warnings
          warnings.filterwarnings('ignore')
          pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)
          res = pipe.run(query=question, params={"Generator": {"top_k": 1}, "Retriever": {"top_k": 5}})
          print_ans(res)

# Function for answer using GPT


Custom function for printing the passage by Jawad

In [None]:
def custom_print_doc(results: dict, max_text_len: Optional[int] = None, print_name: bool = True, print_meta: bool = False, string_out: bool = False):
    #print(f"\nQuery: {results['query']}\n")

    # Verify that the input contains Documents under the `document` key
    if any(not isinstance(doc, Document) for doc in results["documents"]):
        raise ValueError(
            "This results object does not contain `Document` objects under the `documents` key. "
            "Please make sure the last node of your pipeline makes proper use of the "
            "new Haystack primitive objects, and if you're using Haystack nodes/pipelines only, "
            "please report this as a bug."
        )

    for doc in results["documents"]:
        content = doc.content
        if string_out:
          content = (str)(doc.content)
          return content
        print(results["content"])

A custom function to take string after/before a certain word

In [None]:
def substring_after(s, delim):
    return s.partition(delim)[2]
def substring_before(s,delim):
    return s.partition(delim)[0]

Use the passage extracted by the function in RAG dpr and wrap it inside a function 

In [None]:
def bolo_with_gpt(question):     
    pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)
    res = DocumentSearchPipeline(retriever).run(query=question, params={"Retriever": {"top_k": 5}})
    passage= custom_print_doc(res, string_out = True)
    #Get the passage before question mark
    q=substring_before(question,"?")
    prompt = "Answer the question from the given passage." + "Question:" + q+"in North South University?" +"Passage: " + passage
    openai.api_key = "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"      # API key
    response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, temperature= 0.15, max_tokens=128)

    #Parse the answer from the object
    n=json.loads(json.dumps(json.loads((json.dumps(response)))))
    res=json.dumps(n["choices"][0]["text"])
    res2=substring_after(res, "\\n\\n")
    new_string=res2.replace('"','')
    print(question + '\n')
    print(new_string + '\n')
    pprint(passage)

# Go Nuts
Use bolo_with_rag function for answers with rag generator

and bolo_with_gpt function for answers with gpt generator

**Use playground mode**

**File>Open in playground mode**

In [None]:
bolo_with_rag("How long is the admission test at North South University?")


Query: How long is the admission test at North South University?
Answer:
 two and a half hour
[{'answer': ' two and a half hour'}]


In [None]:
bolo_with_gpt("What is the office room number of Mohammad Mahmud Hasan of Mathematics at North South University?")

What is the office room number of Mohammad Mahmud Hasan of Mathematics at North South University?

The office room number of Mohammad Mahmud Hasan of Mathematics at North South University is SAC 506.

('MR. KAZI WAHIDUZZAMAN Record Assistant Office: SAC 502 Email: '
 'kazi.wahiduzzaman@northsouth.edu')
