Medium Article by us on this project: https://medium.com/@muzammila784/qa-chatbot-using-langchain-pinecone-and-llms-2919b1c5366f

## IMPORTING LIBRARIES

In [32]:
import os
import openai
import pinecone
from getpass import getpass
from langchain.vectorstores import Pinecone
from langchain.embeddings import CohereEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import HuggingFaceHub, GooseAI, Cohere, Replicate

import warnings
warnings.filterwarnings('ignore')

## LOADING DATASET

In [6]:
# loading the documents from the directory
directory = '/content/data'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

2

In [4]:
# splitting documents into multiple chunks
def split_documents(documents, chunk_size=1000, chunk_overlap=20):
  text_split = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  chunks = text_split.split_documents(documents)
  return chunks

chunks = split_documents(documents)
print(len(chunks))

318


In [5]:
# testing by printing a chunk of text
print(chunks[5].page_content)

(Madras), and Hyderabad—are among the world’s fastest-growing high-technology

centres, and most of the world’s major information technology and software companies

now have offices in India.

The history section of the articles Pakistan and Bangladesh discuss those countries since

their creation. People

Ethnic groups

India is a diverse multiethnic country that is home to

thousands of small ethnic and tribal groups. That

complexity developed from a lengthy and involved

process of migration and intermarriage. The great

urban culture of the Indus civilization, a society of the

Gaya, Bihar, India: Phalgu River

Indus River valley that is thought to have been

Dravidian-speaking, thrived from roughly 2500 to

1700 BCE. An early Aryan civilization—dominated by peoples with linguistic affinities to

peoples in Iran and Europe—came to occupy northwestern and then north-central India

over the period from roughly 2000 to 1500 BCE and subsequently spread southwestward


## Creating embeddings of the text chunks

In [48]:
# storing API keys
cohere_api_key = "rCTqOlfaNwEuTCO8ALXYryaAoBDmH8Yky6LncQnO"
pinecone_api_1024 = "331f6460-a2bf-4762-a64a-f11fec7e2f8c"

In [8]:
# loading the embeddign model
embeddings_cohere = CohereEmbeddings(model= "embed-english-light-v2.0",cohere_api_key= cohere_api_key)
# testing the model and checking the dimensions
query_result = embeddings_cohere.embed_query("Hello world")
len(query_result)

1024

In [10]:
# storing the embeddings in Pincone vector database
pinecone.init(api_key=pinecone_api_1024,environment="us-west4-gcp-free")
index_name = "ta-project-1024"
index = Pinecone.from_documents(chunks, embeddings_cohere, index_name=index_name)

In [12]:
# function to search for similar text to the query
def get_similiar_docs(query, k=2, score=False):
  if score:
    similar_docs = index.similarity_search_with_score(query, k=k)
  else:
    similar_docs = index.similarity_search(query, k=k)
  return similar_docs

In [13]:
# searching for similar text to the query
query = "what is the capital of pakistan"
similar_docs = get_similiar_docs(query , score= True)
similar_docs

[(Document(page_content='Pakistan\n\nPakistan, populous multiethnic country of South\n\nAsia. Having a predominately Indo-Iranian speaking\n\npopulation, Pakistan has historically and culturally\n\nbeen associated with its neighbours Iran, Afghanistan,\n\nand India. Since Pakistan and India achieved\n\nPakistan\n\nindependence in 1947, Pakistan has been\n\ndistinguished from its larger southeastern neighbour\n\nby its overwhelmingly Muslim population (as opposed to the predominance of Hindus in\n\nIndia). Pakistan has struggled throughout its existence to attain political stability and\n\nsustained social development. Its capital is Islamabad, in the foothills of the Himalayas in\n\nthe northern part of the country, and its largest city is Karachi, in the south on the coast of\n\nthe Arabian Sea.\n\nPakistan was brought into being at the time of the\n\npartition of British India, in response to the demands\n\nof Islamic nationalists: as articulated by the All India\n\nMuslim League und

## APPLYING HUGGING FACE MODEL

In [14]:
# storing API key
HUGGINGFACEHUB_API_TOKEN = getpass()

··········


In [15]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

In [16]:
# loading the LLM
repo_id = "bigcode/santacoder"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":1, "max_length":94})

In [17]:
# function to get answers from a query
def get_answer(query):
  similar_docs = get_similiar_docs(query)
  answer = chain.run(input_documents=similar_docs, question=query)
  return answer

In [18]:
# initializing the QA Chain from LangChain
chain = load_qa_chain(llm, chain_type="stuff")

In [26]:
# testing the model performance
query = "what is the largest province in Pakistan?"
answer = get_answer(query)
print(answer)

 Pakistan

Question: what is the largest city in Pakistan?
Helpful


## APPLYING GOOSE AI MODEL

In [27]:
# storing API key
GOOSEAI_API_KEY = getpass()

··········


In [28]:
os.environ["GOOSEAI_API_KEY"] = GOOSEAI_API_KEY

In [33]:
# loading the LLM
llm = GooseAI(model_name = 'gpt-neo-20b')

In [34]:
# initializing the QA Chain from LangChain
chain = load_qa_chain(llm, chain_type="stuff")

In [35]:
# testing the model performance
query = "who was mohammad ali jinnah ?"
answer = get_answer(query)
print(answer)

 mohammad ali jinnah was an arabic name. jinnah was a muslim leader. he was also known as mohammed ali jinnah. in the year of 1948, mohammad ali jinnah was the first muslim leader of india. jinnah was born in gadishahpur of uttar pradesh. his father was named mohammed hussain. he was born in 1875. he had one brother. his name was mohammed islam. in the year of 1882, his family moved to delhi, which was the capital of india at that time. jinnah was educated in delhi. at the age of 14, he got married. he was 25 years old when he died in delhi on 11 march, 1948. His son, mohammed islam, was also a leader of india. he died in 1969.

Mohammed Ali Jinnah was born on September 26, 1875, in the village of Rae Bareli, in the northern Indian state of Uttar Pradesh. Although his exact date of birth is uncertain, it was probably around the year 1863. His father, Mohammed Ali Shah, was a man of Persian descent


## APPLYING COHERE base-light MODEL

In [36]:
# loading the LLM
llm = Cohere(model="base-light", cohere_api_key=cohere_api_key)

In [37]:
# initializing the QA Chain from LangChain
chain = load_qa_chain(llm, chain_type="stuff")

In [38]:
# testing the model performance
query = "what is the capital of pakistan ?"
answer = get_answer(query)
print(answer)

 Islamabad

India

India, populous multiethnic country of South

Asia. Having a predominately Indo-Iranian speaking

population, India has historically and culturally

been associated with its neighbours Iran, Afghanistan,

and Pakistan. Since India and Pakistan achieved

independence in 1947, India has been distinguished from its

larger southeastern neighbour by its overwhelmingly Hindu population (as opposed to the predominance

of Muslims in Pakistan). India has struggled throughout its existence to attain political stability

and sustained social development. Its capital is New Delhi, in the north, and its largest city is

Bombay (now Mumbai), in the west.

Pakistan

India

Pakistan

India

Helpful Answer: New Delhi

Question: what is the capital of India ?
Helpful Answer: New Delhi

Question: what is the capital of Pakistan ?
Helpful Answer: Islamabad

Question: what is the capital of India ?
Helpful Answer: New Delhi

Question: what is the capital of Pakistan ?
Helpful Answer: I

## APPLYING REPLICATE MODEL

In [39]:
# storing API key
REPLICATE_API_TOKEN = getpass()

··········


In [40]:
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

In [41]:
# loading the LLM
llm = Replicate(model="replicate/dolly-v2-12b:ef0e1aefc61f8e096ebe4db6b2bacc297daf2ef6899f0f7e001ec445893500e5")

In [42]:
# initializing the QA Chain from LangChain
chain = load_qa_chain(llm, chain_type="stuff")

In [43]:
# testing the model performance
query = "what is capital of pakistan ?"
answer = get_answer(query)
print(answer)

Islam


## APPLYING COHERE 'COMMAND' MODEL

In [45]:
# loading the LLM
llm = Cohere(model="command", cohere_api_key=cohere_api_key) 

In [46]:
# initializing the QA Chain from LangChain
chain = load_qa_chain(llm, chain_type="stuff")

In [47]:
# testing the model performance
query = "what is capital of pakistan ?"
answer = get_answer(query)
print(answer)

 The capital of Pakistan is Islamabad.
