<a href="https://colab.research.google.com/github/Ansh23-BI/AI_Based_Usage_Codes/blob/main/RAG%20Based%20Code/Simple_Wiki_%7C_RAG_Based_%7C_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -qq langchain langchain_community langchain-openai requests langchain-Chroma langchain_core

In [None]:
import os
import requests
from langchain_openai import ChatOpenAI
from google.colab import userdata

os.environ["API_KEY"]=userdata.get("OPENROUTER_API_KEY")

In [None]:
title=input("Please enter the title of the wiki page: ")
url="https://en.wikipedia.org/w/api.php"
params={
    "action":"query",       # Use the query API
    "format": "json",       # Response format
    "prop": "extracts",     # Get article content
    "explaintext": 1,       # Return plain text (no HTML)
    "redirects": 1,         # Auto-follow redirects
    "titles": title         # Article to fetch
}

"""
Paras are automatically converted into URL query parameter like https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&explaintext=1&titles=Data_science

"""

headers={
    'User-Agent':'Rag Project anshul@gmail.com'     #wiki blocks the requests if the user agent is missing to avoid bot traffic. Giving the reason and any email is fine
}
response=requests.get(url,params=params,headers=headers)
doc=response.json()
""" we get answer in json text so this line converts it into python dictionary. The main info is present inside pages:..... so we will pull it out.
{
  "query": {
    "pages": {
      "12345": {
        "pageid": 12345,
        "title": "Data science",
        "extract": "Data science is an interdisciplinary field..."
      }
    }
  }
}

"""
page=next(iter(doc['query']['pages'].values()))  # print(doc['query']['pages'].values()) this will print from pages but only as dict_values. Now next(iter(doc....))) helps to pull the first value.
text=page.get('extract',"")
print(text[0:20000])


Please enter the title of the wiki page: Data Science
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. 
Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.
Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer scienc

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks=text_splitter.split_text(text)     #The below code and this code is doing the same thing. Only diff is we are using .split_text at one place and .split_documents() at another. LAter one is better to be used in RAG based work as it assign meta data to all the chunks andthus makes easier to combine the chunks later on.
print(chunks[0])



Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. 
Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.


In [None]:
from langchain_core.documents import Document

docs=[Document(
    page_content=text,
    metadata={
        "url":f"https://en.wikipedia.org/wiki/{title}",
        "source":"wikipedia"
    }

)]     #The [] were added only because .split_documents needs the input in list format and not in tuple.
doc=text_splitter.split_documents(docs)
print(doc[0])



page_content='Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. 
Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.' metadata={'url': 'https://en.wikipedia.org/wiki/Data Science', 'source': 'wikipedia'}


In [None]:
print(len(doc))

17


In [None]:
llm=ChatOpenAI(
    model_name="gpt-3.5-turbo",
    api_key=userdata.get("OPENROUTER_API_KEY"),
    base_url="https://openrouter.ai/api/v1",
    temperature=0.6,
    max_tokens=300)


In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings=OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=userdata.get("OPENROUTER_API_KEY"),
    base_url="https://openrouter.ai/api/v1",
)

In [None]:
# embedding the text and uploading to vectorDB.
from langchain_chroma import Chroma
vector_store=Chroma.from_documents(
    documents=doc,
    embedding=embeddings
)



In [None]:
print(vector_store._collection.get())

{'ids': ['6b6d29c3-812e-4fb4-b211-c13cf556df11', '6534af37-baa6-44dd-a232-53e3d83b7645', '1c34e76e-317d-49ae-a770-5bfbf0c83c76', '5704389b-878d-46b9-b685-85d8d2de4b42', '40b47479-4f4d-490d-9735-bf7ab9cbf554', '11df34c0-0edf-44d6-bd81-1c8c54fe4ca8', 'f81df042-bcb2-45b0-91c7-10641bf8371f', '58fbd495-23fe-47fe-8659-77ac7df6ced6', 'a03fcadc-213b-48a1-9c95-a863254fcf4d', '838287bf-e1ef-45f8-8126-9fc9bffae85d', '7e18c781-5c62-43e1-9885-b1ac98d4dd58', '7ebe2a65-4144-4502-891d-8944fd955538', 'd4f05a9e-1528-4024-bcc6-ec6e9967c46f', '7e306195-2b5b-4359-8694-bb7bb45d0b7d', '4f637d20-9fc8-478b-b4c3-99350e8fa351', 'c5241725-53f2-4437-ac47-e98d90f34e3b', 'c8035c78-2c84-4f0c-a3e5-94ed40aff627'], 'embeddings': None, 'documents': ['Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. \nDa

In [None]:
print(vector_store._collection.get(ids=["6534af37-baa6-44dd-a232-53e3d83b7645"],include=["embeddings","documents"]))

{'ids': ['6534af37-baa6-44dd-a232-53e3d83b7645'], 'embeddings': array([[-0.01768508, -0.01315481,  0.01878043, ..., -0.00473233,
         0.01237849,  0.03554031]]), 'documents': ['Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.'], 'uris': None, 'included': ['embeddings', 'documents'], 'data': None, 'metadatas': None}


In [None]:
#Retriever step
retriever=vector_store.as_retriever()

In [None]:
#Augumentation
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a helpful AI assistant. Use the provided context to answer the user's question. Give precise and concise answer. Don't increase the length of the answers unneccessary. "
        "If the answer is not in the context, say you don't know.\n\nContext:\n{context}"
    ),
    ("human", "{question}")
])


In [None]:
from langchain_core.runnables import RunnablePassthrough   #whenever we need to pass the input as it is we will use this. Here I am passing my question to prompt

In [None]:
# now the relevant info might be present in multiple chunks so we may have to send multiple relevant doc. Passing the output from retriever to this function.
def format_docs(docs):
  return "\n".join(doc.page_content for doc in docs)

In [None]:
#Rag chain

from langchain_core.output_parsers import StrOutputParser
rag_chain= ({"context":retriever | format_docs,"question":RunnablePassthrough()}
            | prompt_template
            | llm
            | StrOutputParser())

In [None]:
rag_chain.invoke("tell me one main thing from the document?")

'Data science ethics courses are increasingly integrating human-centric topics like fairness, accountability, and responsible decision-making to help students understand the societal impacts of data-driven technologies.'