The drive link is at: https://drive.google.com/drive/folders/1V_f0z7rIP5zge9uI-OA1efvUYpISqnwB?usp=drive_link

In [None]:
!pip install -q pypdf langchain_community langchain openai sentence-transformers chromadb langchain_cohere tiktoken gdown pypdf pymupdf langchain-groq

In documents where the index of the same is not given, LDA can be implemented for topic extraction and in order to extract the import topics and summarize the document for the same.

#Implementing LDA for standard PDFs

The Latent Dirichlet Allocation algorithm was a natural choice for this task. This algorithm takes a document and returns a list of topics for the document along with a list of words associated with each topic. What is important for our case is the list of words associated with each topic.

In [None]:
import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pypdf import PdfReader
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain.llms import OpenAI

##Loading Document

Loading the document using PyPDF Loader

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(
    file_path="/content/ecosystem.pdf"#, api_model="prebuilt-layout"
)

In [None]:
data=loader.load()

In [None]:
len(data)

17

**Preprocessing to remove stopwords:**

This preprocess function takes an input text and a set of stopwords as parameters. It tokenizes the input text, removes stopwords, and filters out tokens that are shorter than a specified length (in this case, tokens with a length less than or equal to 3 characters). Finally, it returns a list of preprocessed tokens, ready for further text analysis or processing.

In [None]:
def preprocess(text, stop_words):
    """
    Tokenizes and preprocesses the input text, removing stopwords and short
    tokens.

    Parameters:
        text (str): The input text to preprocess.
        stop_words (set): A set of stopwords to be removed from the text.
    Returns:
        list: A list of preprocessed tokens.
    """
    result = []
    for token in simple_preprocess(text, deacc=True):
        if token not in stop_words and len(token) > 3:
            result.append(token)
    return result

##Topic Extraction

**Extracting topics from document**

This function is used to extract topics and their associated words from a PDF document using the Latent Dirichlet Allocation (LDA) algorithm. It preprocesses the input text by tokenizing it, removing stopwords, and filtering out short tokens. Then, it creates a dictionary and a corpus required for training the LDA model. After building the LDA model with the specified number of topics and words per topic, it retrieves the topics along with their corresponding words. Finally, it returns a list of sublists, where each sublist contains the relevant words for a topic discovered by the LDA algorithm.

In [None]:
def get_topic_lists_from_pdf(text, num_topics, words_per_topic):
    """
    Extracts topics and their associated words from a PDF document using the
    Latent Dirichlet Allocation (LDA) algorithm.

    Parameters:
        text(str): The page content for topic extraction.
        num_topics (int): The number of topics to discover.
        words_per_topic (int): The number of words to include per topic.

    Returns:
        list: A list of num_topics sublists, each containing relevant words
        for a topic.
    """
    # Preprocess the document
    nltk.download('stopwords')
    stop_words = set(stopwords.words(['english', 'spanish']))
    processed_text = preprocess(text, stop_words)

    # Create a dictionary and a corpus
    dictionary = corpora.Dictionary([processed_text])
    corpus = [dictionary.doc2bow(processed_text)]

    # Build the LDA model
    lda_model = LdaModel(
        corpus,
        num_topics=num_topics,
        id2word=dictionary,
        passes=15
        )

    # Retrieve the topics and their corresponding words
    topics = lda_model.print_topics(num_words=words_per_topic)

    # Store each list of words from each topic into a list
    topics_ls = []
    for topic in topics:
        words = topic[1].split("+")
        topic_words = [word.split("*")[1].replace('"', '').strip() for word in words]
        topics_ls.append(topic_words)

    return topics_ls


**Generates a description along with each major topic in the document**

This function Generates descriptive prompts for LLM based on topic words extracted from a PDF document.It takes the output of `get_topic_lists_from_pdf` function,
which consists of a list of topic-related words for each topic, and generates an output string in bulleted nested list format.

In [None]:
def topics_from_pdf(llm, text, num_topics, words_per_topic):
    """
    Generates descriptive prompts for LLM based on topic words extracted from a
    PDF document.

    This function takes the output of `get_topic_lists_from_pdf` function,
    which consists of a list of topic-related words for each topic, and
    generates an output string in bulleted nested list format.

    Parameters:
        llm (LLM): An instance of the Large Language Model (LLM) for generating
        responses.
        text(str): The page content for topic extraction.
        words_per_topic (int): The number of words to include per topic.
    Returns:
        str: A response generated by the language model based on the provided
        topic words.
    """

    # Extract topics and convert them to string
    list_of_topicwords = get_topic_lists_from_pdf(text, num_topics,
                                                  words_per_topic)
    string_lda = ""
    for list in list_of_topicwords:
        string_lda += str(list) + "\n"

    # Create the template
    template_string = '''Describe the topic of each of the {num_topics}
        double-quote delimited lists in a simple sentence and give a small description for the same. The lists are the result of an
        algorithm for topic discovery. Make sure the description for each of
        the topic descriptions have a word limit of {words_per_topic}.

        Do not provide an introduction or a conclusion, only describe the
        topics. Do not mention the word "topic" when describing the topics.
        Use the following template for the response.

        Q: (sentence describing the topic):(small description of the topic)

        Q: (sentence describing the topic):(small description of the topic)

        ...

        Q: (sentence describing the topic):(small description of the topic)

        Lists: """{string_lda}""" '''

    # LLM call
    prompt_template = ChatPromptTemplate.from_template(template_string)
    chain = LLMChain(llm=llm, prompt=prompt_template)
    response = chain.run({
        "string_lda" : string_lda,
        "num_topics" : num_topics,
        "words_per_topic":  words_per_topic
        })

    return response

Finding out the string size of each indidual page in the document

In [None]:
sum=0
for i in range(len(data)):
  sum+=len(data[i].page_content)
sum/len(data)

1881.8235294117646

Setting up LLMs for creating summaries

In [None]:
os.environ["AZURE_OPENAI_API_KEY"] = "a20bc67dbd7c47ed8c978bbcfdacf930"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://gpt-res.openai.azure.com/"

In [None]:
from langchain_groq import ChatGroq
from langchain_community.chat_models import AzureChatOpenAI
# openai_key = "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"
# llm = OpenAI(openai_api_key=openai_key, max_tokens=-1)
llm = AzureChatOpenAI(
  model_name = "gpt4-32k",
  openai_api_version="2023-05-15",
  azure_deployment="gpt4-inference"
  )
# llm= ChatGroq(temperature=0, groq_api_key="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", model_name="mixtral-8x7b-32768")

Creating chunks according to the page size for topic extraction and summary. In order to ensure that the semantic information is kept, a chunk overlap is added for the chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize Text Splitter for question generation
text_splitter_gen = RecursiveCharacterTextSplitter(chunk_size=1800, chunk_overlap=150)
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# Split text into chunks for question generation
chunks = ''
for page in data:
    chunks += page.page_content
text_chunks = text_splitter_gen.split_text(chunks)

**To generate list**

This function takes input text containing topic descriptions and generates a list of formatted prompts.







In [None]:
from langchain.prompts import PromptTemplate
def generate_list(text):
  template=""""
  Seperate each topic and their descriptions into seperate strings/elements in a list.\n
   Each topic and their descriptions here look something like this:

  Q: (sentence describing the topic):(small description of the topic)

  Ensure each of these are saved into seperate strings of a list.\n
  The final response should be a list of Strings. DO NOT RETURN AN EMPTY LIST OR STRING.
  The final response will look like so:
  ['Q.(sentence describing the topic):(small description of the topic)','Q.(sentence describing the topic):(small description of the topic)','Q.(sentence describing the topic):(small description of the topic)','Q.(sentence describing the topic):(small description of the topic)','Q.(sentence describing the topic):(small description of the topic)','Q.(sentence describing the topic):(small description of the topic)']
  STRICTLY HAVE ONLY ONE LIST, DO NOT MAKE MULTIPLE LISTS. ALL THE TOPICS SHOULD BE SEPERATE ELEMENTS IN ONE LIST ONLY
  CONTEXT: {text}
  """
  # model = OpenAI(openai_api_key=openai_key, max_tokens=-1)
  model = AzureChatOpenAI(
  model_name = "gpt4-32k",
  openai_api_version="2023-05-15",
  azure_deployment="gpt4-inference"
  )
  prompt = PromptTemplate(
    template= template,
    input_variables=["text"],
    )

  chain = prompt | model
  response = chain.invoke({"text":text})
  return response


Using these chunks for topic extraction now. Extracting all topics from pdf

Checking the number of chunks generated

In [None]:
len(text_chunks)

20

Using the previous functions to generate topics and descriptions for the document. Every chunk/page is individually fed to the functions to generate topics utilizing the help of multiple LLM API calls and then added/parsed into a list with each element of the list having one topic

In [None]:
num_topics=3
words_per_topic=15
final_summary=""
import ast
topics_final_list=[]
for i in range(len(text_chunks)):
  summary = topics_from_pdf(llm, text_chunks[i], num_topics, words_per_topic)
  final_summary+=summary
  topics=generate_list(summary)
  try:
      # Attempt to parse the string representation of the list into a Python list
      topics_list = eval(topics.content)
      topics_final_list += topics_list
  except SyntaxError:
      # Ignore the SyntaxError and continue with the loop
      pass

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /ro

Viewing the list

In [None]:
topics_final_list

['Q: Understanding the aquatic ecosystem: This list focuses on the structure, components, and energy cycling within an aquatic ecosystem.',
 'Q: Exploring forest ecosystems: This list discusses the function, structure, and ecological components of forest ecosystems including both biotic and abiotic elements.',
 'Q: Studying terrestrial ecosystems: The focus here is on the energy flow, ecological structure, and biotic components within terrestrial ecosystems.',
 'Q: Exploration of aquatic ecosystems and their functions: This entails understanding varied species, energy flow, abiotic components, and the functionality of aquatic ecosystems like ponds.',
 'Q: Analysis of components and energy flow in ecosystems: This pertains to the study of how different species, especially in aquatic ecosystems like ponds, interact with the abiotic components and contribute to energy flow.',
 "Q: Examination of species' roles in aquatic ecosystems: This involves assessing the different species in ecosyst

Hence in this way, the final summary has been implemented using LDA.

##MCQ Generation


This code snippet initializes and configures components for text processing and question generation. It splits the input text into manageable chunks using a recursive character-based text splitter. Then, it converts these chunks into documents suitable for further processing. The code also sets up embeddings using the Hugging Face framework, specifically the MiniLM-L6 model, for text representation. Finally, it constructs a vector store using Chroma from the processed documents to facilitate efficient retrieval and analysis of text data.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize Text Splitter for question generation
text_splitter_question_gen = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=200)
# Split text into chunks for question generation
text_question_gen = ''
for page in data:
    text_question_gen += page.page_content
text_chunks_question_gen = text_splitter_question_gen.split_text(text_question_gen)
# Initialize Text Splitter for question generation
text_splitter_question_gen = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=200)
# Split text into chunks for question generation
text_question_gen = ''
for page in data:
    text_question_gen += page.page_content
text_chunks_question_gen = text_splitter_question_gen.split_text(text_question_gen)
# Convert chunks into Documents for question generation
docs_question_gen = [Document(page_content=t) for t in text_chunks_question_gen]
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
# embeddings = HuggingFaceInferenceAPIEmbeddings(api_key="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",model_name="BAAI/bge-small-en-v1.5")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                           model_kwargs={'device': 'cpu'})
vector_store = Chroma.from_documents(docs_question_gen, embeddings)

This code sets up a retrieval chain for answer generation. It uses an Azure Chat OpenAI model for language generation and a Contextual Compression Retriever for retrieval. Then, it constructs a retrieval QA chain using these components, enabling answer generation based on retrieved information.

In [None]:
# Initialize retrieval chain for answer generation
import os
from langchain.chains import RetrievalQA
llm = AzureChatOpenAI(
  model_name = "gpt4-32k",
  openai_api_version="2023-05-15",
  azure_deployment="gpt4-inference"
  )
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
#COHERE does not allow multiple API calls so could not implement the same
# os.environ['COHERE_API_KEY']="QWe3cacVyfK7BQioAI1lwMUT8Eydbt3Xp0N2kwUF"
# compressor = CohereRerank()
base_retriever=vector_store.as_retriever(k=5)
#retriever=ContextualCompressionRetriever(base_compressor=compressor, base_retriever=base_retriever)
question_gen_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                                retriever=base_retriever)

In [None]:
len(topics_final_list)

60

This code iterates over each topic in topics_final_list and generates multiple-choice questions (MCQs) based on each topic. It constructs a question string template for each topic and then uses a retrieval chain to generate MCQs with options following a specified format. Finally, it appends the generated MCQs to the text variable.

In [None]:
text=""
for i in range(len(topics_final_list)):
  # Generate the string with the variable value
  question_string = f"""
  You are a UPSC Exam Expert in India. Your job is to test the knowledge of aspirants.\
  Generate the possible MCQ questions along with their options on the topic and decription {topics_final_list[i]}.\
  Here is the format which must be strictly followed for the generated MCQs:

      Q. The Question for the MCQ Framed
      (a) Option 1
      (b) Option 2
      (c) Option 3
      (d) Option 4

      MAKE SURE EVERY QUESTION FOLLOWS THE EXACT SAME FORMAT WITHOUT FAIL


  """

  # Run the chain with the generated string
  answer = question_gen_chain.run(question_string)
  text += answer

In [None]:
print(text)

Q. Which of the following is the second trophic level in a lake ecosystem?
(a) Phytoplankton
(b) Zooplankton
(c) Benthos
(d) Fishes

Q. Secondary producers in an ecosystem are:
(a) Herbivores
(b) Producers
(c) Carnivores
(d) None of the above

Q. What is the percentage of photosynthetically active radiation (PAR) in the incident solar radiation?
(a) 100%
(b) 50%
(c) 1-5%
(d) 2-10%

Q. Which component distinguishes between a grazing food chain and a detritus food chain?
(a) Producers
(b) Primary consumers
(c) Secondary consumers
(d) Decomposers

Q. In the context of an ecosystem, distinguish between production and decomposition. Which statement is correct?
(a) Production refers to the creation of new organic matter, while decomposition refers to the breakdown of organic matter.
(b) Production refers to the breakdown of organic matter, while decomposition refers to the creation of new organic matter.
(c) Both production and decomposition refer to the creation of new organic matter.
(d) B

In [None]:
topics_final_list

['Q: Understanding the aquatic ecosystem: This list focuses on the structure, components, and energy cycling within an aquatic ecosystem.',
 'Q: Exploring forest ecosystems: This list discusses the function, structure, and ecological components of forest ecosystems including both biotic and abiotic elements.',
 'Q: Studying terrestrial ecosystems: The focus here is on the energy flow, ecological structure, and biotic components within terrestrial ecosystems.',
 'Q: Exploration of aquatic ecosystems and their functions: This entails understanding varied species, energy flow, abiotic components, and the functionality of aquatic ecosystems like ponds.',
 'Q: Analysis of components and energy flow in ecosystems: This pertains to the study of how different species, especially in aquatic ecosystems like ponds, interact with the abiotic components and contribute to energy flow.',
 "Q: Examination of species' roles in aquatic ecosystems: This involves assessing the different species in ecosyst

#When an Index page is present for the PDF

##Loading Document

Loading the Textbook using PyPDFLoader

In [None]:
loader = PyPDFLoader(
    #api_endpoint=endpoint, api_key=key,
    file_path="/content/1680673908_PT Sprint 2023 Science & Technology_Final (1).pdf"#, api_model="prebuilt-layout"
)
data=loader.load()

Storing the index of the textbook seperately

In [None]:
index_list=data[1:5]

##Topic Extraction

This function generates a prompt to extract all possible topics from the Index in the provided context list. It utilizes an instance of an LLM to generate a response based on the given template and input context. The extracted topics are returned as a list in the response.

In [None]:
def extract_topics(index_list):

  template = """
  Your job is to extract all the topics from the CONTEXT provided\
  Generate all the possible Science and Technology topics from the information given in CONTEXT.\
  Store this in a list data type, with each of the topics being an element of the list

  Your final response should be a list format.
  CONTEXT: {index_list}
  """

  model = AzureChatOpenAI(
    model_name = "gpt4-32k",
    openai_api_version="2023-05-15",
    azure_deployment="gpt4-inference"
    )

  prompt = PromptTemplate(
    template= template,
    input_variables=["index_list"],
    )

  chain = prompt | model
  response = chain.invoke({"index_list":index_list})
  return response

In [None]:
topics=extract_topics(index_list)

In [None]:
final_index=topics.content
final_index

'["ChatGPT- Powered WhatsApp Chatbot", "Generative Artificial Intelligence", "Critical and Emerging Technologies Dialogue", "Muons Penetrate Ancient Xi’an Fortress Wall", "H5N1- Avian Influenza", "Combating Filariasis", "Sickle Cell Disease", "India’s Nuclear Power Capacity", "NISAR Mission", "ISRO’s SSLV-D2", "Lumpy Skin Disease", "Antimicrobial-Resistant Gonorrhea", "WHO Report on Global Trans Fat Elimination", "Hyderabad: Center for the Fourth Industrial Revolution", "Superconductivity", "Doppler Weather Radar Network", "Neuromorphic Computing", "Solitary Wave in Martian Magnetosphere", "India’s First Solar Mission", "Exoplanet", "Shukrayaan I", "Age Determination Techniques", "Virovore", "Stem Cell-derived Mitochondrial Transplant", "Local Bubbles", "Innovation in Agriculture", "GM Mustard", "Base Editing", "Whole Genome Sequencing", "Anticancer mRNA Vaccine", "Incovacc, Intranasal Covid-19 Vaccine", "Deepfake Technology", "iDEX and Defence India Start-Up Challenge", "Water Worlds"

In [None]:
import ast

# Convert the string to a list of strings
index_list = ast.literal_eval(final_index)

Chunking and Creating a Vector Database for the same

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize Text Splitter for question generation
text_splitter_question_gen = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=200)
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# Split text into chunks for question generation
text_question_gen = ''
for page in data:
    text_question_gen += page.page_content
text_chunks_question_gen = text_splitter_question_gen.split_text(text_question_gen)
# Convert chunks into Documents for question generation
docs_question_gen = [Document(page_content=t) for t in text_chunks_question_gen]
# # Create vector database for answer generation
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
embeddings = HuggingFaceInferenceAPIEmbeddings(api_key="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",model_name="BAAI/bge-small-en-v1.5")
vector_store = Chroma.from_documents(docs_question_gen, embeddings)

RetrievalQA Chain Type using multiple API calls for each topic

In [None]:
# Initialize retrieval chain for answer generation
from langchain.chains import RetrievalQA
llm = AzureChatOpenAI(
  model_name = "gpt4-32k",
  openai_api_version="2023-05-15",
  azure_deployment="gpt4-inference"
  )
question_gen_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                                retriever=vector_store.as_retriever(k=5))

##MCQ Generation

Generating MCQ Questions for each topic as extracted from the index. Multiple API calls are made for each topic.

In [None]:
text=""
for i in range(len(index_list)):
  # Generate the string with the variable value
  question_string = f"""
  You are a UPSC Exam Expert in India. Your job is to test the knowledge of aspirants.\
  Generate 3-4 possible MCQ questions along with their options on the topic {index_list[i]}.\
    Here is the format which must be strictly followed for the generated MCQs:

      Q. The Question for the MCQ Framed
      (a) Option 1
      (b) Option 2
      (c) Option 3
      (d) Option 4

  """

  # Run the chain with the generated string
  answer = question_gen_chain.run(question_string)
  text += answer

Displaying the MCQs


In [None]:
print(text)

I'm sorry, but the provided context does not provide sufficient information on the topic "ChatGPT- Powered WhatsApp Chatbot" to generate the requested MCQs.Q1. Which of the following technologies is not typically associated with Generative Artificial Intelligence?
(a) Machine Learning
(b) Pattern Recognition
(c) Big Data
(d) Quantum Computing

Q2. What was the percentage increase in the use of AI in India following the outbreak of the virus, as per a PwC report?
(a) 35%
(b) 45%
(c) 55%
(d) 65%

Q3. Which of the following is not a barrier to the adoption of AI in India?
(a) Limited understanding of AI
(b) High Investments and Well Evolved Startup Ecosystem
(c) Limited Availability of AI Trained Talent
(d) Low Investments and Less Evolved Startup Ecosystem

Q4. Which AI-based portal is aimed at assisting judges with legal research in India?
(a) SUPACE
(b) AI Judiciary
(c) AI Court
(d) Legal AIQ1. What is the aim of the Atal Innovation Mission (AIM)?
(a) To provide a platform and collabor