# 📚 StudyForge: Multi-Agent AI Assistant for Exam Revision

This project is a comprehensive, multi-agent academic assistant designed to help students efficiently revise for exams using their own PDF documents (such as textbooks, notes, or research papers). Built with **CrewAI**, **LlamaIndex**, and **LangChain**, it leverages the power of LLMs and agent orchestration to automate the creation of study materials and interactive Q&A.
---

## 😫 The Problem

Students are overwhelmed by long PDFs, textbooks, and scattered online content — especially during exams. There's no time to revise everything.

---

## 💡 The Solution

This assistant turns your PDFs into a structured study guide, complete with notes, MCQs, explanations, and even videos — in minutes.

---
> ⚡️ *Turn any textbook or notes PDF into a complete, interactive study assistant in minutes!*

## 🚀 Key Features

- **PDF Upload & Processing:** Students can upload their study PDFs, which are automatically chunked and indexed for semantic search and retrieval.
- **Multi-Agent Workflow:** Specialized agents collaborate to generate:
    - 📄 **Summary:** Concise overview of the document.
    - 📝 **Notes:** Bullet-point revision notes.
    - ❓ **MCQs:** Multiple-choice questions for self-assessment.
    - 💡 **Simplified Explanation:** Easy-to-understand explanations of complex topics.
    - 📺 **YouTube Resources:** Curated educational videos for each topic.
    - 🌐 **Web Resources:** High-quality web articles and references.
    - 🧩 **Comprehensive Compilation:** All outputs are combined into a structured markdown report.
- **Downloadable Output:** The final study guide can be downloaded as a markdown or PDF file for offline use.
- **Agentic RAG (Retrieval-Augmented Generation):**
    - **Smart Q&A:** Ask questions about your document. If the answer is not found but is related, the agent fetches information from the web.
    - **Context Awareness:** If the question is unrelated to the document, the agent politely declines to answer.
    - **Web Augmentation:** Even when the answer is found, the agent can supplement it with additional web resources.

---

## 🛠️ Technologies Used

- **CrewAI:** For orchestrating multiple specialized agents and tasks.
- **LlamaIndex:** For document ingestion, chunking, and vector-based retrieval.
- **LangChain:** For tool integration, prompt engineering, and agentic RAG.
- **Google Gemini & Mistral LLMs:** For high-quality language understanding and generation.
- **YouTube & Web Search APIs:** For fetching relevant external resources.
- **WeasyPrint:** For converting markdown study guides to PDF.

---

## 🧑‍💻 How It Works

1. **Document Ingestion:** Upload your PDF. The system splits it into manageable chunks and creates a semantic index.
2. **Agent Collaboration:** Each agent performs its specialized task (summarization, note-making, MCQ generation, etc.).
3. **Compilation:** Outputs from all agents are merged into a single, well-structured markdown document.
4. **Download:** Export your study guide as markdown or PDF.
5. **Interactive Q&A:** Use the Agentic RAG interface to ask follow-up questions, get explanations, or explore related topics with web-augmented answers.

---

## 🎯 Why Use This Project?

- **Saves Time:** Instantly generate summaries, notes, and practice questions from large PDFs.
- **Enhances Understanding:** Get simplified explanations and curated video/web resources.
- **Interactive Learning:** Ask questions and get context-aware, up-to-date answers.
- **All-in-One Solution:** Everything you need for exam prep in one workflow.
-  **Understand better** with simplified explanations
-  **Ask smart questions** and get contextual answers
-  **Study efficiently** with personalized summaries, MCQs, and curated resources

---

## 💡 Example Use Cases

- **Exam Revision:** Quickly turn a textbook or lecture PDF into a personalized study guide.
- **Self-Assessment:** Practice with automatically generated MCQs.
- **Deep Dives:** Use the Q&A agent to clarify doubts or explore related concepts.
- **Resource Discovery:** Find the best videos and articles for each topic in your syllabus.

---

## 📦 Project Workflow Overview

1. **Upload PDF →**  
2. **Agents Generate Outputs (Summary, Notes, MCQs, Explanations, Resources) →**  
3. **Compile to Markdown →**  
4. **Download as PDF/Markdown →**  
5. **Interactive Q&A with Agentic RAG**

---

## 📝 Next Steps

- Try uploading your own PDF and see the multi-agent workflow in action!
- Use the interactive chat to ask questions about your document or related topics.
- Download your personalized study guide for offline revision.

---

> **Empowering students to study smarter, not harder!**

In [None]:
%%capture
!pip install crewai

In [None]:
%%capture
!pip install 'crewai[tools]'

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
%pip install llama-index-llms-google-genai llama-index-embeddings-google-genai

In [1]:
import os
GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [None]:
from crewai import Agent, Task, Crew ,LLM

In [None]:
import os
GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [None]:
# %pip install -U langchain-google-genai
# !pip install langchain
!pip install --upgrade langchain langchain-core langchain-google-genai

In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [None]:
mistral_llm = LLM(
    model="mistral/mistral-large-latest",
    temperature=0.7,
    api_key=""
)
my_llm = LLM(
    model="gemini/gemini-2.5-flash",
    temperature=0.7,
    api_key="", #
    provider="gemini" # 
)
groq_llm=llm = LLM(
    model="groq/llama-3.3-70b-versatile",
    temperature=0.7,
    api_key=""
)
# groq_llm=
# print(my_llm)
# print(llm)

In [None]:
!pip install llama-index llama-index-vector-stores-chroma llama-index-llms-huggingface-api llama-index-embeddings-huggingface -U -q

In [None]:
!pip install llama-index-llms-langchain

In [None]:
%pip install llama-index-llms-mistralai
# !pip install llama-index

In [None]:
from llama_index.llms.mistralai import MistralAI
mistral_llamaindex_llm = MistralAI(api_key="")

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from google.genai.types import EmbedContentConfig
# Load document, put the correct directory name here . here I am doing it only for 1 pdf .
reader = SimpleDirectoryReader(input_files=["ML.pdf"])
documents = reader.load_data()
print(f"Loaded {len(documents)} document(s).")

# Split into chunks
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
# you can change the batch size , model_name to different varieties that google embeddings provide ,
# go to https://ai.google.dev/gemini-api/docs/embeddings
embed_model = GoogleGenAIEmbedding(
    model_name="text-embedding-004",
    embed_batch_size=200,
    api_key=""
)
# Set up LLM and embedding model
Settings.llm = mistral_llamaindex_llm
Settings.embed_model = embed_model

# Create vector index
vector_index = VectorStoreIndex(nodes)

# Create query engine
query_engine = vector_index.as_query_engine()

Loaded 10 document(s).


In [None]:
from llama_index.core import (
    SimpleDirectoryReader, VectorStoreIndex, Settings,
    PromptTemplate
)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer, ResponseMode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from crewai.tools import tool

# Global shared vector index
vector_index = None

# Build index once for the PDF
def load_and_index(input_pdf_path: str):
    global vector_index
    reader = SimpleDirectoryReader(input_files=[input_pdf_path])
    documents = reader.load_data()

    splitter = SentenceSplitter(chunk_size=1024)
    nodes = splitter.get_nodes_from_documents(documents)

    embed_model = GoogleGenAIEmbedding(
        model_name="text-embedding-004",
        embed_batch_size=200,
        api_key=""
    )

    Settings.embed_model = embed_model
    vector_index = VectorStoreIndex(nodes)
    print("Indexing complete.")

# Shared function for prompt-based RAG

def run_prompted_rag(template_str: str, user_query: str):
    global vector_index
    template = PromptTemplate(template_str)
    synthesizer = get_response_synthesizer(
        response_mode=ResponseMode.COMPACT,
        text_qa_template=template
    )
    retriever = vector_index.as_retriever(similarity_top_k=4)
    query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=synthesizer)
    response = query_engine.query(user_query)
    return str(response)

In [None]:
load_and_index("ML.pdf")

Indexing complete.


In [None]:
SUMMARY_PROMPT = """You are an academic summarization expert. Summarize this content:
{context_str}
"""

In [None]:
print(vector_index)

<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x7d6b74076a50>


In [None]:
run_prompted_rag(EXPLAIN_PROMPT,"Explain in simple terms")

In [None]:
SUMMARY_PROMPT = """You are an academic summarization expert. Summarize this content:
{context_str}
"""

NOTES_PROMPT = """Create concise bullet-point notes from this content:
{context_str}
"""

MCQ_PROMPT = """Generate 10-12 multiple choice questions based on the following:
{context_str}
"""

EXPLAIN_PROMPT = """Explain the following content in simple terms, like for a high school student:
{context_str}
"""

In [None]:
%pip install --upgrade --quiet  youtube_search

In [None]:
@tool("Summarizer")
def summarizer_tool(query="Summarize"):
    """
    Summarizes PDF content using context-aware RAG
    """
    return run_prompted_rag(SUMMARY_PROMPT, query)

@tool("NoteMaker")
def notes_tool(query="Provide Notes in Bullet points"):
    """
    Creates bullet-point study notes
    """
    return run_prompted_rag(NOTES_PROMPT, query)

@tool("MCQGenerator")
def mcq_tool(query="Give MCQs"):
    """
    Creates MCQs from PDF
    """
    return run_prompted_rag(MCQ_PROMPT, query)

@tool("Simplifier")
def explanation_tool(query="Explain in simple terms"):
    """
    Explains the content in simple terms
    """
    return run_prompted_rag(EXPLAIN_PROMPT, query)
import os
os.environ["SERPER_API_KEY"] = ""
from langchain_community.utilities import GoogleSerperAPIWrapper
# from crewai.tools import BaseTool
from crewai.tools import tool
@tool("MyWebSearchTool")
def web_search_tool(query=None):
    """
    Searches the web using Serper.dev and returns links to high-quality resources/articles for each topic.
    """
    search = GoogleSerperAPIWrapper()

    # Step 1: Get topics from the PDF
    ans_1 = query_engine.query(
        "Give all the topics which this document is based on as a Python list of strings, each item separated by a comma."
    )

    # Step 2: Extract topics from the response
    topics = re.findall(r"'(.*?)'", str(ans_1))

    results = []

    # Step 3: Search each topic and collect top 3 links
    for topic in topics:
        try:
            search_results = search.results(topic)
            links = []
            for result in search_results.get("organic", []):
                if "link" in result:
                    links.append(result["link"])
                if len(links) >= 3:
                    break
            if links:
                formatted = "\n".join(f"- {url}" for url in links)
                results.append(f"🌐 **{topic}**\n{formatted}\n")
        except Exception as e:
            results.append(f"🌐 **{topic}**\nError fetching results: {str(e)}\n")

    return "\n".join(results)

from langchain_community.tools import YouTubeSearchTool
YT = YouTubeSearchTool()
from crewai.tools import tool
# ans_1 = query_engine.query(
#     "Give all the topics which this document is based on as a Python list of strings, each item separated by a comma."
# )
import re
# from langchain.tools import tool

@tool("YTVideoSearch")
def youtube_search(query=None):
    """
    Searches YouTube for educational videos on the main topics in the document.
    Returns a dictionary with topics and associated video links.
    """
    from langchain_community.tools import YouTubeSearchTool

    YT = YouTubeSearchTool()

    # Step 1: Extract topics from the document
    ans_1 = query_engine.query(
        "Give all the topics which this document is based on as a Python list of strings, each item separated by a comma."
    )

    # Step 2: Parse topics from the response (if model returns Python-style string list)
    topics = re.findall(r"'(.*?)'", str(ans_1))  # fallback for string like: ['A' 'B']

    results = []

    # Step 3: Run YouTube search for each topic and collect results
    for topic in topics:
        video_result = YT.run(topic)
        results.append(f"🔎 **{topic}**\n{video_result}\n")

    # Step 4: Join all results into one string
    return "\n".join(results)
@tool("DownloadPDF")
def download_pdf_tool():
    """Downloads final_output.pdf in Colab."""
    try:
        from google.colab import files
        files.download("final_output.pdf")
        return "✅ Download triggered"
    except ImportError:
        return "⚠️ Download only supported in Colab."

In [None]:
print(ans_1)

['Introduction to Machine Learning', 'Supervised Learning', 'Classification', 'Regression']


In [None]:
download_pdf_tool()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

'✅ Download triggered'

In [None]:
web_search_tool()

'🌐 **Introduction to Machine Learning**\n- https://developers.google.com/machine-learning/crash-course\n- https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about\n- https://www.coursera.org/learn/machine-learning-duke\n\n🌐 **Supervised Learning**\n- https://en.wikipedia.org/wiki/Supervised_learning\n- https://www.ibm.com/think/topics/supervised-learning\n- https://www.geeksforgeeks.org/supervised-machine-learning/\n\n🌐 **Classification**\n- https://en.wikipedia.org/wiki/Classification\n- https://www.merriam-webster.com/dictionary/classification\n- https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/classification\n\n🌐 **Regression**\n- https://en.wikipedia.org/wiki/Regression_analysis\n- https://www.imdb.com/title/tt3319920/\n- https://www.investopedia.com/terms/r/regression.asp\n'

In [None]:
youtube_search()

"🔎 **Introduction to Machine Learning**\n['https://www.youtube.com/watch?v=h0e2HAPTGF4&pp=ygUgSW50cm9kdWN0aW9uIHRvIE1hY2hpbmUgTGVhcm5pbmc%3D', 'https://www.youtube.com/watch?v=ukzFI9rgwfU&pp=ygUgSW50cm9kdWN0aW9uIHRvIE1hY2hpbmUgTGVhcm5pbmfSBwkJwQkBhyohjO8%3D']\n\n🔎 **Supervised Learning**\n['https://www.youtube.com/watch?v=W01tIRP_Rqs&pp=ygUTU3VwZXJ2aXNlZCBMZWFybmluZw%3D%3D', 'https://www.youtube.com/watch?v=4qVRBYAdLAo&pp=ygUTU3VwZXJ2aXNlZCBMZWFybmluZw%3D%3D']\n\n🔎 **Classification**\n['https://www.youtube.com/watch?v=DVouQRAKxYo&pp=ygUOQ2xhc3NpZmljYXRpb24%3D', 'https://www.youtube.com/watch?v=FoZUNdOUj9g&list=RDFoZUNdOUj9g&start_radio=1&pp=ygUOQ2xhc3NpZmljYXRpb26gBwE%3D']\n\n🔎 **Regression**\n['https://www.youtube.com/watch?v=VrcB9PJ22F0&list=RDVrcB9PJ22F0&start_radio=1&pp=ygUKUmVncmVzc2lvbqAHAQ%3D%3D', 'https://www.youtube.com/watch?v=T5AoqxQFkzY&pp=ygUKUmVncmVzc2lvbg%3D%3D']\n"

In [None]:
import re
topics = re.findall(r"'(.*?)'", str(ans_1))
print(type(topics))

<class 'list'>


In [None]:
from crewai import Agent, Task, Crew ,LLM

summarizer_agent = Agent(
    role="Summarization Expert",
    goal="Summarize academic PDFs efficiently",
    backstory="You specialize in extracting summaries from technical content.",
    llm=mistral_llm,
    allow_delegation=False,
    verbose=True,
    tools=[summarizer_tool]

)

notes_agent = Agent(
    role="Note Making Assistant",
    goal="Generate clear and concise notes for revision",
    backstory="You turn complex PDFs into easy notes for students.",
   llm=mistral_llm,
    allow_delegation=False,
    verbose=True,
    tools=[notes_tool],
)

mcq_agent = Agent(
    role="Question Generator",
    goal="Create useful exam-oriented MCQs",
    backstory="You understand how to turn academic content into testable questions.",
    tools=[mcq_tool],
    llm= mistral_llm,
    allow_delegation=False,
    verbose=True,

)
download_agent = Agent(
    role="Download Manager",
    goal="Download the final_output.pdf file to the user's device",
    backstory="You are responsible for handling the download of files. You don't use any language model — just ensure the user gets the file.",
    tools=[download_pdf_tool],
    llm=llm,
    allow_delegation=False,
    verbose=True
)
explainer_agent = Agent(
    role="Simplification Agent",
    goal="Simplify academic content for better understanding",
    backstory="You are great at explaining complex topics in simple language.",
    tools=[explanation_tool],
    allow_delegation=False,
    max_iter=1,
    verbose=True,
  llm=mistral_llm,
)
composer_Agent = Agent(
    role="Final Composer",
    goal=(
        "Take the outputs from the summary, notes, MCQs, simplification, YouTube, and web search agents, "
        "and combine them into a single structured markdown report. Do not call any tools."
    ),
    backstory=(
        "You are skilled at organizing and formatting diverse content into a clean, readable study guide. "
        "You will use the provided outputs and combine them into a single markdown document."
    ),
    allow_delegation=False,
    verbose=True,
  llm=mistral_llm,
)

web_research_agent = Agent(
    role="Web Research Assistant",
    goal="Search the internet to gather supplementary information related to the academic topic.",
    backstory=(
        "You are a digital research assistant with access to real-time web tools. "
        "Your goal is to enhance academic topics by finding relevant articles, blogs, documentation, and real-world examples "
        "to give students broader context and deeper understanding."
    ),
    llm=mistral_llm,
    allow_delegation=False,
    verbose=True,
    tools=[web_search_tool]
)
youtube_finder_agent = Agent(
    role="YouTube Search Agent",
    goal="Find the best educational videos related to the given topic.",
    backstory=(
        "You help students by finding relevant and high-quality educational videos on YouTube. "
        "You return short descriptions along with links that will help explain the topic in audio-visual form."
    ),
   llm=mistral_llm,
    allow_delegation=False,
    verbose=True,
    tools=[youtube_search]
)

In [None]:
from crewai import Task, Crew

summary_task = Task(
    description="Summarize the PDF content for the user",
    expected_output="A clear and concise summary.",
    agent=summarizer_agent
    ,tools=[summarizer_tool]
)

notes_task = Task(
    description="Create bullet-point notes based on the content",
    expected_output="Concise revision notes",
    agent=notes_agent,
     tools=[notes_tool],
)
download_task = Task(
    description="Download the generated final_output.pdf file to the local system or browser via Colab.",
    expected_output="Trigger download of final_output.pdf",
    agent=download_agent
)
compose_task = Task(
    description=(
        "Take the pre-generated outputs from the summary, notes, MCQs, explanation, YouTube search, and web search, "
        "and combine them into a single clean markdown document. Do not regenerate any of the content — assume it is already available. "
        "Use clear section headers for each type of content."
    ),
    expected_output=(
        "A markdown document with sections:\n"
        "1. 📄 Summary\n"
        "2. 📝 Notes\n"
        "3. ❓ MCQs\n"
        "4. 💡 Explanation\n"
        "5. 📺 YouTube Resources\n"
        "6. 🌐 Web Articles\n"
    ),
    agent=composer_Agent,
    output_file="final_output.md"
)

mcq_task = Task(
    description="Generate multiple choice questions based on the document",
    expected_output="10-12 MCQs",
    agent=mcq_agent,
    tools=[mcq_tool]
)

explanation_task = Task(
    description="Simplify the PDF topic so it's easier to understand",
    expected_output="Simplified explanation",
    agent=explainer_agent,
    tools=[explanation_tool]
)
web_research_task = Task(
    description="Conduct a web search to find additional resources related to the topic",
    expected_output="List of 5-6 URLs from the web",
    agent=web_research_agent,
    tools=[web_search_tool]
)
youtube_search_task = Task(
    description="Search YouTube for relevant and high-quality educational videos on the given topic. Prioritize tutorial or lecture-style content.",
    expected_output="Top 5-6 video titles with their YouTube links and a short description of each.",
    agent=youtube_finder_agent,
    tools=[youtube_search]
)
crew = Crew(
    tasks=[summary_task, notes_task, mcq_task,explanation_task,youtube_search_task,web_research_task,compose_task],
    agents=[summarizer_agent, notes_agent, mcq_agent,explainer_agent,youtube_finder_agent,web_research_agent,composer_Agent],
    verbose=False,
)


In [None]:
def run_pdf_pipeline(pdf_path, user_query):
    load_and_index(pdf_path)
    return crew.kickoff()

In [None]:
result = run_pdf_pipeline("ML.pdf", "Help me study this for my exam")

Indexing complete.


In [None]:
from IPython.display import Markdown

# Extract markdown content from between the code blocks
markdown_content1 = result.raw.split('```markdown\n')[0].split('\n```')[0]
Markdown(markdown_content1)

# Machine Learning Concepts Study Guide

This document provides a brief overview of machine learning concepts, including supervised, semi-supervised, and unsupervised learning, along with data characteristics and preprocessing techniques.

---

## 📄 Summary

*   **Machine Learning (ML):** A branch of computer science and AI that focuses on using data and algorithms to mimic human learning, improving accuracy over time. ML enables computers to learn from data and patterns without explicit programming by learning mathematical models.
*   **Semi-supervised Learning:** Leverages both labeled and unlabeled data. A partial model is learned from a small portion of labeled data, which is then used to label the unlabeled data (creating "pseudo-labeled" data). The labeled and pseudo-labeled data are combined to train a final model for prediction.
*   **Unsupervised Learning:** Focuses on finding patterns in unlabeled data. Clustering is a key technique, grouping similar data points together based on similarity measures (e.g., Euclidean distance). An example is vehicle stoppage detection using GPS traces, where similarity is measured by the Euclidean distance between latitude and longitude coordinates.
*   **Supervised Learning:** Trains a model on labeled data (input-output pairs) to predict outputs for new, unseen inputs. Examples include classification (predicting categories) and regression (predicting continuous values).
*   **Reinforcement Learning:** An agent learns to make decisions by interacting with an environment, receiving rewards or penalties for its actions, aiming to maximize cumulative reward.
*   **Data Types:** Structured (e.g., tables), unstructured (e.g., text, images), and semi-structured (e.g., XML).
*   **Data Quality Issues:** Missing values, noise, and inconsistencies.
*   **Data Preprocessing:** Essential step to prepare data for ML models, including:
    *   **Data Cleaning:** Handling missing values (e.g., imputation, deletion), smoothing noisy data (e.g., binning, regression, clustering), and resolving inconsistencies.
    *   **Data Transformation:** Normalization (scaling data to a specific range), aggregation, and attribute construction.
    *   **Data Reduction:** Reducing data volume while maintaining integrity, using techniques like dimensionality reduction (e.g., PCA), numerosity reduction (e.g., sampling), and data compression.
    *   **Data Discretization:** Converting continuous attributes into discrete intervals.
*   **Overfitting:** Occurs when a model learns the training data too well, including noise, leading to poor performance on new data.

---

## 📝 Notes

**Introduction to Machine Learning**
*   Introduces the topic of Machine Learning.

**Supervised Learning**
*   Uses labeled data to train models.
*   Input data is tagged with the correct output.

**Semi-supervised Learning**
*   Uses a small portion of labeled data and a large amount of unlabeled data.
*   Learns a partial model from the labeled data.
*   Uses the partial model to label the unlabeled data (creating pseudo-labels).
*   Combines labeled and pseudo-labeled data.
*   Learns a model from the combined data to make predictions.

**Unsupervised Learning**
*   Focuses on finding patterns in unlabeled data.
*   Clustering is a key technique.
    *   Groups similar data points.
    *   Uses similarity measures (e.g., Euclidean distance).
*   Example: Vehicle stoppage detection using GPS traces.

**Reinforcement Learning**
*   An agent learns to make decisions by interacting with an environment.
*   Receives rewards or penalties for its actions.
*   Aims to maximize cumulative reward.

**Data Characteristics and Preprocessing**
*   **Data Types**
    *   Structured (e.g., tables)
    *   Unstructured (e.g., text, images)
    *   Semi-structured (e.g., XML)
*   **Data Quality Issues**
    *   Missing values
    *   Noise
    *   Inconsistencies
*   **Data Preprocessing**
    *   Essential step to prepare data for ML models.
    *   **Data Cleaning**
        *   Handling missing values (e.g., imputation, deletion)
        *   Smoothing noisy data (e.g., binning, regression, clustering)
        *   Resolving inconsistencies
    *   **Data Transformation**
        *   Normalization (scaling data to a specific range)
        *   Aggregation
        *   Attribute construction
    *   **Data Reduction**
        *   Reduces data volume while maintaining integrity.
        *   Techniques: Dimensionality reduction (e.g., PCA), Numerosity reduction (e.g., sampling), Data compression
    *   **Data Discretization**
        *   Converts continuous attributes into discrete intervals.

**Overfitting**
*   Occurs when a model learns the training data too well, including noise.
*   Leads to poor performance on new data.

---

## ❓ MCQs

1.  According to the text, which of the following is a common application of Machine Learning?
    a) Data Storage
    b) Prediction
    c) Hardware Design
    d) Network Security

2.  Which of the following is NOT listed as a type of Machine Learning in the provided text?
    a) Supervised Learning
    b) Unsupervised Learning
    c) Reinforcement Learning
    d) Deep Learning

3.  What is the primary characteristic of Semi-supervised Learning?
    a) It only uses labeled data.
    b) It only uses unlabeled data.
    c) It uses a small portion of labeled data and a large number of unlabeled data.
    d) It requires equal amounts of labeled and unlabeled data.

4.  In Semi-supervised Learning, what is the purpose of the "Partial Model"?
    a) To directly make final predictions.
    b) To label the unlabeled data.
    c) To store the labeled data.
    d) To encrypt the data.

5.  After the Partial Model labels the unlabeled data in Semi-supervised Learning, what happens next?
    a) The labeled data is discarded.
    b) The unlabeled data is discarded.
    c) The labeled and pseudo-labeled data are combined.
    d) A new Partial Model is created.

6.  What is the ultimate goal after combining labeled and pseudo-labeled data in Semi-supervised Learning?
    a) To delete the unlabeled data.
    b) To learn a model and make a prediction for a new example.
    c) To create a new Partial Model.
    d) To store the data in a database.

7.  Which of the following machine learning types is best suited for situations where you have limited labeled data?
    a) Supervised Learning
    b) Unsupervised Learning
    c) Reinforcement Learning
    d) Semi-supervised Learning

8.  The process of using a "Partial Model" to label unlabeled data is also known as:
    a) Data Encryption
    b) Pseudo-Labeling
    c) Data Compression
    d) Model Optimization

9.  What is the main advantage of using Semi-supervised learning over Supervised learning when dealing with large datasets?
    a) It always provides more accurate results.
    b) It requires less labeled data, which can be expensive to obtain.
    c) It is always faster to train.
    d) It doesn't require any data preprocessing.

10. Which of the following is a key component used in Semi-Supervised learning?
    a) A large amount of labeled data
    b) A small amount of labeled data
    c) No labeled data
    d) Only image data

11. What is the final step in the Semi-Supervised learning process described in the text?
    a) Discarding the labeled data
    b) Making a prediction for a new example
    c) Creating a new partial model
    d) Deleting the unlabeled data

12. Which of the following is a common task that machine learning is used for?
    a) Creating new programming languages
    b) Designing computer hardware
    c) Classifying data into different categories
    d) Managing network infrastructure

---

## 💡 Explanation

**Introduction to Machine Learning**

Imagine you're teaching a computer to learn, much like how people learn. That's what **Machine Learning (ML)** is all about! It's a special part of computer science and artificial intelligence where computers use data and clever instructions (algorithms) to get better at tasks over time. Instead of telling the computer every single step, we give it examples, and it learns patterns and builds mathematical models to make predictions or decisions on its own.

**Supervised Learning**

Think of this like learning with a very helpful teacher. In **Supervised Learning**, you give the computer a lot of examples where you already know the correct answer. For instance, if you want the computer to tell the difference between pictures of cats and dogs, you'd show it many pictures, and for each one, you'd clearly label it as "cat" or "dog." The computer then learns from these labeled examples to predict the right answer for new pictures it hasn't seen before. This is used for things like:
*   **Classification:** Putting things into categories (like deciding if an email is spam or not).
*   **Regression:** Predicting a number (like guessing house prices based on size and location).

**Semi-supervised Learning**

This is a clever mix! In **Semi-supervised Learning**, you have a small amount of data that's already labeled (like the examples in Supervised Learning), but also a large amount of data that isn't labeled at all. The process works like this:
1.  The computer first learns a little bit from the small amount of labeled data.
2.  It then uses what it learned to make educated guesses and "label" the much larger amount of unlabeled data. These are called "pseudo-labels" because the computer guessed them.
3.  Finally, the computer combines both the original labeled data and the newly pseudo-labeled data. It then uses all of this combined information to train a much stronger final model that can make good predictions. This is super useful when getting lots of labeled data is difficult or expensive.

**Unsupervised Learning**

Imagine you have a huge pile of toys, and no one tells you what kind of toys they are. In **Unsupervised Learning**, the computer's job is to find hidden patterns or groups within data that has no labels. It's like organizing those toys into piles of similar items (e.g., all the cars together, all the dolls together) without any prior instructions.
*   A main technique is **Clustering**, where the computer groups similar pieces of data together. It does this by measuring how "alike" different pieces of data are (for example, using something called "Euclidean distance" to see how close two points are on a map).
*   An example is finding where vehicles stop just by looking at their GPS trails, without anyone telling the computer where the "stops" are.

**Reinforcement Learning**

Think of this like training a pet using rewards and punishments. In **Reinforcement Learning**, a computer program (called an "agent") learns by trying things out in a digital environment. If it does something good, it gets a "reward." If it does something bad, it gets a "penalty." The agent's main goal is to learn the best actions to take over time to get the most rewards. It's how AI learns to play complex games or control robots.

**Data Characteristics and Preprocessing**

Before you can teach a machine learning model anything, you need to prepare your "teaching materials" – the data!

*   **Data Types:** Data comes in different forms:
    *   **Structured Data:** This is neat and organized, like information in a spreadsheet with clear rows and columns (e.g., names, ages, addresses in a table).
    *   **Unstructured Data:** This is messy and doesn't have a fixed format, like plain text documents, images, or audio files.
    *   **Semi-structured Data:** This is a mix, having some organization but not as rigid as a table (e.g., XML files or JSON data, which use tags to describe data).

*   **Data Quality Issues:** Data isn't always perfect. It can have:
    *   **Missing values:** Gaps where information is simply not there.
    *   **Noise:** Random errors, incorrect data, or irrelevant information.
    *   **Inconsistencies:** Different ways of writing the same thing (e.g., "NY" vs. "New York").

*   **Data Preprocessing:** This is a crucial step to clean and get your data ready for the machine learning model.
    *   **Data Cleaning:** Fixing the quality issues. This involves:
        *   Dealing with missing information (e.g., filling it in with a guess, or removing the data entry).
        *   Smoothing out noisy data (e.g., using averages or grouping similar data).
        *   Fixing inconsistencies so everything is uniform.
    *   **Data Transformation:** Changing the way data is represented:
        *   **Normalization:** Scaling numbers so they are all within a specific range (like changing all values to be between 0 and 1), which helps models learn better.
        *   **Aggregation:** Combining data (e.g., adding up sales figures for a month).
        *   **Attribute Construction:** Creating new, useful pieces of information from existing ones.
    *   **Data Reduction:** When you have too much data, you can reduce its size without losing important information. Techniques include:
        *   **Dimensionality Reduction:** Reducing the number of characteristics or features of your data (like simplifying a complex description to just a few key points).
        *   **Numerosity Reduction:** Reducing the number of actual data entries (like taking a smaller sample from a huge dataset).
        *   **Data Compression:** Making the data file smaller.
    *   **Data Discretization:** Converting continuous numbers (like temperature measurements that can be any value) into specific categories or "bins" (like "cold," "mild," "hot").

**Overfitting**

Imagine you're studying for a test and you memorize every single example question, including all the typos and weird little details. When the actual test comes, if the questions are slightly different, you might struggle because you learned the practice examples *too well*, rather than understanding the main ideas.

That's **Overfitting** in machine learning. It happens when a model learns the training data so perfectly, including all its random errors or unique quirks, that it doesn't do well when it sees new, slightly different data. It's like the model becomes too specialized for the data it was trained on and can't generalize its learning to new situations.

---

## 📺 YouTube Resources

Top 6 educational videos on Machine Learning concepts:

1.  **What is Machine Learning?**
    *   **Link:** `https://www.youtube.com/watch?v=ukzFI9rgwfU`
    *   **Description:** This video from IBM Technology provides a concise and accessible introduction to Machine Learning, explaining its core definition, purpose, and how it enables computers to learn from data without explicit programming.

2.  **Supervised Machine Learning Clearly Explained!!!**
    *   **Link:** `https://www.youtube.com/watch?v=4qVRBYAdLAo`
    *   **Description:** From the highly regarded StatQuest with Josh Starmer, this video offers a clear and intuitive explanation of supervised learning, covering how models are trained on labeled data to make predictions for new inputs, with examples of classification and regression.

3.  **Semi-supervised Machine Learning Explained**
    *   **Link:** `https://www.youtube.com/watch?v=kYJ7xX4fD9E`
    *   **Description:** Codebasics explains the concept of semi-supervised learning, detailing its unique approach of leveraging both a small portion of labeled data and a large amount of unlabeled data to train a robust model.

4.  **Unsupervised Machine Learning Clearly Explained!!!**
    *   **Link:** `https://www.youtube.com/watch?v=C1N5yXg0f24`
    *   **Description:** Another excellent explanation from StatQuest with Josh Starmer, this video makes unsupervised learning easy to understand, focusing on how models find patterns and structures in unlabeled data, with a particular emphasis on clustering.

5.  **Reinforcement Learning Explained**
    *   **Link:** `https://www.youtube.com/watch?v=npvR4VAp-8M`
    *   **Description:** Presented by freeCodeCamp.org, this video offers a comprehensive introduction to reinforcement learning, illustrating how an agent learns to make optimal decisions through interactions with an environment, receiving rewards or penalties.

6.  **Overfitting and Underfitting Clearly Explained!!!**
    *   **Link:** `https://www.youtube.com/watch?v=rWp2d61z2F8`
    *   **Description:** StatQuest with Josh Starmer breaks down the critical concept of overfitting (and underfitting) in machine learning, explaining why a model might perform poorly on new data if it learns the training data, including noise, too well.

---

## 🌐 Web Articles

*   [https://developers.google.com/machine-learning/crash-course](https://developers.google.com/machine-learning/crash-course)
*   [https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about](https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about)
*   [https://www.coursera.org/learn/machine-learning-duke](https://www.coursera.org/learn/machine-learning-duke)
*   [https://en.wikipedia.org/wiki/Supervised_learning](https://en.wikipedia.org/wiki/Supervised_learning)
*   [https://en.wikipedia.org/wiki/Semi-supervised_learning](https://en.wikipedia.org/wiki/Semi-supervised_learning)
*   [https://towardsdatascience.com/unsupervised-learning-8c4051ec549b](https://towardsdatascience.com/unsupervised-learning-8c4051ec549b)

In [None]:
!pip install markdown weasyprint

In [None]:
import markdown
from weasyprint import HTML

# Step 1: Read the markdown
with open("final_output.md", "r", encoding="utf-8") as f:
    markdown_text = f.read()

# Step 2: Convert to HTML
html_content = markdown.markdown(markdown_text)

# Step 3: Generate PDF
HTML(string=html_content).write_pdf("Machine_Learning.pdf")

print("✅ PDF generated using WeasyPrint: Machine_Learning.pdf")

# Step 4: Trigger file download in Colab
try:
    from google.colab import files
    files.download("Machine_Learning.pdf")
except ImportError:
    print("Download link only works in Google Colab.")


In [None]:
def run_doc_chat_loop(query_engine):
    """
    Interactive loop for asking follow-up questions about a document using RAG.
    Uses the query_engine (e.g., from LlamaIndex or LangChain Retriever).
    """
    print("📚 You can now ask questions about the document. Type 'quit' to exit.")

    while True:
        user_input = input("\n💬 Ask a question about the document (or type 'quit'): ").strip()

        if user_input.lower() in ['quit', 'exit']:
            print("👋 Exiting document chat. See you next time!")
            break

        try:
            response = query_engine.query(user_input)
            print(f"\n🧠 Answer:\n{response}\n")
        except Exception as e:
            print(f"⚠️ Error during question answering: {str(e)}")


In [None]:
run_doc_chat_loop(query_engine)

📚 You can now ask questions about the document. Type 'quit' to exit.

💬 Ask a question about the document (or type 'quit'): what is machine learning ?

🧠 Answer:
Machine Learning is a field within Computer Science and Artificial Intelligence that centers on employing data and algorithms to replicate human learning processes, with the goal of enhancing accuracy over time. It empowers computers to acquire knowledge autonomously, without the need for explicit programming, by enabling Machine Learning algorithms to derive mathematical models from data samples and patterns. It is generally used for prediction, classification, and clustering. It has multiple types including Supervised Learning, Unsupervised Learning, Reinforcement Learning and Semi-supervised Learning.


💬 Ask a question about the document (or type 'quit'): what is meditation ?

🧠 Answer:
The context discusses Machine Learning, not meditation. It defines Machine Learning as a branch of Computer Science and Artificial Intelli

In [None]:
result2 = run_pdf_pipeline("Project management.pdf", "Help me study this for my exam")

Indexing complete.


[93m Maximum iterations reached. Requesting final answer.[00m


Repaired JSON: {}


Repaired JSON: {}


Repaired JSON: {}


Repaired JSON: {}


Repaired JSON: {}


Repaired JSON: [{"{": "nput"}, "}"]
Repaired JSON: [{"{": "nput"}, "}"]


In [None]:
from IPython.display import Markdown

# Extract markdown content from between the code blocks
markdown_content = result2.raw.split('```markdown\n')[1].split('\n```')[0]
Markdown(markdown_content)

# Software Project Management Study Guide

## 📄 Summary

This document outlines key aspects of software project management. The goal is to enable efficient teamwork towards successful project completion. The project manager's responsibilities include project planning, monitoring, and control. Estimating software cost involves determining the product size, then estimating the effort needed based on size, and finally, determining project duration and cost from the effort estimate. Furthermore, compressing the delivery schedule can significantly increase human effort, while using fewer people over a longer period can be beneficial.

## 📝 Notes

### Software Project Management (General)

* **Goal:** Enable efficient teamwork for successful software project completion.

### SPMP Document

* **Purpose:** Documents project plans after planning phase.

### Organization of SPMP Document

* **Introduction:** Objectives, major functions, performance issues, management & technical constraints.
* **Project Estimates:** Historical data, estimation techniques, effort, cost, project duration.
* **Project Resources Plan:** People, hardware/software, special resources.
* **Schedules:** Work Breakdown Structure (WBS), task network, Gantt chart, PERT chart.
* **Risk Management Plan:** Risk analysis, identification, estimation, abatement procedures.
* **Project Tracking and Control Plan.**
* **Miscellaneous Plans:** Process tailoring, quality assurance.

## ❓ MCQs

1. **According to the text, what is one way experts can estimate software project effort?**
   a) Using historical data from similar projects.
   b) Dividing the software into components and summing individual estimates.
   c) Applying a standard formula based on lines of code.
   d) Consulting with the project stakeholders.

2. **What is an example of a component unit that experts might divide a software product into for estimation purposes?**
   a) Project budget
   b) GUI
   c) Risk assessment
   d) Project timeline

3. **The text mentions that the staffing level during the life cycle of a software product development follows what kind of curve?**
   a) Bell curve
   b) Exponential curve
   c) Rayleigh curve
   d) Linear curve

4. **According to the text, when is the maximum number of engineers typically required during a software product development lifecycle?**
   a) Requirements gathering
   b) Design phase
   c) Coding phase
   d) Testing phase

5. **What is "Expert Judgement" used for in software project management, according to the text?**
   a) Risk mitigation
   b) Cost estimation
   c) Resource allocation
   d) Communication planning

6. **Which of the following is NOT mentioned as a component unit for expert estimation?**
   a) GUI
   b) Database module
   c) Data communication module
   d) Project Management module

7. **The Rayleigh curve is associated with:**
   a) Project budget allocation
   b) Staffing levels during software development
   c) Risk assessment probability
   d) Code complexity measurement

8. **What type of team organization is mentioned in the text?**
   a) Hierarchical team organization
   b) Matrix team organization
   c) Mixed team organization
   d) Agile team organization

9. **The expert judgement method relies on:**
   a) Statistical analysis of past projects
   b) Intuition and experience of experts
   c) Automated tools for estimation
   d) Detailed project documentation

10. **The staffing level in a software project typically:**
    a) Remains constant throughout the project
    b) Increases linearly over time
    c) Follows a Rayleigh curve
    d) Decreases exponentially over time

11. **Which phase of software development generally requires the most engineers, according to the text?**
    a) Design
    b) Implementation
    c) Testing
    d) Deployment

12. **What is the primary purpose of dividing a software product into component units for estimation?**
    a) To simplify the estimation process
    b) To improve communication among team members
    c) To identify potential risks
    d) To allocate resources more efficiently

## 💡 Explanation

**Software Project Management: The Basics**

The goal of software project management is to help teams work together effectively to complete a software project successfully. The project manager is in charge of planning, keeping an eye on, and controlling the project.

**Estimating Software Cost:**

1. Figure out the size of the software product.
2. Estimate the amount of effort needed based on the product's size.
3. Determine the project's duration and cost from the effort estimate.

**Important Note:** Trying to finish the project faster can make the team work much harder and increase costs. Sometimes, having fewer people work over a longer period can be more efficient.

**Software Project Management Plan (SPMP) Document:**

This document outlines the project plans after the planning phase. Here's what it includes:

1. **Introduction:**
   - Objectives
   - Major functions
   - Performance issues
   - Management and technical constraints

2. **Project Estimates:**
   - Historical data
   - Estimation techniques
   - Effort
   - Cost
   - Project duration

3. **Project Resources Plan:**
   - People
   - Hardware/software
   - Special resources

4. **Schedules:**
   - Work Breakdown Structure (WBS)
   - Task network
   - Gantt chart
   - PERT chart

5. **Risk Management Plan:**
   - Risk analysis
   - Identification
   - Estimation
   - Abatement procedures

6. **Project Tracking and Control Plan**

7. **Miscellaneous Plans:**
   - Process tailoring
   - Quality assurance

## 📺 YouTube Resources

### Software Project Management Tutorials
- [Software Project Management Tutorial](https://www.youtube.com/watch?v=CiBpX_OrrOI&pp=ygUgc29mdHdhcmUgcHJvamVjdCBtYW5hZ2VtZW50IHR1dG9yaWFs)
- [Software Project Management Tutorial](https://www.youtube.com/watch?v=oU_m9qFzZ5Q&pp=ygUgc29mdHdhcmUgcHJvamVjdCBtYW5hZ2VtZW50IHR1dG9yaWFs)

### Software Project Management Lectures
- [Software Project Management Lecture](https://www.youtube.com/watch?v=wHjJjnBwMKI&pp=ygUgc29mdHdhcmUgcHJvamVjdCBtYW5hZ2VtZW50IGxlY3R1cmU%3D)
- [Software Project Management Lecture](https://www.youtube.com/watch?v=Y8nZROaRR9k&pp=ygUgc29mdHdhcmUgcHJvamVjdCBtYW5hZ2VtZW50IGxlY3R1cmU%3D)

### Basics of Software Project Management
- [Basics of Software Project Management](https://www.youtube.com/watch?v=BnWn_VZV_yM&pp=ygUXYmFzaWNzIG9mIHNvZnR3YXJlIHByb2plY3QgbWFuYWdlbg%3D%3D)
- [Basics of Software Project Management](https://www.youtube.com/watch?v=iWlFdVJT11s&pp=ygUXYmFzaWNzIG9mIHNvZnR3YXJlIHByb2plY3QgbWFuYWdlbg%3D%3D)

### Full Course on Software Project Management
- [Software Project Management: Full Course](https://www.youtube.com/watch?v=yDxWlw7mYJE&pp=ygUXU29mdHdhcmUgUHJvamVjdCBNYW5hZ2VtZW50OiBGdWxsIENvdXJzZQ%3D%3D)

### Software Project Management for Beginners
- [Software Project Management for Beginners](https://www.youtube.com/watch?v=5CfBjcAy_5g&pp=ygUXU29mdHdhcmUgUHJvamVjdCBNYW5hZ2VtZW50IEZvciBCZWdpbmVycyE%3D)

### Software Project Management: Crash Course
- [Software Project Management: Crash Course](https://www.youtube.com/watch?v=XnxUvFRrA-E&list=RDXnxUvFRrA-E&start_radio=1&pp=ygUXU29mdHdhcmUgUHJvamVjdCBNYW5hZ2VtZW50OiBDcmFzaCBDb3Vyc2U%3D)
- [Software Project Management: Crash Course](https://www.youtube.com/watch?v=9CEDZRgA_xU&list=RDXnxUvFRrA-E&index=3&pp=ygUXU29mdHdhcmUgUHJvamVjdCBNYW5hZ2VtZW50OiBDcmFzaCBDb3Vyc2U%3D)

## 🌐 Web Articles

- [Agile Development Best Practices](https://www.atlassian.com/software/jira/guides/agile-development/scrum/best-practices)
- [Software Project Management Best Practices](https://www.projectengineer.net/software-project-management-best-practices/)
- [Software Development Project Management Best Practices](https://airfocus.com/blog/software-development-project-management-best-practices/)
- [What is Software Project Management?](https://www.wrike.com/project-management-guide/faq/what-is-software-project-management/)
- [Best Practices in Project Management](https://www.simplilearn.com/tutorials/agile-scrum/best-practices-in-project-management)
- [Software Project Management Guide](https://www.planview.com/resources/guide/software-project-management/)

In [None]:
import markdown
from weasyprint import HTML

# Step 1: Read the markdown
with open("final_output.md", "r", encoding="utf-8") as f:
    markdown_text = f.read()

# Step 2: Convert to HTML
html_content = markdown.markdown(markdown_text)

# Step 3: Generate PDF
HTML(string=html_content).write_pdf("Software_Project_Management.pdf")

print("✅ PDF generated using WeasyPrint: final_output.pdf")

# Step 4: Trigger file download in Colab
try:
    from google.colab import files
    files.download("Software_Project_Management.pdf")
except ImportError:
    print("Download link only works in Google Colab.")


In [7]:
!pip install langchain_community

In [None]:
# !pip install llama-index-llms-langchain llama_index
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from langchain_core.tools import Tool
from langchain_core.tools import tool
def return_query_engine(pdf_path):
 reader = SimpleDirectoryReader(input_files=[pdf_path])
 documents = reader.load_data()
 print(f"Loaded {len(documents)} document(s).")

 # Split into chunks
 splitter = SentenceSplitter(chunk_size=1024)
 nodes = splitter.get_nodes_from_documents(documents)
 # you can change the batch size , model_name to different varieties that google embeddings provide ,
 # go to https://ai.google.dev/gemini-api/docs/embeddings
 embed_model = GoogleGenAIEmbedding(
 model_name="text-embedding-004",
 embed_batch_size=200,
 api_key=""
)
 # Set up LLM and embedding model
 Settings.llm = llm
 Settings.embed_model = embed_model
 # Create vector index
 vector_index = VectorStoreIndex(nodes)
 # Create query engine
 query_engine = vector_index.as_query_engine()
 return query_engine
import os
os.environ["SERPER_API_KEY"] = ""
import json

def safe_str_to_dict(json_str: str) -> dict:
    try:
        return json.loads(json_str)
    except json.JSONDecodeError as e:
        print(f"❌ JSON parsing error: {e}")
        return {}
query_engine= return_query_engine(pdf_path="Software_Project_Management.pdf")
@tool
def RUN_RAG(inputs: str) -> str:
    """
    Takes query and returns the RAG answer.
    """
    try:
        # query_engine = return_query_engine(pdf_path)
        return str(query_engine.query(inputs))
    except Exception as e:
        return f"❌ Error querying PDF: {str(e)}"
@tool
def get_topic(input: str = "what is this document based on ?")->str:
  """
  takes the query and returns the topic on which this document is based on
  """
  return str(query_engine.query(input))

Loaded 5 document(s).


In [40]:
from langchain_community.utilities import GoogleSerperAPIWrapper
@tool
def search_web(inputs: str) -> str:
    """
    Searches the web using Serper.dev.
    """
    # inputs_dict=safe_str_to_dict(inputs)
    # query = inputs_dict.get("input")
    search = GoogleSerperAPIWrapper()
    return search.run(inputs)

In [50]:
tools = [RUN_RAG, search_web,get_topic]
from langchain.prompts import PromptTemplate

react_template = """
You are a helpful academic assistant designed to answer questions about a student's uploaded PDF document (e.g. textbook, notes, research paper).

You can use the following tools:
{tools}

You must use this format:

Question: the input question you must answer
Thought: you should always reason step by step before answering
Action: the action to take, must be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the complete, clear, and concise answer to the original question

Your reasoning flow should follow this logic:

1. First, try using the `RUN_RAG` tool to answer based on the uploaded document.
2. Now use the `get_topic` tool to know which topic the document is based on, dont put any input in this function , just run it.
3. If the document does not answer the question but the topic is still relevant (e.g., same subject), use the `search_web` tool to look it up.
4. Now you know the topic of the document, If the question is completely unrelated to the topic of the document, respond with:
   "I'm sorry, this question is unrelated to the uploaded material and I can't answer it."
5. If the document answers the question fully,  use `search_web` optionally to provide additional learning resources or some more things related to that answer/topic.
Always reflect before taking action. Be sure to cite observations from tools clearly in your final answer.

Begin!

Question: {input}
Thought:{agent_scratchpad}
"""


prompt = PromptTemplate(
    template=react_template,
    input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)

In [51]:
from langchain.agents import create_react_agent, AgentExecutor

agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True,
    return_intermediate_steps=False
)


In [36]:
agent_executor.invoke({
    "input": "Explain reinforcement learning.",
})




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mOkay, I will start by using the `RUN_RAG` tool to see if the document explains reinforcement learning. Then, I will use the `get_topic` tool to determine the document's topic. If the document doesn't directly answer the question but the topic is relevant, I'll use `search_web` to find an answer. Finally, if the question is unrelated to the document's topic, I will respond accordingly.
Action: RUN_RAG
Action Input: Explain reinforcement learning.[0m[36;1m[1;3mI am sorry, but the context provided does not contain any information about reinforcement learning.[0m[32;1m[1;3mThe document does not contain information about reinforcement learning. I will now use the `get_topic` tool to determine the document's topic.
Action: get_topic
Action Input: what is this document based on ?[0m[38;5;200m[1;3mThis document is a study guide that outlines key aspects of software project management. It aims to enable efficient teamwork, an

{'input': 'Explain reinforcement learning.',
 'output': "I'm sorry, this question is unrelated to the uploaded material and I can't answer it."}

In [43]:
agent_executor.invoke({
    "input": "Explain system design in context of software project management.",
})




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mOkay, I understand the question. First, I will try to answer it using the uploaded document. Then, I will identify the topic of the document to ensure the question is relevant. If the document doesn't contain the answer but the topic is relevant, I will use the web search tool.

Action: RUN_RAG
Action Input: Explain system design in context of software project management.[0m[36;1m[1;3mThe document does not contain an explanation of system design in the context of software project management.[0m[32;1m[1;3mOkay, the document doesn't directly answer the question. Now I need to determine the topic of the document to see if the question is related.

Action: get_topic
Action Input: what is this document based on ?[0m[38;5;200m[1;3mThis document is a study guide that outlines key aspects of software project management. It aims to enable efficient teamwork towards successful project completion.[0m[32;1m[1;3mOkay, the docu

{'input': 'Explain system design in context of software project management.',
 'output': 'System design, in the context of software project management, is the process of defining the architecture, components, modules, interfaces, and data flow that collectively make up an application or software system. It outlines how various parts of a system interact with each other to meet performance, reliability, and scalability needs (search_web). The uploaded document is a study guide about software project management (get_topic), making this question relevant.'}

In [52]:
agent_executor.invoke({
    "input": "Explain Functional Point metrics",
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mOkay, I will start by using the `RUN_RAG` tool to see if the document explains Functional Point metrics. Then, I will use the `get_topic` tool to determine the document's topic. If the document doesn't fully answer the question but the topic is relevant, I'll use the `search_web` tool to find more information. Finally, I'll synthesize the information and provide a comprehensive answer.

Action: RUN_RAG
Action Input: Explain Functional Point metrics[0m[36;1m[1;3mI am sorry, but the document does not contain information about Functional Point metrics.[0m[32;1m[1;3mThe document does not contain information about Functional Point metrics. I will now use the `get_topic` tool to determine the document's topic to see if the question is related to the topic.

Action: get_topic
Action Input: what is this document based on ?[0m[38;5;200m[1;3mThis document is a study guide that outlines key aspects of software project managemen

{'input': 'Explain Functional Point metrics',
 'output': "Functional Point (FP) Analysis is a method used to quantify the amount of business functionality that a software application delivers to a user. It's a unit of measurement that expresses the size and complexity of a software system in terms of the functions it provides. Function points are used to estimate the effort and cost required to develop or maintain a software project. They provide a standardized way to measure the different functions of an application from the user's perspective, based on their requests and the responses they receive (search_web)."}

In [53]:
agent_executor.invoke({
    "input": "explain Organization of SPMP Document",
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mOkay, I understand. I need to explain the organization of the SPMP document. I will start by using the `RUN_RAG` tool to see if the document itself explains its organization. Then I will use `get_topic` to understand the topic of the document. If the document doesn't contain the answer, I will use `search_web` to find the answer online.

Action: RUN_RAG
Action Input: "Organization of SPMP Document"[0m[36;1m[1;3mThe SPMP Document includes an introduction with objectives, major functions, performance issues, and management and technical constraints. It also contains project estimates with historical data, estimation techniques, effort, cost, and project duration, as well as a project resources plan covering people, hardware/software, and special resources. Furthermore, it includes schedules, a risk management plan, and miscellaneous plans.[0m[32;1m[1;3mOkay, the document provides a good overview of the SPMP document's org

{'input': 'explain Organization of SPMP Document',
 'output': 'The SPMP Document is organized to cover key aspects of software project management. It includes an introduction outlining objectives, major functions, performance issues, and management and technical constraints. It also contains project estimates with historical data, estimation techniques, effort, cost, and project duration, as well as a project resources plan covering people, hardware/software, and special resources. Furthermore, it includes schedules, a risk management plan, and miscellaneous plans.'}

In [54]:
# Estimating Software Cost:
agent_executor.invoke({
    "input": "how is Estimating Software Cost done ?",
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mOkay, I understand the question. I need to explain how estimating software cost is done, using the uploaded document as the primary source.

First, I'll use the `RUN_RAG` tool to see if the document directly addresses this question. If not, I'll use `get_topic` to understand the document's subject. If the subject is relevant to software cost estimation, I'll use `search_web` to find information and provide a comprehensive answer. If the subject is unrelated, I will respond that I cannot answer the question.

Action: RUN_RAG
Action Input: "how is Estimating Software Cost done?"[0m[36;1m[1;3mEstimating software cost involves determining the product size, then estimating the effort needed based on size, and finally, determining project duration and cost from the effort estimate.[0m[32;1m[1;3mThe document provides a direct answer to the question. Now, I'll use `get_topic` to confirm the document's subject.

Action: get_topi

{'input': 'how is Estimating Software Cost done ?',
 'output': 'Estimating software cost involves determining the product size, then estimating the effort needed based on size, and finally, determining project duration and cost from the effort estimate (as stated in the document).\n\nSome common software cost estimation techniques include:\n\n*   **Bottom-Up Estimation:** Estimating individual components and summing them up.\n*   **Top-Down Estimation:** Estimating the overall project cost and then allocating it to individual components.\n*   **Parametric Estimation:** Using statistical relationships and historical data.\n*   **Three-Point Estimation:** Using optimistic, pessimistic, and most likely estimates.\n*   **Analogous Estimation:** Comparing to similar past projects.\n\nThese methods help in ensuring project success, budget management, and risk mitigation.'}