# Assignment: Build a Document Summarization Agent with LangChain


---

### Objective:
This assignment will guide you through building a **document summarization agent using LangChain**. You'll learn how to load documents, split them into manageable chunks, and apply different summarization techniques (like `stuff`, `map_reduce`, and `refine`) with a Large Language Model (LLM). By the end, you'll be able to summarize long texts efficiently and understand the trade-offs of various summarization strategies.

---

### Instructions:
1.  **LLM Access**: You'll need access to an LLM API (e.g., Google's Gemini, OpenAI's GPT-4). For this assignment, we'll primarily use **Google's Gemini Pro model** via the `google-generativeai` integration with LangChain.
2.  **Environment Setup**: Install the necessary Python libraries: `pip install langchain-google-genai langchain_community unstructured pypdf`.
3.  **API Key**: Securely handle your API key. It's best practice to load it from an environment variable.
4.  **Jupyter Notebook**: All your code, outputs, observations, and analysis must be documented in this Jupyter Notebook.
5.  **Document Selection**: You'll need a long document (e.g., a research paper, an article, a chapter from a book) to summarize. A PDF file is recommended for testing `PyPDFLoader`.
6.  **Analysis**: Critically evaluate the summaries generated by each method.

---

## Part 1: Setup and Document Loading
Begin by configuring your LLM and loading a document into LangChain.

### Task 1.1: API Configuration and LLM Initialization
Set up your Google Generative AI API key and initialize the `ChatGoogleGenerativeAI` model in LangChain.

In [None]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain

# --- YOUR API KEY HERE ---
# It's highly recommended to load your API key from an environment variable for security.
# For example:
# os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
# For this assignment, you can temporarily paste it directly, but be careful not to share your notebook with the key.
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY_HERE" # Replace with your actual API key

# Initialize the LLM (Gemini Pro)
llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.7)

print("LangChain LLM initialized with Gemini Pro!")

### Task 1.2: Document Loading and Splitting
Choose a document (e.g., a PDF, a long text file) and load it using an appropriate LangChain Document Loader. Then, split the document into smaller, manageable chunks using a `RecursiveCharacterTextSplitter`.

* **Choose a document**: Ideally, a PDF or a `.txt` file that is at least 3-5 pages long (or contains equivalent text).
    * If using PDF: Place the PDF file in the same directory as this notebook.
    * If using text: Create a `.txt` file with your long text.
* **Document Loader**: Use `PyPDFLoader` for PDFs or `TextLoader` for text files.
* **Text Splitter**: Use `RecursiveCharacterTextSplitter` with `chunk_size` and `chunk_overlap` of your choice. Experiment with these values.

**Chosen Document**: [Specify the name and type of your chosen document, e.g., `my_research_paper.pdf`]
**Chosen `chunk_size`**: [Your chosen chunk size]
**Chosen `chunk_overlap`**: [Your chosen chunk overlap]

In [None]:
# Replace 'path/to/your/document.pdf' or 'path/to/your/document.txt' with your actual file path
# Example for PDF:
file_path = "sample_document.pdf" # Make sure this file exists in your directory
loader = PyPDFLoader(file_path)
# If you're using a .txt file, uncomment the line below and comment the PyPDFLoader line:
# loader = TextLoader("sample_document.txt")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Experiment with this value
    chunk_overlap=200, # Experiment with this value
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.split_documents(documents)

print(f"Original document loaded into {len(documents)} document(s).")
print(f"Document split into {len(chunks)} chunks.")
print(f"First chunk example (first 200 chars):\n{chunks[0].page_content[:200]}...")

### Analysis for Task 1.2:
* Why is it necessary to split documents for summarization with LLMs?
* How do `chunk_size` and `chunk_overlap` affect the chunks? What are the potential trade-offs of using very small vs. very large chunks?
* What challenges might arise when splitting documents (e.g., losing context between chunks)?

---

## Part 2: Implementing Summarization Chains
Now, you'll apply different summarization strategies provided by LangChain.

### Task 2.1: 'Stuff' Chain Summarization
Use the `stuff` chain, which simply 'stuffs' all document chunks into a single prompt. This works best for smaller documents that fit within the LLM's context window.

In [None]:
print("--- Running 'Stuff' Chain Summarization ---")
try:
    # Make sure your chunks are small enough to fit into the LLM's context window
    # For very large documents, 'stuff' will fail due to token limit.
    # Consider using only the first few chunks for demonstration if your document is long.
    stuff_chain = load_summarize_chain(llm, chain_type="stuff")

    # If your document is very long, you might need to summarize only the first few chunks:
    # summary_stuff = stuff_chain.run(chunks[:5]) # Example: summarize only first 5 chunks

    summary_stuff = stuff_chain.run(chunks) # Tries to summarize all chunks if they fit

    print("\n--- 'Stuff' Summary ---")
    print(summary_stuff)
except Exception as e:
    print(f"Error with 'Stuff' chain: {e}")
    print("This often means your document is too long for the model's context window.")
    print("Try reducing the number of chunks passed to the chain (e.g., `chunks[:X]`) or using a smaller document.")

### Analysis for Task 2.1:
* What are the advantages of the 'stuff' chain? (e.g., capturing full context, simpler implementation).
* What are its primary limitations?
* How does the quality of the 'stuff' summary compare to your expectation?

### Task 2.2: 'Map-Reduce' Chain Summarization
Implement the `map_reduce` chain, which summarizes chunks individually (`map` step) and then combines those summaries (`reduce` step).

In [None]:
print("\n--- Running 'Map-Reduce' Chain Summarization ---")
map_reduce_chain = load_summarize_chain(llm, chain_type="map_reduce")

try:
    summary_map_reduce = map_reduce_chain.run(chunks)
    print("\n--- 'Map-Reduce' Summary ---")
    print(summary_map_reduce)
except Exception as e:
    print(f"Error with 'Map-Reduce' chain: {e}")

### Analysis for Task 2.2:
* How does the 'map-reduce' summary differ from the 'stuff' summary in terms of style, length, and detail?
* What are the main benefits of using 'map-reduce' for summarization?
* What are its potential drawbacks (e.g., loss of global context, repetition)?

### Task 2.3: 'Refine' Chain Summarization
Utilize the `refine` chain, which iteratively builds a summary by passing each chunk and the current summary to the LLM.

In [None]:
print("\n--- Running 'Refine' Chain Summarization ---")
refine_chain = load_summarize_chain(llm, chain_type="refine")

try:
    summary_refine = refine_chain.run(chunks)
    print("\n--- 'Refine' Summary ---")
    print(summary_refine)
except Exception as e:
    print(f"Error with 'Refine' chain: {e}")

### Analysis for Task 2.3:
* How does the 'refine' summary compare to 'stuff' and 'map-reduce'? Does it seem more coherent or detailed?
* What are the advantages of the 'refine' approach (e.g., better contextual understanding, iterative improvement)?
* What are its potential disadvantages (e.g., slower, potential for bias accumulation)?

---

## Part 3: Comparative Analysis and Evaluation
Compare the different summarization methods and discuss their suitability for various use cases.

### Task 3.1: Qualitative Comparison
Compare the three generated summaries based on the following criteria:
* **Conciseness**: How brief and to-the-point is each summary?
* **Completeness**: Does each summary capture all key information?
* **Coherence/Readability**: Does each summary flow well and make sense?
* **Accuracy**: Is the information in the summary factual and free of hallucination?
* **Length**: How long is each summary (e.g., word count)?

**Comparison Table (Fill in your observations):**

| Criteria             | 'Stuff' Chain | 'Map-Reduce' Chain | 'Refine' Chain |
| :------------------- | :------------ | :----------------- | :------------- |
| **Conciseness** |               |                    |                |
| **Completeness** |               |                    |                |
| **Coherence** |               |                    |                |
| **Accuracy** |               |                    |                |
| **Approx. Word Count** |               |                    |                |

### Analysis for Task 3.1:
* Based on your comparison, which summarization method (`stuff`, `map_reduce`, `refine`) performed best for *your chosen document* and why?
* Under what circumstances would you recommend each method for a practical application?
    * When is `stuff` ideal?
    * When is `map_reduce` most suitable?
    * When does `refine` shine?
* Did you observe any instances of hallucination or factual inaccuracies in any of the summaries? If so, describe them.

---

## Part 4: Conclusion and Reflection
In this markdown cell, provide a comprehensive summary of your findings and reflections based on this assignment.

* **Key Learnings about LangChain Summarization**: What were your main takeaways regarding LangChain's capabilities for document summarization?
* **Challenges Encountered**: What difficulties did you face during the assignment (e.g., token limits, quality of summaries, debugging)?
* **Practical Applications**: How can document summarization agents be used in real-world scenarios (e.g., business, research, education)?
* **Future Improvements**: If you were to further develop this summarization agent, what features or techniques would you explore next (e.g., custom prompts, combining methods, RAG for better factual accuracy)?
* **Ethical Considerations**: Discuss any ethical implications related to automated document summarization, such as potential for bias, loss of nuance, or misrepresentation of original content.

---

### Submission:
* Ensure all code cells have been executed and their outputs (including prints and summaries) are visible.
* All analysis and reflections are clearly written in markdown cells.
* Save your Jupyter Notebook as `[YourName]_LangChain_Summarization_Assignment.ipynb`.