# Assignment Details
Your assignment is as follows:
You need to come up with an algorithm that works which can take the following PDF and extract key information from the same for an investor looking to evaluate the company. The investor looks at key elements such as future growth prospects, key changes in the business, key triggers, important information that might have a material effect on next year's earnings and growth.


In this assignment you are free to use the GPT algorithm and API's of your choice. You are free to use any python libraries and packages to achieve this task. The code should be well documented and cleanly written.

Length of code is not important, its quality is. Also the quality of the output is important. The code should work across other documents of the same kind and should preferably not have any hard coded logic within it.

You will need to submit your code as a Google collab link (Please ensure the Google collab link is visible to public for anyone who has the link).


Please use the below link to access the document that needs to be summarized:

https://drive.google.com/file/d/1jguXFqGgYkbeF5X9jFJ4ts50m-VJ1rCu/view?usp=sharing

# Document Summarization and Key Business Information Extraction Pipeline

## ✅ My Approach
## Overview

This pipeline summarizes large documents and extracts key business information relevant to investors. The solution leverages two models:
- **Summarization Model**: `allenai/led-base-16384` for summarizing large chunks of text.
- **Information Extraction Model**: `Gemma-2-2b-it` for identifying key business information from the summarized text.

The workflow divides the document into manageable sections, summarizes each section individually, and consolidates the results for a final business analysis.

---

## Step-by-Step Architecture

### Step 1: Document Chunking

- **Purpose**: Handle large documents that exceed the input capacity of the summarization model.
- **Process**:
  - Split the document into **chunks of up to 2000 words each**.
  - Each chunk is processed independently to avoid memory overload and improve model efficiency.
- **Implementation**:
  - Use a text-processing function to divide the document based on sentence or paragraph boundaries, ensuring coherent chunks.

---

### Step 2: Chunk-wise Summarization with `allenai/led-base-16384`

- **Purpose**: Generate summaries for each document chunk.
- **Model**: `allenai/led-base-16384` (Longformer Encoder-Decoder model) optimized for long input sequences.
- **Process**:
  - Pass each chunk (2000 words max) into the model to obtain a summary.
  - Store each chunk’s summary in a list or concatenate into a single text.
- **Implementation**:
  - Use a loop to feed each chunk into the summarization model.
  - Capture and save each summarized output.

---

### Step 3: Combine Summaries

- **Purpose**: Consolidate individual chunk summaries for a coherent, reduced representation of the entire document.
- **Process**:
  - Concatenate all chunk summaries.
  - Optionally, re-run the combined text through the summarization model for further condensation.
- **Implementation**:
  - Combine summaries into a single string for downstream processing.

---

### Step 4: Key Business Information Extraction with `Gemma-2-2b-it`

- **Purpose**: Extract key information useful to investors.
  - Focus on specific points:
    - **Future Growth Prospects**
    - **Key Changes in Business**
    - **Key Triggers**
    - **Material Impacts on Earnings and Growth**
- **Model**: `Gemma-2-2b-it`, a GPT-based model fine-tuned for business text analysis.
- **Process**:
  - Pass the combined summary into `Gemma-2-2b-it` using a custom prompt to extract the targeted information.
  - The output is a structured set of bullet points relevant to investors.
- **Prompt Design**:
  - Use a clear and specific prompt instructing the model to extract and organize the output according to investor interests.

---

## Final Output

- **Format**: A bullet-point summary of key business insights.
- **Content**: Structured to address investor priorities with insights on growth, strategic changes, potential triggers, and material impacts.
- **Usage**: Final output serves as a concise business overview, suitable for investor evaluation and decision-making.

---

## Summary of the Workflow

1. **Chunk Document** → **Summarize Each Chunk** → **Combine Summaries**
2. **Summarized Document** → **Extract Key Business Information**
3. **Final Output**: Key business insights in investor-focused bullet points.

---

## Advantages of This Approach

- **Efficiently Handles Large Texts**: Chunking allows summarization of documents that exceed the model’s input length.
- **Targeted Business Insights**: Extraction model focuses on specific investor needs, making the final output highly relevant for evaluation.


# Installing the Required Libraries

In [1]:
!pip install -U -q PyMuPDF transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[?25h

# Importing the Required Modules

In [2]:
import os
import fitz  # PyMuPDF
import re    # For cleaning text
import json
import torch # For using 'cuda'
from transformers import pipeline # For using the transformer models from HF hub
from IPython.display import Markdown, display # For displaying the Final output in a structured format

# Document Parsing and Text Extraction

- extract_text_from_pdf : This function uses PyMuPDF to extract text from the Document.
- clean_extracted_text : This function not only uses the above one but also clean the extracted text using regex.

In [3]:
def extract_text_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text += page.get_text("text")
    # Post-process text to remove excessive whitespace
    lines = text.split('\n')
    lines = [line.strip() for line in lines if line.strip()]
    formatted_text = '\n'.join(lines)
    return formatted_text

def clean_extracted_text(pdf_path):
    extracted_text = extract_text_from_pdf(pdf_path)
    # Remove non-alphanumeric characters except for common punctuation and newlines
    text = re.sub(r'[^A-Za-z0-9\s,.\'@|\-\n]', '', extracted_text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # Replace multiple newlines with a single newline
    text = re.sub(r'\n+', '\n', text)
    # Strip leading and trailing spaces and newlines
    text = text.strip()
    return text

# Text Splitting and saving into chunks

As the Document is large so we can not pass the complete document as a whole to the summarizer model. So we will be first splitting the complete text into small chunks and then summarize each chunks and combine those summarized chunks.

In [4]:
# Function to split text into chunks of max 2000 words
def split_text_into_chunks(text, max_words=2000):
    words = text.split()
    total_words = len(words)
    print(f"Total words in document: {total_words}")

    # Calculate the number of chunks
    chunks = []
    for i in range(0, total_words, max_words):
        chunk = " ".join(words[i:i + max_words])
        chunks.append(chunk)

    print(f"Total number of chunks created: {len(chunks)}")
    return chunks

# Summarization Function

>Using Longformer Encoder-Decoder(LED) model for summarizing each chunk. 👉[Model link](https://huggingface.co/allenai/led-base-16384)

This model has 4096 context window so this model can easily handel big chunk of text at a time.

- summarize_text_chunks : This function summarizes each chunks and add those with a newline
- process_pdf_for_summarization : In this function we are calling most of the other functions like claning text, split into chunks, summarization too.

In [5]:
# Initialize the Hugging Face summarization pipeline
summarizer = pipeline("summarization", model="allenai/led-base-16384")

# Summarization function for each chunk
def summarize_text_chunks(chunks):
    summarized_text = ""
    for idx, chunk in enumerate(chunks):
        print(f"Summarizing chunk {idx + 1}...")
        summary = summarizer(chunk, max_length=80, do_sample=False)
        summarized_text += summary[0]['summary_text'] + "\n"  # Add a newline between summaries

    return summarized_text


# Main process to clean text, split, and summarize
def process_pdf_for_summarization(pdf_path):
    # Extract and clean text from PDF
    text = clean_extracted_text(pdf_path)

    # Split text into manageable chunks of 2000 words
    text_chunks = split_text_into_chunks(text, max_words=2000)

    # Summarize each chunk and combine the results
    complete_summary = summarize_text_chunks(text_chunks)
    return complete_summary

config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### Doc Uploading and calling the function

here the 'pdf_path' variable holds the name of the document uploaded. [PDF link](https://drive.google.com/file/d/1jguXFqGgYkbeF5X9jFJ4ts50m-VJ1rCu/view). Uplod the document and pass the file path (in this case passing the file name as the file is in same directory).

In [6]:
# Run the function and print the complete summary
pdf_path = 'doc.pdf'  # Uplod the document and pass the file path (in this case passing the file name as the file is in same directory)
output_summary = process_pdf_for_summarization(pdf_path)
print("Output Summary 👇",'\n',output_summary)

Input ids are automatically padded from 2677 to 3072 to be a multiple of `config.attention_window`: 1024


Total words in document: 8834
Total number of chunks created: 5
Summarizing chunk 1...


model.safetensors:   0%|          | 0.00/648M [00:00<?, ?B/s]

Input ids are automatically padded from 2487 to 3072 to be a multiple of `config.attention_window`: 1024


Summarizing chunk 2...


Input ids are automatically padded from 2433 to 3072 to be a multiple of `config.attention_window`: 1024


Summarizing chunk 3...


Input ids are automatically padded from 2435 to 3072 to be a multiple of `config.attention_window`: 1024


Summarizing chunk 4...


Input ids are automatically padded from 1036 to 2048 to be a multiple of `config.attention_window`: 1024


Summarizing chunk 5...
Output Summary 👇 
 August 03, 2023 To, National Stock Exchange of India Limited Exchange Plaza, 5th Floor, Plot No. C1, G Block, Bandra Kurla Complex, Bandra E, Mumbai -400 051 Symbol SJS Enterprises July 27, 2023 Before I hand over to Mahendra, I would like to share one more update with you. On the CSR front
and cash equivalent were at Rs.2,849.7 million at the end of June '23. However, post Walter Pack India acquisition, it was at Rs.461.7 million, and our net debt post-acquisition has increased to Rs.539.8 million and equity share capital has increased to Rs.310.38 million post preferential allotment to our promoter
. So both for the EV segment and for the conventional IC engines, I think the IML and IMD technology that we have to offer, IME will piggyback on this, will find a lot of demand when the customer sees this. So at the moment we are working on some development projects which I think are very exciting technologies. So cross selling to two wheeler cust

# Using Gemma-2-2b-it to Extract Key Information from the summary

here you can find the Gemma-2-2b-it model.👉 [Model Link](https://huggingface.co/google/gemma-2-2b-it).

 ⚠️⚠️ To use this model first you need to grant the access of this model from the Hugging Face hub page and then create a access token from your Hugging Face Profile. Use that access token to login to "huggingface_hub" CLI below 👇

 ### ⚠️⚠️⚠️⚠️please create your own access token⚠️⚠️⚠️⚠️

In [None]:
from huggingface_hub import login
login("hf_vHiHpOJxxxxxxxxxxxxxxxxxfOaRsp")
print("Logged in to Hugging Face Hub!")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Logged in to Hugging Face Hub!


# Loading the Model using Pipeline Function

In [8]:
pipe2 = pipeline(
    "text-generation",
    model="google/gemma-2-2b-it",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

# Creating a custom Prompt for the Gemma2-2b-it model for extracting relevant information for the Inverstor.

In [9]:
messages = [
    {"role": "user",
    "content": f"""Context:\n {output_summary}\n Instructions:\n Based on the above context, please extract and organize the key information an investor would need to evaluate this company in proper markdown format. Present the information in bullet points, focusing on:
    Future growth prospects: Any trends, opportunities, or strategies that suggest growth potential for the company.
    Key changes in the business: Significant changes in business operations, structure, or strategy.
    Key triggers: Events or factors that could impact the company’s performance or market position.
    Material impacts on earnings and growth: Important details that could significantly affect next year’s financial performance. """
    },
]


# Run the Model on the Custom prompt to generate the final output

In [10]:
outputs = pipe2(messages, max_new_tokens=1024)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
display(Markdown(assistant_response))

## SJS Enterprises Investor Information: Key Takeaways

**Future Growth Prospects:**

* **IML and IMD Technology Demand:**  The company's IML and IMD technology is expected to be in high demand due to its potential to address customer needs in the EV and conventional IC engine segments.
* **Development Projects:**  Exciting development projects are underway, suggesting potential for future growth and innovation.
* **Cross-selling Opportunities:**  The company is focusing on cross-selling its products to two-wheeler customers, potentially expanding its market reach.
* **Export Growth:**  Exports have shown significant growth in recent quarters, indicating potential for international expansion.

**Key Changes in Business:**

* **Acquisition Impact:**  The acquisition of Walter Pack India has led to a decrease in CSR and cash equivalent, an increase in net debt, and an increase in equity share capital through a preferential allotment.
* **Capacity Utilization:**  The company currently operates at a 60-70% capacity utilization, suggesting potential for increased revenue generation.

**Key Triggers:**

* **Geopolitical and Macroeconomic Factors:**  The company's export performance has been impacted by geopolitical and macroeconomic issues in Europe and North America.
* **Future Market Conditions:**  The long-term outlook for the company's growth is dependent on the evolution of the EV and IC engine markets, as well as the global economic landscape.

**Material Impacts on Earnings and Growth:**

* **Net Debt Increase:**  The acquisition of Walter Pack India has increased the company's net debt, which could impact its financial flexibility and profitability.
* **Preferential Allotment:**  The preferential allotment of equity shares to the promoter has increased the company's equity share capital, potentially impacting its financial structure and future growth.
* **Export Growth:**  The company's export performance is a key driver of its revenue and profitability, and its future growth is dependent on the global market conditions. 


**Note:** This information is based on the provided context and may not be exhaustive. Further research and analysis are recommended for a comprehensive understanding of the company's performance and future prospects.