**Task Overview and Problem Definition :**
 Pdf summarization for the  key elements such as future growth prospects, key changes in the business, key triggers, important information that might have a material effect on next year's earnings and growth.

To run the NoteBook change the pdf path in main execution by uploading pdf file to your colab session storage or drive 

**STEPS:**

1.Installing Dependencies

2.Extract Text from the PDF

3.Extracted Text into chunks

4.Chunks into Batches

5.Summarize Batches Using Groq API and Combine Batch Summaries into Final Summary



In [None]:
import os

# Set API key as an environment variable
os.environ["API_KEY"] = ""

# Retrieve the API key when needed
API_KEY = os.getenv("API_KEY")

**Step** 1: **Installing** **Dependencies**

In [None]:
!pip install pymupdf
!pip install langchain
!pip install langchain_groq
!pip install python-dotenv




**Step 2: Text Extraction from pdf**

In [None]:
import fitz  # PyMuPDF

class PDFExtractor:
    def __init__(self, pdf_path):
        """Initializes the PDFExtractor with the path to the PDF file."""
        self.pdf_path = pdf_path
        self.doc = fitz.open(pdf_path)

    def extract_text(self):
        """Extracts all text from the PDF and returns it as a single string."""
        full_text = ""
        for page_num in range(len(self.doc)):
            page = self.doc[page_num]
            full_text += page.get_text("text") + "\n"  # Append text of each page
        return full_text

    def close(self):
        """Closes the document to free up resources."""
        self.doc.close()

# # Usage example
# pdf_path = "/content/drive/MyDrive/SJS Transcript Call.pdf"  # Replace with your PDF file path
# pdf_extractor = PDFExtractor(pdf_path)
# pdf_text = pdf_extractor.extract_text()
# pdf_extractor.close()
# print(pdf_text)


**Step 3**: **Extracted Text into chunks**

In [None]:

import re

class TextChunker:
    def __init__(self, max_words=300):
        """Initializes the TextChunker with a maximum number of words per chunk."""
        self.max_words = max_words

    def split_text_into_chunks(self, text):
        """
        Splits the given text into chunks with a maximum number of words per chunk.

        Parameters:
            text (str): The text to split.

        Returns:
            list: A list of text chunks.
        """
        sentences = re.split(r'(?<=[.!?]) +', text)
        chunks = []
        current_chunk = []
        current_word_count = 0

        for sentence in sentences:
            sentence_word_count = len(sentence.split())
            if current_word_count + sentence_word_count > self.max_words:
                chunks.append(" ".join(current_chunk))
                current_chunk = []
                current_word_count = 0

            current_chunk.append(sentence)
            current_word_count += sentence_word_count

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks
# text_chunker = TextChunker()
# chunked_text = text_chunker.split_text_into_chunks(pdf_text)

# print(chunked_text)
# # print(type(chunked_text))
# # print(len(chunked_text))




**Step 4**: **Chunks into Batches**

In [None]:
class TokenBatcher:
    def __init__(self, token_limit=4000, model="llama3-8b-8192"):
        self.token_limit = token_limit
        self.model = model

    def num_tokens_from_string(self, string):
        """
        Calculates the number of tokens in a string.
        """
        return len(string.split())

    def create_batches(self, text_chunks):
        """
        Creates batches of text chunks, each within the token limit.
        """
        batches = []
        current_batch = []
        current_token_count = 0

        for chunk in text_chunks:
            chunk_tokens = self.num_tokens_from_string(chunk)

            if current_token_count + chunk_tokens > self.token_limit:
                # If adding this chunk exceeds the token limit, start a new batch
                batches.append(current_batch)
                current_batch = [chunk]
                current_token_count = chunk_tokens
            else:
                # Add the chunk to the current batch
                current_batch.append(chunk)
                current_token_count += chunk_tokens

        # Add the final batch if there are any remaining chunks
        if current_batch:
            batches.append(current_batch)

        return batches



**Step 5: Summarize Batches Using Groq API and Combine Batch Summaries into Final Summary**

In [None]:
from groq import Groq

class TextSummarizer:
    def __init__(self, groq_api_key, model="llama3-8b-8192", temperature=0.7, max_tokens=1000, top_p=1):
        """
        Initializes the TextCleaner class with the Groq API key and default model parameters.

        Args:
            groq_api_key (str): The API key for accessing the Groq API.
            model (str): The Groq model to use for text cleaning.
            temperature (float): Controls the randomness in the output.
            max_tokens (int): The maximum number of tokens to generate.
            top_p (float): Controls the diversity of the output.
        """
        self.client = Groq(api_key=groq_api_key)
        self.model = model
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.top_p = top_p

    def clean_text(self, chunk):
        """
        Cleans and formats the text chunk to improve readability and focuses on specific investor elements.

        Args:
            chunk (str): The raw text to be cleaned.
            agent_context (str): Optional additional context for the cleaning agent.

        Returns:
            str: The cleaned and formatted text.
        """
        # Prepare the cleaning prompt
        context = """
                    You are an AI model trained to analyze financial documents from the perspective of an investor. Your task is to process the provided
                    financial data or company report, ensuring that all relevant information is extracted in a way that highlights the most important factors for investors.

                    You should clean and process the data by removing any irrelevant sections such as:
                    - **Table of contents**, legal disclaimers, boilerplate text, or other generic content that does not provide value to the analysis.
                    - **Historical or outdated data** that is not pertinent to forecasting the company's future.
                    - **Non-financial content**, like promotional material or unrelated charts and graphs.
                    - **Excessive formatting** that may obscure the core financial information."""

        prompt ="""Please analyze the following company financial data and provide an investor-focused and make sure to include all numeric values for all metrics

                    **Future Growth Prospects**: Identify any signs of potential growth for the company, including new market opportunities, business expansions, or projected increases in revenue. **Ensure that all numeric values** mentioned, such as growth percentages, future revenue figures, or market share projections, are captured and included in your analysis.

                    **Key Business Changes**: Highlight significant changes in the company’s structure, operations, or strategy that might affect its future performance. **Include all numeric values** tied to these changes, such as revenue impacts, cost savings, or any figures provided that illustrate how these changes will affect the business.

                    **Key Triggers**: Identify any external or internal factors that might significantly affect the company’s financial performance, such as regulatory changes, market trends, or new competitors. **Ensure that all relevant numeric values** are included, such as projected market share shifts, investment figures, or economic impacts linked to these triggers.

                    **Impact on Coming Year Earnings and Growth**: Assess how the factors identified in""" + chunk


        # Set up the chat message format for the API call

        try:
            # Make the completion request to Groq API
            chat_completion = self.client.chat.completions.create(
                messages = [
                {"role": "system", "content": context},
                {"role": "user", "content": prompt}
               ],


                model=self.model,
                temperature=self.temperature,
                max_tokens=self.max_tokens,
                top_p=self.top_p,
                stop=None,
                stream=False
            )

            # Extract the cleaned text
            cleaned_text = chat_completion.choices[0].message.content.strip()

            return cleaned_text

        except Exception as e:
            print(f"Error during Groq API call: {e}")
            return chunk  # Return the original text if there's an error




**Main Execution Function**

In [None]:
# main.py

import os
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()

# import PDFExtractor
# import TextChunker
# import TextSummarizer
# import TokenBatcher

def main():
    # Initialize paths and configurations

    pdf_path = "/content/SJS Transcript Call.pdf"  # Replace with the actual PDF file path
     # Replace with your Groq API key



    # Step 1: Extract text from the PDF
    pdf_extractor = PDFExtractor(pdf_path)
    pdf_text = pdf_extractor.extract_text()
    pdf_extractor.close()
    # print("Text extracted from PDF.")

    # Step 2: Split text into chunks
    text_chunker = TextChunker(max_words=300)
    text_chunks = text_chunker.split_text_into_chunks(pdf_text)
    # print("Text split into chunks.")

    #step3
    token_batcher = TokenBatcher(token_limit=4000)
    batches = token_batcher.create_batches(text_chunks)
    # print(f"Created {len(batches)} batches based on token limits.")

    # Step 4: Clean and summarize each batch using Groq API
    text_summarizer = TextSummarizer(API_KEY)
    final_text=""

    for i, batch in enumerate(batches):
    # print(f"Processing Chunk {i+1}...")
        combined_text = "\n".join(batch)
        # print(f"Processing Batch {i + 1}...")
        cleaned_chunk = text_summarizer.clean_text(combined_text)
        if cleaned_chunk :
            final_text += cleaned_chunk + "\n\n"



    final_summary = text_summarizer.clean_text(final_text)


    print("Final Summary:", final_summary)




p
if __name__ == "__main__":
    main()


Final Summary: **Future Growth Prospects:**

* The company expects to generate significant growth from the Walter Pack India business, with proforma revenue growth of 48.2% YoY and proforma EBITDA margin expansion of 120 bps.
* The company is targeting a growth rate of 20% to 25% organically, driven by the consolidation of SJS, Exotech, and Walter Pack.
* Walter Pack is expected to grow at a faster pace than the consolidated entity, driven by its strong demand from customers and new business opportunities.
* The company is also expecting growth from its consumer durables segment, driven by its expansion into new product lines and customer relationships.
* The company is targeting a growth rate of 50% in FY24, driven by the consolidation of SJS, Exotech, and Walter Pack.
* The company is expecting a significant increase in revenue from its exports, driven by its growth in Europe and North America.

**Key Business Changes:**

* The acquisition of Walter Pack India has added new technolog