<a href="https://colab.research.google.com/github/PranavGovindu/practice/blob/main/book_summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain
!pip install langchain-community langchain-core
!pip install tiktoken



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain_core.prompts import PromptTemplate


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

# Load your custom Hugging Face model
model_name = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_length=4096)

# Integrate the pipeline into LangChain
llm = HuggingFacePipeline(pipeline=pipe)

# Use the LLM in LangChain
response = llm("What is LangChain?")
print(response)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)
  response = llm("What is LangChain?")
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


What is LangChain?

LangChain is an open-source framework designed to simplify the development of applications powered by large language models (LLMs). It provides a modular and flexible toolkit for building applications that leverage the capabilities of LLMs, such as:

**Key Features:**

* **Chain Creation:** LangChain allows you to create complex chains of prompts and actions, enabling LLMs to perform multiple tasks in sequence.
* **Memory Management:** It provides mechanisms for LLMs to retain context and information from previous interactions, improving the accuracy and coherence of responses.
* **Agent Capabilities:** LangChain enables the creation of agents that can interact with the world, access external data sources, and make decisions based on their environment.
* **Prompt Templates:** It offers a library of prompt templates that can be customized to tailor the interaction with LLMs to specific use cases.
* **Integration with Other Tools:** LangChain can be integrated with va

In [None]:
#entropy thing
#multiple agents to summarize

In [None]:
prompt_template="""
<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"""


# Instantiation using from_template (recommended)
prompt = PromptTemplate.from_template(prompt_template)



In [None]:
import tiktoken
import numpy as np

def calculate_entropy(tokens):
    _, counts = np.unique(tokens, return_counts=True)
    probabilities = counts / len(tokens)
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

In [None]:
def split_text_by_entropy(text, max_tokens=2000, entropy_threshold=4.0, window_size=50):
    # Initialize the tokenizer
    encoder = tiktoken.get_encoding("gpt2")
    tokens = encoder.encode(text)

    # Calculate entropy over a sliding window
    entropies = [
        calculate_entropy(tokens[i:i + window_size]) for i in range(0, len(tokens) - window_size + 1)
    ]

    # Find split points where entropy changes significantly or falls below/above the threshold
    split_points = [0]
    for i in range(1, len(entropies)):
        if abs(entropies[i] - entropies[i - 1]) > 1.0 or entropies[i] < entropy_threshold:
            split_points.append(i * window_size)
    split_points.append(len(tokens))

    # Create chunks based on split points
    chunks = []
    for start, end in zip(split_points[:-1], split_points[1:]):
        chunk_tokens = tokens[start:end]
        if len(chunk_tokens) <= max_tokens:
            chunks.append(encoder.decode(chunk_tokens))
        else:
            # Further split large chunks to fit max_tokens
            sub_chunks = [
                chunk_tokens[i:i + max_tokens] for i in range(0, len(chunk_tokens), max_tokens)
            ]
            chunks.extend([encoder.decode(sub_chunk) for sub_chunk in sub_chunks])

    return chunks

In [None]:
# load and split book
def load_and_split_book(text):
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=4069,chunk_overlap=300)
    return text_splitter.split_text(text)


In [None]:
#here we summaize the chunks
def summarize_chunks(chunks, llm):
    chunk_summaries = []
    for chunk in chunks:
        prompt = f"Summarize the following text:\n\n{chunk}"
        summary = llm(prompt)
        chunk_summaries.append(summary)
    return chunk_summaries


In [None]:
#combine summaries into the final summary
def combine_summaries(chunk_summaries, llm):
    combined_text = " ".join(chunk_summaries)
    final_summary = llm(f"Summarize the following:\n\n{combined_text}")
    return final_summary

In [None]:
def main(file_path):
    # Load the book's text content
    with open(file_path, "r",encoding="ISO-8859-1") as file:
        text = file.read()

    # Split the text into chunks based on entropy
    print("Splitting text by entropy...")
    chunks=load_and_split_book(text)
    #chunks = split_text_by_entropy(text, max_tokens=4000, entropy_threshold=1.0, window_size=50)
    print(f"Book split into {len(chunks)} chunks.")

    # Summarize each chunk
    print("Summarizing chunks...")
    chunk_summaries = summarize_chunks(chunks, llm)

    # Combine summaries
    print("Combining summaries...")
    final_summary = combine_summaries(chunk_summaries, llm)

    return final_summary

In [1]:
texto="""
The Top 7 Habits for Success and Personal Growth

Set Clear Goals

Successful individuals consistently set clear, actionable goals. By defining what you want to achieve, you create a roadmap for your efforts. Use techniques like SMART (Specific, Measurable, Achievable, Relevant, Time-bound) goals to stay on track and measure your progress.

Prioritize Time Management

Time is a finite resource, and effective time management is essential. Use tools like planners, calendars, or apps to organize your day. The Pomodoro Technique, Eisenhower Matrix, and time-blocking can help ensure you focus on high-priority tasks while avoiding distractions.

Maintain a Growth Mindset

Adopting a growth mindset means believing that abilities and intelligence can be developed with effort and learning. Embrace challenges, seek feedback, and view failures as opportunities to grow. This perspective fuels resilience and adaptability in both personal and professional spheres.

Cultivate Healthy Habits

Physical and mental health are foundational to productivity. Prioritize regular exercise, a balanced diet, and sufficient sleep. Meditation or mindfulness practices can reduce stress and improve focus, while hydration and movement breaks enhance overall well-being.

Practice Effective Communication

Strong communication skills enable you to express ideas clearly and build meaningful relationships. Practice active listening, empathy, and assertiveness. Whether in personal or professional settings, effective communication fosters understanding and collaboration.

Commit to Lifelong Learning

Knowledge is a powerful tool for growth. Make it a habit to learn something new every day. Read books, take courses, attend workshops, or listen to podcasts in areas that interest you. Staying curious and informed keeps you competitive and intellectually engaged.

Build a Positive Network

Surround yourself with supportive, motivated individuals who inspire you to achieve your best. Networking isn’t just about making connections; it’s about fostering meaningful relationships. A strong, positive network provides guidance, encouragement, and valuable opportunities.

Conclusion
Developing these habits takes time and dedication, but the rewards are significant. By integrating them into your daily routine, you’ll cultivate a foundation for success, resilience, and continuous personal growth. Start small, stay consistent, and watch as these habits transform your life.

"""

In [None]:
main("/content/Communication-Skills.pdf")

Splitting text by entropy...
Book split into 73 chunks.
Summarizing chunks...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Combining summaries...


ValueError: Input length of input_ids is 4096, but `max_length` is set to 4096. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.