**LLM : Large Language Models**

**Frameworks:**

- ML == Scikit-Learn
- DL == TensorFlow/PyTorch/Keras
- NLP == NLTK
- GenAI + LLM == LangChain

- Langchain is a python coding framework (we can say high level package), it is used to integrate with LLM models.
- LangChain is an Ocean, having 40+ companies models. Spome examples below:
    - LLM models == ChatGPT, GeminiAI, BedRock

1. ChatGPT == OpenAI + Microsoft
2. GeminiAI == Google
3. BedRock == Amazon
4. Llama == Meta

In [45]:
# Step-1: Install the packages

!pip install langchain
!pip install langchain-community
!pip install openai



In [1]:
# Step-2: Import the Packages

from langchain.document_loaders import TextLoader
import pandas as pd
import openai

In [3]:
loader = TextLoader("Sample_text.txt")
documents = loader.load()
print(documents)
print(type(documents))

[Document(metadata={'source': 'Sample_text.txt'}, page_content='The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a variety of tools, including text splitters, to efficiently handle large documents.\nWith TextSplitter, you can break down documents into manageable chunks, ensuring that no token limits are exceeded.\nThis allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.')]
<class 'list'>


- Document
    - metadata
        - page_content

In [4]:
len(documents)   # As it is a list with 1 item

1

In [5]:
print(type(documents[0]))    # Here the type is langchain document
documents[0]

<class 'langchain_core.documents.base.Document'>


Document(metadata={'source': 'Sample_text.txt'}, page_content='The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a variety of tools, including text splitters, to efficiently handle large documents.\nWith TextSplitter, you can break down documents into manageable chunks, ensuring that no token limits are exceeded.\nThis allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.')

In [6]:
documents[0].metadata      # Source : From where this file is coming

{'source': 'Sample_text.txt'}

In [7]:
print(type(documents[0].page_content))    # type is string
documents[0].page_content

<class 'str'>


'The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a variety of tools, including text splitters, to efficiently handle large documents.\nWith TextSplitter, you can break down documents into manageable chunks, ensuring that no token limits are exceeded.\nThis allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.'

In [8]:
len(documents[0].page_content)

433

- Step-3: Text Splitters also called as Chunking

In [9]:
from langchain.text_splitter import CharacterTextSplitter

In [10]:
# Chunk_size     == Maximum chunk size (in characters or tokens)
# Chunk_overlap  == Overlapping tokens between chunks to maintain continuity & to keep the semantic meaning, 
# separator      == Define how the text is split (newline, space, custom delimiter).

text_splitter = CharacterTextSplitter(separator=' ',chunk_size = 120,chunk_overlap = 10)


In [11]:
docs = text_splitter.split_documents(documents)
docs

[Document(metadata={'source': 'Sample_text.txt'}, page_content='The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a'),
 Document(metadata={'source': 'Sample_text.txt'}, page_content='provides a variety of tools, including text splitters, to efficiently handle large documents.\nWith TextSplitter, you can'),
 Document(metadata={'source': 'Sample_text.txt'}, page_content='you can break down documents into manageable chunks, ensuring that no token limits are exceeded.\nThis allows the'),
 Document(metadata={'source': 'Sample_text.txt'}, page_content='allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.')]

In [None]:
# No Warning while running above code, means it divided properly
# One warning means === one chunk problem

In [12]:
len(docs)

4

In [13]:
docs[0]   # One Chunk

Document(metadata={'source': 'Sample_text.txt'}, page_content='The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a')

- Print all chunks

In [14]:
for i, chunk in enumerate(docs):
    print(f"chunk {i+1}:\n{chunk.page_content}")


chunk 1:
The Langchain framework is designed to help developers build applications powered by language models.
It provides a
chunk 2:
provides a variety of tools, including text splitters, to efficiently handle large documents.
With TextSplitter, you can
chunk 3:
you can break down documents into manageable chunks, ensuring that no token limits are exceeded.
This allows the
chunk 4:
allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.


- Print the lengths of all chunks

In [71]:
[len(docs[i].page_content) for i in range(len(docs))]

[115, 120, 112, 113]

In [72]:
# Here total length 460 > 433 because of the chunk overlap between 2 chunks
sum([len(docs[i].page_content) for i in range(len(docs))])

460

In [73]:
chunks = text_splitter.split_text(documents[0].page_content)      # Here we need to provide string
chunks

['The Langchain framework is designed to help developers build applications powered by language models.\nIt provides a',
 'provides a variety of tools, including text splitters, to efficiently handle large documents.\nWith TextSplitter, you can',
 'you can break down documents into manageable chunks, ensuring that no token limits are exceeded.\nThis allows the',
 'allows the language model to process large text in smaller, sequential parts, maintaining context and continuity.']

In [41]:
# from openai import OpenAI

# # Initialize OpenAI with your API key
# llm = OpenAI(api_key="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")

# # Function to get a summary for a chunk of text using the ChatCompletion endpoint
# def get_summary(chunk):
#     response = openai.ChatCompletion.create(
#         model="gpt-3.5-turbo",  # Use a chat-based model (e.g., gpt-3.5-turbo or gpt-4)
#         messages=[
#             {"role": "system", "content": "You are a helpful assistant."},
#             {"role": "user", "content": f"Summarize the following text:\n\n{chunk}"}
#         ],
#         max_tokens=150,  # Limit the response length
#         temperature=0.5  # Adjust the creativity level
#     )
    
#     # Extract and return the summary text from the response
#     return response['choices'][0]['message']['content'].strip()

# # Process each chunk with the language model
# summaries = [get_summary(chunk) for chunk in chunks]

# # Print the summaries
# for i, summary in enumerate(summaries):
#     print(f"Summary of Chunk {i+1}:\n{summary}\n")