| **Category**                       | **Languages**                                                                 |
|-------------------------------------|-------------------------------------------------------------------------------|
| **Compiled Languages**              | C++, Go, Java, Kotlin, Rust, C, COBOL                                          |
| **Interpreted Languages**           | Python, Ruby, Perl, Lua, PowerShell                                           |
| **Scripting Languages**             | JavaScript (JS), TypeScript (TS), PHP, Shell Scripting, Elixir                 |
| **Markup/Formatting Languages**     | Markdown, LaTeX, HTML                                                          |
| **Functional and Declarative**      | Haskell, Scala, Elixir                                                         |
| **Specialized Languages**           | Protocol Buffers (Proto), Solidity (SOL), RST (reStructuredText)              |
| **Object-Oriented Languages**       | Java, C#, Swift, Ruby, Python                                                 |

In [1]:
! pip install -qU langchain-text-splitters

- Full list of supported languages

In [2]:
from langchain_text_splitters import (Language, RecursiveCharacterTextSplitter)

supported_languages = [e.value for e in Language]
print(supported_languages)

['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'swift', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'haskell', 'elixir', 'powershell']


- You can also see the separators used for a given language

In [3]:
RecursiveCharacterTextSplitter.get_separators_for_language(Language.MARKDOWN)

['\n#{1,6} ',
 '```\n',
 '\n\\*\\*\\*+\n',
 '\n---+\n',
 '\n___+\n',
 '\n\n',
 '\n',
 ' ',
 '']

In [4]:
from langchain_text_splitters import (Language, RecursiveCharacterTextSplitter)
from datetime import datetime

# Step 1: Define the Python code to be split into chunks
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""

# Step 2: Initialize the text splitter for Python
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,  # Specify the language as Python
    chunk_size=50,            # Define the maximum size of each chunk
    chunk_overlap=0           # Define the overlap size between chunks
)

# Step 3: Split the code into chunks
chunks_without_metadata = python_splitter.split_text(PYTHON_CODE)

# Step 4: Add metadata to each chunk
metadata = []
for idx, chunk in enumerate(chunks_without_metadata):
    chunk_metadata = {
        "document_id": "Python_Code_Snippet",  # Name/ID for the code snippet
        "chunk_index": idx + 1,               # Index of the chunk (starting from 1)
        "text_length": len(chunk),            # Length of the chunk
        "start_position": sum(len(c) for c in chunks_without_metadata[:idx]),  # Start position of the chunk
        "end_position": sum(len(c) for c in chunks_without_metadata[:idx+1]),  # End position of the chunk
        "language": "Python",                 # Programming language of the code
        "timestamp": datetime.now().isoformat()  # Current timestamp in ISO 8601 format
    }
    metadata.append(chunk_metadata)

# Step 5: Combine the chunks with their metadata
python_docs = python_splitter.create_documents([PYTHON_CODE], metadatas=metadata)
chunks=python_docs

# Step 6: Print details about the chunks
print('Total number of chunks is:', len(chunks))  # Total number of chunks
print('First chunk:\n',chunks[0].page_content[-150:])  # Display the last 150 characters of the first chunk

# Step 7: Analyze chunk sizes
chunk_sizes = [len(chunk.page_content) for chunk in chunks]  # Calculate sizes of all chunks
print('Maximum chunk size:', max(chunk_sizes))  # Print the largest chunk size
print('Minimum chunk size:', min(chunk_sizes))  # Print the smallest chunk size


Total number of chunks is: 2
First chunk:
 def hello_world():
    print("Hello, World!")
Maximum chunk size: 45
Minimum chunk size: 33
