## Data processing components

| Component category | Component name |
| --- | --- |
| Data preprocessing | DocumentCleaner |
| Data preprocessing | DocumentSplitter |
| Data extraction | LinkContentFetcher |
| Data caching | URLCacheChecker |
| Audio to text processing | LocalWhisperTranscriber |
| Audio to text processing | RemoteWhisperTranscriber |
| File converter | AzureOCRDocumentConverter |
| File converter | HTMLToDocument |
| File converter | MarkdownToDocument |
| File converter | PyPDFToDocument |
| File converter | TikaDocumentConverter |
| File converter | TextFileToDocument |
| Language classifier | DocumentLanguageClassifier |
| Language classifier | TextLanguageClassifier |


### Document Cleaner

Exercise: Remove white spaces and punctuation from a document using the DocumentCleaner component.


In [4]:
from haystack.preview.components.preprocessors import DocumentCleaner 
from haystack.preview.dataclasses import Document

Simple instance removing extra white spaces, specific characters. We can also remove special characters.

In [21]:
# Define a regular expression for removing exclamation marks and punctuation
punctuation_regex = r"[!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~]"

# Create an instance of DocumentCleaner with the regex
cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_repeated_substrings=False,
    remove_substrings=punctuation_regex,
    remove_regex=None
)

# Sample document with exclamation marks and punctuation
sample_document = Document(content="This is a simple document! <<With some extra spaces... and punctuation!!", meta={"name": "test_doc"})

# Using the cleaner
cleaned_documents = cleaner.run([sample_document])

# Extracting the cleaned document
cleaned_document = cleaned_documents['documents'][0]

# Output the cleaned content
print("Cleaned Document Content:", cleaned_document.content)

Cleaned Document Content: This is a simple document With some extra spaces and punctuation


### Document Splitter

Exercise: Document Splitting for Language Model Processing

Objective:

Write a Python script to split a long text document into smaller segments using the DocumentSplitter component. The script should be able to handle splitting by words, sentences, or passages. You'll test the splitter with different configurations and observe how it affects the output.

In [22]:
from haystack.preview.components.preprocessors import DocumentSplitter

In [23]:
# Assuming all necessary imports are done and DocumentSplitter class is defined

# Create a long text document
text_content = """
Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...
"""

# Create a Document object
long_document = Document(content=text_content, meta={"name": "long_text_doc"})

# Initialize DocumentSplitters with different configurations
word_splitter = DocumentSplitter(split_by="word", split_length=50, split_overlap=10)
sentence_splitter = DocumentSplitter(split_by="sentence", split_length=5, split_overlap=1)
passage_splitter = DocumentSplitter(split_by="passage", split_length=2, split_overlap=0)

# Function to print split documents
def print_splits(documents, title):
    print(f"--- {title} ---")
    for i, doc in enumerate(documents['documents'], 1):
        print(f"Segment {i}:\n{doc.content}\n")

# Split the document in different ways
word_splits = word_splitter.run([long_document])
sentence_splits = sentence_splitter.run([long_document])
passage_splits = passage_splitter.run([long_document])

# Print the results
print_splits(word_splits, "Word Splits")
print_splits(sentence_splits, "Sentence Splits")
print_splits(passage_splits, "Passage Splits")


--- Word Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...


--- Sentence Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...

Segment 2:
.


--- Passage Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...


