## Data processing components

| Component category | Component name |
| --- | --- |
| Data preprocessing | DocumentCleaner |
| Data preprocessing | DocumentSplitter |
| Data extraction | LinkContentFetcher |
| Data caching | URLCacheChecker |
| Audio to text processing | LocalWhisperTranscriber |
| Audio to text processing | RemoteWhisperTranscriber |
| File converter | AzureOCRDocumentConverter |
| File converter | HTMLToDocument |
| File converter | MarkdownToDocument |
| File converter | PyPDFToDocument |
| File converter | TikaDocumentConverter |
| File converter | TextFileToDocument |
| Language classifier | DocumentLanguageClassifier |
| Language classifier | TextLanguageClassifier |


In [None]:
!pip install haystack-ai farm-haystack 

### Document Cleaner

Exercise: Remove white spaces and punctuation from a document using the DocumentCleaner component.


In [1]:
from haystack.components.preprocessors import DocumentCleaner 
from haystack.dataclasses import Document

Simple instance removing extra white spaces, specific characters. We can also remove special characters.

In [2]:
# Define a regular expression for removing exclamation marks and punctuation
punctuation_regex = r"[!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~]"

# Create an instance of DocumentCleaner with the regex
cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_repeated_substrings=False,
    remove_substrings=punctuation_regex,
    remove_regex=None
)

# Sample document with exclamation marks and punctuation
sample_document = Document(content="This is a simple document! <<With some extra spaces... and punctuation!!", meta={"name": "test_doc"})

# Using the cleaner
cleaned_documents = cleaner.run([sample_document])

# Extracting the cleaned document
cleaned_document = cleaned_documents['documents'][0]

# Output the cleaned content
print("Cleaned Document Content:", cleaned_document.content)

Cleaned Document Content: This is a simple document With some extra spaces and punctuation


### Document Splitter

Exercise: Document Splitting for Language Model Processing

Objective:

Write a Python script to split a long text document into smaller segments using the DocumentSplitter component. The script should be able to handle splitting by words, sentences, or passages. You'll test the splitter with different configurations and observe how it affects the output.

In [3]:
from haystack.components.preprocessors import DocumentSplitter

In [4]:
# Assuming all necessary imports are done and DocumentSplitter class is defined

# Create a long text document
text_content = """
Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...
"""

# Create a Document object
long_document = Document(content=text_content, meta={"name": "long_text_doc"})

# Initialize DocumentSplitters with different configurations
word_splitter = DocumentSplitter(split_by="word", split_length=50, split_overlap=10)
sentence_splitter = DocumentSplitter(split_by="sentence", split_length=5, split_overlap=1)
passage_splitter = DocumentSplitter(split_by="passage", split_length=2, split_overlap=0)

# Function to print split documents
def print_splits(documents, title):
    print(f"--- {title} ---")
    for i, doc in enumerate(documents['documents'], 1):
        print(f"Segment {i}:\n{doc.content}\n")

# Split the document in different ways
word_splits = word_splitter.run([long_document])
sentence_splits = sentence_splitter.run([long_document])
passage_splits = passage_splitter.run([long_document])

# Print the results
print_splits(word_splits, "Word Splits")
print_splits(sentence_splits, "Sentence Splits")
print_splits(passage_splits, "Passage Splits")


--- Word Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...


--- Sentence Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...

Segment 2:
.


--- Passage Splits ---
Segment 1:

Your long text content goes here. It should include multiple paragraphs, sentences, and a variety of words.
...




### Fetching data from a link

Exercise: Implementing and Testing LinkContentFetcher

Objective:

In this exercise, you will implement and test the LinkContentFetcher component to fetch and extract content from various URLs. This component is designed to handle different content types, retry on failures, and rotate user agents for web requests.

In [5]:
from haystack.components.fetchers import LinkContentFetcher

In [8]:

# Initialize LinkContentFetcher
fetcher = LinkContentFetcher(
    raise_on_failure=False,
    user_agents=["UserAgent1", "UserAgent2"],
    retry_attempts=3,
    timeout=5
)

# List of URLs to test
urls = [
    "https://en.wikipedia.org/wiki/Barbie_(film)",
    "https://en.wikipedia.org/wiki/Oppenheimer_(film)",
]

# Fetch content from URLs
results = fetcher.run(urls)

# Analyze the fetched content
for stream in results['streams']:
    print(f"URL: {stream.meta['url']}")
    print(f"Content Type: {stream.meta['content_type']}")
    print(f"First 10 characters: {stream.data[:10]} ...")
    print("\n")


URL: https://en.wikipedia.org/wiki/Barbie_(film)
Content Type: text/html
First 10 characters: b'<!DOCTYPE ' ...


URL: https://en.wikipedia.org/wiki/Oppenheimer_(film)
Content Type: text/html
First 10 characters: b'<!DOCTYPE ' ...




This extracts the content of a website and stores it into a `ByteStream` data structure.

In [None]:
#results['streams']

###ByteStream(data=b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled 


We can save this into a Document.

In [10]:
from haystack.dataclasses import Document

web_document_barbie = Document(blob=results['streams'][0], meta=results['streams'][0].meta)
web_document_oppenheimer = Document(blob=results['streams'][1], meta=results['streams'][1].meta)

Then save our Documents into a Document store.

In [11]:
from haystack.document_stores.in_memory.document_store import InMemoryDocumentStore

sample_docstore = InMemoryDocumentStore()
web_docs = [web_document_barbie, web_document_oppenheimer]
sample_docstore.write_documents(documents=web_docs)

2

### Implementing URLCacheChecker

In this exercise, you will implement the UrlCacheChecker component, which checks for the presence of documents from specific URLs in a document store. The goal is to understand how to implement caching functionality in web retrieval pipelines using a document store.

In [14]:
from haystack.components.caching import CacheChecker

# Initialize UrlCacheChecker
url_cache_checker = CacheChecker(document_store=sample_docstore, cache_field='url')

# List of URLs to check
urls_to_check = [
    "https://en.wikipedia.org/wiki/Oppenheimer_(film)", # This URL should be a hit
    "https://en.wikipedia.org/wiki/Avengers:_Endgame",  # This URL should be a miss

]

# Run UrlCacheChecker
cache_results = url_cache_checker.run(urls_to_check)

# Analyze Results
print("Hits (Found in Store):")
for doc in cache_results['hits']:
    print(f"URL: {doc.meta['url']} - Content: {doc.blob.data[0:10]} ... ")

print("\nMisses (Not Found in Store):")
for url in cache_results['misses']:
    print(url)


Hits (Found in Store):
URL: https://en.wikipedia.org/wiki/Oppenheimer_(film) - Content: b'<!DOCTYPE ' ... 

Misses (Not Found in Store):
https://en.wikipedia.org/wiki/Avengers:_Endgame


### RemoteWhisperTranscriber

Objective:

Write a Python script to use the RemoteWhisperTranscriber component for transcribing audio files using OpenAI's Whisper API. The goal is to understand how to interact with remote machine learning models for audio transcription.





In [15]:
from haystack.components.audio import RemoteWhisperTranscriber
from haystack.dataclasses import ByteStream
from pathlib import Path
import os
from dotenv import load_dotenv

  def backtrace(trace: np.ndarray):


In [17]:

# Initialize RemoteWhisperTranscriber with your OpenAI API key
load_dotenv("../../.env")
api_key = os.getenv("OPENAI_API_KEY")
transcriber = RemoteWhisperTranscriber()

# List of audio file paths
audio_file_paths = ["./audio-files/harvard.wav", "./audio-files/jackhammer.wav"]

# Convert audio files to ByteStream objects
audio_streams = [ByteStream.from_file_path(Path(file_path)) for file_path in audio_file_paths]

# Transcribe audio files
transcription_results = transcriber.run(audio_streams)

# Process and display results
for doc in transcription_results['documents']:
    print(f"Transcribed Text: {doc.content}")
    print(f"Metadata: {doc.meta}")
    print("-----------")



Transcribed Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.
Metadata: {}
-----------
Transcribed Text: The stale smell of old beer lingers.
Metadata: {}
-----------
