## Building LLM-powered pipelines to extract and process data with Haystack

In this notebook, we will show how to build a pipeline to extract and process data using Haystack by deepset.

For extraction, we will look into extracting and processing content from: 

* The internet 

* Files of different formats: PDF, txt, Markdown, JSON, CSV 

For cleaning and processing, we will focus on: 

* Removing certain characters and white space 

* Chunking and splitting text  

Once the data has been extracted and cleaned, we will store it into a Haystack document store.

### Extracting content from the internet

We will use the following components to extract content from the internet:

* `SerperDevWebSearch()` - this component will enable us to perform web searches using natural language queries.
* `LinkContentFetcher()` - this component will enable us to fetch content from the links returned by the web search.

In [None]:
!pip install --upgrade haystack-ai

Let's start by importing the appropriate modules and classes.

In [1]:
from haystack.components.websearch import SerperDevWebSearch
from haystack.components.fetchers import LinkContentFetcher

Let's initialize the `SerperDevWebSearch()` and `LinkContentFetcher()` components and use them to perform a web search and fetch content from the links returned by the web search. 

**Note** You will need to have a Serper API key to use the `SerperDevWebSearch()`component. You can get a free or paid API key by signing up at [https://serper.dev/](https://serper.dev/).

This notebook assumes you have a `.env` file in the root directory of this repository with the following content:

```bash
SERPER_API_KEY=your_api_key
OPENAI_API_KEY=your_api_key
```

In [2]:
from dotenv import load_dotenv
import os

load_dotenv("./../../.env")

open_ai_key = os.getenv("OPENAI_API_KEY")
serper_api_key = os.getenv("SERPERDEV_API_KEY")

Initialize the components.

For the `SerperDevWebSearch()` component, we will use the following parameters:

* `api_key` - by default this is set to `SERPERDEV_API_KEY` so as long as we have loaded it, we don't need to pass it explicitly
* `top_k` - the number of search results to return
* `allowed_domains` - a list of domains to restrict the search to
* `search_params` - a dictionary of search parameters to pass to the Serper API

For the `LinkContentFetcher()` component, we will use the following parameters:

* `retry_attempts` - the number of times to retry fetching content from a link
* `timeout` - the time to wait before retrying to fetch content from a link


Let's limit our search to five results from Wikipedia and Encyclopedia Britannica.

In [3]:
web_search = SerperDevWebSearch(top_k=5, 
                                allowed_domains=["https://en.wikipedia.org/",
                                                 "https://www.britannica.com/"],
                                search_params={"type":"search"})
link_content = LinkContentFetcher(retry_attempts=3,
                                  timeout=10)

### Connecting the components

We will initialize the `Pipeline()` class, add the components and connect them.


In [5]:
from haystack.pipeline import Pipeline

# Initialize pipeline
pipeline = Pipeline()

# Add components
pipeline.add_component(name="search", instance=web_search)
pipeline.add_component(name="fetcher", instance=link_content)

# Connect components to one another
pipeline.connect("search.links", "fetcher.urls")

# Draw pipeline
pipeline.draw("./images/search_fetch_pipeline.png")



We can execute the pipeline as follows:

In [6]:
query = "What can you tell me about the year of the dragon?"
output = pipeline.run(data={"search":{"query":query}})

Let's take a look at the results:

In [8]:
output.keys()

dict_keys(['search', 'fetcher'])

Let's take a look at a few results. Due to the length of the response, we will only display the first 50 characters of the content.

In [30]:
for item in output["search"]["documents"]:
    print(item.content[0:50])
    print("Title: ", item.meta['title'], "URL: ", item.meta['link'])
    print("-------")

The Dragon, also known as Loong is the fifth of th
Title:  Dragon (zodiac) - Wikipedia URL:  https://en.wikipedia.org/wiki/Dragon_(zodiac)
-------
Dragon, Chenshi, 07:00 to 08:59, Dragons are hover
Title:  Chinese zodiac - Wikipedia URL:  https://en.wikipedia.org/wiki/Chinese_zodiac
-------
The Chinese Dragon, also known as the loong, long 
Title:  Chinese dragon - Wikipedia URL:  https://en.wikipedia.org/wiki/Chinese_dragon
-------
The Dragon Years is a compilation album by the New
Title:  The Dragon Years - Wikipedia URL:  https://en.wikipedia.org/wiki/The_Dragon_Years
-------
Note: according to this website, Abraham Lincoln w
Title:  Talk:Dragon (zodiac) - Wikipedia URL:  https://en.wikipedia.org/wiki/Talk%3ADragon_(zodiac)
-------


Let's take a look at the scraped content from the first link. We will only show the first 50 characters.

In [33]:
output["fetcher"]['streams'][0].data[0:50]

b'<!DOCTYPE html>\n<html class="client-nojs vector-fe'

## Adding cleaning and splitting components

We will reinitialize the `Pipeline()` class and add new instances of each of the following components:

* `SerperDevWebSearch()` - to perform web searches using natural language queries
* `LinkContentFetcher()` - to fetch content from the links returned by the web search
* `HTMLToDocument()` - to convert the HTML content to a Haystack document
* `DocumentCleaner()` - to clean the content of the document
* `DocumentWriter()` - to write the document to a Haystack document store

In [38]:
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter

# Initialize document store
document_store = InMemoryDocumentStore()

# Initialize components
web_search = SerperDevWebSearch(top_k=5, 
                                allowed_domains=["https://en.wikipedia.org/",
                                                 "https://www.britannica.com/"],
                                search_params={"type":"search"})
link_content = LinkContentFetcher(retry_attempts=3,
                                  timeout=10)
html_to_document = HTMLToDocument()
cleaner = DocumentCleaner(
	remove_empty_lines=True,
	remove_extra_whitespaces=True,
	remove_repeated_substrings=False)
writer = DocumentWriter(document_store=document_store)

# Initialize pipeline
pipeline = Pipeline()

# Add components
pipeline.add_component(name="search", instance=web_search)
pipeline.add_component(name="fetcher", instance=link_content)
pipeline.add_component(name="html_to_document", instance=html_to_document)
pipeline.add_component(name="cleaner", instance=cleaner)
pipeline.add_component(name="writer", instance=writer)

# Connect components to one another
pipeline.connect("search.links", "fetcher.urls")
pipeline.connect("fetcher", "html_to_document")
pipeline.connect("html_to_document.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "writer.documents")

# Draw pipeline
pipeline.draw("./images/search_fetch_clean_save_pipeline.png")

In [39]:
query = "What can you tell me about the year of the dragon?"
result = pipeline.run(data={"search":{"query":query}})

Show documents in the document store.

In [41]:
document_store.filter_documents()

[Document(id=c7843dbad70fe9f99b7674017b0ec29dccaa6c668bd3fd6845aca6d97f6416cb, content: 'Dragon (zodiac)
 39 languages
 Dragon
 Zodiac dragon, showing the lóng (龍) character for dragon
 It has ...', meta: {'content_type': 'text/html', 'url': 'https://en.wikipedia.org/wiki/Dragon_(zodiac)'}),
 Document(id=44088fd4874191fb4410c1718f1e907342da2ffc25446ce0bd9923f9049ebee6, content: 'Chinese zodiac
 44 languages
 Lunar calendar classification in a 12-year cycle
 Unless otherwise specif...', meta: {'content_type': 'text/html', 'url': 'https://en.wikipedia.org/wiki/Chinese_zodiac'}),
 Document(id=0cda3111923c649feeb40e8a9e8a1b16ca2c371dfc67c4e19bf2f874171b144b, content: 'Chinese dragon
 57 languages
 "Loong" redirects here. For the airlines, see Loong Air .
 This article h...', meta: {'content_type': 'text/html', 'url': 'https://en.wikipedia.org/wiki/Chinese_dragon'}),
 Document(id=fedf29d7ee1459fe0929c99453480b068d1ab56886221d8724f22fbd6ece8b3e, content: 'The Dragon Years
 The Dragon Years (