# Preprocessing

Haystack includes a suite of tools to extract text from different file types, normalize white space
and split text into smaller pieces to optimize retrieval.
These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack.

Ultimately, Haystack expects data to be provided as a list of documents in the following dictionary format:
``` python
docs = [
    {
        'content': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
```

This tutorial will show you all the tools that Haystack provides to help you cast your data into this format.

## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

### Enabling Telemetry 
Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details.

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

## Converters

Haystack's converter classes are designed to help you turn files on your computer into the documents
that can be processed by the Haystack pipeline.
There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.
The parameter `valid_languages` does not convert files to the target language, but checks if the conversion worked as expected. Here are some examples of how you would use file converters:

In [1]:
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor


converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="/home/nozander/Workspace/doc-similar/data/doc1.txt", meta={"filename":"classic"})[0]

# converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
# doc_pdf = converter.convert(file_path="data/tutorial8/bert.pdf", meta=None)[0]

# converter = DocxToTextConverter(remove_numeric_tables=False, valid_languages=["en"])
# doc_docx = converter.convert(file_path="data/tutorial8/heavy_metal.docx", meta=None)[0]

  from .autonotebook import tqdm as notebook_tqdm


Haystack also has a convenience function that will automatically apply the right converter to each file in a directory:

<Document: {'content': '\n\nThe soundtrack album of the fourth season of HBO series \'\'Game of Thrones\'\', titled \'\'\'\'\'Game of Thrones: Season 4\'\'\'\'\' was released digitally on June 10, 2014, and on CD on July 1, 2014. Season 4 of \'\'Game of Thrones\'\' saw the Icelandic band Sigur Rós perform their rendition of "The Rains of Castamere" in a cameo appearance at King Joffrey\'s wedding in the second episode, "The Lion and the Rose".\n\n==Reception==\nThe soundtrack received mostly positive reviews from critics. The soundtrack was awarded a score of 4/5 by Heather Phares of AllMusic.\n\n==Track listing==\n\n\n==Credits and personnel==\nPersonnel adapted from the album liner notes.\n\n* David Benioff – liner notes\n* Ramin Djawadi – composer, primary artist, producer\n* Sigur Rós – primary artist \n* George R.R. Martin – lyricist\n* D.B. Weiss – liner notes \n\n\n==Charts==\n\n\n\n\n\n\n Peak position\n\n\n\n\n\n\n\n', 'content_type': 'text', 'score': None, 'meta': {'filename'

In [7]:
from haystack.utils import convert_files_to_docs


all_docs = convert_files_to_docs(dir_path="/home/nozander/Workspace/doc-similar/data")

In [17]:
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers

document_store:InMemoryDocumentStore = InMemoryDocumentStore(use_bm25=True,use_gpu=True,similarity="cosine")
retriever = BM25Retriever(document_store=document_store)

In [18]:
document_store.write_documents(all_docs)

Updating BM25 representation...:   0%|          | 0/184 [00:00<?, ? docs/s]

Updating BM25 representation...: 100%|██████████| 184/184 [00:00<00:00, 833.30 docs/s] 


In [20]:
document_store.get_all_documents(filters = {"$and": {"document_id": {"$eq": document_id}}})

[<Document: {'content': '\n\nThe soundtrack album of the fourth season of HBO series \'\'Game of Thrones\'\', titled \'\'\'\'\'Game of Thrones: Season 4\'\'\'\'\' was released digitally on June 10, 2014, and on CD on July 1, 2014. Season 4 of \'\'Game of Thrones\'\' saw the Icelandic band Sigur Rós perform their rendition of "The Rains of Castamere" in a cameo appearance at King Joffrey\'s wedding in the second episode, "The Lion and the Rose".\n\n==Reception==\nThe soundtrack received mostly positive reviews from critics. The soundtrack was awarded a score of 4/5 by Heather Phares of AllMusic.\n\n==Track listing==\n\n\n==Credits and personnel==\nPersonnel adapted from the album liner notes.\n\n* David Benioff – liner notes\n* Ramin Djawadi – composer, primary artist, producer\n* Sigur Rós – primary artist \n* George R.R. Martin – lyricist\n* D.B. Weiss – liner notes \n\n\n==Charts==\n\n\n\n\n\n Chart (2014)\n\n Peak position\n\n\n\n\n\n\n\n', 'content_type': 'text', 'score': None, 'me

In [19]:
all_docs[0]

<Document: {'content': '\n\nThe soundtrack album of the fourth season of HBO series \'\'Game of Thrones\'\', titled \'\'\'\'\'Game of Thrones: Season 4\'\'\'\'\' was released digitally on June 10, 2014, and on CD on July 1, 2014. Season 4 of \'\'Game of Thrones\'\' saw the Icelandic band Sigur Rós perform their rendition of "The Rains of Castamere" in a cameo appearance at King Joffrey\'s wedding in the second episode, "The Lion and the Rose".\n\n==Reception==\nThe soundtrack received mostly positive reviews from critics. The soundtrack was awarded a score of 4/5 by Heather Phares of AllMusic.\n\n==Track listing==\n\n\n==Credits and personnel==\nPersonnel adapted from the album liner notes.\n\n* David Benioff – liner notes\n* Ramin Djawadi – composer, primary artist, producer\n* Sigur Rós – primary artist \n* George R.R. Martin – lyricist\n* D.B. Weiss – liner notes \n\n\n==Charts==\n\n\n\n\n\n Chart (2014)\n\n Peak position\n\n\n\n\n\n\n\n', 'content_type': 'text', 'score': None, 'met