# Explore the Haystack preprocessing suite on my own document

### usual install & import

In [1]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]

!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

Collecting pip
  Downloading pip-22.0.4-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.3 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.4
Collecting farm-haystack[colab,ocr]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-ot296ua_/farm-haystack_050efd1c25b548a7a5572db6c8e17d6f
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-ot296ua_/farm-haystack_050efd1c25b548a7a5572db6c8e17d6f
  Resolved https://github.com/deepset-ai/haystack.git to commit ae712fe6bf087c717f3e38e4e87d2347165fc12b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting mlflow<=1.13.1
  Downloading mlfl

In [2]:
# Here are the imports we need
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs, fetch_archive_from_http

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/


I will use the bias-variance pdf that I love to test the suite: https://homes.cs.washington.edu/~pedrod/papers/mlc00a.pdf

In [3]:
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_pdf = converter.convert(file_path="/content/bias_variance.pdf", meta=None)[0]

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


An useful function in case of multiple format - or if you are lazy

In [4]:
# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.

# all_docs = convert_files_to_docs(dir_path=doc_dir)

It is important to apply the PreProcessor suite on our documents. It allows to speed up the retriever process and to bring many beneficts in case of Dense Passage Retriever (DPR). In the documentations, it is suggested to use documents composed of 100 words for DPR and up to 10000 words for sparse methods (e.g. BM25)

Before proceeding, let's understand the parameters:

* clean_empty_lines will normalize 3 or more consecutive empty lines to be just a two empty lines
* clean_whitespace will remove any whitespace at the beginning or end of each line in the text
* clean_header_footer will remove any long header or footer texts that are repeated on each page
* split_by="word" will split the text by words
* split_length=100 - Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> "sentence", then each output document will have 10 sentences.
I am a little bit perplexed by this parameter because it will split the text in documents of max 1000 words lenght even though we set 100 as input 100.
* split_respect_sentence_boundary set to True imposes that documents will not start or end midway through a sentence



In [41]:
# This is a default usage of the PreProcessor.
# Here, it performs cleaning of consecutive whitespaces
# and splits a single large document into smaller documents.
# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences
# Note how the single document passed into the document gets split into 64 smaller documents

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)
docs_default = preprocessor.process([doc_pdf])
print(f"n_docs_input: 1\nn_docs_output: {len(docs_default)}")

100%|██████████| 1/1 [00:00<00:00, 35.65docs/s]

n_docs_input: 1
n_docs_output: 64





Checking the documents' size to undertand the split_lenght

In [38]:
len(docs_default[2].content)

104

In [26]:
for d in docs_default:
  print(len(d.content))

739
555
580
587
521
573
548
454
469
518
496
510
501
452
537
584
561
266
393
456
611
534
504
551
535
517
374
604
564
555
457
613
526
452
554
371
420
613
598
321
634
371
604
600
580
387
432
478
555
529
444
595
533
546
517
548
552
608
703
685
761
650
714
645


I will have to investigate more this paramenter (split_lenght) to understand why it splits the documents with a size (words in a doc) 10 times more what I select