## Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

- Database

- Retriever

- Reader

In this notebook we will setup the first part, the database - where we will be using Elasticsearch.

In [1]:
!pip install pypdf2

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
   ---------------------------------------- 0.0/232.6 kB ? eta -:--:--
   ----- ---------------------------------- 30.7/232.6 kB 1.3 MB/s eta 0:00:01
   ------------------- -------------------- 112.6/232.6 kB 1.7 MB/s eta 0:00:01
   --------------------------------- ------ 194.6/232.6 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 232.6/232.6 kB 1.6 MB/s eta 0:00:00
Installing collected packages: pypdf2
Successfully installed pypdf2-3.0.1


**import pdf file**

In [74]:
from PyPDF2 import PdfReader

# Open the PDF file
pdf_reader = PdfReader('./data/Error-Analysis-PLMs.pdf')

**Preprocess the chunks text**

In [78]:
def preprocess_text(text):
    """
    Preprocesses the text by replacing unwanted text with desired replacements.
    
    Args:
    - text (str): The input text to be preprocessed.
    - replacements (dict): A dictionary where keys are the unwanted text to be replaced,
                           and values are the replacements for each key.
    
    Returns:
    - str: The preprocessed text.
    """
    # Define replacements (e.g., replace '\xa0' with a regular space)
    replacements = {'\xa0': ' ', '\n':' '}

    for old_text, new_text in replacements.items():
        text = text.replace(old_text, new_text)
    return text


**Extract the pdf text for all pages**

In [79]:
from itertools import chain

pages_text = [preprocess_text(page.extract_text()) for page in pdf_reader.pages]

#pdf_chunks = list(chain.from_iterable(pages_text))[2:]
len(pages_text)

14

In [80]:
pages_text[0]

'Vol.:(0123456789)Human-Centric Intelligent Systems  https://doi.org/10.1007/s44230-024-00061-7 RESEARCH ARTICLE Error Analysis of Pretrained Language Models (PLMs)  in English‑to‑Arabic Machine Translation Hend Al‑Khalifa1,3  · Khaloud Al‑Khalefah2 · Hesham Haroon3 Received: 3 October 2023 / Accepted: 4 January 2024  © The Author(s) 2024 Abstract Advances in neural machine translation utilizing pretrained language models (PLMs) have shown promise in improving the  translation quality between diverse languages. However, translation from English to languages with complex morphology,  such as Arabic, remains challenging. This study investigated the prevailing error patterns of state-of-the-art PLMs when  translating from English to Arabic across different text domains. Through empirical analysis using automatic metrics (chrF,  BERTScore, COMET) and manual evaluation with the Multidimensional Quality Metrics (MQM) framework, we compared  Google Translate and five PLMs (Helsinki, Marefa, F

In [41]:
import requests

In [44]:
requests.get('http://localhost:9200/_cluster/health').json()

{'cluster_name': 'docker-cluster',
 'status': 'green',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 3,
 'active_shards': 3,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 0,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 100.0}

In [45]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green open .geoip_databases vK8uAcb1QJmzR1OndcuG8A 1 0 41 0 38.4mb 38.4mb



**Delete error_plms**

In [81]:
from elasticsearch import Elasticsearch

# Create a connection
es = Elasticsearch([{'host':'localhost', 'port':9200}])

# Delete the index
es.indices.delete(index='error_plms')




{'acknowledged': True}

Now let's initialize a new index error_plms which we will use to store our Error PLMs dataset.

In [82]:
from haystack.document_stores import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='error_plms'
)

In [83]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases vK8uAcb1QJmzR1OndcuG8A 1 0 41 0 38.4mb 38.4mb
yellow open label            zi9jIquWRaCd7ANkED4aLw 1 1  0 0   227b   227b
yellow open error_plms       u53WxAAFTDq-Oe3-zZW9cQ 1 1  0 0   227b   227b



Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```
    {
        'content': '<paragraph>',
        'meta': {
            'source': 'meditations'
        }
    }
```

In [86]:
data_json = [
    {
        'content': paragraph,
        'meta': {
            'source': 'Human-Centric'
        }
    } for paragraph in pages_text
]

In [87]:
data_json[:3]

[{'content': 'Vol.:(0123456789)Human-Centric Intelligent Systems  https://doi.org/10.1007/s44230-024-00061-7 RESEARCH ARTICLE Error Analysis of Pretrained Language Models (PLMs)  in English‑to‑Arabic Machine Translation Hend Al‑Khalifa1,3  · Khaloud Al‑Khalefah2 · Hesham Haroon3 Received: 3 October 2023 / Accepted: 4 January 2024  © The Author(s) 2024 Abstract Advances in neural machine translation utilizing pretrained language models (PLMs) have shown promise in improving the  translation quality between diverse languages. However, translation from English to languages with complex morphology,  such as Arabic, remains challenging. This study investigated the prevailing error patterns of state-of-the-art PLMs when  translating from English to Arabic across different text domains. Through empirical analysis using automatic metrics (chrF,  BERTScore, COMET) and manual evaluation with the Multidimensional Quality Metrics (MQM) framework, we compared  Google Translate and five PLMs (Helsin

In [88]:
len(data_json)

14

**Now we simply write our data to Elasticsearch.**

In [89]:
doc_store.write_documents(data_json)

In [90]:
requests.get('http://localhost:9200/error_plms/_count').json()

{'count': 14,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}