# Open-Domain QA Project

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

- Database

- Retriever

- Reader

### Step 1: prepare the Database


#### a. Load our data

We'll be using the following UN report "The Qusetion of Palestine"

https://raw.githubusercontent.com/Azizkhaled/NLP/main/Data/UN_text.txt

In [1]:
import requests

In [2]:
# function to remove duplicates while keeping the same order

def remove_duplicates_keep_order(input_list):
    seen = set()
    result = []

    for item in input_list:
        if item not in seen:
            result.append(item)
            seen.add(item)

    return result


In [3]:
data = requests.get('https://raw.githubusercontent.com/Azizkhaled/NLP/main/Data/UN_text.txt')
text = data.text.split('\n')
#remove duplicates 
print('Before\n', len(text), ': ', text[0:3])

text = remove_duplicates_keep_order(text)
print('\nAfter\n', len(text),': ', text[0:3])


Before
 1500 :  ['\r', 'The question of Palestine was brought before the United Nations shortly after the end of the Second World War.\r', '\r']

After
 889 :  ['\r', 'The question of Palestine was brought before the United Nations shortly after the end of the Second World War.\r', 'The origins of the Palestine problem as an international issue, however, lie in events occurring towards the end of the First World War. These events led to a League of Nations decision to place Palestine under the administration of Great Britain as the Mandatory Power under the Mandates System adopted by the League. In principle, the Mandate was meant to be in the nature of a transitory phase until Palestine attained the status of a fully independent nation, a status provisionally recognized in the League’s Covenant, but in fact the Mandate’s historical evolution did not result in the emergence of Palestine as an independent nation.\r']


#### b. Setting up an index in elasticsearch

If you didn't already, you will need to install [ElasticSearch](https://www.elastic.co/downloads/past-releases/elasticsearch-7-11-2)

In [4]:
# confirm Elasticsearch is up and running 

requests.get('http://localhost:9200/_cluster/health').json()


{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 3,
 'active_shards': 3,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 3,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 50.0}

##### Initilize the new index for the UN dataset

In [5]:
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='un'
)

08/20/2023 12:28:29 - INFO - elasticsearch -   HEAD http://localhost:9200/un [status:200 request:0.005s]
08/20/2023 12:28:29 - INFO - elasticsearch -   GET http://localhost:9200/un [status:200 request:0.003s]
08/20/2023 12:28:29 - INFO - elasticsearch -   PUT http://localhost:9200/un/_mapping [status:200 request:0.019s]
08/20/2023 12:28:29 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.003s]


In [6]:
# Check
print(requests.get('http://localhost:9200/_cat/indices').text)


yellow open squad_docs s6GZ48hkQq23SCjlfytjwg 1 1 1204 0  18.4mb  18.4mb
yellow open un         q0OxhRVAQLelbesmP3kDtQ 1 1 1778 0 674.8kb 674.8kb
yellow open label      N6xlf6qPSz-glW15SVWhkA 1 1    0 0    208b    208b



#### c. format the data

the required format: 

    {
        'text': '<paragraph>',
        'meta': {
            'source': 'meditations'
        }
    }

In [7]:
data_json = [
    {
        'text': paragraph,
        'meta': {
            'source': 'un palestine'
        }
    } for paragraph in text
]

In [8]:
data_json[:3]

[{'text': '\r', 'meta': {'source': 'un palestine'}},
 {'text': 'The question of Palestine was brought before the United Nations shortly after the end of the Second World War.\r',
  'meta': {'source': 'un palestine'}},
 {'text': 'The origins of the Palestine problem as an international issue, however, lie in events occurring towards the end of the First World War. These events led to a League of Nations decision to place Palestine under the administration of Great Britain as the Mandatory Power under the Mandates System adopted by the League. In principle, the Mandate was meant to be in the nature of a transitory phase until Palestine attained the status of a fully independent nation, a status provisionally recognized in the League’s Covenant, but in fact the Mandate’s historical evolution did not result in the emergence of Palestine as an independent nation.\r',
  'meta': {'source': 'un palestine'}}]

In [9]:
len(data_json)

889

##### d. upload the data to ElasticSearch

In [20]:
doc_store.write_documents(data_json)


08/20/2023 12:31:50 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.545s]
08/20/2023 12:31:51 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.069s]


In [21]:
requests.get('http://localhost:9200/un/_count').json()


{'count': 889,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

### Step 2: Retriever: BM25 and Reader : bert-base-cased-squad2  

In [12]:
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='un'
)

08/20/2023 12:28:31 - INFO - elasticsearch -   HEAD http://localhost:9200/un [status:200 request:0.005s]
08/20/2023 12:28:31 - INFO - elasticsearch -   GET http://localhost:9200/un [status:200 request:0.003s]
08/20/2023 12:28:31 - INFO - elasticsearch -   PUT http://localhost:9200/un/_mapping [status:200 request:0.014s]
08/20/2023 12:28:31 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.002s]


##### a. Retriever

In [13]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(doc_store)  # BM25


##### b. Reader

In [15]:
from haystack.reader.farm import FARMReader

reader = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2',
                    context_window_size=1500,
                    use_gpu=True)

08/20/2023 12:28:32 - INFO - farm.utils -   Using device: CUDA 
08/20/2023 12:28:32 - INFO - farm.utils -   Number of GPUs: 1
08/20/2023 12:28:32 - INFO - farm.utils -   Distributed Training: False
08/20/2023 12:28:32 - INFO - farm.utils -   Automatic Mixed Precision: None


08/20/2023 12:30:22 - INFO - farm.utils -   Using device: CUDA 
08/20/2023 12:30:22 - INFO - farm.utils -   Number of GPUs: 1
08/20/2023 12:30:22 - INFO - farm.utils -   Distributed Training: False
08/20/2023 12:30:22 - INFO - farm.utils -   Automatic Mixed Precision: None
08/20/2023 12:30:22 - INFO - farm.infer -   Got ya 3 parallel workers to do inference ...
08/20/2023 12:30:22 - INFO - farm.infer -    0    0    0 
08/20/2023 12:30:22 - INFO - farm.infer -   /w\  /w\  /w\
08/20/2023 12:30:22 - INFO - farm.infer -   /'\  / \  /'\
08/20/2023 12:30:22 - INFO - farm.infer -       


##### c. Retriever-Reader ODQA pipeline

In [23]:
from haystack.pipeline import ExtractiveQAPipeline

qa = ExtractiveQAPipeline(reader=reader, retriever=retriever)

### Running Queries: 

In [35]:
qa.run(query='What were the origins of the Palestine problem as an international issue?', top_k_reader=3)


08/20/2023 13:06:39 - INFO - elasticsearch -   POST http://localhost:9200/un/_search [status:200 request:0.018s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]


{'query': 'What were the origins of the Palestine problem as an international issue?',
 'no_ans_gap': 2.5004959106445312,
 'answers': [{'answer': 'towards the end of the First World War',
   'score': 9.186897277832031,
   'probability': 0.7592116315941247,
   'context': 'The origins of the Palestine problem as an international issue, however, lie in events occurring towards the end of the First World War. These events led to a League of Nations decision to place Palestine under the administration of Great Britain as the Mandatory Power under the Mandates System adopted by the League. In principle, the Mandate was meant to be in the nature of a transitory phase until Palestine attained the status of a fully independent nation, a status provisionally recognized in the League’s Covenant, but in fact the Mandate’s historical evolution did not result in the emergence of Palestine as an independent nation.\r',
   'offset_start': 97,
   'offset_end': 135,
   'offset_start_in_doc': 97,
   'off

In [34]:
qa.run(query='Why did the Mandate for Palestine not result in the emergence of an independent nation?', top_k_reader=3)


08/20/2023 13:05:50 - INFO - elasticsearch -   POST http://localhost:9200/un/_search [status:200 request:0.017s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.66 Batches/s]


{'query': 'Why did the Mandate for Palestine not result in the emergence of an independent nation?',
 'no_ans_gap': 5.823172092437744,
 'answers': [{'answer': 'Palestinian Arabs could not present their own views',
   'score': 12.28139877319336,
   'probability': 0.8227621947973142,
   'context': 'Both Governments then requested the views of the independent Arab Governments which, in the meantime, had formed the Arab League in March 1945, envisioning the future membership of an eventually independent Palestine. Since the Palestinian Arabs could not present their own views, the Arab Governments actively advocated their case, and obtained assurances from the United States Government of consultation on any formula for Palestine. They now proposed a conference to discuss the Palestine problem.\r',
   'offset_start': 228,
   'offset_end': 279,
   'offset_start_in_doc': 228,
   'offset_end_in_doc': 279,
   'document_id': 'c9ec55d9-605e-4fad-b95b-8c48322f6c8c',
   'meta': {'source': 'un palest

In [33]:
qa.run(query='Who is stealing land in Palestine?', top_k_reader=3)


08/20/2023 13:03:06 - INFO - elasticsearch -   POST http://localhost:9200/un/_search [status:200 request:0.012s]
Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.63 Batches/s]


{'query': 'Who is stealing land in Palestine?',
 'no_ans_gap': -9.19066047668457,
 'answers': [{'answer': 'various Jewish agencies',
   'score': 0.30732420086860657,
   'probability': 0.5096027003720435,
   'context': '“Land available for settlement. It has emerged quite definitely that there is at the present time and with the present methods of Arab cultivation no margin of land available for agricultural settlement by new immigrants with the exception of such undeveloped land as the various Jewish agencies hold in reserve.” 77\r',
   'offset_start': 272,
   'offset_end': 295,
   'offset_start_in_doc': 272,
   'offset_end_in_doc': 295,
   'document_id': 'a00e49e8-568b-4f43-b029-7c3c43f3c589',
   'meta': {'source': 'un palestine'}},
  {'answer': 'Jewish population of Palestine who lived there before the War never had any trouble with their Arab neighbours',
   'score': -3.7163710594177246,
   'probability': 0.38590785130289834,
   'context': '“… We wish to point out here that the Jewi

Interesting Stuff 
### FREE PALESTINE 