## QA with retriever and haystacl

Unfortunatly, this notebook has to be run localy. you will need to install [ElasticSearch](https://www.elastic.co/downloads/past-releases/elasticsearch-7-11-2)

On windows, run the following code:

         pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html

In [3]:
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='squad_docs')

08/17/2023 21:42:01 - INFO - elasticsearch -   PUT http://localhost:9200/squad_docs [status:200 request:0.436s]
08/17/2023 21:42:01 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.288s]


Good, our connection to elasticsearch is up and running. Let invistigate the connection

In [5]:
import requests
res = requests.get('http://localhost:9200/_cluster/health')

res.json()

{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 2,
 'active_shards': 2,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 2,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 50.0}

Okay we can see that the cluster is definitely running. The cluster status is yellow, ideally we want to aim for green but the reason we see yellow here is because not all replica shards have been allocated to nodes. The details of this don't really matter, but it essentially just means that we don't have a full set of backup (replica) data shards - which is only a problem if our primary data sources get corrupted/lost. 

### Adding data

Right now our Elasticsearch instance contains a single, empty index called 'squad_docs'. We need to populate this with our squad data.

In [11]:
import json
devpath = r'dev_2.json'
with open(devpath, 'r') as f:
    squad = json.load(f)

In [12]:
squad_docs = []

for sample in squad:
    squad_docs.append({
        'text': sample['context']
    })

Then we add our data to the index like this:

In [16]:
squad_docs[:2]

[{'text': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
 {'text': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") r

In [13]:
document_store.write_documents(squad_docs)

08/17/2023 21:54:03 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.531s]
08/17/2023 21:54:05 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.306s]
08/17/2023 21:54:06 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.076s]
08/17/2023 21:54:07 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.067s]
08/17/2023 21:54:08 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.098s]
08/17/2023 21:54:09 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.034s]
08/17/2023 21:54:10 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.167s]
08/17/2023 21:54:11 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.050s]


### Retrieving Data with TF-IDF

When we're retrieving data from Elasticsearch we will be retrieving documents using either the TF-IDF, or BM25 algorithms.

**TF-IDF** is a common relevance scoring algorithm, the built is calculated using:

- **TF**, the volume of words in the query (question) that appear in the document.

- **IDF**, the inverse of the fraction of documents that contain the same word (eg common words like 'the' don't score well, whereas 'Beyonce' would).

In [17]:
from haystack.retriever.sparse import TfidfRetriever

retriever = TfidfRetriever(document_store)

08/17/2023 22:01:55 - INFO - elasticsearch -   POST http://localhost:9200/squad_docs/_search?scroll=1d&size=10000 [status:200 request:0.458s]
08/17/2023 22:01:55 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.163s]
08/17/2023 22:01:55 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.065s]
08/17/2023 22:01:55 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.008s]
08/17/2023 22:01:55 - INFO - haystack.retriever.sparse -   Found 11873 candidate paragraphs from 11873 docs in DB


In [21]:
query = "Free Palestine"

retriever.retrieve(query)

[{'text': 'For some decades prior to the First Palestine Intifada in 1987, the Muslim Brotherhood in Palestine took a "quiescent" stance towards Israel, focusing on preaching, education and social services, and benefiting from Israel\'s "indulgence" to build up a network of mosques and charitable organizations. As the First Intifada gathered momentum and Palestinian shopkeepers closed their shops in support of the uprising, the Brotherhood announced the formation of HAMAS ("zeal"), devoted to Jihad against Israel. Rather than being more moderate than the PLO, the 1988 Hamas charter took a more uncompromising stand, calling for the destruction of Israel and the establishment of an Islamic state in Palestine. It was soon competing with and then overtaking the PLO for control of the intifada. The Brotherhood\'s base of devout middle class found common cause with the impoverished youth of the intifada in their cultural conservatism and antipathy for activities of the secular middle class s

### Remove duplicates

So many duplicates! Lets fix that

In [22]:
context = [sample["context"] for sample in squad]

In [23]:
len(context)

11873

Lets convert it to set then to list again to remove the duplicates

In [24]:
context_no_duplicates = list(set(context))

In [25]:
len(context_no_duplicates)

1204

##### Delete the index data

we use the following: 'http://localhost:9200/squad_docs/**_delete_by_query**'

In [31]:
res = requests.get('http://localhost:9200/squad_docs/_count')

print('before', res.json())

res = requests.post('http://localhost:9200/squad_docs/_delete_by_query',
                    json={
                        'query': {
                            'match_all': {}
                        }
                    })

res = requests.get('http://localhost:9200/squad_docs/_count')

res.json()

print('after', res.json())

before {'count': 0, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
after {'count': 0, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}


#### now reupload the data without duplicates

In [32]:
# convert back to dictionary format we need
squad_docs = [{'text': sample} for sample in context_no_duplicates]

In [34]:
document_store.write_documents(squad_docs)

08/17/2023 22:20:52 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.324s]
08/17/2023 22:20:53 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.190s]
08/17/2023 22:20:55 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.054s]


In [35]:
retriever = TfidfRetriever(document_store)

08/17/2023 22:20:57 - INFO - elasticsearch -   POST http://localhost:9200/squad_docs/_search?scroll=1d&size=10000 [status:200 request:0.023s]
08/17/2023 22:20:57 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.004s]
08/17/2023 22:20:57 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.002s]
08/17/2023 22:20:57 - INFO - haystack.retriever.sparse -   Found 1204 candidate paragraphs from 1204 docs in DB


In [37]:
query = "Free Palestine"

retriever.retrieve(query)

[{'text': 'While the concept of a "social market economy" was only introduced into EU law in 2007, free movement and trade were central to European development since the Treaty of Rome 1957. According to the standard theory of comparative advantage, two countries can both benefit from trade even if one of them has a less productive economy in all respects. Like in other regional organisations such as the North American Free Trade Association, or the World Trade Organisation, breaking down barriers to trade, and enhancing free movement of goods, services, labour and capital, is meant to reduce consumer prices. It was originally theorised that a free trade area had a tendency to give way to a customs union, which led to a common market, then monetary union, then union of monetary and fiscal policy, political and eventually a full union characteristic of a federal state. In Europe, however, those stages were considerably mixed, and it remains unclear whether the "endgame" should be the sa

Now we're returning a set of relevant documents, without duplicates. GOOD!

### Retrieving data with BM25

Finally, let's return back to the other sparse retriever that we can use with Elasticsearch. We already used TF-IDF, by switching TfidfRetriever for ElasticsearchRetriever we can switch to the BM25 algorithm, which is an improved version of TF-IDF and is recommended by Haystack.

In [38]:
# import BM25 retriever
from haystack.retriever.sparse import ElasticsearchRetriever

query = "Free Palestine"
# intialize
retriever = ElasticsearchRetriever(document_store)

# and query
retriever.retrieve(query)

08/17/2023 22:24:02 - INFO - elasticsearch -   POST http://localhost:9200/squad_docs/_search [status:200 request:0.058s]


[{'text': 'For some decades prior to the First Palestine Intifada in 1987, the Muslim Brotherhood in Palestine took a "quiescent" stance towards Israel, focusing on preaching, education and social services, and benefiting from Israel\'s "indulgence" to build up a network of mosques and charitable organizations. As the First Intifada gathered momentum and Palestinian shopkeepers closed their shops in support of the uprising, the Brotherhood announced the formation of HAMAS ("zeal"), devoted to Jihad against Israel. Rather than being more moderate than the PLO, the 1988 Hamas charter took a more uncompromising stand, calling for the destruction of Israel and the establishment of an Islamic state in Palestine. It was soon competing with and then overtaking the PLO for control of the intifada. The Brotherhood\'s base of devout middle class found common cause with the impoverished youth of the intifada in their cultural conservatism and antipathy for activities of the secular middle class s