
This Live Notebook is created by Owais Ahmad, contact and questions: [owaiskhan9654.github.io](https://owaiskhan9654.github.io/)

# ElasticSearch Basics
- [ElasticSearch installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html)
- [ElasticSearch python client](https://elasticsearch-py.readthedocs.io/en/master/)
- BIOASQ TASK 9a Dataset [Link](http://participants-area.bioasq.org/Tasks/9a/trainingDataset/raw/allMeSH/)

## Installing python client

```!pip install elasticsearch```

### Importing packages

In [1]:
# Import packages
import numpy as np
import pandas as pd
import re
import ijson
from pprint import pprint
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
from elasticsearch import Elasticsearch

### Configurating ElasticSearch

In [2]:
# Elastic search configuation

es = Elasticsearch(HOST="http://localhost", PORT=9200)

### In BIOASQ TASK 9A total  Number of articles present are 15,559,157 which is around 25.6 GB in size and Total Number of MeSH Covered in These articles are 29,369

### Loading BioAsq Task 8a Dataset

In this notebook I am only working with sample size of 10,000 


## DataSet [Link](http://participants-area.bioasq.org/Tasks/9a/trainingDataset/raw/allMeSH/)

In [3]:
%%time

abstractText=[]
meshMajor=[]
pmid=[]
title=[]
journal=[]
year=[]
count=0 #you may increase the value of count in order to 
f = open(r"D:/Lab Backup by Sushil/OWAIS/BIO ASQ DATASET TASK A/allMeSH_2021.json")
objects = ijson.items(f, 'articles.item')
for obj in tqdm(objects):
    abstractText.append(obj["abstractText"].strip())
    meshMajor.append(obj["meshMajor"])
    pmid.append(obj["pmid"])
    title.append(obj['title'])
    journal.append(obj['journal'])
    year.append(obj['year'])
    count =count +1
    if count==10000:
        break

data = pd.DataFrame({'abstractText': abstractText, 'journal':journal,'meshMajor': meshMajor,'pmid':pmid,'title':title,'year':year})
data=data.to_dict(orient='records')

9999it [00:00, 75381.85it/s]

Wall time: 202 ms





In [4]:
pprint(data[509]) #Random dataset at index 509

{'abstractText': 'Resin acids in pulp and paper mills wastewater are '
                 'potentially partitioned in the solids in post-primary '
                 'clarification due to higher hydrophobicity with log Kow '
                 '?1.74-5.80. They are known to adversely affect anaerobic '
                 'digestion (AD) process, although the effect has not been '
                 'quantified deterministically in control studies. The '
                 'objective of the present work was to determine the effect of '
                 'untreated and ozonated spiked resin acids on AD of primary '
                 'sludge. Batch adsorption tests were conducted to determine '
                 'the solid-liquid partition coefficient (Kd) of resin acids '
                 'on the primary sludge. Higher Kd was obtained at pH 4; '
                 'however, it was decreased by 78-98% at pH 8. Thereafter, '
                 'batch AD of model resin acids in primary sludge using food '
   

### Creating index for each Articles

In [5]:
%%time

i=-1
for a_data in tqdm(data):
    i=i+1
    result=es.index(index='bioasq_task_9a',body=a_data,id=i)
pprint(result)
print('\n\nThis is only showing last inserted element')

100%|███████████████████████████████████████████████████████████████████████████| 10000/10000 [00:35<00:00, 285.22it/s]

{'_id': '9999',
 '_index': 'bioasq_task_9a',
 '_primary_term': 1,
 '_seq_no': 9999,
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': '_doc',
 '_version': 1,
 'result': 'created'}


This is only showing last inserted element
Wall time: 35.1 s





### Printing the indexed Data for confirmation

In [6]:
for i in tqdm(range(len(data))):
    
    result=es.get(index="bioasq_task_9a",id=i)
pprint(result)
print('\n\nSize of this notebook will become to large so I only printing the element which is last inserted')

100%|██████████████████████████████████████████████████████████████████████████| 10000/10000 [00:08<00:00, 1209.17it/s]

{'_id': '9999',
 '_index': 'bioasq_task_9a',
 '_primary_term': 1,
 '_seq_no': 9999,
 '_source': {'abstractText': 'The coronavirus disease (COVID-19), while mild '
                             'in most cases, has nevertheless caused '
                             'significant mortality. The measures adopted in '
                             'most countries to contain it have led to '
                             'colossal social and economic disruptions, which '
                             'will impact the medium- and long-term health '
                             'outcomes for many communities. In this paper, we '
                             'deliberate on the reality and facts surrounding '
                             'the disease. For comparison, we present data '
                             'from past pandemics, some of which claimed more '
                             'lives than COVID-19. Mortality data on road '
                             'traffic crashes and other non-com




### If some mistake occurs while indexing you can execute below command to remove that particular index

In [7]:
#es.indices.delete(index="bioasq_task_9a")

In [8]:
print(es.indices.get_alias("*")) #To show how many indices are totally present

{'bioasq_task_9a': {'aliases': {}}}


### Defining Elastic_ser  method for matching documents with given query and number of results to be shown

In [9]:
# Creating Query Function

## Match Query 
def Elastic_ser(Query="COVID-19",Result_size=2): #Default Query as COVID-19 and Default Result Size as 2
    body = {
        "from":0,
        "size":Result_size,
        "query": {
            "match": {
                "meshMajor":Query
            }
        }
    }

    results = es.search(index="bioasq_task_9a", body=body)
    return(results)

In [10]:
Elastic_ser('SARS-CoV-2',1)

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2630, 'relation': 'eq'},
  'max_score': 5.066309,
  'hits': [{'_index': 'bioasq_task_9a',
    '_type': '_doc',
    '_id': '2044',
    '_score': 5.066309,
    '_source': {'abstractText': 'OBJECTIVE: To analyze the clinical manifestations of heart, liver and kidney damages in the early stage of COVID-19 to identify the indicators for these damages.METHODS: We analyzed the clinical features, underlying diseases, and indicators of infection in 12 patients with COVID-19 on the second day after their admission to our hospital between January 20 and February 20, 2020.The data including CK-MB, aTnI, BNP, heart rate, changes in ECG, LVEF (%), left ventricular general longitudinal strain (GLS, measured by color Doppler ultrasound) were collected.The changes of liver function biochemical indicators were dynamically reviewed.BUN, UCR, eGFR, Ccr, and UACR and the level

### Combining queries

#### must, must_not and should


In [11]:
 
body = {
    "from":0,
    "size":1, #change this inorder to increase the result size
    "query": {
        "bool": {
            "must_not": {
                  "match": {
                    "meshMajor":"COVID-19"
                           }
                         },
            "should": {
                "match": {
                    "meshMajor": "Betacoronavirus"
                          }
                      },
            "must": {
                "match": {
                    "meshMajor": "Pneumonia, Viral"
                         },
                    }
                 }
               } 
            }

res = es.search(index="bioasq_task_9a", body=body)
res

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 160, 'relation': 'eq'},
  'max_score': 3.3100905,
  'hits': [{'_index': 'bioasq_task_9a',
    '_type': '_doc',
    '_id': '9700',
    '_score': 3.3100905,
    '_source': {'abstractText': "BACKGROUND: This longitudinal study aimed to examine the changes in psychological distress of the general public from the early to community-transmission phases of the COVID-19 pandemic and to investigate the factors related to these changes.METHODS: An internet-based survey of 2,400 Japanese people was conducted in two phases: early phase (baseline survey: February 25-27, 2020) and community-transmission phase (follow-up survey: April 1-6, 2020). The presence of severe psychological distress (SPD) was measured using the Kessler's Six-scale Psychological Distress Scale. The difference of SPD percentages between the two phases was examined. Mixed-effects ordinal logistic r