In [1]:
#importing all packages required
from urllib.request import urlopen
import pandas as pd
import json
from elasticsearch import Elasticsearch, RequestsHttpConnection
from newsapi import NewsApiClient
import warnings

In [2]:
#Elasticsearch connection function
def Elasticsearch_connection(host_link,user_auth):
    warnings.filterwarnings("ignore")
    es = Elasticsearch(hosts=host_link ,verify_certs=False,http_auth= user_auth, connection_class=RequestsHttpConnection,)
    print("ElasticSearch connection has been established and this connection instance is stored in es variable")
    return es

In [3]:
es = Elasticsearch_connection(['https://tux-es1.cci.drexel.edu:9200/','https://tux-es2.cci.drexel.edu:9200/','https://tux-es3.cci.drexel.edu:9200/'],'ms4976:Phooh3ahkei7')

ElasticSearch connection has been established and this connection instance is stored in es variable


**Testing and evaluating search engine indices:**

__Use case 1 : "Crimes in US" on index "ms4976_info624_201904_newsproject1"__

In [19]:
es.search(index = 'ms4976_info624_201904_newsproject1', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "crimes in US",
          "fields": ["title^2","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

{'took': 30,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 347, 'relation': 'eq'},
  'max_score': 18.714148,
  'hits': [{'_index': 'ms4976_info624_201904_newsproject',
    '_type': '_doc',
    '_id': '2223',
    '_score': 18.714148,
    '_source': {'source': 'Wired',
     'author': 'Matt Burgess, WIRED UK',
     'title': 'A British AI Tool to Predict Violent Crime Is Too Flawed to Use',
     'description': 'A government-funded system known as Most Serious Violence was built to predict first offenses but turned out to be wildly inaccurate.',
     'url': 'https://www.wired.com/story/a-british-ai-tool-to-predict-violent-crime-is-too-flawed-to-use/',
     'publishedAt': '2020-08-09T13:00:00Z',
     'timestamp': 0.00023691392416765148}},
   {'_index': 'ms4976_info624_201904_newsproject',
    '_type': '_doc',
    '_id': '2237',
    '_score': 12.209831,
    '_source': {'source': 'Reuters',
     'author': 'Reuters Editor

The information needs for this query ("crimes in US") is to find all relevant news articles from the "ms4976_info624_201904_newsproject1" index which have any details regarding crimes that are happening in united states. We can observe from the query Top 10 results where Doc 1,Doc 5,Doc 8 are not relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    Relevant = Doc 2,Doc 3,Doc 4,Doc 6, Doc 7,Doc 9,Doc 10

    Non-Relevant = Doc 1,Doc 5,Doc 8

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 7/(7+3)
              =7/10
              =0.7
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) +(rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 0+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (0/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (1/log(9))+(1/log(10))
        = 0 + 1/1 + 1/1.58 +1/2 + 0/2.32 + 1/2.58 + 1/2.8 + 0/3 + 1/3.17+ 1/3.32
        = 1 + 0.63 + 0.5 + 0.39 +0.36 + 0.31 + 0.3
        =3.49
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+(1/log(2)) +(1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) +(0/log(8)) +(0/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 +0.43 + 0.39 +0.36 
        =4.31
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =3.49/4.31
         =0.8

__Use case 2 : "articles by BBC news" on index "ms4976_info624_201904_newsproject1"__

In [14]:
es.search(index = 'ms4976_info624_201904_newsproject1', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "articles by BBC news",
          "fields": ["source^3","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

{'took': 8,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 947, 'relation': 'eq'},
  'max_score': 21.565105,
  'hits': [{'_index': 'ms4976_info624_201904_newsproject',
    '_type': '_doc',
    '_id': '206',
    '_score': 21.565105,
    '_source': {'source': 'BBC News',
     'author': 'https://www.facebook.com/bbcnews',
     'title': 'Musicians hire fishing boat to beat France quarantine - BBC News',
     'description': '<ol><li>Musicians hire fishing boat to beat France quarantine\xa0\xa0BBC News\r\n</li><li>British tourists rush back from France to avoid quarantine\xa0\xa0CNN\r\n</li><li>Thousands of Britons return from France to avoid quarantine\xa0\xa0ABC News\r\n</li><li>The French quarantine f…',
     'url': 'https://www.bbc.com/news/uk-scotland-53792546',
     'publishedAt': '2020-08-15T13:07:04Z',
     'timestamp': 0.00023699089174106783}},
   {'_index': 'ms4976_info624_201904_newsproject',
    '_type': '_d

The information needs for this query ("articles by BBC news") is to find all relevant news articles from the "ms4976_info624_201904_newsproject1" index which are published by BBC news source. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 3 : "COVID-19" on index "ms4976_info624_201904_newsproject1"

In [12]:
es.search(index = 'ms4976_info624_201904_newsproject1', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "COVID-19",
          "fields": ["title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

{'took': 10,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 308, 'relation': 'eq'},
  'max_score': 8.097325,
  'hits': [{'_index': 'ms4976_info624_201904_newsproject',
    '_type': '_doc',
    '_id': '108',
    '_score': 8.097325,
    '_source': {'source': 'TheChronicleHerald.ca',
     'author': 'Reuters Inc.',
     'title': 'Chinese COVID-19 vaccine candidate shows promise in animal tests - TheChronicleHerald.ca',
     'description': "<ol><li>Chinese COVID-19 vaccine candidate shows promise in animal tests\xa0\xa0TheChronicleHerald.ca\r\n</li><li>Russia claims it's in the last phase of COVID-19 vaccine trials\xa0\xa0ABC News\r\n</li><li>COVID-19 Vaccines ‘Making Good Progress’ in Trials But Won’t Be Useab…",
     'url': 'https://www.thechronicleherald.ca/news/world/chinese-covid-19-vaccine-candidate-shows-promise-in-animal-tests-477118/',
     'publishedAt': '2020-07-24T09:57:45Z',
     'timestamp': 0.00023670722

The information needs for this query "COVID-19" is to find all relevant news articles from the "ms4976_info624_201904_newsproject1" index which are related to covid -19 virus topics. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

Precision = True positive(TP)/True positive(TP) + False Positive(FP)
          = 10/(10+0)
          =10/10
          =1
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 4 : "Annie Karni articles on Trump" on index "ms4976_info624_201904_newsproject1"

In [24]:
es.search(index = 'ms4976_info624_201904_newsproject1', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "Annie Karni articles on Donald Trump",
          "fields": ["author^5","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

{'took': 207,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 395, 'relation': 'eq'},
  'max_score': 12.603967,
  'hits': [{'_index': 'ms4976_info624_201904_newsproject1',
    '_type': '_doc',
    '_id': '863',
    '_score': 12.603967,
    '_source': {'source': 'Pypi.org',
     'author': 'lucasyangpersonal@gmail.com',
     'title': 'nw-kaya added to PyPI',
     'description': 'Simplified python article discovery & extraction.',
     'url': 'https://pypi.org/project/nw-kaya/',
     'publishedAt': '2020-08-21T15:08:02Z',
     'timestamp': 0.0002370688731388191}},
   {'_index': 'ms4976_info624_201904_newsproject1',
    '_type': '_doc',
    '_id': '3143',
    '_score': 11.348137,
    '_source': {'source': 'New York Times',
     'author': 'Ginia Bellafante',
     'title': 'Why the Big City President Made Cities the Enemy',
     'description': 'Donald Trump — a lifelong New Yorker — declares war on urban America.',
     

The information needs for this query("Annie Karni articles on Donald Trump") is to find all relevant news articles from the "ms4976_info624_201904_newsproject1" index which are Published by Annie Karni Author on any Donald Trump topics.We can observe from the query Top 10 results where Doc 1 is not relevant to our query information needs.

Based on the above query information needs we can categorized these results into relevant and non-relevant document

    relevant = Doc 2,Doc 3,Doc 4,Doc 5,Doc 6, Doc 7,Doc 8,Doc 9,Doc 10
    non-relevant = Doc 1

For this query, we are getting only one False Positive document among top 10 retreived results and remaining all are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 9/(9+1)
              =9/10
              =0.9
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG =  0+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        =  0 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        = 4.25
 
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0
        =4.95
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =4.25/4.95
         =0.85

**Evaluation on  "ms4976_info624_201904_newsproject2" index which has different similarities configuration**

Use case 1 : "US Elections" on index "ms4976_info624_201904_newsproject2" 

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject2', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "crimes in US",
          "fields": ["title^2","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("crimes in US") is to find all relevant news articles from the "ms4976_info624_201904_newsproject2" index which have any details regarding crimes in united states.We can observe from the query Top 10 results where Doc 1,Doc 5,Doc 8 are not relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc 2,Doc 3,Doc 4,Doc 6, Doc 7,Doc 9,Doc 10
    non-relevant = Doc 1,Doc 5,Doc 8

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 7/(7+3)
              =7/10
              =0.7

Discounted cumulative gain can be given by below formula

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 0+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (0/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (1/log(9)) +(1/log(10))
        = 0 + 1/1 + 1/1.58 +1/2 + 0/2.32 + 1/2.58 + 1/2.8 + 0/3 + 1/3.17+ 1/3.32
        = 1 + 0.63 + 0.5 + 0.39 +0.36 + 0.31 + 0.3
        =3.49
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (0/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 +0.43 + 0.39 +0.36 
        =4.31

Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =3.49/4.31
         =0.8

Use case 2 : "articles by BBC news" on index "ms4976_info624_201904_newsproject2"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject2', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "articles by BBC news",
          "fields": ["source^3","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query("articles by BBC news") is to find all relevant news articles from the "ms4976_info624_201904_newsproject2" index which are published by BBC news source.We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 3 : "COVID-19" on index "ms4976_info624_201904_newsproject2"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject2', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "COVID-19",
          "fields": ["title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("COVID-19") is to find all relevant news articles from the "ms4976_info624_201904_newsproject2" index which are related to covid -19 virus.We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1
              
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 4 : "Annie Karni articles on Trump" on index "ms4976_info624_201904_newsproject2"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject2', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "Annie Karni articles on Donald Trump",
          "fields": ["author^5","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("Annie Karni articles on Donald Trump") is to find all relevant news articles from the "ms4976_info624_201904_newsproject2" index which are Published by Annie Karni Author on any Donald Trump topics.We can observe from the query Top 10 results where Doc 1,Doc 2,Doc 3 are not relevant to our query information needs.

Based on the above query information needs we can categorized these results into relevant and non-relevant document

    relevant = Doc 4,Doc 5,Doc 6, Doc 7,Doc 8,Doc 9,Doc 10
    non-relevant = Doc 1,Doc 2,Doc 3

For this query, we are getting three False Positive document among top 10 retreived results and remaining all are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 7/(7+3)
              =7/10
              =0.7
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:        

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG =  0+ (0/log(2)) + (0/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        =  0 + 0 + 0 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        = 2.62
 
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (0/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0 + 0 + 0
        =4.31
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =2.62/4.31
         =0.61

**Evaluation on  "ms4976_info624_201904_newsproject3" index which has different similarities configuration**

Use case 1 : "US Elections" on index "ms4976_info624_201904_newsproject3" 

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject3', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "crimes in US",
          "fields": ["title^2","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("crimes in US") is to find all relevant news articles from the "ms4976_info624_201904_newsproject3" index which have any details regarding crimes in united states.We can observe from the query Top 10 results where Doc 1,Doc 5,Doc 8 are not relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc 2,Doc 3,Doc 4,Doc 6, Doc 7,Doc 9,Doc 10
    non-relevant = Doc 1,Doc 5,Doc 8

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 7/(7+3)
              =7/10
              =0.7
              
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 0+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (0/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (1/log(9)) +(1/log(10))
        = 0 + 1/1 + 1/1.58 +1/2 + 0/2.32 + 1/2.58 + 1/2.8 + 0/3 + 1/3.17+ 1/3.32
        = 1 + 0.63 + 0.5 + 0.39 +0.36 + 0.31 + 0.3
        =3.49

IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (0/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 +0.43 + 0.39 +0.36 
        =4.31

Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =3.49/4.31
         =0.8

Use case 2 : "articles by BBC news" on index "ms4976_info624_201904_newsproject3"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject3', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "articles by BBC news",
          "fields": ["source^3","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query("articles by BBC news") is to find all relevant news articles from the "ms4976_info624_201904_newsproject3" index which are published by BBC news source. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs. 

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1
              
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 3 : "COVID-19" on index "ms4976_info624_201904_newsproject3"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject3', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "COVID-19",
          "fields": ["title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("COVID-19") is to find all relevant news articles from the "ms4976_info624_201904_newsproject3" index which are related to covid -19 virus. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1

DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 4 : "Annie Karni articles on Trump" on index "ms4976_info624_201904_newsproject3"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject3', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "Annie Karni articles on Donald Trump",
          "fields": ["author^5","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("Annie Karni articles on Donald Trump") is to find all relevant news articles from the "ms4976_info624_201904_newsproject3" index which are Published by Annie Karni Author on any Donald Trump topics. We can see from  the Top 10 retrieved results Doc 1,Doc 2,Doc 3,Doc 4 are not relevant to our query information needs.

Based on the above query information needs we can categorized these results into relevant and non-relevant document

    relevant = Doc 5,Doc 6, Doc 7,Doc 8,Doc 9,Doc 10
    non-relevant = Doc 1,Doc 2,Doc 3,Doc 4

For this query, we are getting four False Positive document among top 10 retreived results and remaining all are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 6/(6+4)
              =6/10
              =0.6
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:        

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG =  0+ (0/log(2)) + (0/log(3)) + (0/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        =  0 + 0 + 0 + 0 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        = 2.12

IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (0/log(7)) + (0/log(8)) + (0/log(9)) +(0/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 0/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0 + 0 + 0 + 0
        =3.95
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =2.12/3.95
         =0.54

**Evaluation on  "ms4976_info624_201904_newsproject4" index which has different settings configuration**

Use case 1 : "US Elections" on index "ms4976_info624_201904_newsproject4" 

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject4', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "crimes in US",
          "fields": ["title^2","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("crimes in US") is to find all relevant news articles from the "ms4976_info624_201904_newsproject4" index which have any details regarding crimes in united states.We can see from  the Top 10 retrieved results Doc 1,Doc 5,Doc 8 are not relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc 2,Doc 3,Doc 4,Doc 6, Doc 7,Doc 9,Doc 10
    non-relevant = Doc 1,Doc 5,Doc 8

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 7/(7+3)
              =7/10
              =0.7

DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 0+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (0/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (1/log(9)) +(1/log(10))
        = 0 + 1/1 + 1/1.58 +1/2 + 0/2.32 + 1/2.58 + 1/2.8 + 0/3 + 1/3.17+ 1/3.32
        = 1 + 0.63 + 0.5 + 0.39 +0.36 + 0.31 + 0.3
        =3.49
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (0/log(8)) + (0/log(9)) +(0/log(10))
         = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 0/3 + 0/3.17+ 0/3.32
        = 1 + 1 + 0.63 + 0.5 +0.43 + 0.39 +0.36 
        =4.31
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =3.49/4.31
         =0.8

Use case 2 : "articles by BBC news" on index "ms4976_info624_201904_newsproject4"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject4', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "articles by BBC news",
          "fields": ["source^3","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("articles by BBC news") is to find all relevant news articles from the "ms4976_info624_201904_newsproject4" index which are published by BBC news source. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs

Precision is defined as percentage of retrieved docs that are relevant    
    
    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1
          
DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
         = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
         = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
         =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

Use case 3 : "COVID-19" on index "ms4976_info624_201904_newsproject4"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject4', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "COVID-19",
          "fields": ["title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query ("COVID-19") is to find all relevant news articles from the "ms4976_info624_201904_newsproject4" index which are related to covid -19 virus. We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Below are relevant and non-relevant document based on the query information needs

    relevant = Doc1, Doc 2, Doc 3, Doc 4, Doc 5, Doc 6, Doc 7, Doc 8, Doc 9, Doc 10
    non-relevant = Null

For this query, we are getting no False Positive as all the retreived results are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1

DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula:

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below

    nDCG = DCG/IDCG
         =5.25/5.25
         =1

#### Use case 4 : "Annie Karni articles on Trump" on index "ms4976_info624_201904_newsproject4"

In [None]:
es.search(index = 'ms4976_info624_201904_newsproject4', body=
{"from":0, "size":10,
 "query": 
 {
   "bool":
   {
     "must":
     [{
        "multi_match": 
        {
          "query": "Annie Karni articles on Donald Trump",
          "fields": ["author^5","title","description"]
        }
     }],
     "should":
     {
       "rank_feature":
       {
         "field": "timestamp",
         "sigmoid":{"pivot": 5,"exponent":0.6}
       }
     } 
   }
  }
}) 

The information needs for this query "Annie Karni articles on Donald Trump" is to find all relevant news articles from the "ms4976_info624_201904_newsproject4" index which are Published by Annie Karni Author on any Donald Trump topics.We can see from the query results that all the Top 10 retrieved results are relevant to our query information needs.

Based on the above query information needs we can categorized these results into relevant and non-relevant document

    relevant = Doc 1,Doc 2,Doc 3,Doc 4,Doc 5,Doc 6, Doc 7,Doc 8,Doc 9,Doc 10
    non-relevant = Null

For this query, we are getting Zero False Positive document's among top 10 retreived results all are relevant to the information needs.

Precision is defined as percentage of retrieved docs that are relevant

    Precision = True positive(TP)/True positive(TP) + False Positive(FP)
              = 10/(10+0)
              =10/10
              =1

DCG appears to have high value if the top retrieved documents are relevant

Discounted cumulative gain can be given by below formula

    DCG = rel1+ (rel2/log(2)) + (rel3/log(3)) + (rel4/log(4)) + (rel5/log(5)) + (rel6/log(6)) + (rel7/log(7)) + (rel8/log(8))   (rel9/log(9)) +(rel10/log(10))

Here relevance value for any document will be 1 if the retrieved document is relevant to the query needs and 0 if the retrieved document is not relevant to the query needs

    DCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 + 0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
IDCG can be calculated by re-ordering the retreived results by making all relevant results at top in decreasing order of their scores

    IDCG = 1+ (1/log(2)) + (1/log(3)) + (1/log(4)) + (1/log(5)) + (1/log(6)) + (1/log(7)) + (1/log(8)) + (1/log(9)) +(1/log(10))
        = 1 + 1/1 + 1/1.58 +1/2 + 1/2.32 + 1/2.58 + 1/2.8 + 1/3 + 1/3.17+ 1/3.32
        = 1 + 1 + 0.63 + 0.5 + 0.43 + 0.39 +0.36 + 0.33 + 0.31 + 0.3
        =5.25
    
Normalized DCG(nDCG) can be calculated by using DCG and IDCG as shown below:

    nDCG = DCG/IDCG
         =5.25/5.25   
         =1

**Comparision of all Indices**

|Index |usecase 1 precision|usecase 1 nDCG|usecase 2 precision|usecase 2 nDCG|usecase 3 precision|usecase 3 nDCG|usecase 4 precision|usecase 4 nDCG|Average precision|
|------|---------|---------|---------|---------|---|------|---------|---------|---------|
|ms4976_info624_201904_newsproject1|0.7|0.8|1|1|1|1|0.9|0.85|0.9|
|ms4976_info624_201904_newsproject2|0.7|0.8|1|1|1|1|0.7|0.61|0.85|
|ms4976_info624_201904_newsproject3|0.7|0.8|1|1|1|1|0.6|0.54|0.83|
|ms4976_info624_201904_newsproject4|0.7|0.8|1|1|1|1|1|1|0.93|

In Evaluating our index, we are much focused on precision and nDCG metrics rather than recall metric as this is not important in our domain objective. We can't judge our search engine precision only by taking consideration of one query results so we have considered different use cases and we are taking average precision and nDCG values to evaluate our index performance.

For all Indices, We obtained almost same precision results for most of the queries/use cases as our data collection is very small when compared to real time situations due to this our results has no significant difference and by considering the nDCG metric evaluations the index "ms4976_info624_201904_newsproject4" is giving better results compared to other indices and as we  more interested on the top relevant results than lower order relevant results we have choosen this index as best in our case.

Note: All the indices setting and mappings details are provided in Custom Similarities Comparision jupyter notebook.