# **ElasticSearch -2:Basic Search  &  Semantic Search**

#### This notebook corresponds with the slide "4.2. ES  -  Basic Search  &  Semantic Search"

* Useful Links:
    * ES Doc - REST APIs: https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html 
    * ES Doc - Search APIs: https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html 
    * "Learn Elasticsearch from practical examples": https://medium.com/codex/learn-elasticsearch-from-practical-examples-495f2f8db83e
    * "Deep Dive into Querying Elasticsearch. Filter vs Query. Full-text search": https://towardsdatascience.com/deep-dive-into-querying-elasticsearch-filter-vs-query-full-text-search-b861b06bd4c0

*------- All files relative paths need to be changed according to your file locations in order to run this notebook -------*

## **1.Imports**

In [1]:
import json
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch
from elasticsearch.client import IndicesClient

## **2.ES Data Structure & Type**

This step aims to map out the data structure and data types of the ES search results for further data processing and calculations

* **Result:**   (Dict)
    * {
    * "took" (int): 3 -- execution time
    * "timed_out" (Bool): false -- whether request is time out
    * **"_shards":**  (Dict) -- can be understood as "index"
        * {
        * "total" (int): 1-- number of indexes retrieved
        * "successful" (bool): 1 -- whether request is executed successfully
        * "skipped" (int): 0 -- number of indexes skipped
        * "failed" (int): 0 -- number of indexes failed to retrive
        * }
    * **"hits":** (Dict) -- doc-level request results
        * {
        * "total" (Dict)
            * {
            * "value" (int): 826 -- number of docs retrieved
            * "relation" (str): "eq" -- actual number of docs retrieved, 
                                        "eq" means "value" is the actual number,
                                        "gt" means actual number is greater than "value"
            * }
        * **<u>"max_score" (float): 1.0 -- relevance score on the "total" level</u>**
            * <u> About Relevance Scores: it is a positive number indicating how well the documents matche the query.</u>
            * <u> Elasticsearch has an algorithm called **Okapi Best Match 25 (BM25)**, which is an enhanced Term Frequency/Inverse Document Frequency similarity algorithm that calculates the relevancy scores for each of the results and sorts & displays them in descending order</u>
        * **"hits":** (list of Dicts) -- content-level request results
            * [
              * {
              * "_index" (str): "sus_reports_1" -- index name
              * "_type" (str): "_doc" -- type of data, here we only have documents, so "_doc
              * "_id" (str): "86.1" -- doc ID
              * "_score" (float): 1.0 -- relevance score on the "doc" level
              * **"_source":** (Dict) -- data-level request results
                  * {
                  * "id" (str): "86.1" -- doc ID
                  * "label" (int): 1 -- PDF label, 1 is positive report, 0 is negative report
                  * "company" (str): "SFL Corp Ltd" -- company name
                  * "industry" (str): "Industrials" -- industry name
                  * "country" (str): "Norway" -- country name
                  * "date" (int): 2021 -- date of this PDF
                  * "filename" (str): "86.pdf" -- original PDF file name
                  * "page" (int): 1 -- the corresponding page number of this doc
                  * "text_len" (int): 493 -- total number of words extracted from this page
                  * "text" (str): "........" -- extracted text of this page
                  * "emb_text_vector" (list): [-0.04, 0.018, 0.026, ...] -- vector embedding of the "text"
                  * }
              * **"_source":** (Dict) {}
              * **"_source":** (Dict) {}
              * **"_source":** (Dict) {}
              * **"_source":** (Dict) {}
              * }
            * ]
        * }
    * }

</br>
* 【"Pretty" print of ES results】: """" j = json.dumps(result, indent=1) """

## **3.ES Search Examples - Basic**

In [2]:
# Create an ES client for operating ES
es_client = Elasticsearch("localhost:9200", # Default port
                          http_auth=["elastic", "ING_project"],
                          timeout=300) # Need to set "timeout" parameter to allow longer data loading time

# Create an ES index client to create indexes
es_index_client = IndicesClient(es_client)

In [55]:
#【1】Count the number of indexes and the number of docs/pages
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_count
        # {
        #   "query": {
        #     "match_all": {}
        #   }
        # }
search_query = {
  "query": {
    "match_all": {}
  }
}

# Get "COUNT" results, es_client uses "count" not "search"!
result = es_client.count(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "count": 93950,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 }
}


In [56]:
#【2】Search a specific index "sus_reports_2" and display ONLY selected fields with specified number of results = 2
# Scope: 1 index: "sus_reports_2"
        # GET sus_reports_*/_search
        # {
        #   "query": {
        #     "match_all": {}
        #   },
        #   "_source": ["id","label", "company", "industry", "country", "date","filename", "page", "text_len"] 
        # }
search_query = {
  "size":2,
  "query": {
    "match_all": {}
  },
  "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"] 
}

# Get search results 
result = es_client.search(index="sus_reports_2", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 1,
 "timed_out": false,
 "_shards": {
  "total": 1,
  "successful": 1,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 1034,
   "relation": "eq"
  },
  "max_score": 1.0,
  "hits": [
   {
    "_index": "sus_reports_2",
    "_type": "_doc",
    "_id": "16359.1",
    "_score": 1.0,
    "_source": {
     "date": 2021,
     "country": "China",
     "text_len": 145,
     "filename": "16359.pdf",
     "company": "CGN Wind Energy Ltd",
     "industry": "Energy",
     "id": "16359.1",
     "label": 1,
     "page": 1
    }
   },
   {
    "_index": "sus_reports_2",
    "_type": "_doc",
    "_id": "16359.2",
    "_score": 1.0,
    "_source": {
     "date": 2021,
     "country": "China",
     "text_len": 296,
     "filename": "16359.pdf",
     "company": "CGN Wind Energy Ltd",
     "industry": "Energy",
     "id": "16359.2",
     "label": 1,
     "page": 2
    }
   }
  ]
 }
}


  result = es_client.search(index="sus_reports_2", body=search_query, request_timeout=1000)


In [58]:
#【3】Search a part of words with "ngram_analyzer" defined in ES settings
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #     "query": {
        #         "match": {
        #         "filename.ngrams": "21"
        #         }
        #     }
        # }
search_query = {
    "query": {
        "match": {
        "filename.ngrams": "21" # "filename" field was given a "ngrams" field in the ES mapping
        }
    },
    "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"] 
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 8,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 200,
   "relation": "eq"
  },
  "max_score": 4.225013,
  "hits": [
   {
    "_index": "sus_reports_38",
    "_type": "_doc",
    "_id": "219.1",
    "_score": 4.225013,
    "_source": {
     "date": 2019,
     "country": "Singapore",
     "text_len": 41,
     "filename": "219.pdf",
     "company": "Tiong Seng Holdings Ltd",
     "industry": "Consumer discretionary",
     "id": "219.1",
     "label": 1,
     "page": 1
    }
   },
   {
    "_index": "sus_reports_38",
    "_type": "_doc",
    "_id": "219.2",
    "_score": 4.225013,
    "_source": {
     "date": 2019,
     "country": "Singapore",
     "text_len": 321,
     "filename": "219.pdf",
     "company": "Tiong Seng Holdings Ltd",
     "industry": "Consumer discretionary",
     "id": "219.2",
     "label": 1,
     "page": 2
    }
   },
   {
    "_index": "sus_reports_38",
    "_typ

  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


In [59]:
#【4】Aggregation over "country" as a bucket and count and number of docs per bucket
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "size":0,
        #   "aggs": {
        #     "country-count-agg": {
        #       "terms": {
        #         "field": "country.keyword"
        #       }
        #     }
        #   }
        # }
search_query = {
  "size":0, # here 0 means:"don't show the search results, show onlyl the aggregation results"
  "aggs": {
    "country-count-agg": {
      "terms": { # "terms" in aggs means the aggregation method is multi-bucket aggregation
        "field": "country.keyword" # target field must have "keyword" field in order to aggregate
      }
    }
  }
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 5,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 10000,
   "relation": "gte"
  },
  "max_score": null,
  "hits": []
 },
 "aggregations": {
  "country-count-agg": {
   "doc_count_error_upper_bound": 0,
   "sum_other_doc_count": 31141,
   "buckets": [
    {
     "key": "United States",
     "doc_count": 12464
    },
    {
     "key": "Spain",
     "doc_count": 11596
    },
    {
     "key": "China",
     "doc_count": 6506
    },
    {
     "key": "Italy",
     "doc_count": 6118
    },
    {
     "key": "India",
     "doc_count": 5958
    },
    {
     "key": "South Korea",
     "doc_count": 4781
    },
    {
     "key": "Japan",
     "doc_count": 4585
    },
    {
     "key": "Netherlands",
     "doc_count": 4380
    },
    {
     "key": "Sweden",
     "doc_count": 3568
    },
    {
     "key": "Germany",
     "doc_count": 2853
    }
   ]
  }
 }
}


  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


In [61]:
#【5】Aggregate and see which industries tend to have super long sustainability reports
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "size": 0,
        #   "aggs": {
        #     "brand-count-agg": {
        #       "terms": {
        #         "field": "industry.keyword"
        #       }
        #     }
        #   },
        #   "query": {
        #     "range": {
        #       "page": {
        #         "gte": 500
        #       }
        #     }
        #   }
        # }
search_query = {
  "size": 0,
  "aggs": {
    "brand-count-agg": {
      "terms": {
        "field": "industry.keyword" # Aggregate over "industry"
      }
    }
  },
  "query": {
    "range": {
      "page": {
        "gte": 600 # highest "page" number greater then 500, meaning super long reports!
      }
    }
  }
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

# Long reports are all from "Industrials" industry!!!!!!!

{
 "took": 5,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 171,
   "relation": "eq"
  },
  "max_score": null,
  "hits": []
 },
 "aggregations": {
  "brand-count-agg": {
   "doc_count_error_upper_bound": 0,
   "sum_other_doc_count": 0,
   "buckets": [
    {
     "key": "Industrials",
     "doc_count": 171
    }
   ]
  }
 }
}


  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


In [62]:
#【6】Search a field whose value is in a range
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "query": {
        #     "range": {
        #       "page": {
        #         "gte": 2,
        #         "lte": 4
        #       }
        #     }
        #   },
        #   "_source": ["id","label", "company", "industry", "country", "date", "filename", "page"] 
search_query = {
  "query": {
    "range": {
      "date": { # "date" has to be a numeric type in mapping to use range search
        "gte": 2017,
        "lte": 2018
      }
    }
  },
  "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"]
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 8,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 185,
   "relation": "eq"
  },
  "max_score": 1.0,
  "hits": [
   {
    "_index": "sus_reports_2",
    "_type": "_doc",
    "_id": "4045.1",
    "_score": 1.0,
    "_source": {
     "date": 2017,
     "country": "United States",
     "text_len": 212,
     "filename": "4045.pdf",
     "company": "Antelope Valley-East Kern Water Agency Financing Authority",
     "industry": "U.S. Municipal",
     "id": "4045.1",
     "label": 1,
     "page": 1
    }
   },
   {
    "_index": "sus_reports_2",
    "_type": "_doc",
    "_id": "4045.2",
    "_score": 1.0,
    "_source": {
     "date": 2017,
     "country": "United States",
     "text_len": 123,
     "filename": "4045.pdf",
     "company": "Antelope Valley-East Kern Water Agency Financing Authority",
     "industry": "U.S. Municipal",
     "id": "4045.2",
     "label": 1,
     "page": 2
    }


  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


In [63]:
#【7】Search a field and sort result based on another field
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "query": {
        #     "match": {
        #       "country": "Norway"
        #     }
        #   },
        #   "sort": [
        #     {
        #       "page": {
        #         "order": "desc"
        #       }
        #     }
        #   ],
        #   "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"]
search_query = {
  "query": {
    "match": {
      "country": "Norway"
    }
  },
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    }
  ],
  "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"] 
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 12,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 1803,
   "relation": "eq"
  },
  "max_score": null,
  "hits": [
   {
    "_index": "sus_reports_1",
    "_type": "_doc",
    "_id": "86.1",
    "_score": null,
    "_source": {
     "date": 2021,
     "country": "Norway",
     "text_len": 493,
     "filename": "86.pdf",
     "company": "SFL Corp Ltd",
     "industry": "Industrials",
     "id": "86.1",
     "label": 1,
     "page": 1
    },
    "sort": [
     2021
    ]
   },
   {
    "_index": "sus_reports_1",
    "_type": "_doc",
    "_id": "86.2",
    "_score": null,
    "_source": {
     "date": 2021,
     "country": "Norway",
     "text_len": 698,
     "filename": "86.pdf",
     "company": "SFL Corp Ltd",
     "industry": "Industrials",
     "id": "86.2",
     "label": 1,
     "page": 2
    },
    "sort": [
     2021
    ]
   },
   {
    "_index": "sus_reports_1",
    "_type": "_

  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


## **4.ES Search Examples - Advanced**

In [64]:
#【1】Search all positive reports' [300th, 400th, 500th] pages and order results by date ascending using【"painless" script】
# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "sort": [
        #     {
        #       "_script": {
        #         "type": "number",
        #         "script": {
        #           "lang": "painless",
        #           "source": """
        #               if (doc['label.keyword'].value == 1) {
        #                 return 1;
        #               } else {
        #                 return 0;
        #               }
        #             """
        #         }
        #       }
        #     },
        #     {
        #       "date": {
        #         "order": "asc"
        #       }
        #     }
        #   ],
        #   "query": {
        #     "terms": {
        #       "page": [300, 400, 500]
        #       }
        #   },
        #   "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"]     
search_query = {
  "sort": [
    {
      "_script": { # Condition-1
        "type": "number",
        "script": {
          "lang": "painless", 
          "source": """ 
              if (doc['label.keyword'].value == 1) {  
                return 1;
              } else {
                return 0;
              }
            """
        }
      }
    },
    {
      "date": { # Condition-2
        "order": "asc"
      }
    }
  ],
  "query": {
    "terms": { # Condition-3
      "page": [300, 400, 500]
      }
  },
  "_source": ["id","label", "company", "industry", "country", "date", "filename", "page", "text_len"] 
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 14,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 66,
   "relation": "eq"
  },
  "max_score": null,
  "hits": [
   {
    "_index": "sus_reports_19",
    "_type": "_doc",
    "_id": "3759.300",
    "_score": null,
    "_source": {
     "date": 2016,
     "country": "United States",
     "text_len": 842,
     "filename": "3759.pdf",
     "company": "State of Connecticut",
     "industry": "U.S. Municipal",
     "id": "3759.300",
     "label": 1,
     "page": 300
    },
    "sort": [
     0.0,
     2016
    ]
   },
   {
    "_index": "sus_reports_19",
    "_type": "_doc",
    "_id": "3759.400",
    "_score": null,
    "_source": {
     "date": 2016,
     "country": "United States",
     "text_len": 433,
     "filename": "3759.pdf",
     "company": "State of Connecticut",
     "industry": "U.S. Municipal",
     "id": "3759.400",
     "label": 1,
     "page": 400
    },
    "sort": [
   

  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


In [65]:
#【2】Search on Multiple Conditions -- using【Boolean queries】
# the "Boolean query" has one or more Boolean clauses, including "must", "should", 
# "filter" and "must_not". ES search matching results by "relevance score", which measures how well each 
# document matches a query. There are two different contexts to write boolean queries
# In 【--- Query context---】the query clauses ("must") definitely contribute to the "relevance score" 
# In 【--- Filter context ---】the scoring is ignored!

# Scope: all indexes: "sus_reports_*"
        # GET sus_reports_*/_search
        # {
        #   "query": {
        #     "bool": {
        #       "must": [
        #         {"match": {"text.ngrams": "environmental"}},
        #         {"match":  {"text.ngrams":  "have achieved"}}
        #       ],
        #       "should":[
        #                 {"match":  {"text.ngrams":  "recycle"}},
        #                 {"match":  {"text.ngrams":  "circularity"}},
        #                 {"match":  {"text.ngrams":  "waste"}},
        #                 {"match":  {"text.ngrams":  "water"}},
        #                 {"match":  {"text.ngrams":  "emission"}},
        #                 {"match":  {"text.ngrams":  "biodiversity"}},
        #                 {"match":  {"text.ngrams":  "co2"}},
        #                 {"match":  {"text.ngrams":  "plastics"}},
        #                 {"match":  {"text.ngrams":  "innovation"}},
        #                 {"match":  {"text.ngrams":  "technology"}},
        #                 {"match":  {"text.ngrams":  "have reduced emission"}},
        #                 {"match":  {"industry.keyword": "Utilities"}},
        #                 {"match":  {"industry.keyword": "Energy"}},
        #                 {"match":  {"industry.keyword": "Materials"}}
        #               ],
        #       "must_not": {
        #         "range": {
        #           "page": {
        #             "gt": 5
        #           }
        #         }
        #       },
        #       "filter": {
        #         "bool": {
        #           "must_not": {
        #             "match": {"label": 0}
        #           }
        #         }
        #       },
        #       "minimum_should_match": 12
        #     }
        #   }
        # } 
search_query = {
  "query": {
    "bool": {
      "must": [ # Condition Group-1
        {"match": {"text.ngrams": "environmental"}},
        {"match":  {"text.ngrams":  "have achieved"}}
      ],
      "should":[ # Condition Group-2
                {"match":  {"text.ngrams":  "recycle"}},
                {"match":  {"text.ngrams":  "circularity"}},
                {"match":  {"text.ngrams":  "waste"}},
                {"match":  {"text.ngrams":  "water"}},
                {"match":  {"text.ngrams":  "emission"}},
                {"match":  {"text.ngrams":  "biodiversity"}},
                {"match":  {"text.ngrams":  "co2"}},
                {"match":  {"text.ngrams":  "plastics"}},
                {"match":  {"text.ngrams":  "innovation"}},
                {"match":  {"text.ngrams":  "technology"}},
                {"match":  {"text.ngrams":  "have reduced emission"}},
                {"match":  {"industry.keyword": "Utilities"}},
                {"match":  {"industry.keyword": "Energy"}},
                {"match":  {"industry.keyword": "Materials"}}
              ],
      "must_not": { # Condition-3: all conditions above must be met in the first 5 pages per doc
        "range": {
          "page": {
            "gt": 5
          }
        }
      },
      "filter": {
        "bool": {
          "must_not": { # Condition-4: exclude negative reports
            "match": {"label": 0}
          }
        }
      },
      "minimum_should_match": 12 # Condition-5: must match at least 12 "should" clauses
    }
  }
}

# Get search results 
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)
print(json.dumps(result, indent=1))

{
 "took": 59,
 "timed_out": false,
 "_shards": {
  "total": 94,
  "successful": 94,
  "skipped": 0,
  "failed": 0
 },
 "hits": {
  "total": {
   "value": 42,
   "relation": "eq"
  },
  "max_score": 30.505863,
  "hits": [
   {
    "_index": "sus_reports_31",
    "_type": "_doc",
    "_id": "3394.3",
    "_score": 30.505863,
    "_source": {
     "id": "3394.3",
     "label": 1,
     "company": "Nordic Renewable Power AB",
     "industry": "Energy",
     "country": "Sweden",
     "date": 2020,
     "filename": "3394.pdf",
     "page": 3,
     "text_len": 704,
     "text": "I think often about the impact \nI want to make on the world\n in my personal life and in the organisation \nI am honoured to lead. \nFor 3M, our purpose is articulated in our \nvision statement, ending with \ufb01improving \n\ncommitment to Sustainability, which is a \nvalue that matters deeply to our people, \nto our customers and to me personally.\nWe started our Pollution Prevention Pays \nprogramme back in 1975 \

  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


## **5.ES Semantic Search**

#### 5.1. Load the USE Encoder and the Function

In [3]:
# Import the essential TensorFlow libraries：
import tensorflow.compat.v1 as tf 
import tensorflow_hub as hub

# Load the Universal Sentence Encoder Model：
graph = tf.Graph()

with tf.Session(graph = graph) as session:
    print("Downloading pre-trained embeddings from tensorflow hub…")
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2") 
    text_ph = tf.placeholder(tf.string)
    embeddings_1 = embed(text_ph)
    print("Done.")
    print("Creating tensorflow session…")
    
    session = tf.Session()
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print("Done.")
    
# Define a function to use the USE to convert the texts to vectors:
def text_to_vector(text):
    vectors = session.run(embeddings_1, feed_dict={text_ph: text})
    return [vector.tolist() for vector in vectors]

#【Time of running this cell: 9sec】

2022-01-25 15:52:06.670720: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading pre-trained embeddings from tensorflow hub…
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Done.
Creating tensorflow session…
Done.


#### 5.2. Define the Standard Texts

In [4]:
# Define example text (a positive ESG report) and vectorize it
ref_text_1 = "We show our efforts to help the green economy, creating business and value by recycling plastic waste. We focus on the positive impact on the environment and people through further growing our sustainable offering. We create technologies and solutions to advance a more efficient, sustainable, resilient and environment-friendly world for all. We perform detailed analysis to evaluate the significance of working activities that influence the environment. Our Environmental policy is also defined in the engineering phase which is an opportunity to propose technological modifications which can result in energy saving and cleaner emissions, leading to environmental benefits for the customer, stakeholders and the whole community. We are using only renewable energy. All our electricity is from renewable sources. Our electricity mainly come from solar panels and wind power. We demonstrate our commitment to this policy by striving to ensure that our actions have no or minimal impact on our planet. We have reduced our green house gas emissions. We are committed to promote decarbonization and better use of energy, continuously implement energy efficiency initiatives. Water consumption has been reduced and water has been recycled with innovative technologies.  It’s essential to protect water, not only for our business needs, but also for the sake of the communities in which we operate, because access to clean, fresh water is a critical human need. We also implemented a comprehensive water management system that includes a rainwater harvesting system. We have undertaken careful and comprehensive collection, transportation and final treatment of waste. Our digitalization of documents assists a paper-less approach which helps to reduce paper waste. We have successfully used innovative technologies to minimize hazard wastes. Negative impact on the environment has been reduced. During each audit we inspect environmental permits, waste management, and effluent treatment plants. We began an office eco-efficiency program aimed at reduction, reuse and recycling of waste. Each office has designated recycling bins. We have eliminated plastic from our packaging. We also committed to a plastic-free future. We have reduced carbon (CO2 ) emissions and reduced our carbon footprint of our operations, products and services. We achieved net-zero operational emissions. Circularity is part of our business model and we are expanding our environmental commitments to integrate biodiversity. We have started a series of initiatives to protect animal and plants biodiversity. We have been actively source sustainable green materials during our production."
ref_text_vec_1 = text_to_vector([ref_text_1])[0]
ref_text_2 = "It is our policy to ensure that all activities contribute to the social and economic and environmental welfare of our stakeholders through efficient and sustainable use of labor, land and capital without degradation to our natural environment. Building our business based on ethical, moral principles, respecting our employees and seeking to understand and support the interest of the communities whose environmental resources we share. We have formed professional links with the Institute of Environmental Management and Assessment (IEMA) and participated in the consultation on Guidance for Greenhouse Gas Reporting. We have achieved the ISO 14001 accreditation. Our operational sites have been recognised by the environment agencies and the environmental protection agencies. Our suppliers and we have obtained all necessary environmental certifications. We check and evaluate our suppliers on a regular basis to make sure that all parts of our supply chain comply with our environmental and sustainability standards. We have joined the Climate Pledge Coalition. We have engaged actively with sustainable business initiatives such as the U.N. Global Compact. We are publishing end-to-end biodiversity footprint reports using the new Global Biodiversity Score (GBS) tool from CDC Biodiversité. We already set internal waste targets for ISO 14000-certified sites and a goal to achieve higher standards to protect water, soil, air, animals and plants."
ref_text_vec_2 = text_to_vector([ref_text_2])[0]

# Store the reference text vectors into a list
ref_text_vec_list = [ref_text_vec_1, ref_text_vec_2]

# Print some stats of the text embeddings
print("--- 'ref_text_1' - Environmental key words Text to be embedded: {}".format(ref_text_1), "\n")
print("--- 'ref_text_vec_1' - Embedding size: {}".format(len(ref_text_vec_1)), "\n")
print("--- 'ref_text_vec_1' - Obtained Embedding[{},…]\n".format(ref_text_vec_1[:5]))
print("--- 'ref_text_2' - Environmental engagement Text to be embedded: {}".format(ref_text_1), "\n")
print("--- 'ref_text_vec_2' - Embedding size: {}".format(len(ref_text_vec_1)), "\n")
print("--- 'ref_text_vec_2' - Obtained Embedding[{},…]\n".format(ref_text_vec_1[:5]))
print("----------- The 'ref_text_vec_list' has: {} lists of text embedding vectors".format(len(ref_text_vec_list)), "\n")

#【Time of running this cell: 1sec】

--- 'ref_text_1' - Environmental key words Text to be embedded: We show our efforts to help the green economy, creating business and value by recycling plastic waste. We focus on the positive impact on the environment and people through further growing our sustainable offering. We create technologies and solutions to advance a more efficient, sustainable, resilient and environment-friendly world for all. We perform detailed analysis to evaluate the significance of working activities that influence the environment. Our Environmental policy is also defined in the engineering phase which is an opportunity to propose technological modifications which can result in energy saving and cleaner emissions, leading to environmental benefits for the customer, stakeholders and the whole community. We are using only renewable energy. All our electricity is from renewable sources. Our electricity mainly come from solar panels and wind power. We demonstrate our commitment to this policy by striving to

#### 5.3. Use ES Built-in Fuctions to Calculate Cosine & Euclidean Similarity Scores
**1. Cosine similarity:**
measures the cosine of the angle between two vectors projected in a multi-dimensional plane. It is the judgment based on orientation rather than magnitude.

**2. Euclidean similarity:**
is the square root of the sum of squared differences between corresponding elements of the two vectors

* Useful Links:
    * ES Script Score Query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html
    * ES Similarity Module: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#index-modules-similarity
    * "Text similarity search with vector fields": https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch 
    * "Vector-Based Semantic Search using Elasticsearch": https://medium.com/version-1/vector-based-semantic-search-using-elasticsearch-48d7167b38f5
    * Alternative method - "Add semantic search to Elasticsearch": https://dev.to/neuml/add-semantic-search-to-elasticsearch-3ifb
    * "A Comparison of Semantic Similarity Methods for Maximum Human Interpretability": https://arxiv.org/pdf/1910.09129.pdf

In [7]:
#【Calculate!!!!!】 "Cosine similarity" and "Euclidean similarity" of text from each doc/page using the USE text embeddings

# Create empty Dict to hold final calculations results
result_dict = {}
# Name two similarity/distance measures to be calculated in a list
dis_measure_list = ["Cosine similarity", "Euclidean similarity"]
# List out the integer number of each index following the pattern "sus_reports_*" 
index_list = list(np.arange(1, 95))
# Use a counter to generate ref_text_vec names following the pattern "ref_text_vec_*"
counter = 1

# Iterate through all reference vectors
for ref_text_vec in ref_text_vec_list: 
    
    # Create an empty Dict for each ref_text_vec
    ref_text_vec_dict = {}
    
    # Iterate through all distance measures
    for measure in dis_measure_list: 
        
        # ---------------------------------------------------------------------------------------------------------------------
        # Calculate "Cosine similarity" for all indexes:
        if measure == "Cosine similarity":
            # Create an empty list to hold all cosine similarity results for ref_text_vec_*
            cos_sim_list = [] 
            # Iterate through all indexes individually to avoid data overload
            for index_nr in index_list: 
                # Use "script_query" to calculate the similarity scores on the index level
                script_query = {
                    "script_score": {
                        "query": {
                            "match_all": {}
                        },
                        "script": {
                            "source": 
                                # Artificially add 1.0 to the cosineSimilarity scores to avoid ES "negative score" error
                                # Later will subtract this 1.0 in dataframe processing
                                "cosineSimilarity(params.query_vector, doc['emb_text_vector']) +1.0",
                            "params": {
                                "query_vector": 
                                    # Calculate with the current ref_text_vec_*
                                    ref_text_vec 
                            }
                        }
                    }
                }
                # Call the "es_client" to calculate the cos_similarity score with enriched data 
                cos_similarity = es_client.search(index="sus_reports_"+str(index_nr), 
                                                  body={"size": 2000, 
                                                        "query": script_query,
                                                            "_source": {
                                                                "includes": ["id", "label", "company", "industry", "country", 
                                                                             "date", "filename", "page", "text_len"]
                                                                }
                                                        })
                # Append result from each index to cos_sim_list to form a list of all results for 1 distance measure for 1 reference vector
                cos_sim_list.append(cos_similarity)
    
            # Put all results into ref_text_vec_dict as values with keys being the distance measure name
            key = "Cosine similarity"
            ref_text_vec_dict[key] = cos_sim_list
        
        
        # ---------------------------------------------------------------------------------------------------------------------
        # Calculate "Euclidean similarity" for all indexes:
        if measure == "Euclidean similarity":    
            # Same logic and steps
            euc_sim_list = [] 
            for index_nr in index_list: 
                # Use "script_query" to calculate the similarity scores on the index level
                script_query = {
                    "script_score": {
                        "query": {
                            "match_all": {}
                        },
                        "script": {
                            "source": 
                                # use 1/l2 to reverse euclidean similarity so that similar vectors can score higher
                                "1/l2norm(params.query_vector, doc['emb_text_vector'])",
                            "params": {
                                "query_vector": 
                                    ref_text_vec 
                            }
                        }
                    }
                }
                # Call the "es_client" to calculate the cos_similarity score with enriched data 
                euc_similarity = es_client.search(index="sus_reports_"+str(index_nr), 
                                                  body={"size": 2000, 
                                                        "query": script_query,
                                                            "_source": {
                                                                "includes": ["id", "label", "company", "industry", "country", 
                                                                             "date", "filename", "page", "text_len"]
                                                                }
                                                        })
                # Append result from each index to euc_sim_list to form a list of all results for 1 distance measure for 1 reference vector
                euc_sim_list.append(euc_similarity)
    
            # Put all results into ref_text_vec_dict as values with keys being the distance measure name
            key = "Euclidean similarity"
            ref_text_vec_dict[key] = euc_sim_list

    # Put all result Dicts into result_dict as values with keys being the reference text name
    result_key = "ref_text_vec_" + str(counter)
    result_dict[result_key] = ref_text_vec_dict
    counter += 1

# Show result stats
print("--- result_dict has {} Dicts corresponding to {} reference vectors".format(len(result_dict), len(ref_text_vec_list)))
for level_1_key, level_1_value in result_dict.items():
    print("--- Reference text vector {}".format(level_1_key), 
          " as a 'key' has {} Dicts corresponding to {} distance measures!".format(len(level_1_value), len(dis_measure_list)))
    for level_2_key, level_2_value in level_1_value.items():
        print("--- Distance measure {}".format(level_2_key), 
              " as a 'key' has {} Lists corresponding to {} indexes!".format(len(level_2_value), len(index_list)))

#【Time of running this cell: 1.5 min】

  cos_similarity = es_client.search(index="sus_reports_"+str(index_nr),
  euc_similarity = es_client.search(index="sus_reports_"+str(index_nr),


--- result_dict has 2 Dicts corresponding to 2 reference vectors
--- Reference text vector ref_text_vec_1  as a 'key' has 2 Dicts corresponding to 2 distance measures!
--- Distance measure Cosine similarity  as a 'key' has 94 Lists corresponding to 94 indexes!
--- Distance measure Euclidean similarity  as a 'key' has 94 Lists corresponding to 94 indexes!
--- Reference text vector ref_text_vec_2  as a 'key' has 2 Dicts corresponding to 2 distance measures!
--- Distance measure Cosine similarity  as a 'key' has 94 Lists corresponding to 94 indexes!
--- Distance measure Euclidean similarity  as a 'key' has 94 Lists corresponding to 94 indexes!


In [10]:
# Define a function to quickly extract a specific result
def check_bulk_sim_cal(result_dict, ref_text_vec, dis_measures, index_nr, doc_idx):
    
    print("--- The 'result' is of type ",type(result_dict), 
          " with {} ref_text_vec".format(len(result_dict)))
    print("--- The '{}' is of type ".format(ref_text_vec), type(result_dict[ref_text_vec]), 
          " with {} dis_measures".format(len(result_dict[ref_text_vec])))
    print("--- The '{}' is of type ".format(dis_measures), type(result_dict[ref_text_vec][dis_measures]), 
          " with {} indexes".format(len(result_dict[ref_text_vec][dis_measures])))
    print("--- The number of {} results in index '{}' is {}".format(dis_measures,
                                                                  "sus_reports_"+str(index_nr),
                                                                  len(result_dict[ref_text_vec][dis_measures][index_nr-1]["hits"]["hits"])))
    print("--- Have a look at the result of '{}th' doc in this index".format(doc_idx), json.dumps(result_dict[ref_text_vec][dis_measures][index_nr-1]["hits"]["hits"][doc_idx-1], indent=1))


check_bulk_sim_cal(result_dict = result_dict, 
                   ref_text_vec = "ref_text_vec_2", 
                   dis_measures = "Euclidean similarity", 
                   index_nr = 2, 
                   doc_idx = 24)

--- The 'result' is of type  <class 'dict'>  with 2 ref_text_vec
--- The 'ref_text_vec_2' is of type  <class 'dict'>  with 2 dis_measures
--- The 'Euclidean similarity' is of type  <class 'list'>  with 94 indexes
--- The number of Euclidean similarity results in index 'sus_reports_2' is 1034
--- Have a look at the result of '24th' doc in this index {
 "_index": "sus_reports_2",
 "_type": "_doc",
 "_id": "16347.21",
 "_score": 1.4056617,
 "_source": {
  "date": 2021,
  "country": "Greece",
  "text_len": 1204,
  "filename": "16347.pdf",
  "company": "Mytilineos SA",
  "industry": "Industrials",
  "id": "16347.21",
  "label": 1,
  "page": 21
 }
}


In [11]:
# Check again the data structure and types
print("--1-- 'result_dict' has:", result_dict.keys())
print("--2-- 'result_dict' type:", type(result_dict.values()))
print("--3-- 'ref_text_vec_1' has:", result_dict["ref_text_vec_1"].keys())
print("--4-- 'ref_text_vec_1' type:",type(result_dict["ref_text_vec_1"].values()))
print("--5-- 'Cosine similarity' type:", type(result_dict["ref_text_vec_1"]["Cosine similarity"]))
print("--6-- 1st item in 'Cosine similarity' LIST is a Dict:", json.dumps(result_dict["ref_text_vec_1"]["Cosine similarity"][0]["hits"]["hits"][0], indent=1))

--1-- 'result_dict' has: dict_keys(['ref_text_vec_1', 'ref_text_vec_2'])
--2-- 'result_dict' type: <class 'dict_values'>
--3-- 'ref_text_vec_1' has: dict_keys(['Cosine similarity', 'Euclidean similarity'])
--4-- 'ref_text_vec_1' type: <class 'dict_values'>
--5-- 'Cosine similarity' type: <class 'list'>
--6-- 1st item in 'Cosine similarity' LIST is a Dict: {
 "_index": "sus_reports_1",
 "_type": "_doc",
 "_id": "16456.33",
 "_score": 1.8316375,
 "_source": {
  "date": 2021,
  "country": "United States",
  "text_len": 370,
  "filename": "16456.pdf",
  "company": "Dana Inc",
  "industry": "Consumer discretionary",
  "id": "16456.33",
  "label": 1,
  "page": 33
 }
}


#### 5.4. Load the Data and Similarity Score into a Dataframe

In [12]:
#【Bulk process】ES calculation results into a dataframe

# Create an empty dataframe 
df_USE = pd.DataFrame(columns=["index", "id", "label", "company", "industry", "country", "date", "filename", "page", "text_len"])

# Use a counter as a helper
counter = 1 

# Iterate through the 'result_dict', with keys being "ref_text_vec_*" and value being Dicts
for rtv_key, rtv_value in result_dict.items():  
    
    # Append full doc info only for the 1st batch of bulking processing
    if counter == 1:  
        
        # Iterate through "ref_text_vec_1" Dicts, with keys being "similarity measure name" and value being Dicts
        for dis_measure, result_list in rtv_value.items(): 
        
            # Detect "similarity measure name" for unique column creation in the next line
            if dis_measure == "Cosine similarity":  
                column_name = "ref_" + str(counter) + "_cos_Score"
                
                # Iterate through the "Cosine similarity" List, with each item being a Dict of ES result data for 1 index
                for result in result_list: 
                    
                    # Iterate through the ES result List, with each the iterator being a Dict of data for 1 doc/page
                    for res in result["hits"]["hits"]:
                        
                        # Append the doc/page-level data into the dataframe "df_USE"
                        df_USE = df_USE.append({"index": res["_index"],  # doc-level
                                                "id": res["_id"],
                                                "label": res["_source"]["label"],  # doc["source"]-level
                                                "company": res["_source"]["company"], 
                                                "industry": res["_source"]["industry"], 
                                                "country": res["_source"]["country"], 
                                                "date": res["_source"]["date"], 
                                                "filename": res["_source"]["filename"], 
                                                "page": res["_source"]["page"], 
                                                "text_len": res["_source"]["text_len"], 
                                                column_name: res["_score"]-1}, # Minus 1 that was added during ES calculation
                                                ignore_index=True)
            
            else:
                # Create unique column name for reference text vector 1 and Euclidean similarity scores
                column_name = "ref_" + str(counter) + "_euc_Score"
                # Iterate through the "Eclidean similarity" List, with each item being a Dict of ES result data for 1 index
                for result in result_list: 
                    # Iterate through the ES result List, with each the iterator being a Dict of data for 1 doc/page
                    for res in result["hits"]["hits"]: 
                        # Find the matching id (string)
                        match_id = res["_id"]  
                        # Add column and Eclidean similarity score for each matching id
                        df_USE.loc[df_USE["id"] == match_id, column_name] = res["_score"]-1 
                        # Minus 1 for better interpretation of the reversed Euclidean similarity scores

            # Print progress info
            print("--- {} + {} is loaded into dataframe as a new column!".format(rtv_key, dis_measure))
    
    
    else:
         # Iterate through "ref_text_vec_>1" Dicts, with keys being "similarity measure name" and value being Dicts
         # The rest logic is the same as above
         for dis_measure, result_list in rtv_value.items():
            if dis_measure == "Cosine similarity":
                column_name = "ref_" + str(counter) + "_cos_Score"
                for result in result_list: 
                    for res in result["hits"]["hits"]: 
                        match_id = res["_id"] 
                        df_USE.loc[df_USE["id"] == match_id, column_name] = res["_score"]-1
            else:
                column_name = "ref_" + str(counter) + "_euc_Score"
                for result in result_list: 
                    for res in result["hits"]["hits"]: 
                        match_id = res["_id"] 
                        df_USE.loc[df_USE["id"] == match_id, column_name] = res["_score"]-1
            # Print progress info
            print("--- {} + {} is loaded into dataframe as a new column!".format(rtv_key, dis_measure))
    
    # Increase counter by 1
    counter += 1

# Display the dataframe
df_USE

#【Time of running this cell: 50 min】


--- ref_text_vec_1 + Cosine similarity is loaded into dataframe as a new column!
--- ref_text_vec_1 + Euclidean similarity is loaded into dataframe as a new column!
--- ref_text_vec_2 + Cosine similarity is loaded into dataframe as a new column!
--- ref_text_vec_2 + Euclidean similarity is loaded into dataframe as a new column!


Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
0,sus_reports_1,16456.33,1,Dana Inc,Consumer discretionary,United States,2021,16456.pdf,33,370,0.831638,0.723306,0.788395,0.537169
1,sus_reports_1,16456.3,1,Dana Inc,Consumer discretionary,United States,2021,16456.pdf,3,301,0.789938,0.542806,0.786803,0.531420
2,sus_reports_1,16357.119,1,CHN Energy New Energy Co Ltd,Energy,China,2021,16357.pdf,119,254,0.789372,0.540731,0.684293,0.258470
3,sus_reports_1,16447.4,1,ZF Finance GmbH,Consumer discretionary,Germany,2021,16447.pdf,4,354,0.786902,0.073286,0.589408,0.103519
4,sus_reports_1,16456.34,1,Dana Inc,Consumer discretionary,United States,2021,16456.pdf,34,441,0.785670,0.527369,0.815695,0.647086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93863,sus_reports_94,15241.5,0,Lotte Global Logistics Co Ltd,Industrials,South Korea,2021,15241_2020.pdf,5,29,0.260075,-0.177964,0.309389,-0.149120
93864,sus_reports_94,15241.9,0,Lotte Global Logistics Co Ltd,Industrials,South Korea,2021,15241_2020.pdf,9,41,0.243503,-0.187017,0.342438,-0.128000
93865,sus_reports_94,15241.65,0,Lotte Global Logistics Co Ltd,Industrials,South Korea,2021,15241_2020.pdf,65,9,0.145189,-0.235196,0.249563,-0.183741
93866,sus_reports_94,15241.1,0,Lotte Global Logistics Co Ltd,Industrials,South Korea,2021,15241_2020.pdf,1,1,0.000104,-0.292857,0.037749,-0.279157


#### 5.5. Sanity Check of the calculated scores

In [18]:
# Search for all docs to check the total number of docs/pages
# and compare it with the shape of the df_USE dataframe see how many docs/pages have been lost
search_query = {
    "size": 100000, 
    "query": {
        "match_all": {}
        },
    "_source": ["id","label", "company", "industry", "country", "date","filename", "page", "text_len"] 
    # Need to specify a few fields to limit the size of data retrieved to avoid ES crash
}
result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)

Total_page_count = len(result["hits"]["hits"])
print("--- Total number of docs/pages retrived: ", Total_page_count)
print("--- The number of docs/pages lost during the process: ", Total_page_count - df_USE.shape[0])

  result = es_client.search(index="sus_reports_*", body=search_query, request_timeout=1000)


--- Total number of docs/pages retrived:  93950
--- The number of docs/pages lost during the process:  82


In [66]:
# Define a Sanity Check Function:
# Compare a specific score from ES result with the row from 'df_USE' with the same id
def check_df_score_match(result_dict, ref_text_vec, dis_measures, index_name_str, doc_id):
    if len(result_dict) == 0:
        print("result_dict is empty!!!")
    else:
        for result in result_dict[ref_text_vec][dis_measures]:
            for res in result["hits"]["hits"]:
                if res["_index"] == index_name_str and res["_id"] == str(doc_id):
                    score = res["_score"] -1
                    print("--- For embedding '{}', the '{}' score of doc '{}' from index '{}'  is {}!".format(ref_text_vec,
                                                                                                    dis_measures,
                                                                                                    doc_id,
                                                                                                    index_name_str,
                                                                                                    score),
                            "\n--- Check this score with the row with the same id from the 'df_USE' dataframe below: ")
                    display(df_USE[df_USE["id"] == str(doc_id)])


# Call the function and check the results
check_df_score_match(result_dict = result_dict, 
                     ref_text_vec = "ref_text_vec_2", 
                     dis_measures = "Euclidean similarity", 
                     index_name_str = "sus_reports_45", 
                     doc_id = 3975.43)

--- For embedding 'ref_text_vec_2', the 'Euclidean similarity' score of doc '3975.43' from index 'sus_reports_45'  is 0.7971127! 
--- Check this score with the row with the same id from the 'df_USE' dataframe below: 


Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
43522,sus_reports_45,3975.43,0,Public Service Co of Oklahoma,Utilities,United States,2021,3975_2020.pdf,43,333,0.892594,1.157597,0.845183,0.797113


In [21]:
# Save dataframe into a CSV file
df_USE.to_csv('ES_df_to_csv/ES_df_to_csv.csv')  


#### 5.5. Compare high/low-score Pages with Real Texts

In [27]:
df_high_ref_1 = df_USE[(df_USE["ref_1_cos_Score"] >= 0.5) & (df_USE["ref_1_euc_Score"] >= 0.5)]
df_high_ref_1.sort_values(["ref_1_cos_Score", "ref_1_euc_Score"], ascending=False).head()

Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
83036,sus_reports_82,482.28,0,Whirlpool Corp,Consumer discretionary,United States,2021,482_2020.pdf,28,363,0.900901,1.246213,0.798832,0.576542
43522,sus_reports_45,3975.43,0,Public Service Co of Oklahoma,Utilities,United States,2021,3975_2020.pdf,43,333,0.892594,1.157597,0.845183,0.797113
2706,sus_reports_4,4040.2,1,Solarfield Energy Pvt Ltd,Utilities,India,2021,4040.pdf,20,542,0.890549,1.137347,0.868813,0.952268
13028,sus_reports_16,463.16,1,Pacific Life Global Funding II,Financials,United States,2021,463.pdf,16,384,0.879648,1.038257,0.773007,0.484153
47689,sus_reports_49,463.16,0,Pacific Life Global Funding II,Financials,United States,2021,463_2019.pdf,16,384,0.879648,1.038257,0.773007,0.484153


In [29]:
df_high_ref_2 = df_USE[(df_USE["ref_2_cos_Score"] >= 0.5) & (df_USE["ref_2_euc_Score"] >= 0.5)]
df_high_ref_2.sort_values(["ref_2_cos_Score", "ref_2_euc_Score"], ascending=False).head()

Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
32767,sus_reports_34,3295.108,1,Ryobi Ltd,Industrials,Japan,2020,3295.pdf,108,318,0.398553,0.656067,0.89929,1.228174
64395,sus_reports_65,3295.108,0,Ryobi Ltd,Industrials,Japan,2020,3295_2019.pdf,108,479,0.817688,0.656067,0.89929,1.228174
8127,sus_reports_11,3987.42,1,Difer Enerji Sanayi Ve Ticaret AS,Energy,Turkey,2021,3987.pdf,42,384,0.859979,0.889681,0.897437,1.207953
59582,sus_reports_61,3469.7,0,Norther SA,Energy,Belgium,2020,3469_2019.pdf,70,501,0.746299,0.40386,0.890862,1.140408
12048,sus_reports_15,14883.42,1,BKK AS,Utilities,Norway,2020,14883.pdf,42,258,0.809221,0.618898,0.887261,1.10595


In [68]:
df_high_ref_1 = df_USE[(df_USE["ref_1_cos_Score"] <= 0.5) 
                       & (df_USE["ref_1_euc_Score"] <= 0.5)
                       & (df_USE["text_len"] >= 100)]
df_high_ref_1.sort_values(["ref_1_cos_Score", "ref_1_euc_Score"], ascending=True).head()

Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
85164,sus_reports_84,16276.51,0,China Jushi Co Ltd,Materials,China,2021,16276_2020.pdf,51,181,-0.020826,-0.300143,0.065289,-0.268614
15083,sus_reports_17,25.32,1,NRG Energy Inc,Utilities,United States,2020,25.pdf,32,116,0.087009,0.208873,0.717299,0.329906
8815,sus_reports_11,391.81,1,Alibaba Group Holding Ltd,Consumer discretionary,China,2021,391.pdf,81,110,0.10462,-0.252724,0.21076,-0.204059
53649,sus_reports_54,20.189,0,Arcadis NV,Industrials,Netherlands,2020,20_2019.pdf,189,588,0.117274,-0.247386,0.130189,-0.24182
75056,sus_reports_74,9521.204,0,Tibagi Energia SPE S/A,Utilities,Brazil,2019,9521_2018.pdf,204,102,0.118522,-0.246854,0.156162,-0.23024


In [69]:
df_high_ref_2 = df_USE[(df_USE["ref_2_cos_Score"] <= 0.5) 
                       & (df_USE["ref_2_euc_Score"] <= 0.5)
                       & (df_USE["text_len"] >= 100)]
df_high_ref_2.sort_values(["ref_2_cos_Score", "ref_2_euc_Score"], ascending=True).head()

Unnamed: 0,index,id,label,company,industry,country,date,filename,page,text_len,ref_1_cos_Score,ref_1_euc_Score,ref_2_cos_Score,ref_2_euc_Score
17107,sus_reports_19,447.1,1,Vienna Insurance Group AG Wiener Versicherung ...,Financials,Austria,2021,447.pdf,100,175,0.649899,-0.315097,-0.06509,-0.31484
33022,sus_reports_35,3211.2,1,Green Tower III GmbH & Co KG,Energy,Germany,2019,3211.pdf,20,161,0.703424,-0.312788,-0.051657,-0.310478
40080,sus_reports_41,10155.13,1,Greenko Solar Mauritius Ltd,Energy,India,2019,10155.pdf,13,268,0.603834,-0.310782,-0.034855,-0.304903
5578,sus_reports_7,588.38,1,Quimper AB,Industrials,Sweden,2021,588.pdf,38,321,0.591434,-0.30583,-0.033656,-0.3045
33149,sus_reports_35,3211.16,1,Green Tower III GmbH & Co KG,Energy,Germany,2019,3211.pdf,16,370,0.631765,-0.316871,-0.033183,-0.304341
