Checks percentage of time consuming queries (TEST)
====
This notebook checks whether the percentage of queries with high completion times (>**nsec**) (as computed for a period of several minutes **interval**) exceeds a given value (**percentlimit**)  at any given time during the selected interval (**nhours**). It sends mails to all the people substribed to that alert. It is run every half an hour from a cron job (not yet).
In this way we can detect spikes that tend to cause server malfunctions.

In [115]:
from subscribers import subscribers
import alerts
import es_query

import datetime
import re
import json
import sys
from elasticsearch import Elasticsearch, exceptions as es_exceptions
from elasticsearch.helpers import scan

### Variables for this script:
1. Maximum allowed percentage of queries consuming more than 10s with respect to the total amount of queries. Alert goes off when ot is surpassed
2. Time interval to calculate the percentage
3. Time period for the scan

In [116]:
# Percentage of queries taking > 10s
percentlimit=10
# Time limit in seconds (defines 'high' completion times)
nsec=10
# Testing interval in minutes
interval="3m"
# Time period to scan from now backwards
nhours=1

### Get starting and current time for query interval 

We need :
1. Current UTC time (as set in timestamp on ES DB)
2. Previous date stamp (**nhours** ago) obtained from a time delta

In order to subtract the time difference we need **ct** to be a datetime object

In [117]:
# Get current UTC time (as set in timestamp on ES DB)
# In order to subtract the time difference we need ct to be a datetime object

# Following 2 lines are for testing purposes only
#curtime = '20170126T120000.000Z'
#ct = datetime.datetime.strptime(curtime, "%Y%m%dT%H%M%S.%fZ")

ct = datetime.datetime.utcnow()
ind = 'frontier-new-%d-%02d' % (ct.year, ct.month)
print('INDEX: ',ind)
curtime = ct.strftime('%Y%m%dT%H%M%S.%f')[:-3]+'Z'

td = datetime.timedelta(hours=nhours)
st = ct - td
starttime = st.strftime('%Y%m%dT%H%M%S.%f')[:-3]+'Z'

print('start time', starttime)
print('current time',curtime)


INDEX:  frontier-new-2017-11
start time 20171109T112253.338Z
current time 20171109T122253.338Z


### Establish connection to ES-DB and submit query

Send a query to the ES-DB to get the Frontier servers which served queries taking more than **nsec** seconds

In [118]:
es = Elasticsearch(hosts=[{'host':'atlas-kibana.mwt2.org', 'port':9200}],timeout=60)

condition='rejected:true OR disconn:true OR procerror:true'

my_query={
   "size": 0,
   "query": {
      "range": {
         "@timestamp": {
            "gte": starttime,
            "lte": curtime,
            "format": "basic_date_time"
         }
      }
   },
   "aggs": {
     "dhist": {
       "date_histogram": {
         "field": "@timestamp",
         "interval": interval,
         "time_zone": "UTC",
         "min_doc_count": 1
       },
       "aggs": {
         "frserver": {
           "terms": {
             "field": "frontierserver",
             "size": 20,
             "order": {
               "_term": "asc"
             }
           },
           "aggs": {
             "amount": {
               "range": {
                 "field": "querytime",
                 "ranges": [
                   {
                     "from": 0,
                     "to": nsec*1000
                   },
                   {
                     "from": nsec*1000,
                     "to": 100000000
                   }
                 ]
               }
             }
           }
         }
       }
     }
   }
}

res = es.search(index=ind, body=my_query, request_timeout=600)

frontierservers={}
for min in range(len(res['aggregations']['dhist']['buckets'])):
#   print(res['aggregations']['dhist']['buckets'][min]['key_as_string'])
   for frsrvr in range(len(res['aggregations']['dhist']['buckets'][min]['frserver']['buckets'])):
      tim=res['aggregations']['dhist']['buckets'][min]['key_as_string']
      frs=res['aggregations']['dhist']['buckets'][min]['frserver']['buckets'][frsrvr]['key']
      low=res['aggregations']['dhist']['buckets'][min]['frserver']['buckets'][frsrvr]['amount']['buckets'][0]['doc_count']
      high=res['aggregations']['dhist']['buckets'][min]['frserver']['buckets'][frsrvr]['amount']['buckets'][1]['doc_count']
#      print('   ',frs, low, high)
      perc=100.*float(high)/float(high+low)
      if frs in frontierservers:
         if perc > frontierservers[frs][0]:
            frontierservers[frs]=(perc,tim)
      else:
         frontierservers[frs]=(perc,tim)  

print(frontierservers)


{'frontier-atlas2.lcg.triumf.ca': (0.1422475106685633, '2017-11-09T11:24:00.000Z'), 'aiatlas149.cern.ch': (0.03963535473642489, '2017-11-09T11:57:00.000Z'), 'aiatlas036.cern.ch': (0.0, '2017-11-09T11:21:00.000Z'), 'ccosvms0014': (0.0, '2017-11-09T11:21:00.000Z'), 'ccosvms0013.in2p3.fr': (0.0, '2017-11-09T11:21:00.000Z'), 'aiatlas147.cern.ch': (0.0, '2017-11-09T11:21:00.000Z'), 'aiatlas146.cern.ch': (0.14556040756914118, '2017-11-09T11:39:00.000Z'), 'ccosvms0012.in2p3.fr': (0.0, '2017-11-09T11:21:00.000Z'), 'aiatlas037.cern.ch': (0.0, '2017-11-09T11:21:00.000Z'), 'frontier-atlas1.lcg.triumf.ca': (0.12804097311139565, '2017-11-09T11:54:00.000Z'), 'aiatlas038.cern.ch': (0.0, '2017-11-09T11:21:00.000Z'), 'ccsvli200': (0.0, '2017-11-09T11:21:00.000Z'), 'aiatlas073.cern.ch': (0.12738853503184713, '2017-11-09T11:51:00.000Z'), 'frontier-atlas3.lcg.triumf.ca': (0.11961722488038277, '2017-11-09T11:48:00.000Z'), 'aiatlas148.cern.ch': (0.0, '2017-11-09T11:21:00.000Z')}


### Submit an alert if any server had a percentage  of long time consuming queries beyond the established limit

Send the Frontier server name and the maximum percentage of long time queries observed for any given **interval** in minutes above the limit **percentlimit**

In [111]:
percmat={}
for frsrvr in frontierservers:
   if frontierservers[frsrvr][0] > percentlimit:
      percmat[frsrvr] = ("%3.2f"% (frontierservers[frsrvr][0]),frontierservers[frsrvr][1])
        
print(percmat)

{'frontier-atlas2.lcg.triumf.ca': ('1.03', '2017-11-08T09:15:00.000Z')}


In [114]:
if len(percmat) > 0:
    S = subscribers()
    A = alerts.alerts()

    test_name = 'Failed queries'
    users =  S.get_immediate_subscribers(test_name)
    for user in users:
        body = 'Dear ' + user.name +',\n\n'
        body += '\tthis mail is to let you know that the percentage of long time queries (>'
        body += str(nsec)+'s) is\n\n'
        for fkey in percmat:
          body += fkey
          body += ' : '
          body += str(percmat[fkey][0]) + '%'
          body += ' on ' + percmat[fkey][1] + ' UTC time\n'
        body += '\nBest regards,\nATLAS AAS'
        body += '\n\n To change your alerts preferences please you the following link:\n' + user.link

        A.sendMail(test_name, user.email, body)
##        A.addAlert(test_name, user.name, str(res_page))


1
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Failed queries
From: AAAS@mwt2.org
To: ilijav@gmail.com

Dear Ilija Vukotic,

	this mail is to let you know that the percentage of long time queries (>10s) is

frontier-atlas2.lcg.triumf.ca : 1.03% on 2017-11-08T09:15:00.000Z UTC time

Best regards,
ATLAS AAS

 To change your alerts preferences please you the following link:
https://docs.google.com/forms/d/e/1FAIpQLSeedRVj0RPRadEt8eGobDeneix_vNxUkqbtdNg7rGMNOrpcug/viewform?edit2=2_ABaOnufrzSAOPoVDl6wcXDnQKk0EfkQRmlxj04nw9npJrTAK5BZPijqoLhg
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Failed queries
From: AAAS@mwt2.org
To: julio.lozano.bahilo@cern.ch

Dear Julio Lozano Bahilo,

	this mail is to let you know that the percentage of long time queries (>10s) is

frontier-atlas2.lcg.triumf.ca : 1.03% on 2017-11-08T09:15:00.000Z UTC time

Best regards,
ATLAS AAS

 To change your a