Checks number of failed queries (rejected/unprocessed queries and DB disconnections) (TEST)
====
This notebook checks if there are failed queries:
- Rejected queries: server is busy and doesn't respond to the query
- DB disconnections: the query was processed by the Frontier server but the Oracle DB terminated the connection
- Unprocessed queries: Oracle DB returned data, but it wasn't sent to the querying job 

It sends mails to all the people substribed to that alert. It is run every half an hour from a cron job (not yet).

In [1]:
from subscribers import subscribers
import alerts
import es_query

import datetime
import re
import json
import sys
from elasticsearch import Elasticsearch, exceptions as es_exceptions
from elasticsearch.helpers import scan

# Period to check from now backwards
nhours=24

In [2]:
# Following 2 lines are for testing purposes only
#curtime = '20170126T120000.000Z'
#ct = datetime.datetime.strptime(curtime, "%Y%m%dT%H%M%S.%fZ")

### Get starting and current time for query interval 

We need :
1. Current UTC time (as set in timestamp on ES DB)
2. Previous date stamp (**nhours** ago) obtained from a time delta

In order to subtract the time difference we need **ct** to be a datetime object

In [3]:
ct = datetime.datetime.utcnow()
ind = 'frontier-new-%d-%02d' % (ct.year, ct.month)
print(ind)
curtime = ct.strftime('%Y%m%dT%H%M%S.%f')[:-3]+'Z'

td = datetime.timedelta(hours=nhours)
st = ct - td
starttime = st.strftime('%Y%m%dT%H%M%S.%f')[:-3]+'Z'

print('start time', starttime)
print('current time',curtime)

frontier-new-2017-11
start time 20171105T163733.811Z
current time 20171106T163733.811Z


### Establish connection to ES-DB and submit query

Send a query to the ES-DB for documents containing information of failed queries

In [6]:
es = Elasticsearch(hosts=[{'host':'atlas-kibana.mwt2.org', 'port':9200}],timeout=60)

condition='rejected:true OR disconn:true OR procerror:true'

my_query={
   "size": 0,
   "query": {
       "range": {
          "@timestamp": {
             "gte": starttime,
             "lte": curtime,
             "format": "basic_date_time"
          }
       }
   },
   "aggs" : {
      "servers": {
         "terms" : {
             "size" : 20,
             "field" : "frontierserver"
         },
         "aggs" : {
            "unserved": {
               "filters": {
                  "filters": {
                     "rejected" : {
                        "query_string": {
                           "query": "rejected:true"
                        }
                     },
                     "disconnect" : {
                        "query_string": {
                           "query": "disconn:true"
                        }
                     },
                     "procerror" : {
                        "query_string": {
                           "query": "procerror:true"
                        }
                     }
                  }
               }
            }
         }
      }
   }
}

res = es.search(index=ind, body=my_query, request_timeout=600)

#print(res)

frontiersrvr = {}
res=res['aggregations']['servers']['buckets']
for r in res:
    ub  = r['unserved']['buckets']
    rej = ub['rejected']['doc_count']
    dis = ub['disconnect']['doc_count']
    pre = ub['procerror']['doc_count']
    if rej+dis+pre == 0: continue
    mes=''
    if rej>0: mes += str(rej) + " rejected\t"
    if dis>0: mes += str(dis) + " disconnected\t"
    if pre>0: mes += str(pre) + " unprocessed "
    frontiersrvr[r['key']] = mes + 'queries.'

print('problematic servers:', frontiersrvr)


problematic servers: {'aiatlas146.cern.ch': '4 unprocessed queries.', 'aiatlas073.cern.ch': '4 unprocessed queries.', 'aiatlas147.cern.ch': '4 unprocessed queries.', 'ccosvms0013.in2p3.fr': '2 unprocessed queries.', 'aiatlas036.cern.ch': '4 unprocessed queries.', 'ccosvms0014': '3 unprocessed queries.', 'ccsvli200': '1 unprocessed queries.', 'ccosvms0012.in2p3.fr': '2 unprocessed queries.'}


### Any non-zero value for any Frontier server triggers the alert

The alert contains every Frontier server with failed queries and which kind of failures happened.

In [8]:
if len(frontiersrvr) > 0:
    S = subscribers()
    A = alerts.alerts()

    test_name = 'Failed queries'
    users =  S.get_immediate_subscribers(test_name)
    for user in users:
        body = 'Dear ' + user.name +',\n\n'
        body += '\tthis mail is to let you know that the following servers present failed queries in the past ' 
        body += str(nhours)+' hours: \n'
        body += '\t(attached numbers correspond to rejected, disconnected and unprocessed queries) \n\n'
        for fkey in frontiersrvr:
           body += fkey
           body += ' : '
           body += frontiersrvr[fkey]
           body += '\n'
        body += '\nBest regards,\nATLAS AAS'
        body += '\n\n To change your alerts preferences please use the following link:\n' + user.link
        A.sendMail(test_name, user.email, body)
##        A.addAlert(test_name, user.name, str(res_page))


Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Failed queries
From: AAAS@mwt2.org
To: ilijav@gmail.com

Dear Ilija Vukotic,

	this mail is to let you know that the following servers present failed queries in the past 24 hours: 
	(attached numbers correspond to rejected, disconnected and unprocessed queries) 

aiatlas146.cern.ch : 4 unprocessed queries.
aiatlas073.cern.ch : 4 unprocessed queries.
aiatlas147.cern.ch : 4 unprocessed queries.
ccosvms0013.in2p3.fr : 2 unprocessed queries.
aiatlas036.cern.ch : 4 unprocessed queries.
ccosvms0014 : 3 unprocessed queries.
ccsvli200 : 1 unprocessed queries.
ccosvms0012.in2p3.fr : 2 unprocessed queries.

Best regards,
ATLAS AAS

 To change your alerts preferences please use the following link:
https://docs.google.com/forms/d/e/1FAIpQLSeedRVj0RPRadEt8eGobDeneix_vNxUkqbtdNg7rGMNOrpcug/viewform?edit2=2_ABaOnufrzSAOPoVDl6wcXDnQKk0EfkQRmlxj04nw9npJrTAK5BZPijqoLhg
Content-Type: text/plain; charse