# Send alert emails about packet loss based on alarms and user subscribing

This notebook is run by a cron job every hour, and its purpose is to send alert emails about packet loss for user specified site(s) based on alarms and user subscribing records. 

This notebook works following this procedure: 

(1) Get all the alarms of type packetloss for the past hour (call it NEW) and past past hour (call it OLD) from Elasticsearch

(2) Get the user subscribing records from Google Sheets calling APIs in subscribers.py

(3) Process the alarms data and subscribing data to make them easier to use for this monitoring task

(4) TN_old means total number of alarmed links involving a specific site ip (no matter from it or to it) for OLD time period

(5) TN_new means total number of alarmed links involving a specific site ip (no matter from it or to it) for NEW time period

(6) TN_delta means the change of value from TN_old to TN_new. We need to compare TN_delta v.s. +N and v.s. -N (tune N later)

(7) If a site ip never occurs in NEW and OLD, then it must be totally fine, and we do not care about it at all (TN_old == TN_new == TN_delta == 0)

(8) If a site ip occurs in NEW or OLD or both, then we may have TN_delta > 0 or == 0 or < 0 for this site ip, so we want to take a closer look at this site ip, so we do (9) (10) (11)

(9) If TN_delta >= +N, then overall the links connected to this site are becoming worse, so we send email

(10) If TN_delta <= -N, then overall the links connected to this site are becoming better, so we send email

(11) Otherwise, the overall status for this site is not changing or just changing slightly, so we do not send email

(12) In order to send email, we need a dictionary whose key is site ip and value is a list of relevant user emails



## Import necessary packages and classes

In [1]:
# Retrieve user subscribing records from google sheets.
from subscribers import subscribers
import alerts

S = subscribers()
A = alerts.alerts()

# Related to Elasticsearch queries
from elasticsearch import Elasticsearch, exceptions as es_exceptions, helpers
import datetime

# Regular Expression
import re

## Establish Elasticsearch connection

In [2]:
es = Elasticsearch(hosts=[{'host':'atlas-kibana.mwt2.org', 'port':9200}],timeout=60)

## List all alarms-yyyy.mm indices

In [3]:
indices = es.cat.indices(index="alarms-*", h="index", request_timeout=600).split('\n')
indices = [x for x in indices if x != '']
indices = [x.strip() for x in indices]
print(indices)

['alarms-2017-03', 'alarms-2016-12', 'alarms-2017-01', 'alarms-2017-02', 'alarms-2016-08', 'alarms-2016-10', 'alarms-2017-05', 'alarms-2016-09', 'alarms-2016-11', 'alarms-2017-04']


## Find indices to be used

In [4]:
cday = datetime.datetime.utcnow()
pday = cday - datetime.timedelta(days=1)
ind1 = 'alarms-%d-%02d' % (cday.year, cday.month)
ind2 = 'alarms-%d-%02d' % (pday.year, pday.month)

print('checking for indices:', ind1, ind2)

ind=[]
if ind1 in indices:
    ind.append(ind1)
if ind2 != ind1 and ind2 in indices and cday.hour<3:   # not necessarily 3, just indicate it is the beginning period of new day
    ind.append(ind2)

if len(ind)==0:
    print('no current indices found. Aborting.')
    exit
else:
    print('will use indices:', ind)

checking for indices: alarms-2017-05 alarms-2017-05
will use indices: ['alarms-2017-05']


## Queries to find all the alarms of type Packet Loss for the past hour and past past hour

In [5]:
query_new = {
    "size": 1000,
    "query": {
        "bool": {
            "must": [
                {"term": { "_type": "packetloss" }}
            ],
            "filter": {
                "range": {
                    "alarmTime": {
                        "gt": "now-3h"
                    }
                }
            }
        }
    }
}
# +SPM Changed time queries 20-Apr-2017:  New is last 3 hours now, and Old is the previous 3 hours before that.
query_old = {
    "size": 1000,
    "query": {
        "bool": {
            "must": [
                {"term": { "_type": "packetloss" }}
            ],
            "filter": {
                "range": {
                    "alarmTime": {
                        "gt": "now-6h",
                        "lt": "now-3h"
                    }
                }
            }
        }
    }
}

print(query_new)
print(query_old)

{'query': {'bool': {'must': [{'term': {'_type': 'packetloss'}}], 'filter': {'range': {'alarmTime': {'gt': 'now-3h'}}}}}, 'size': 1000}
{'query': {'bool': {'must': [{'term': {'_type': 'packetloss'}}], 'filter': {'range': {'alarmTime': {'lt': 'now-3h', 'gt': 'now-6h'}}}}}, 'size': 1000}


## Execute the query

In [6]:
result_new = es.search(index=ind, body=query_new, request_timeout=120)
print('Number of hits of new alarms:', result_new['hits']['total'] )

result_old = es.search(index=ind, body=query_old, request_timeout=120)
print('Number of hits of old alarms:', result_old['hits']['total'] )

hits_new = result_new['hits']['hits']
hits_old = result_old['hits']['hits']

Number of hits of new alarms: 76
Number of hits of old alarms: 85


## Generate the two dictionaries for sites, one is from ip to name, one is from name to ip

In [7]:
site_ip_name = {}

for hit in hits_new:
    info = hit['_source']
    site_ip_name[info['src']] = info['srcSite']
    site_ip_name[info['dest']] = info['destSite']

for hit in hits_old:
    info = hit['_source']
    site_ip_name[info['src']] = info['srcSite']
    site_ip_name[info['dest']] = info['destSite']

print(site_ip_name)

{'131.225.205.12': 'UnknownSite', '192.41.236.31': 'AGLT2', '141.34.200.28': 'DESY-ZN', '144.206.236.189': 'RRC-KI-T1', '134.158.103.10': 'IN2P3-LAPP', '192.108.47.12': 'FZK-LCG2', '206.12.154.60': 'CA-VICTORIA-WESTGRID-T2', '148.187.64.25': 'CSCS-LCG2', '206.12.24.251': 'SFU-LCG2', '193.109.172.188': 'pic', '130.209.239.124': 'UKI-SCOTGRID-GLASGOW', '147.156.116.40': 'IFIC-LCG2', '130.246.47.129': 'UKI-SOUTHGRID-RALPP', '206.12.9.2': 'TRIUMF-LCG2', '129.93.183.249': 'Nebraska', '144.206.237.142': 'RRC-KI', '131.111.66.196': 'UKI-SOUTHGRID-CAM-HEP', '85.122.31.74': 'RO-16-UAIC', '138.253.60.82': 'UKI-NORTHGRID-LIV-HEP', '194.80.35.169': 'UKI-NORTHGRID-LANCS-HEP', '200.136.80.20': 'SPRACE', '129.215.213.70': 'UKI-SCOTGRID-ECDF', '192.41.230.59': 'AGLT2', '128.142.223.247': 'CERN-PROD', '194.85.69.75': 'ITEP', '134.158.159.85': 'GRIF', '202.122.32.170': 'BEIJING-LCG2', '194.36.11.38': 'UKI-LT2-QMUL', '62.40.126.129': 'UnknownSite', '142.150.19.61': 'CA-SCINET-T2', '147.231.25.192': 'prag

In [8]:
site_name_ip = {}

for ip in site_ip_name:
    name = site_ip_name[ip]
    if name in site_name_ip:
        site_name_ip[name].append(ip)
    else:
        site_name_ip[name] = [ip]

print(site_name_ip)

{'TRIUMF-LCG2': ['206.12.9.2'], 'CERN-PROD': ['128.142.223.247'], 'DESY-ZN': ['141.34.200.28'], 'SARA-MATRIX': ['145.100.17.8'], 'INFN-ROMA1': ['141.108.35.18'], 'UKI-SOUTHGRID-OX-HEP': ['163.1.5.210'], 'MWT2': ['149.165.225.223'], 'UKI-SCOTGRID-GLASGOW': ['130.209.239.124'], 'UKI-NORTHGRID-LIV-HEP': ['138.253.60.82'], 'UTA_SWT2': ['129.107.255.29'], 'RO-16-UAIC': ['85.122.31.74'], 'pic': ['193.109.172.188'], 'RU-Protvino-IHEP': ['194.190.165.192'], 'CSCS-LCG2': ['148.187.64.25'], 'UKI-LT2-QMUL': ['194.36.11.38'], 'RRC-KI-T1': ['144.206.236.189'], 'ITEP': ['194.85.69.75'], 'SFU-LCG2': ['206.12.24.251'], 'FZK-LCG2': ['192.108.47.12'], 'GRIF': ['134.158.159.85'], 'RO-07-NIPNE': ['81.180.86.64'], 'UNI-FREIBURG': ['132.230.202.235'], 'UKI-SCOTGRID-ECDF': ['129.215.213.70'], 'FMPhI-UNIBA': ['158.195.14.26'], 'IN2P3-LAPP': ['134.158.103.10'], 'BEIJING-LCG2': ['202.122.32.170'], 'TECHNION-HEP': ['192.114.101.125'], 'IFIC-LCG2': ['147.156.116.40'], 'UKI-NORTHGRID-MAN-HEP': ['195.194.105.178'],

## Calculate TN_old, the total number of alarmed links involving a specific site ip (either as source site or as destination site) for the OLD time period

In [9]:
TN_old = {}

def TN_old_add_one(ip):
    if ip in TN_old:
        TN_old[ip] += 1
    else:
        TN_old[ip] = 1

for alarm in hits_old:
    TN_old_add_one(alarm['_source']['src'])
    TN_old_add_one(alarm['_source']['dest'])

#TN_old

## Calculate TN_new, the total number of alarmed links involving a specific site ip (either as source site or as destination site) for the NEW time period

In [10]:
TN_new = {}

def TN_new_add_one(ip):
    if ip in TN_new:
        TN_new[ip] += 1
    else:
        TN_new[ip] = 1

for alarm in hits_new:
    TN_new_add_one(alarm['_source']['src'])
    TN_new_add_one(alarm['_source']['dest'])

#TN_new

## Calculate TN_delta, which is equal to ( TN_new - TN_old )

In [11]:
TN_delta = {}

for ip in TN_old:
    if ip in TN_new:
        TN_delta[ip] = TN_new[ip] - TN_old[ip]
    else:
        TN_delta[ip] = -TN_old[ip]

for ip in TN_new:
    if ip not in TN_old:
        TN_delta[ip] = TN_new[ip]

TN_delta

{'128.142.223.247': -1,
 '129.107.255.29': -1,
 '129.215.213.70': 0,
 '129.93.183.249': 1,
 '130.209.239.124': 0,
 '130.246.176.109': 0,
 '130.246.47.129': -1,
 '131.111.66.196': 0,
 '131.169.98.30': 0,
 '131.225.205.12': 1,
 '132.230.202.235': 0,
 '134.158.103.10': -1,
 '134.158.159.85': 1,
 '138.253.60.82': -1,
 '141.108.35.18': 1,
 '141.34.200.28': 0,
 '142.150.19.61': 0,
 '143.167.3.116': 0,
 '144.206.236.189': 0,
 '144.206.237.142': 1,
 '145.100.17.8': -1,
 '147.156.116.40': -1,
 '147.231.25.192': -1,
 '148.187.64.25': 0,
 '149.165.225.223': 1,
 '158.195.14.26': -13,
 '163.1.5.210': 0,
 '18.12.1.171': 0,
 '192.108.47.12': 0,
 '192.114.101.125': 0,
 '192.41.230.59': -1,
 '192.41.236.31': -1,
 '193.109.172.188': 0,
 '193.136.75.146': 0,
 '194.190.165.192': 0,
 '194.36.11.38': 0,
 '194.80.35.169': -1,
 '194.85.69.75': 0,
 '195.194.105.178': -2,
 '200.136.80.20': 2,
 '202.122.32.170': 0,
 '206.12.154.60': 0,
 '206.12.24.251': 0,
 '206.12.9.2': 1,
 '62.40.126.129': 0,
 '81.180.86.64': 

## Look at the distribution of TN_delta, so that we can tune the parameter N

In [12]:
for N in range(10):
    count_worse = 0
    count_better = 0
    count_stable = 0
    for ip in TN_delta:
        if TN_delta[ip] > N:
            count_worse += 1
        elif TN_delta[ip] < -N:
            count_better += 1
        else:
            count_stable += 1
    print('N=%d     links went bad=%d     links went good=%d     unchanged=%d' % (N, count_worse, count_better, count_stable))

N=0     links went bad=8     links went good=14     unchanged=25
N=1     links went bad=1     links went good=2     unchanged=44
N=2     links went bad=0     links went good=1     unchanged=46
N=3     links went bad=0     links went good=1     unchanged=46
N=4     links went bad=0     links went good=1     unchanged=46
N=5     links went bad=0     links went good=1     unchanged=46
N=6     links went bad=0     links went good=1     unchanged=46
N=7     links went bad=0     links went good=1     unchanged=46
N=8     links went bad=0     links went good=1     unchanged=46
N=9     links went bad=0     links went good=1     unchanged=46


## Let's use N=5 for now, and we will tune later

In [13]:
N = 5

ip_list_worse = []
ip_list_better = []

for ip in TN_delta:
    if TN_delta[ip] >= N:
        ip_list_worse.append(ip)
    elif TN_delta[ip] <= -N:
        ip_list_better.append(ip)

print('--- The ip of the site(s) which got worse:')
print(ip_list_worse)
print('--- The ip of the site(s) which got better:')
print(ip_list_better)

--- The ip of the site(s) which got worse:
[]
--- The ip of the site(s) which got better:
['158.195.14.26']


## Generate the dictionary: key = site name, value = a list of relevant user emails

In [18]:
user_interest_site_name = {}

def reg_user_interest_site_name(sitename, email):
    if sitename in user_interest_site_name:
        user_interest_site_name[sitename].append(email)
    else:
        user_interest_site_name[sitename] = [email]

test_name = 'PerfSONAR [Packet loss change for link(s) where your site is a source or destination]'
emailSubject = 'Significant change in the number of network paths with large packet loss where your subscribed site(s) are the source or destination'

users = S.get_immediate_subscribers(test_name)

# Handle blank answer, one site, several sites separated by comma, wildcard such as prefix* etc.
for user in users:
    sitenames = user.sites
    print(user.to_string(), sitenames)
    if len(sitenames) == 0:
        sitenames = ['.']  # Handle blank answer, so match all site names
    sitenames = [x.replace('*', '.') for x in sitenames]  # Handle several site names, and wildcard
    for sn in sitenames:
        p = re.compile(sn, re.IGNORECASE)
        for sitename in site_name_ip:
            if p.match(sitename):
                reg_user_interest_site_name(sitename, user)


user name:Ilija Vukotic  email:ilija@vukotic.me ['MWT2']
user name:Ilija Vukotic  email:ilijav@gmail.com ['MWT2', '*']


## Generate the dictionary: key = site ip, value = a list of relevant user emails

In [19]:
user_interest_site_ip = {}

def reg_user_interest_site_ip(siteip, email):
    if siteip in user_interest_site_ip:
        user_interest_site_ip[siteip].append(email)
    else:
        user_interest_site_ip[siteip] = [email]

for sitename in user_interest_site_name:
    for siteip in site_name_ip[sitename]:
        for user in user_interest_site_name[sitename]:
            reg_user_interest_site_ip(siteip, user)

print(user_interest_site_ip)

{'131.225.205.12': [<subscribers_new.user object at 0x7fc3db3ea470>], '131.111.66.196': [<subscribers_new.user object at 0x7fc3db3ea470>], '141.34.200.28': [<subscribers_new.user object at 0x7fc3db3ea470>], '144.206.236.189': [<subscribers_new.user object at 0x7fc3db3ea470>], '134.158.103.10': [<subscribers_new.user object at 0x7fc3db3ea470>], '192.108.47.12': [<subscribers_new.user object at 0x7fc3db3ea470>], '206.12.154.60': [<subscribers_new.user object at 0x7fc3db3ea470>], '148.187.64.25': [<subscribers_new.user object at 0x7fc3db3ea470>], '206.12.24.251': [<subscribers_new.user object at 0x7fc3db3ea470>], '193.109.172.188': [<subscribers_new.user object at 0x7fc3db3ea470>], '130.209.239.124': [<subscribers_new.user object at 0x7fc3db3ea470>], '85.122.31.74': [<subscribers_new.user object at 0x7fc3db3ea470>], '130.246.47.129': [<subscribers_new.user object at 0x7fc3db3ea470>], '206.12.9.2': [<subscribers_new.user object at 0x7fc3db3ea470>], '129.93.183.249': [<subscribers_new.user 

## Generate info for sending alert emails (for the sites getting worse)

In [20]:
for ip in ip_list_worse:
    text = "The site %s (%s)'s network paths have worsened, the count of src-destination paths with packet-loss went from %d to %d.\n" % (site_ip_name[ip], ip, TN_old.get(ip,0), TN_new.get(ip,0))
    text += "These are all the problematic src-destination paths for the past hour:\n"
    for alarm in hits_new:
        src_ip = alarm['_source']['src']
        dest_ip = alarm['_source']['dest']
        if src_ip == ip:
            text += '    %s (%s)  --->  %s (%s) \n' % (site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
    for alarm in hits_new:
        src_ip = alarm['_source']['src']
        dest_ip = alarm['_source']['dest']
        if dest_ip == ip:
            text += '    %s (%s)  --->  %s (%s) \n' % (site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
    print(text)
    for user in user_interest_site_ip[ip]:
        user.alerts.append(text)

## Generate info for sending alert emails (for the sites getting better)

In [21]:
for ip in ip_list_better:
    text = "The site %s (%s)'s network paths have improved, the count of src-destination paths with packet-loss went from %d to %d.\n" % (site_ip_name[ip], ip, TN_old.get(ip,0), TN_new.get(ip,0))
    wtext=""
    for alarm in hits_new:
        src_ip = alarm['_source']['src']
        dest_ip = alarm['_source']['dest']
        if src_ip == ip:
            text += '    %s (%s)  --->  %s (%s) \n' % (site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
    for alarm in hits_new:
        src_ip = alarm['_source']['src']
        dest_ip = alarm['_source']['dest']
        if dest_ip == ip:
            text += '    %s (%s)  --->  %s (%s) \n' % (site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)   
    if len(wtext)>0:
        text += "These are the remaining problematic src-destination paths for the past hour:\n"
        text += wtext
#    print(text)
    for user in user_interest_site_ip[ip]:
        user.alerts.append(text)

# user_alert_all

## Send out alert email customized for each user

In [22]:
for user in users:
    if len(user.alerts)>0:
        body = 'Dear ' + user.name + ',\n\n'
        body = body + '\tThis mail is to let you know that there are significant changes in the number of paths with large packet-loss detected by perfSONAR for sites you requested alerting about.\n\n'
        for a in user.alerts:
            body = body + a + '\n'
   
        # Add in two items: 1) Where to go for more information and 2) who to contact to pursue fixing this   +SPM 20-Apr-2017
        body += '\n To get more information about this alert message and its interpretation, please visit:\n'
        body += '  http://twiki.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG/PacketLossAlert\n'
        body += '\n If you suspect a network problem and wish to follow up on it please email the appropriate support list:\n'
        body += '     For OSG sites:  goc@opensciencegrid.org using Subject: Possible network issue\n'
        body += '     For WLCG sites:  wlcg-network-throughput@cern.ch using Subject: Possible network issue\n'
        body += ' Please include this alert email to help expedite your request for network debugging support.\n'
        body += '\n To change your alerts preferences please use the following link:\n' + user.link
        body += '\n\nBest regards,\nATLAS Networking Alert Service'
        #print(body)
        A.sendMail(emailSubject, user.email, body)
        A.addAlert(test_name, user.name,'change in packet loss')

Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Significant change in the number of network paths with large packet loss where your subscribed site(s) are the source or destination
From: AAAS@mwt2.org
To: ilijav@gmail.com

Dear Ilija Vukotic,

	This mail is to let you know that there are significant changes in the number of paths with large packet-loss detected by perfSONAR for sites you requested alerting about.

The site FMPhI-UNIBA (158.195.14.26)'s network paths have improved, the count of src-destination paths with packet-loss went from 17 to 4.
    CA-VICTORIA-WESTGRID-T2 (206.12.154.60)  --->  FMPhI-UNIBA (158.195.14.26) 
    SFU-LCG2 (206.12.24.251)  --->  FMPhI-UNIBA (158.195.14.26) 
    TRIUMF-LCG2 (206.12.9.2)  --->  FMPhI-UNIBA (158.195.14.26) 
    CA-SCINET-T2 (142.150.19.61)  --->  FMPhI-UNIBA (158.195.14.26) 


 To get more information about this alert message and its interpretation, please visit:
  http://twiki.ope