# Send alert emails based on alarms

This Jupyter notebook is run by a cron job every hour, and its purpose is to send alert emails based on alarms and user subscribing records. 

NOTE that currently this code only deal with packet loss!

This notebook works following such procedure:

(1) Get user subscribing data (well formatted) for the monitoring task Packet Loss (user site as source and/or destination) from Google Sheets

(2) Get the packet loss alarms for the past hour (call it NEW), and also for the past past hour (call it OLD) from Elasticsearch

(3) If an alarm is not in OLD, but in NEW, then it means a new problem appears, and we should send email to relevant users

(4) If an alarm is in OLD, but not in NEW, then it means an old problem disappears, and we should send email to relevant users

(5) If an alarm is in both OLD and NEW, then it is a problem known to users, so we do not send any email to users

(6) There is a corner case within (5), which is any newly registered user will not receive email about an existing old problem

(7) So we should send all the alarms in NEW (i.e. currently exist) to newly registered users (i.e. who registered exactly in the past hour)

(8) There is another corner case, which is when a user updates settings and includes new sites or tasks, it is effectively a newly registered user for these newly added sites or tasks. 

(9) At the same time, note that when a user updates settings, in Google Sheets the contents of the original record updates (instead of creating a new record) and the timestamp becomes the new time. 

(10) So, combining the above design and implementation, when a user updates settings, as we detect newly registered users only based on timestamp, we will regard this old user (with new settings) as a new user, so we will send all currently existing alarms to this user (despite the fact that this user should already know most of these alarms, assuming the setting update is minor), but this kind of possibly repeated alarms only happen one time everytime the user updates settings. So I think overall this scheme is reasonable. 

## Import necessary packages and classes

In [1]:
# Retrieve user subscribing records from google sheets. Using Xinran version based on Ilija version.
from subscribers import subscribers
google = subscribers()

# Related to Elasticsearch queries
from elasticsearch import Elasticsearch, exceptions as es_exceptions, helpers
import sys
import datetime

## Establish Elasticsearch connection

In [2]:
es = Elasticsearch(hosts=[{'host':'atlas-kibana.mwt2.org', 'port':9200}],timeout=60)

## List all alarms-yyyy.mm indices

In [3]:
indices = es.cat.indices(index="alarms-*", h="index", request_timeout=600).split('\n')
indices = [x for x in indices if x != '']
indices = [x.strip() for x in indices]
print(indices)

['alarms-2016.08']


## Find indices to be used

In [4]:
cday = datetime.datetime.utcnow()
pday = cday - datetime.timedelta(days=1)
ind1 = 'alarms-%d.%02d' % (cday.year, cday.month)
ind2 = 'alarms-%d.%02d' % (pday.year, pday.month)

print('checking for indices:', ind1, ind2)

ind=[]
if ind1 in indices:
    ind.append(ind1)
if ind2 != ind1 and ind2 in indices and cday.hour<3:
    ind.append(ind2)

if len(ind)==0:
    print('no current indices found. Aborting.')
    sys.exit(1)
else:
    print('will use indices:', ind)


checking for indices: alarms-2016.08 alarms-2016.08
will use indices: ['alarms-2016.08']


## Queries to find all the alarms of type Packet Loss for the past hour and past past hour

In [5]:
query_new = {
    "size": 1000,
    "query": {
        "bool": {
            "must": [
                {"term": { "_type": "packetloss" }}
            ],
            "filter": {
                "range": {
                    "alarmTime": {
                        "gt": "now-1h"
                    }
                }
            }
        }
    }
}

query_old = {
    "size": 1000,
    "query": {
        "bool": {
            "must": [
                {"term": { "_type": "packetloss" }}
            ],
            "filter": {
                "range": {
                    "alarmTime": {
                        "gt": "now-2h",
                        "lt": "now-1h"
                    }
                }
            }
        }
    }
}

print(query_new)
print(query_old)

{'size': 1000, 'query': {'bool': {'filter': {'range': {'alarmTime': {'gt': 'now-1h'}}}, 'must': [{'term': {'_type': 'packetloss'}}]}}}
{'size': 1000, 'query': {'bool': {'filter': {'range': {'alarmTime': {'lt': 'now-1h', 'gt': 'now-2h'}}}, 'must': [{'term': {'_type': 'packetloss'}}]}}}


## Execute the query

In [6]:
result_new = es.search(index=ind, body=query_new, request_timeout=120)
print('Number of hits of new alarms:', result_new['hits']['total'] )

result_old = es.search(index=ind, body=query_old, request_timeout=120)
print('Number of hits of old alarms:', result_old['hits']['total'] )

hits_new = result_new['hits']['hits']
hits_old = result_old['hits']['hits']

Number of hits of new alarms: 239
Number of hits of old alarms: 255


## Generate the site ip and name dictionary, and vice versa

In [7]:
site_ip_name = {}

for hit in hits_new:
    info = hit['_source']
    site_ip_name[info['src']] = info['srcSite']
    site_ip_name[info['dest']] = info['destSite']

for hit in hits_old:
    info = hit['_source']
    site_ip_name[info['src']] = info['srcSite']
    site_ip_name[info['dest']] = info['destSite']

site_ip_name

{'109.105.124.86': 'NDGF-T1',
 '109.105.125.232': 'FI_HIP_T2',
 '117.103.105.191': 'Taiwan-LCG2',
 '128.142.223.247': 'CERN-PROD',
 '128.227.221.44': 'UFlorida-HPC',
 '129.107.255.29': 'UTA_SWT2',
 '129.15.40.231': 'OU_OCHEP_SWT2',
 '129.215.213.70': 'UKI-SCOTGRID-ECDF',
 '129.93.239.148': 'Nebraska',
 '130.246.176.109': 'RAL-LCG2',
 '130.246.47.129': 'UKI-SOUTHGRID-RALPP',
 '131.111.66.196': 'UKI-SOUTHGRID-CAM-HEP',
 '131.154.254.12': 'INFN-T1',
 '131.169.98.30': 'DESY-HH',
 '131.225.205.12': 'UnknownSite',
 '131.243.24.11': 'UnknownSite',
 '132.195.125.213': 'wuppertalprod',
 '132.206.245.252': 'CA-MCGILL-CLUMEQ-T2',
 '132.230.202.235': 'UNI-FREIBURG',
 '134.158.103.10': 'IN2P3-LAPP',
 '134.158.123.183': 'IN2P3-LPC',
 '134.158.132.200': 'GRIF',
 '134.158.159.85': 'GRIF',
 '134.158.20.192': 'IN2P3-CPPM',
 '134.158.73.243': 'GRIF',
 '134.61.24.193': 'UnknownSite',
 '134.75.125.241': 'UnknownSite',
 '134.79.118.72': 'WT2',
 '137.222.74.15': 'UKI-SOUTHGRID-BRIS-HEP',
 '138.253.60.82': 'U

In [8]:
site_name_ip = {}

for ip in site_ip_name:
    name = site_ip_name[ip]
    if name in site_name_ip:
        site_name_ip[name].append(ip)
    else:
        site_name_ip[name] = [ip]

site_name_ip

{'AGLT2': ['192.41.236.31', '192.41.230.59'],
 'Australia-ATLAS': ['192.231.127.41'],
 'BEIJING-LCG2': ['202.122.32.170'],
 'BNL-ATLAS': ['192.12.15.26'],
 'BUDAPEST': ['148.6.8.251'],
 'BU_ATLAS_Tier2': ['192.5.207.251'],
 'CA-MCGILL-CLUMEQ-T2': ['132.206.245.252'],
 'CA-SCINET-T2': ['142.150.19.61'],
 'CA-VICTORIA-WESTGRID-T2': ['206.12.154.60'],
 'CERN-PROD': ['128.142.223.247'],
 'CSCS-LCG2': ['148.187.64.25'],
 'CYFRONET-LCG2': ['212.191.227.174'],
 'DESY-HH': ['131.169.98.30'],
 'DESY-ZN': ['141.34.200.28'],
 'EELA-UTFSM': ['146.83.90.7'],
 'FI_HIP_T2': ['109.105.125.232'],
 'FMPhI-UNIBA': ['158.195.14.26'],
 'FZK-LCG2': ['192.108.47.12'],
 'GLOW': ['144.92.180.75'],
 'GRIF': ['134.158.132.200',
  '192.54.207.250',
  '134.158.159.85',
  '134.158.73.243'],
 'IEPSAS-Kosice': ['147.213.204.112'],
 'IFCA-LCG2': ['193.146.75.138'],
 'IFIC-LCG2': ['147.156.116.40'],
 'IN2P3-CC': ['193.48.99.76'],
 'IN2P3-CPPM': ['134.158.20.192'],
 'IN2P3-LAPP': ['134.158.103.10'],
 'IN2P3-LPC': ['134.

## Generate the suitable representation of old and new packet loss alarms, based on ip

In [9]:
srcIP_destIP_alarm_new = set()
srcIP_destIP_alarm_old = set()

for hit in hits_new:
    info = hit['_source']
    srcIP_destIP_alarm_new.add('{}_{}'.format(info['src'], info['dest']))

for hit in hits_old:
    info = hit['_source']
    srcIP_destIP_alarm_old.add('{}_{}'.format(info['src'], info['dest']))

## Distinguish the three kinds of scenarios, as described at the beginning of this notebook

In [10]:
alarm_kind_new_appear = set()
alarm_kind_old_disappear = set()
alarm_kind_exist_as_before = set()

for alarm in srcIP_destIP_alarm_old:
    if alarm not in srcIP_destIP_alarm_new:
        alarm_kind_old_disappear.add(alarm)
    else:
        alarm_kind_exist_as_before.add(alarm)

for alarm in srcIP_destIP_alarm_new:
    if alarm not in srcIP_destIP_alarm_old:
        alarm_kind_new_appear.add(alarm)
    else:
        alarm_kind_exist_as_before.add(alarm)

print('Number of new appear alarm: {}'.format(len(alarm_kind_new_appear)))
print('Number of old disappear alarm: {}'.format(len(alarm_kind_old_disappear)))
print('Number of exist as before alarm: {}'.format(len(alarm_kind_exist_as_before)))
print()
print('Number of currently existing alarm (i.e. what newly registered users will receive, == 1 + 3): {}'.format(len(srcIP_destIP_alarm_new)))

Number of new appear alarm: 51
Number of old disappear alarm: 67
Number of exist as before alarm: 188

Number of currently existing alarm (i.e. what newly registered users will receive, == 1 + 3): 239


## The variable user_alert_all holds all the needed info to send an email to a specific user

In [11]:
user_alert_all = {}

for user in google.getUserInfo():
    user_info = {}
    user_info['email'] = user[0]
    user_info['fullname'] = user[1]
    user_info['link'] = user[2]
    user_info['freshuser'] = user[3]
    user_info['alerts'] = []
    user_alert_all[user[0]] = user_info    # email should be unique globally, so it is used as key

user_alert_all

{'LINCOLNB@UCHICAGO.EDU': {'alerts': [],
  'email': 'LINCOLNB@UCHICAGO.EDU',
  'freshuser': False,
  'fullname': 'Lincoln Bryant',
  'link': 'https://docs.google.com/forms/d/e/1FAIpQLSfwwtAvMrqp4Ot_LYfmNu75_v33dtAxiXg7ZvVdn1X5v7TEgg/viewform?edit2=2_ABaOnufEkC3ZkGEJM7YrjIK5M9zZCa8Fg0C58snIuKzECJRbScqS5po96bBVKA'},
 'elancon@bnl.gov': {'alerts': [],
  'email': 'elancon@bnl.gov',
  'freshuser': False,
  'fullname': 'Eric Lancon',
  'link': 'https://docs.google.com/forms/d/e/1FAIpQLSfwwtAvMrqp4Ot_LYfmNu75_v33dtAxiXg7ZvVdn1X5v7TEgg/viewform?edit2=2_ABaOnuf_umrjRScBiYvJ_aUfteZA2eiV-yUK9DMa3q7VefVJtC5bEQnNSPOg4A'},
 'ilijav@gmail.com': {'alerts': [],
  'email': 'ilijav@gmail.com',
  'freshuser': False,
  'fullname': 'Ilija Vukotic',
  'link': 'https://docs.google.com/forms/d/e/1FAIpQLSfwwtAvMrqp4Ot_LYfmNu75_v33dtAxiXg7ZvVdn1X5v7TEgg/viewform?edit2=2_ABaOnuezuteui57-PrydNWrUuZf5fmChNqtjEeDab6h5V6lik_-x790uKsPu5Q'},
 'ivukotic@cern.ch': {'alerts': [],
  'email': 'ivukotic@cern.ch',
  'freshuse

## Generate the list about which users are interested in packet loss alerts for which sites and their sites are _source_ sites

In [12]:
interest_packet_loss_source = {}

def reg_interest_src(sitename, email):
    if sitename in interest_packet_loss_source:
        interest_packet_loss_source[sitename].append(email)
    else:
        interest_packet_loss_source[sitename] = [email]

taskName = 'Packet loss increase for link(s) where your site is a source'

subscribe_records = google.getSubscribers_withSiteName(taskName)

for record in subscribe_records:
    sitenames = record[3]
    email = record[1]
    if len(sitenames) == 0:   # blank, so all sites
        reg_interest_src('*', email)
    else:
        for sitename in sitenames:
            reg_interest_src(sitename, email)

interest_packet_loss_source

{'*': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 'CERN-PROD': ['xinran@uchicago.edu'],
 'MWT2': ['LINCOLNB@UCHICAGO.EDU', 'xinran@uchicago.edu'],
 'SFU-LCG2': ['xinran@uchicago.edu']}

## Convert from site name -> email to site ip -> email

In [13]:
interest_packet_loss_source_ip = {}

def reg_interest_src_ip(siteip, email):
    if siteip in interest_packet_loss_source_ip:
        interest_packet_loss_source_ip[siteip].append(email)
    else:
        interest_packet_loss_source_ip[siteip] = [email]

for sitename in interest_packet_loss_source:
    if sitename == '*':
        for ip in site_ip_name:
            for email in interest_packet_loss_source[sitename]:
                reg_interest_src_ip(ip, email)
    else:
        for ip in site_name_ip[sitename]:
            for email in interest_packet_loss_source[sitename]:
                reg_interest_src_ip(ip, email)

interest_packet_loss_source_ip

{'109.105.124.86': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '109.105.125.232': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '117.103.105.191': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '128.142.223.247': ['elancon@bnl.gov',
  'marian.babik@cern.ch',
  'xinran@uchicago.edu'],
 '128.227.221.44': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '129.107.255.29': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '129.15.40.231': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '129.215.213.70': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '129.93.239.148': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '130.246.176.109': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '130.246.47.129': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '131.111.66.196': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '131.154.254.12': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '131.169.98.30': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '131.225.205.12': ['elancon@bnl.gov', 'marian.babik@cern.ch'],
 '131.243.2

## Iterate over all alarms in the three scenarios, and generate the info for emails

In [14]:
for alarm in alarm_kind_new_appear:
    src_ip = alarm.split('_')[0]    # Note that [0] is source ip, [1] is destination ip
    dest_ip = alarm.split('_')[1]
    for email in interest_packet_loss_source_ip[src_ip]:    # when dealing with as_destination_site, change this line accordingly
        line = '[New problem appeared]  ' + taskName
        line += '\n\n'
        line += '    From {} ({}) to {} ({})'.format(site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
        line += '\n'
        print(email)
        print(line)
        user_alert_all[email]['alerts'].append(line)

elancon@bnl.gov
[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From UKI-NORTHGRID-LANCS-HEP (194.80.35.169) to INFN-ROMA1 (141.108.35.18)

marian.babik@cern.ch
[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From UKI-NORTHGRID-LANCS-HEP (194.80.35.169) to INFN-ROMA1 (141.108.35.18)

elancon@bnl.gov
[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From CSCS-LCG2 (148.187.64.25) to OU_OCHEP_SWT2 (129.15.40.231)

marian.babik@cern.ch
[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From CSCS-LCG2 (148.187.64.25) to OU_OCHEP_SWT2 (129.15.40.231)

elancon@bnl.gov
[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From CYFRONET-LCG2 (212.191.227.174) to UnknownSite (161.116.81.235)

marian.babik@cern.ch
[New problem appeared]  Packet loss increase for link(s) where your site is a source

 

In [15]:
for alarm in alarm_kind_old_disappear:
    src_ip = alarm.split('_')[0]    # Note that [0] is source ip, [1] is destination ip
    dest_ip = alarm.split('_')[1]
    for email in interest_packet_loss_source_ip[src_ip]:    # when dealing with as_destination_site, change this line accordingly
        line = '[Old problem disappeared]  ' + taskName
        line += '\n\n'
        line += '    From {} ({}) to {} ({})'.format(site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
        line += '\n'
        print(email)
        print(line)
        user_alert_all[email]['alerts'].append(line)

elancon@bnl.gov
[Old problem disappeared]  Packet loss increase for link(s) where your site is a source

    From praguelcg2 (147.231.25.192) to UKI-NORTHGRID-MAN-HEP (195.194.105.178)

marian.babik@cern.ch
[Old problem disappeared]  Packet loss increase for link(s) where your site is a source

    From praguelcg2 (147.231.25.192) to UKI-NORTHGRID-MAN-HEP (195.194.105.178)

elancon@bnl.gov
[Old problem disappeared]  Packet loss increase for link(s) where your site is a source

    From CA-VICTORIA-WESTGRID-T2 (206.12.154.60) to UKI-NORTHGRID-MAN-HEP (195.194.105.178)

marian.babik@cern.ch
[Old problem disappeared]  Packet loss increase for link(s) where your site is a source

    From CA-VICTORIA-WESTGRID-T2 (206.12.154.60) to UKI-NORTHGRID-MAN-HEP (195.194.105.178)

elancon@bnl.gov
[Old problem disappeared]  Packet loss increase for link(s) where your site is a source

    From JINR-T1 (159.93.229.151) to INDIACMS-TIFR (144.16.111.26)

marian.babik@cern.ch
[Old problem disappeared]  P

In [16]:
for alarm in alarm_kind_exist_as_before:
    src_ip = alarm.split('_')[0]    # Note that [0] is source ip, [1] is destination ip
    dest_ip = alarm.split('_')[1]
    for email in interest_packet_loss_source_ip[src_ip]:    # when dealing with as_destination_site, change this line accordingly
        if user_alert_all[email]['freshuser']:
            line = '[Old problem continues]  ' + taskName
            line += '\n\n'
            line += '    From {} ({}) to {} ({})'.format(site_ip_name[src_ip], src_ip, site_ip_name[dest_ip], dest_ip)
            line += '\n'
            print(email)
            print(line)
            user_alert_all[email]['alerts'].append(line)

In [17]:
user_alert_all

{'LINCOLNB@UCHICAGO.EDU': {'alerts': ['[New problem appeared]  Packet loss increase for link(s) where your site is a source\n\n    From MWT2 (149.165.225.223) to BEIJING-LCG2 (202.122.32.170)\n',
   '[New problem appeared]  Packet loss increase for link(s) where your site is a source\n\n    From MWT2 (72.36.96.4) to BEIJING-LCG2 (202.122.32.170)\n',
   '[New problem appeared]  Packet loss increase for link(s) where your site is a source\n\n    From MWT2 (72.36.96.4) to RAL-LCG2 (130.246.176.109)\n',
   '[New problem appeared]  Packet loss increase for link(s) where your site is a source\n\n    From MWT2 (72.36.96.4) to UKI-NORTHGRID-MAN-HEP (195.194.105.178)\n'],
  'email': 'LINCOLNB@UCHICAGO.EDU',
  'freshuser': False,
  'fullname': 'Lincoln Bryant',
  'link': 'https://docs.google.com/forms/d/e/1FAIpQLSfwwtAvMrqp4Ot_LYfmNu75_v33dtAxiXg7ZvVdn1X5v7TEgg/viewform?edit2=2_ABaOnufEkC3ZkGEJM7YrjIK5M9zZCa8Fg0C58snIuKzECJRbScqS5po96bBVKA'},
 'elancon@bnl.gov': {'alerts': ['[New problem appeare

## Dummy sendMail function for development purpose

In [18]:
def sendMailDummy(subject, to, body):
    if len(body['alerts']) == 0:
        print('======== Do not send alert email to {} as there is no alert for this user ========'.format(to))
    else:
        print('========= Send the following email to a user =========')
        print('------ Email subject ------')
        subject = 'Alert email customized for {}'.format(body['fullname'])
        print(subject)
        print('------ Email to -----------')
        print(to)
        print('------ Email body ---------')
        text = 'Hi {},\n\n'.format(body['fullname'])
        text += '    The following are all the alerts about packet loss that you are interested in:\n\n\n'
        for alert in body['alerts']:
            text += alert
            text += '\n\n'
        text += 'Thank you for using this system. If you want to update your settings or unsubscribe, please use this link: {}'.format(body['link'])
        text += '\n\nBest,\nThe team\n\n\n'
        print(text)
        print('======================================================')
    print()
    print()

## Send out alert email customized for each user

In [19]:
for email in user_alert_all:
    sendMailDummy('auto', email, user_alert_all[email])



------ Email subject ------
Alert email customized for Xinran Wang
------ Email to -----------
xinran@uchicago.edu
------ Email body ---------
Hi Xinran Wang,

    The following are all the alerts about packet loss that you are interested in:


[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From SFU-LCG2 (206.12.24.251) to RAL-LCG2 (130.246.176.109)


[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From MWT2 (149.165.225.223) to BEIJING-LCG2 (202.122.32.170)


[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From CERN-PROD (128.142.223.247) to BEIJING-LCG2 (202.122.32.170)


[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From MWT2 (72.36.96.4) to BEIJING-LCG2 (202.122.32.170)


[New problem appeared]  Packet loss increase for link(s) where your site is a source

    From SFU-LCG2 (206.12.24.251) to UKI-NORTHGRID-MAN-H