Setup Cloudwatch Logging and Alarms #61

daltonfury42 · 2020-07-08T07:30:01Z

Aim: Avoid downtime by proactive monitoring.

Cloudwatch in free tier allows 10 custom alarms and metrics. It can be used for this.

If cpu/memory/(disk utilisation) goes above a threshold for the , alert send to dev team. If for some reason the server also goes down, or a EC2 health check failed, then also some notification should be set up.
If there is unusually high rate of HTTP 500s or 400s, alert send to dev team
For two we need to start using API Gateway so that these metrics star coming to cloudwatch. So blocked on Setup API Gateway #64
Logs and metrics should be stored.

Currently the logs are in syslog. Can we sent it to cloudwatch?

Also if the two services restart more than 3 times in half an hour, can we get an alert?

daltonfury42 · 2020-07-08T07:30:24Z

Health checks would be good to have

daltonfury42 · 2020-08-03T17:18:44Z

We are getting this issue for the second time. Instance becomes totally unreachable.

daltonfury42 · 2020-10-02T15:30:51Z

The CPU utilisation spiked when this happened today:

daltonfury42 · 2020-12-07T16:56:39Z

I have enabled cloudwatch alarms for just this status check failure, for now.

daltonfury42 · 2021-01-02T10:35:56Z

This happened while I have shell access today. Looks like a memory leak.

Disk

df -lh
Filesystem      Size  Used Avail Use% Mounted on
udev            476M     0  476M   0% /dev
tmpfs            98M  820K   98M   1% /run
/dev/xvda1      7.7G  7.1G  673M  92% /
tmpfs           490M     0  490M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           490M     0  490M   0% /sys/fs/cgroup
/dev/loop1       56M   56M     0 100% /snap/core18/1932
/dev/loop3       77M   77M     0 100% /snap/bombsquad/3
/dev/loop4       62M   62M     0 100% /snap/core20/875
/dev/loop5       29M   29M     0 100% /snap/amazon-ssm-agent/2333
/dev/loop6       90M   90M     0 100% /snap/bombsquad/4
/dev/loop8       98M   98M     0 100% /snap/core/10444
/dev/loop9       33M   33M     0 100% /snap/amazon-ssm-agent/2996
/dev/loop10      62M   62M     0 100% /snap/core20/904
/dev/loop0       98M   98M     0 100% /snap/core/10577
/dev/loop11      56M   56M     0 100% /snap/core18/1944
tmpfs            98M     0   98M   0% /run/user/1000

Memory:

              total        used        free      shared  buff/cache   available
Mem:           978M        868M         61M        868K         48M         15M
Swap:            0B          0B          0B

Top Output:

top - 10:31:20 up 22 days,  2:07,  1 user,  load average: 24.29, 16.00, 7.62
Tasks: 124 total,   2 running,  81 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us, 18.0 sy,  0.0 ni,  0.0 id, 80.6 wa,  0.0 hi,  0.2 si,  0.9 st
KiB Mem :  1002124 total,    71044 free,   891008 used,    40072 buff/cache
KiB Swap:        0 total,        0 free,        0 used.    19312 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                  
28294 root      20   0  0.619g 0.014g 0.000g S  6.0  1.4   1:28.08 snapd                                                                                                    
   82 root      20   0  0.000g 0.000g 0.000g S  5.4  0.0   0:45.90 kswapd0                                                                                                  
 1227 root      20   0  0.702g 0.013g 0.000g S  2.6  1.4   1:03.32 ssm-agent-worke                                                                                          
  790 root      20   0  0.686g 0.008g 0.000g S  2.0  0.9   1:06.09 amazon-ssm-agen                                                                                          
11060 ubuntu    20   0  2.298g 0.197g 0.000g S  1.3 20.6   0:38.65 java                                                                                                     
   94 root       0 -20  0.000g 0.000g 0.000g I  0.8  0.0   0:08.22 kworker/0:1H-kb                                                                                          
  832 ubuntu    20   0  2.306g 0.249g 0.000g S  0.6 26.0  27:52.29 java                                                                                                     
   10 root      20   0  0.000g 0.000g 0.000g S  0.4  0.0   0:08.22 ksoftirqd/0                                                                                              
    1 root      20   0  0.215g 0.003g 0.000g D  0.4  0.3   0:24.02 systemd                                                                                                  
  835 root      20   0  0.632g 0.016g 0.000g S  0.4  1.7  14:11.97 containerd                                                                                               
11841 root      20   0  0.024g 0.003g 0.000g R  0.4  0.3   0:02.33 lsb_release                                                                                              
11853 ubuntu    20   0  0.042g 0.001g 0.001g R  0.3  0.1   0:00.47 top                                                                                                      
  393 root      19  -1  0.117g 0.011g 0.000g D  0.2  1.2   0:25.07 systemd-journal                                                                                          
11744 root      39  19  0.192g 0.072g 0.000g D  0.2  7.5   0:21.22 apt-check                                                                                                  
28175 root       0 -20  0.000g 0.000g 0.000g D  0.2  0.0   0:01.83 loop0                                                                                                    
  462 root       0 -20  0.000g 0.000g 0.000g D  0.2  0.0   0:02.05 loop9                                                                                                    
 1099 root      20   0  0.521g 0.126g 0.000g D  0.2 13.2   3:01.21 ruby                                                                                                     
11850 root      20   0  0.063g 0.001g 0.000g D  0.2  0.1   0:00.50 sshd                                                                                                     
11858 root      20   0  0.056g 0.000g 0.000g S  0.1  0.0   0:00.14 cron                                                                                                     
   11 root      20   0  0.000g 0.000g 0.000g I  0.1  0.0   0:13.86 rcu_sched                                                                                                
  789 root      20   0  0.275g 0.001g 0.000g S  0.1  0.1   0:30.52 accounts-daemon                                                                                          
11070 root      20   0  0.000g 0.000g 0.000g I  0.1  0.0   0:00.56 kworker/0:1-eve                                                                                          
11246 ubuntu    20   0  0.103g 0.001g 0.000g S  0.1  0.1   0:00.11 sshd                                                                                                     
  774 root      20   0  0.030g 0.000g 0.000g S  0.0  0.0   0:02.77 cron                                                                                                     
  830 syslog    20   0  0.255g 0.002g 0.000g S  0.0  0.2   0:04.88 rsyslogd                                                                                                 
 1086 root      20   0  0.112g 0.013g 0.000g S  0.0  1.4   1:17.50 ruby

Both the java processes (prod and dev backend services) are taking 2GB+ each, which should not be happening.

daltonfury42 mentioned this issue Jan 2, 2021

Memory Leak #105

Closed

daltonfury42 closed this as completed Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup Cloudwatch Logging and Alarms #61

Setup Cloudwatch Logging and Alarms #61

daltonfury42 commented Jul 8, 2020 •

edited

daltonfury42 commented Jul 8, 2020

daltonfury42 commented Aug 3, 2020

daltonfury42 commented Oct 2, 2020

daltonfury42 commented Dec 7, 2020

daltonfury42 commented Jan 2, 2021

Setup Cloudwatch Logging and Alarms #61

Setup Cloudwatch Logging and Alarms #61

Comments

daltonfury42 commented Jul 8, 2020 • edited

daltonfury42 commented Jul 8, 2020

daltonfury42 commented Aug 3, 2020

daltonfury42 commented Oct 2, 2020

daltonfury42 commented Dec 7, 2020

daltonfury42 commented Jan 2, 2021

daltonfury42 commented Jul 8, 2020 •

edited