Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Cloudwatch Logging and Alarms #61

Closed
daltonfury42 opened this issue Jul 8, 2020 · 5 comments
Closed

Setup Cloudwatch Logging and Alarms #61

daltonfury42 opened this issue Jul 8, 2020 · 5 comments

Comments

@daltonfury42
Copy link
Collaborator

daltonfury42 commented Jul 8, 2020

Aim: Avoid downtime by proactive monitoring.

Cloudwatch in free tier allows 10 custom alarms and metrics. It can be used for this.

  1. If cpu/memory/(disk utilisation) goes above a threshold for the , alert send to dev team. If for some reason the server also goes down, or a EC2 health check failed, then also some notification should be set up.

  2. If there is unusually high rate of HTTP 500s or 400s, alert send to dev team
    For two we need to start using API Gateway so that these metrics star coming to cloudwatch. So blocked on Setup API Gateway #64

  3. Logs and metrics should be stored.

Currently the logs are in syslog. Can we sent it to cloudwatch?

Also if the two services restart more than 3 times in half an hour, can we get an alert?

@daltonfury42
Copy link
Collaborator Author

Health checks would be good to have

@daltonfury42
Copy link
Collaborator Author

image

We are getting this issue for the second time. Instance becomes totally unreachable.

@daltonfury42
Copy link
Collaborator Author

The CPU utilisation spiked when this happened today:

image

@daltonfury42
Copy link
Collaborator Author

I have enabled cloudwatch alarms for just this status check failure, for now.

@daltonfury42
Copy link
Collaborator Author

This happened while I have shell access today. Looks like a memory leak.

Disk

df -lh
Filesystem      Size  Used Avail Use% Mounted on
udev            476M     0  476M   0% /dev
tmpfs            98M  820K   98M   1% /run
/dev/xvda1      7.7G  7.1G  673M  92% /
tmpfs           490M     0  490M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           490M     0  490M   0% /sys/fs/cgroup
/dev/loop1       56M   56M     0 100% /snap/core18/1932
/dev/loop3       77M   77M     0 100% /snap/bombsquad/3
/dev/loop4       62M   62M     0 100% /snap/core20/875
/dev/loop5       29M   29M     0 100% /snap/amazon-ssm-agent/2333
/dev/loop6       90M   90M     0 100% /snap/bombsquad/4
/dev/loop8       98M   98M     0 100% /snap/core/10444
/dev/loop9       33M   33M     0 100% /snap/amazon-ssm-agent/2996
/dev/loop10      62M   62M     0 100% /snap/core20/904
/dev/loop0       98M   98M     0 100% /snap/core/10577
/dev/loop11      56M   56M     0 100% /snap/core18/1944
tmpfs            98M     0   98M   0% /run/user/1000

Memory:

              total        used        free      shared  buff/cache   available
Mem:           978M        868M         61M        868K         48M         15M
Swap:            0B          0B          0B

Top Output:

top - 10:31:20 up 22 days,  2:07,  1 user,  load average: 24.29, 16.00, 7.62
Tasks: 124 total,   2 running,  81 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us, 18.0 sy,  0.0 ni,  0.0 id, 80.6 wa,  0.0 hi,  0.2 si,  0.9 st
KiB Mem :  1002124 total,    71044 free,   891008 used,    40072 buff/cache
KiB Swap:        0 total,        0 free,        0 used.    19312 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                  
28294 root      20   0  0.619g 0.014g 0.000g S  6.0  1.4   1:28.08 snapd                                                                                                    
   82 root      20   0  0.000g 0.000g 0.000g S  5.4  0.0   0:45.90 kswapd0                                                                                                  
 1227 root      20   0  0.702g 0.013g 0.000g S  2.6  1.4   1:03.32 ssm-agent-worke                                                                                          
  790 root      20   0  0.686g 0.008g 0.000g S  2.0  0.9   1:06.09 amazon-ssm-agen                                                                                          
11060 ubuntu    20   0  2.298g 0.197g 0.000g S  1.3 20.6   0:38.65 java                                                                                                     
   94 root       0 -20  0.000g 0.000g 0.000g I  0.8  0.0   0:08.22 kworker/0:1H-kb                                                                                          
  832 ubuntu    20   0  2.306g 0.249g 0.000g S  0.6 26.0  27:52.29 java                                                                                                     
   10 root      20   0  0.000g 0.000g 0.000g S  0.4  0.0   0:08.22 ksoftirqd/0                                                                                              
    1 root      20   0  0.215g 0.003g 0.000g D  0.4  0.3   0:24.02 systemd                                                                                                  
  835 root      20   0  0.632g 0.016g 0.000g S  0.4  1.7  14:11.97 containerd                                                                                               
11841 root      20   0  0.024g 0.003g 0.000g R  0.4  0.3   0:02.33 lsb_release                                                                                              
11853 ubuntu    20   0  0.042g 0.001g 0.001g R  0.3  0.1   0:00.47 top                                                                                                      
  393 root      19  -1  0.117g 0.011g 0.000g D  0.2  1.2   0:25.07 systemd-journal                                                                                          
11744 root      39  19  0.192g 0.072g 0.000g D  0.2  7.5   0:21.22 apt-check                                                                                                  
28175 root       0 -20  0.000g 0.000g 0.000g D  0.2  0.0   0:01.83 loop0                                                                                                    
  462 root       0 -20  0.000g 0.000g 0.000g D  0.2  0.0   0:02.05 loop9                                                                                                    
 1099 root      20   0  0.521g 0.126g 0.000g D  0.2 13.2   3:01.21 ruby                                                                                                     
11850 root      20   0  0.063g 0.001g 0.000g D  0.2  0.1   0:00.50 sshd                                                                                                     
11858 root      20   0  0.056g 0.000g 0.000g S  0.1  0.0   0:00.14 cron                                                                                                     
   11 root      20   0  0.000g 0.000g 0.000g I  0.1  0.0   0:13.86 rcu_sched                                                                                                
  789 root      20   0  0.275g 0.001g 0.000g S  0.1  0.1   0:30.52 accounts-daemon                                                                                          
11070 root      20   0  0.000g 0.000g 0.000g I  0.1  0.0   0:00.56 kworker/0:1-eve                                                                                          
11246 ubuntu    20   0  0.103g 0.001g 0.000g S  0.1  0.1   0:00.11 sshd                                                                                                     
  774 root      20   0  0.030g 0.000g 0.000g S  0.0  0.0   0:02.77 cron                                                                                                     
  830 syslog    20   0  0.255g 0.002g 0.000g S  0.0  0.2   0:04.88 rsyslogd                                                                                                 
 1086 root      20   0  0.112g 0.013g 0.000g S  0.0  1.4   1:17.50 ruby 

Both the java processes (prod and dev backend services) are taking 2GB+ each, which should not be happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant