-
Notifications
You must be signed in to change notification settings - Fork 35
/
operations-storage_alarmdefinitions.xml
1491 lines (1491 loc) · 49.6 KB
/
operations-storage_alarmdefinitions.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0"?>
<!DOCTYPE section [
<!ENTITY % entities SYSTEM "entity-decl.ent"> %entities;
]>
<section xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="storage-alarmdefinitions">
<title>Storage Alarms</title>
<para>
These alarms show under the Storage section of the &productname; &opscon;.
</para>
<section>
<title>SERVICE: OBJECT-STORAGE</title>
<informaltable>
<?dbhtml table-width="99%" ?>
<tgroup cols="2">
<colspec colname="c1" colnum="1" colwidth="1*"/>
<colspec colname="c2" colnum="2" colwidth="2*"/>
<thead>
<row>
<entry>Alarm Information</entry>
<entry>Mitigation Tasks</entry>
</row>
</thead>
<tbody valign="top">
<row>
<entry>
<para>
<emphasis role="bold">Name: swiftlm-scan monitor</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if
<literal>swiftlm-scan</literal> cannot execute a monitoring task.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The
<literal>swiftlm-scan</literal> program is used to monitor and measure
a number of metrics. If it is unable to monitor or measure something,
it raises this alarm.
</para>
</entry>
<entry>
<para>
Click on the alarm to examine the <literal>Details</literal> field and
look for a <literal>msg</literal> field. The text may explain the error
problem. To view/confirm this, you can also log into the host specified
by the <literal>hostname</literal> dimension, and then run this
command:
</para>
<screen>sudo swiftlm-scan | python -mjson.tool</screen>
<para>
The <literal>msg</literal> field is contained in the
<literal>value_meta</literal> item.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; account replicator last</emphasis>
completed in 12 hours
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if an
<literal>account-replicator</literal> process did not complete a
replication cycle within the last 12 hours.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> This can indicate that
the <literal>account-replication</literal> process is stuck.
</para>
</entry>
<entry>
<para>
Another cause of this problem may be that a file system may be corrupt.
Look for sign of this in these logs on the affected node:
</para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
<para>
The file system may need to be wiped, contact &serviceteam; for advice
on the best way to do that if needed. You can then reformat the file
system with these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; deploy playbook against the affected node, which will
format the wiped file system:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml
--limit <hostname></screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; container replicator last</emphasis>
completed in 12 hours
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a
container-replicator process did not complete a replication cycle
within the last 12 hours
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> This can indicate that
the container-replication process is stuck.
</para>
</entry>
<entry>
<para>
SSH to the affected host and restart the process with this command:
</para>
<screen>sudo systemctl restart swift-container-replicator</screen>
<para>
Another cause of this problem may be that a file system may be corrupt.
Look for sign of this in these logs on the affected node:
</para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
<para>
The file system may need to be wiped, contact &serviceteam; for advice
on the best way to do that if needed. You can then reformat the file
system with these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; deploy playbook against the affected node, which will
format the wiped file system:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
--limit <hostname></screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; object replicator last</emphasis>
completed in 24 hours
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if an
object-replicator process did not complete a replication cycle within
the last 24 hours
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> This can indicate that
the object-replication process is stuck.
</para>
</entry>
<entry>
<para>
SSH to the affected host and restart the process with this command:
</para>
<screen>sudo systemctl restart swift-account-replicator</screen>
<para>
Another cause of this problem may be that a file system may be corrupt.
Look for sign of this in these logs on the affected node:
</para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
<para>
The file system may need to be wiped, contact &serviceteam; for advice
on the best way to do that if needed. You can then reformat the file
system with these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; deploy playbook against the affected node, which will
format the wiped file system:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>
&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
--limit <hostname></screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; configuration file</emphasis>
ownership
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if
files/directories in <literal>/etc/swift</literal> are not owned by
&o_objstore;.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> For files in
<literal>/etc/swift</literal>, somebody may have manually edited or
created a file.
</para>
</entry>
<entry>
<para>
For files in <literal>/etc/swift</literal>, use this command to change
the file ownership:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;sudo chown swift.swift /etc/swift/, /etc/swift/*</screen>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; data filesystem ownership</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if files or
directories in <literal>/srv/node</literal> are not owned by &o_objstore;.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> For directories in
<literal>/srv/node/*</literal>, it may happen that the root partition
was reimaged or reinstalled and the UID assigned to the &o_objstore; user
change. The directories and files would then not be owned by the UID
assigned to the &o_objstore; user.
</para>
</entry>
<entry>
<para>
For directories and files in <filename>/srv/node/*</filename>, compare
the swift UID of this system and other systems and the UID of the owner
of <filename>/srv/node/*</filename>. If possible, make the UID of the
&o_objstore; user match the directories or files. Otherwise, change the
ownership of all files and directories under the
<filename>/srv/node</filename> path using a similar <command>chown
swift.swift</command> command as above.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: Drive URE errors detected</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if
<literal>swift-drive-audit</literal> reports an unrecoverable read
error on a drive used by the &o_objstore; service.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> An unrecoverable read
error occurred when &o_objstore; attempted to access a directory.
</para>
</entry>
<entry>
<para>
The UREs reported only apply to file system metadata (that is,
directory structures). For UREs in object files, the &o_objstore; system
automatically deletes the file and replicates a fresh copy from one of
the other replicas.
</para>
<para>
UREs are a normal feature of large disk drives. It does not mean that
the drive has failed. However, if you get regular UREs on a specific
drive, then this may indicate that the drive has indeed failed and
should be replaced.
</para>
<para>
You can use standard XFS repair actions to correct the UREs in the file
system.
</para>
<para>
If the XFS repair fails, you should wipe the GPT table as follows
(where <drive_name> is replaced by the actual drive name):
</para>
<screen>&prompt.ardana;sudo dd if=/dev/zero of=/dev/sd<drive_name> \
bs=$((1024*1024)) count=1</screen>
<para>
Then follow the steps below which will reformat the drive, remount it,
and restart &o_objstore; services on the affected node.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; reconfigure playbook, specifying the affected node:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts _swift-configure.yml \
--limit <hostname></screen>
</step>
</procedure>
<para>
It is safe to reformat drives containing &o_objstore; data because &o_objstore;
maintains other copies of the data (usually, &o_objstore; is configured to
have three replicas of all data).
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; service</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a &o_objstore;
process, specified by the <literal>component</literal> field, is not
running.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> A daemon specified by
the <literal>component</literal> dimension on the host specified by the
<literal>hostname</literal> dimension has stopped running.
</para>
</entry>
<entry>
<para>
Examine the <filename>/var/log/swift/swift.log</filename> file for
possible error messages related the &o_objstore; process. The process in
question is listed in the alarm dimensions in the
<literal>component</literal> dimension.
</para>
<para>
Restart &o_objstore; processes by running the
<filename>swift-start.yml</filename> playbook, with these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; start playbook against the affected host:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit <hostname></screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; filesystem mount point</emphasis>
status
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a file
system/drive used by &o_objstore; is not correctly mounted.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The device specified by
the <literal>device</literal> dimension is not correctly mounted at the
mountpoint specified by the <literal>mount</literal> dimension.
</para>
<para>
The most probable cause is that the drive has failed or that it had a
temporary failure during the boot process and remained unmounted.
</para>
<para>
Other possible causes are a file system corruption that prevents the
device from being mounted.
</para>
</entry>
<entry>
<para>
Reboot the node and see if the file system remains unmounted.
</para>
<para>
If the file system is corrupt, see the process used for the "Drive URE
errors" alarm to wipe and reformat the drive.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; uptime-monitor status</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if the
swiftlm-uptime-monitor has errors using &o_ident; (<literal>keystone-get-token</literal>),
&o_objstore; (<literal>rest-api</literal>) or &o_objstore;'s healthcheck.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The
swiftlm-uptime-monitor cannot get a token from &o_ident; or cannot get a
successful response from the &o_objstore; Object-Storage API.
</para>
</entry>
<entry>
<para>
Check that the &o_ident; service is running:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Check the status of the &o_ident; service:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts keystone-status.yml</screen>
</step>
<step>
<para>
If it is not running, start the service:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts keystone-start.yml</screen>
</step>
<step>
<para>
Contact the support team if further assistance troubleshooting the
&o_ident; service is needed.
</para>
</step>
</procedure>
<para>
Check that &o_objstore; is running:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Check the status of the &o_ident; service:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
</step>
<step>
<para>
If it is not running, start the service:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml</screen>
</step>
</procedure>
<para>
Restart the swiftlm-uptime-monitor as follows:
</para>
<procedure>
<step>
<para>
Log into the first server running the swift-proxy-server service. Use
this playbook below to determine whcih host this is:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml
--limit SWF-PRX[0]</screen>
</step>
<step>
<para>
Restart the swiftlm-uptime-monitor with this command:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;sudo systemctl restart swiftlm-uptime-monitor</screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; &o_ident; server connect</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
be opened to the &o_ident; service (used for token validation)
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The Identity service
(&o_ident;) server may be down. Another possible cause is that the
network between the host reporting the problem and the &o_ident; server
or the <literal>haproxy</literal> process is not forwarding requests to
&o_ident;.
</para>
</entry>
<entry>
<para>
The <literal>URL</literal> dimension contains the name of the virtual
IP address. Use cURL or a similar program to confirm that a connection
can or cannot be made to the virtual IP address. Check that
<literal>haproxy</literal> is running. Check that the &o_ident; service
is working.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; service listening on ip</emphasis>
and port
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms when a &o_objstore;
service is not listening on the correct port or ip.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The &o_objstore; service may be
down.
</para>
</entry>
<entry>
<para>
Verify the status of the &o_objstore; service on the affected host, as
specified by the <literal>hostname</literal> dimension.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; status playbook to confirm status:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml \
--limit <hostname></screen>
</step>
</procedure>
<para>
If an issue is determined, you can stop and restart the &o_objstore; service
with these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Stop the &o_objstore; service on the affected host:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts &o_objstore;-stop.yml \
--limit <hostname></screen>
</step>
<step>
<para>
Restart the &o_objstore; service on the affected host:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit <hostname></screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; rings checksum</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if the &o_objstore; rings
checksums do not match on all hosts.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The &o_objstore; ring files
must be the same on every node. The files are located in
<filename>/etc/swift/*.ring.gz</filename>.
</para>
<para>
If you have just changed any of the rings and you are still deploying
the change, it is normal for this alarm to trigger.
</para>
</entry>
<entry>
<para>
If you have just changed any of your &o_objstore; rings, if you wait until the
changes complete then this alarm will likely clear on its own. If it
does not, then continue with these steps.
</para>
<para>
Use <command>sudo swift-recon --md5</command> to find which node has
outdated rings.
</para>
<para>
Run the <filename>swift-reconfigure.yml</filename> playbook, using the
steps below. This deploys the same set of rings to every node.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; start playbook against the affected host:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml</screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; memcached server connect</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
be opened to the specified memcached server.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The server may be down.
The memcached daemon running the server may have stopped.
</para>
</entry>
<entry>
<para>
If the server is down, restart it.
</para>
<para>
If memcached has stopped, you can restart it by using the
<filename>memcached-start.yml</filename> playbook, using the steps
below. If this fails, rebooting the node will restart the process.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the memcached start playbook against the affected host:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts memcached-start.yml \
--limit <hostname></screen>
</step>
</procedure>
<para>
If the server is running and memcached is running, there may be a
network problem blocking port 11211.
</para>
<para>
If you see sporadic alarms on different servers, the system may be
running out of resources. Contact &serviceteam; for advice.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; individual disk usage
exceeds 80%</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms when a disk drive
used by &o_objstore; exceeds 80% utilization.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> Generally all disk
drives will fill roughly at the same rate. If an individual disk drive
becomes filled faster than other drives it can indicate a problem with
the replication process.
</para>
</entry>
<entry>
<para>
If many or most of your disk drives are 80% full, you need to add more
nodes to your system or delete existing objects.
</para>
<para>
If one disk drive is noticeably (more than 30%) more utilized than the
average of other disk drives, check that &o_objstore; processes are working on
the server (use the steps below) and also look for alarms related to
the host. Otherwise continue to monitor the situation.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; status:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
</step>
</procedure>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; individual disk usage exceeds
90%</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms when a disk drive
used by &o_objstore; exceeds 90% utilization.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> Generally all disk
drives will fill roughly at the same rate. If an individual disk drive
becomes filled faster than other drives it can indicate a problem with
the replication process.
</para>
</entry>
<entry>
<para>
If one disk drive is noticeably (more than 30%) more utilized than the
average of other disk drives, check that &o_objstore; processes are working on
the server, using these steps:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Run the &o_objstore; status:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
</step>
</procedure>
<para>
Also look for alarms related to the host. An individual disk drive
filling can indicate a problem with the replication process.
</para>
<para>
Restart &o_objstore; on that host using the <literal>--limit</literal>
argument to target the host:
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Stop the &o_objstore; service:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-stop.yml \
--limit <hostname></screen>
</step>
<step>
<para>
Start the &o_objstore; service back up:
</para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit <hostname></screen>
</step>
</procedure>
<para>
If the utilization does not return to similar values as other disk
drives, you can reformat the disk drive. You should only do this if the
average utilization of all disk drives is less than 80%. To format a
disk drive contact &serviceteam; for instructions.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; total disk usage exceeds
80%</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms when the average
disk utilization of &o_objstore; disk drives exceeds 80% utilization.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The number and size of
objects in your system is beginning to fill the available disk space.
Account and container storage is included in disk utilization. However,
this generally consumes 1-2% of space compared to objects, so object
storage is the dominate consumer of disk space.
</para>
</entry>
<entry>
<para>
You need to add more nodes to your system or delete existing objects to
remain under 80% utilization.
</para>
<para>
If you delete a project/account, the objects in that account are not
removed until a week later by the <literal>account-reaper</literal>
process, so this is not a good way of quickly freeing up space.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; total disk usage exceeds
90%</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms when the average
disk utilization of &o_objstore; disk drives exceeds 90% utilization.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The number and size of
objects in your system is beginning to fill the available disk space.
Account and container storage is included in disk utilization. However,
this generally consumes 1-2% of space compared to objects, so object
storage is the dominate consumer of disk space.
</para>
</entry>
<entry>
<para>
If your disk drives are 90% full, you must immediately stop all
applications that put new objects into the system. At that point you
can either delete objects or add more servers.
</para>
<para>
Using the steps below, set the <literal>fallocate_reserve</literal>
value to a value higher than the currently available space on disk
drives. This will prevent more objects being created.
</para>
<procedure>
<step>
<para>
Log in to the &clm;.
</para>
</step>
<step>
<para>
Edit the configuration files below and change the value for
<literal>fallocate_reserve</literal> to a value higher than the
currently available space on the disk drives:
</para>
<screen>~/openstack/my_cloud/config/swift/account-server.conf.j2
~/openstack/my_cloud/config/swift/container-server.conf.j2
~/openstack/my_cloud/config/swift/object-server.conf.j2</screen>
</step>
<step>
<para>
Commit the changes to git:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;git add -A
&prompt.ardana;git commit -a -m "changing &o_objstore; fallocate_reserve value"</screen>
</step>
<step>
<para>
Run the configuration processor:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/openstack/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/localhost config-processor-run.yml</screen>
</step>
<step>
<para>
Update your deployment directory:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/openstack/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/localhost ready-deployment.yml</screen>
</step>
<step>
<para>
Run the &o_objstore; reconfigure playbook to deploy the change:
</para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml</screen>
</step>
</procedure>
<para>
If you allow your file systems to become full, you will be unable to
delete objects or add more nodes to the system. This is because the
system needs some free space to handle the replication process when
adding nodes. With no free space, the replication process cannot work.
</para>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; service per-minute
availability</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if the &o_objstore;
service reports unavailable for the previous minute.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The
<literal>swiftlm-uptime-monitor</literal> service runs on the first
proxy server. It monitors the &o_objstore; endpoint and reports latency data.
If the endpoint stops reporting, it generates this alarm.
</para>
</entry>
<entry>
<para>
There are many reasons why the endpoint may stop running. Check:
</para>
<itemizedlist>
<listitem>
<para>
Is <literal>haproxy</literal> running on the control nodes?
</para>
</listitem>
<listitem>
<para>
Is <literal>swift-proxy-server</literal> running on the &o_objstore; proxy
servers?
</para>
</listitem>
</itemizedlist>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; rsync connect</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
be opened to the specified rsync server
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The rsync daemon on the
specified node cannot be contacted. The most probable cause is that the
node is down. The rsync service might also have been stopped on the
node.
</para>
</entry>
<entry>
<para>
Reboot the server if it is down.
</para>
<para>
Attempt to restart rsync with this command:
</para>
<screen>systemctl restart rsync.service</screen>
</entry>
</row>
<row>
<entry>
<para>
<emphasis role="bold">Name: &o_objstore; smart array controller
status</emphasis>
</para>
<para>
<emphasis role="bold">Description:</emphasis> Alarms if there is a
failure in the Smart Array.
</para>
<para>
<emphasis role="bold">Likely cause:</emphasis> The Smart Array or Smart
HBA controller has a fault or a component of the controller (such as a
battery) is failed or caching is disabled.
</para>
<para>
The HPE Smart Storage Administrator (HPE SSA) CLI component will have
to be installed for SSACLI status to be reported. HPE-specific binaries
that are not based on open source are distributed directly from and
supported by HPE. To download and install the SSACLI utility, please
refer to:
<link
xlink:href="https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f"/>
</para>
</entry>
<entry>
<para>