/
admin_administration.xml
1356 lines (1314 loc) · 44.3 KB
/
admin_administration.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter
[
<!ENTITY % entities SYSTEM "entity-decl.ent">
%entities;
]>
<chapter version="5.0" xml:id="cha.admin"
xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink">
<info>
<title>Cluster Management</title>
<dm:docmanager xmlns:dm="urn:x-suse:ns:docmanager">
<dm:bugtracker/>
<dm:translation>yes</dm:translation>
</dm:docmanager>
</info>
<!-- FIXME, mnapp 04/09/18 fill in these sections
<sect1 xml:id="sec.admin.concepts">
<title>Concepts</title>
</sect1>
-->
<sect1 xml:id="sec.admin.kubernetes.install-kubectl">
<title>Interacting With &kube;</title>
<para>
&kube; requires the use of <literal>kubectl</literal> for many tasks.
You can perform most of these actions while logged in to an SSH session on
the master node of your &productname; cluster. <literal>kubectl</literal>
is a pre-installed component of &productname;.
</para>
<para>
The proxy functionality requires <literal>kubectl</literal> to be installed
on your local machine to act as a proxy between the local workstation and the
remote cluster.
</para>
<important>
<title>&sle; Desktop 12 SP3 / 15.0 - Installation from Packagehub</title>
<para>
The use of PackageHub is <link xlink:href="https://packagehub.suse.com/support/">exempt from commercial support</link>.
</para>
<para>
If you are using &sle; 12 SP3 or 15.0, you must
<link xlink:href="https://www.suse.com/documentation/sled-15/book_quickstarts/data/sec_modules_installing.html">enable the PackageHub Extension</link>.
</para>
<para>
The instructions are identical for both versions.
</para>
</important>
<tip>
<title>Installing <command>kubectl</command> on Non-SUSE OS or Old Release</title>
<para>
If you are using an operating system other than the current &sle; 12 SP3/15.0
or &opensuse; Tumbleweed/Leap please consult the
<link xlink:href="https://kubernetes.io/docs/tasks/tools/install-kubectl/">
installation instructions</link> from the &kube; project.
</para>
</tip>
<tip>
<title>The KUBECONFIG Variable</title>
<para>
&kubectl; uses an environment variable named <varname>KUBECONFIG</varname>
to locate your &kubeconfig; file. If this variable is not specified, it
defaults to <filename>$HOME/.kube/config</filename>. To use a different
location, run
</para>
<screen>&prompt.user;<command>export KUBECONFIG=<replaceable>/PATH/TO/KUBE/CONFIG/FILE</replaceable></command></screen>
</tip>
<procedure>
<title>Install the <literal>kubectl</literal> package</title>
<step>
<para>
Install the <filename>kubectl</filename> package:
</para>
<screen>&prompt.sudo;<command>zypper in kubectl</command></screen>
</step>
<step>
<para>
To use kubectl to connect to a local machine you must perform <xref linkend="sec.admin.security.auth.kubeconfig" /> against the &kube; master node. Download the <filename>.kubeconfig</filename> file from &dashboard; and place it in <filename>˜/.kube/config</filename>.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_status.png" width="100%"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_status.png" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
Verify that <literal>kubectl</literal> was installed and is configured correctly:
</para>
<screen>&prompt.user;<command>kubectl get nodes</command>
NAME STATUS ROLES AGE VERSION
caasp3-master Ready master 1d v1.9.8
caasp3-worker-1 Ready <none> 1d v1.9.8
caasp3-worker-2 Ready <none> 1d v1.9.8
caasp3-worker-3 Ready <none> 1d v1.9.8
caasp3-worker-4 Ready <none> 1d v1.9.8</screen>
<para>
You should see the list of nodes known to &productname;.
</para>
</step>
</procedure>
</sect1>
<sect1 xml:id="sec.admin.salt">
<title>Interacting with Salt</title>
<para>
You can run commands across all nodes in the cluster by running them via
<literal>salt</literal>.
</para>
<para>
Log in to the admin node and run:
</para>
<screen>&prompt.user;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt -P 'roles:(admin|kube-master|kube-minion)' \
cmd.run "<replaceable>df -h</replaceable>"</command>
</screen>
<para>
This command tells <literal>docker</literal> to find the
<literal>salt-master</literal> container and execute the command on all nodes
that match the roles <literal>admin</literal>, <literal>kube-master</literal>,
and <literal>kube-minion</literal> (which is all nodes).
</para>
<para>
Replace the example <command>df -h</command> with a command of your choice.
The output will be produced in your current terminal session.
</para>
<sect2 xml:id="sec.admin.salt.worker_threads">
<title>Adjusting The Number Of Salt Worker Threads</title>
<para>
It will sometimes be necessary to resize the &kube; cluster to adjust for
workloads or other factors. Salt will run into problems, if the number of
nodes to handle becomes too large without adjusting the number of available
Salt worker threads.
</para>
<para>
For the correct value, refer to
<xref linkend="sec.deploy.requirements.system.cluster.salt_cluster_size"/>.
</para>
<procedure>
<title>Adjust The Salt Worker Count</title>
<step>
<para>
Log in to your admin node via SSH.
</para>
</step>
<step>
<para>
Run the following command to adjust the configured number of workers
(here: <literal>20</literal>).
</para>
<screen>&prompt.root.admin;<command>echo "worker_threads:<replaceable>20</replaceable>" > /etc/salt/salt-master-custom.conf</command>
</screen>
</step>
<step>
<para>
Find the ID of the &smaster; container.
</para>
<screen>&prompt.root.admin;<command>saltid=$(docker ps -q -f salt-master)</command>
</screen>
</step>
<step>
<para>
And restart the &smaster;.
</para>
<screen>&prompt.root.admin;<command>docker kill $saltid</command>
</screen>
</step>
</procedure>
<para>
Now, Salt will restart and adjust the number of workers in the cluster.
</para>
</sect2>
</sect1>
<sect1 xml:id="sec.admin.nodes">
<title>Node Management</title>
<para>
After you complete the deployment and you bootstrap the cluster, you may
need to perform additional changes to the cluster. By using &dashboard; you
can add additional nodes to the cluster. You can also delete some nodes, but
in that case make sure that you do not break the cluster.
</para>
<sect2 xml:id="sec.admin.nodes.add">
<title>Adding Nodes</title>
<para>
You may need to add additional &worker_node;s to your cluster. The
following steps guides you through that procedure:
</para>
<procedure>
<title>Adding Nodes to Existing Cluster</title>
<step>
<para>
Prepare the node as described in
<xref linkend="sec.deploy.nodes.worker_install"/>
</para>
</step>
<step>
<para>
Open &dashboard; in your browser and login.
</para>
</step>
<step>
<para>
You should see the newly added node as a node to be accepted in
<guimenu>Pending Nodes</guimenu>. Click on <guimenu>Accept Node</guimenu>.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_pending_nodes.png" format="PNG" width="100%"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_pending_nodes.png" width="100%" format="png"
/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
In the <guimenu>Summary</guimenu> you can see the <guimenu>New</guimenu>
that appears next to <guimenu>New nodes</guimenu>. Click the
<guimenu>New</guimenu> button.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_unassigned_nodes.png" width="100%"
format="png"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_unassigned_nodes.png" width="100%"
format="png"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
Select the node to be added and click <guimenu>Add nodes</guimenu>.
</para>
</step>
<step>
<para>
The node has been added to your cluster.
</para>
</step>
</procedure>
<sect3 xml:id="sec.admin.nodes.create_autoyast_profile">
<title>The <command>create_autoyast_profile</command> Command</title>
<para>
The <command>create_autoyast_profile</command> command creates an autoyast
profile for fully automatic installation of &productname;. You can use the
following options when invoking the command:
</para>
<variablelist>
<varlistentry>
<term><literal>-o|--output</literal>
</term>
<listitem>
<para>
Specify to which file the command should save the created profile.
</para>
<screen>&prompt.root;<command>create_autoyast_profile -o <replaceable>FILENAME</replaceable></command></screen>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>--salt-master</literal>
</term>
<listitem>
<para>
Specify the host name of the &smaster;.
</para>
<screen>&prompt.root;<command>create_autoyast_profile --salt-master <replaceable>SALTMASTER</replaceable></command></screen>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>--smt-url</literal>
</term>
<listitem>
<para>
Specify the URL of the SMT server.
</para>
<screen>&prompt.root;<command>create_autoyast_profile --smt-url <replaceable>SALTMASTER</replaceable></command></screen>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>--regcode</literal>
</term>
<listitem>
<para>
Specify the registration code for &productname;.
</para>
<screen>&prompt.root;<command>create_autoyast_profile --regcode <replaceable>RIGISTRATION_CODE</replaceable></command></screen>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>--reg-email</literal>
</term>
<listitem>
<para>
Specify an e-mail address for registration.
</para>
<screen>&prompt.root;<command>create_autoyast_profile --reg-email <replaceable>E-MAIL_ADRESS</replaceable></command></screen>
</listitem>
</varlistentry>
</variablelist>
</sect3>
</sect2>
<sect2 xml:id="sec.admin.nodes.remove">
<title>Removing Nodes</title>
<warning>
<para>
If you attempt to remove more nodes than are required for the minimum cluster
size (3 nodes: 1 master, 2 workers) &dashboard; will display a warning.
Your cluster will be disfunctional until you add the minimum amount of nodes
again.
</para>
</warning>
<note>
<para>
As each node in the cluster runs also an instance of
<literal>etcd</literal>, &productname; has to ensure that removing of
several nodes does not break the <literal>etcd</literal> cluster. In case
you have, for example, three nodes in the <literal>etcd</literal> and you
delete two of them, &productname; deletes one node, recovers the cluster
and only if the recovery is successful, allows the next node to be removed.
If a node runs just an <literal>etcd-proxy</literal>, there is nothing special
that has to be done, as deleting any amount of
<literal>etcd-proxy</literal> can not break the <literal>etcd</literal>
cluster.
</para>
</note>
<note>
<para>
If you have only one master node configured, &dashboard; will not allow you
to remove it. You must first add a second master node as a replacement.
</para>
</note>
<procedure>
<step>
<para>
Log-in to &dashboard; on your &productname; Admin node.
Then, click <guimenu>Remove</guimenu> next to the node you wish to remove.
A dialog will ask you to confirm the removal.
</para>
<informalfigure>
<mediaobject>
<imageobject>
<imagedata fileref="velum_status.png" format="PNG" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
The cluster will then attempt to remove the node in a controlled manner.
Progress is indicated by a spinning icon and the words <literal>Pending removal</literal>
in the location where the <guimenu>Remove</guimenu>-button was displayed before.
</para>
<informalfigure>
<mediaobject>
<imageobject>
<imagedata fileref="velum_pending_removal.png" format="PNG" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
<para>
This should conclude the regular removal process. If the node is successfully
removed, it will disappear from the list after a few moments.
</para>
</step>
<step>
<para>
In some cases nodes can not be removed in a controlled manner and must be
forced out of the cluster. A typical scenario is a machine instance was
removed externally or has no connectivity. In such cases, the removal will
fail. You then get the option to <guimenu>Force remove</guimenu>. A dialog
will ask you to confirm the removal.
</para>
<informalfigure>
<mediaobject>
<imageobject>
<imagedata fileref="velum_failed_removal.png" format="PNG" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
<para>
Additionally, a large warning dialog will ask you to confirm the forced
removal. Click <guimenu>Proceed with forcible removal</guimenu> if you
are sure you wish to force the node out of the cluster.
</para>
<informalfigure>
<mediaobject>
<imageobject>
<imagedata fileref="velum_force_removal.png" format="PNG" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
</procedure>
</sect2>
<sect2 xml:id="sec.admin.nodes.remove.unassigned">
<!-- FIXME mnapp 2018-07-03, replace terminology and screenshots once
bsc#1100113 has been resolved -->
<title>Removing Unassigned nodes</title>
<para>You might run into the situation where you have (accidentally) added
new nodes to a cluster but did not wish to bootstrap them. They are now
registered against the cluster and show up in "Unassigned nodes".
You might have already configured the machine to register with another cluster
and want to clean out this entry from the "Unassigned Nodes" view.
You must perform the following steps:
</para>
<procedure>
<step>
<para>
Find the "Unassigned nodes" line in the overview and click on <guimenu>(new)</guimenu>
next to the count number. You will be shown the "Unassigned Nodes" view
where all the unassigned nodes are listed. Make sure that you first assign
all roles to nodes that you wish to keep and proceed with bootstrapping.
Once the list only show the nodes you are sure to remove copy the ID of the
node you wish to drop.
</para>
<informalfigure>
<mediaobject>
<imageobject>
<imagedata fileref="velum_unassigned_nodes.png" format="PNG" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
Log into the Admin node of you cluster via SSH.
</para>
</step>
<step>
<para>
Run the following command and replace <replaceable>$ID_FROM_UNASSIGNED_QUEUE</replaceable>
with the node ID that you copied from the "Unassigned nodes" view in &dashboard;.
</para>
<warning>
<para>
Make absolutely sure that the node ID you have copied is the one of the node
you wish to drop. This command is <literal>irreversible</literal> and will remove the
specified node from the cluster without confirmation.
</para>
</warning>
<screen>&prompt.root;<command>docker exec -it $(docker ps | grep "velum-dashboard" | awk '{print $1}') \
entrypoint.sh bundle exec rails runner 'puts Minion.find_by(minion_id: "<replaceable>$ID_FROM_UNASSIGNED_QUEUE</replaceable>").destroy'</command>
</screen>
</step>
</procedure>
</sect2>
</sect1>
<sect1 xml:id="sec.admin.nodes.graceful_shutdown">
<title>Graceful Shutdown and Startup</title>
<sect2 xml:id="sec.admin.nodes.graceful_shutdown.overview">
<title>Overview</title>
<para>
&kube;, being a self-healing solution, tries to keep all pods and
services available. In general, this is of its core features and
desired functions. But it is important to take this into account if
you are doing a complete shutdown of the infrastructure.
</para>
<para>
There are two ways of shutting down the whole cluster: Shut down
and start all nodes at once or restart them sequentially in
segments. In both cases, &productname; expects that IP addresses do
not change after the restart, even when using dynamic IP addresses.
</para>
<para>
When restarting segments of nodes, it is possible to avoid
downtime.
</para>
<note>
<title>Deviating from Shutdown and Startup Procedures</title>
<para>
The procedures described in this section are recommended to
reduce logged errors. However, it is possible to not follow this
order as long as all nodes are stopped in a graceful way.
</para>
</note>
</sect2>
<sect2 xml:id="sec.admin.nodes.graceful_shutdown.nodes">
<title>Node Types</title>
<para>
For shutting down and starting nodes, three different types of nodes
are relevant:
</para>
<itemizedlist>
<listitem>
<para>
The &admin_node; contains state and needs to be shut down in a graceful
way to ensure that all state has been synced to disk in a clean way.
</para>
</listitem>
<listitem>
<para>
Nodes with <literal>etcd</literal> contain state and also need to be shut
down in a graceful way. They will usually be a subset of the master nodes.
But it can happen that some workers run <literal>etcd</literal> members.
</para>
</listitem>
<listitem>
<para>
The rest (masters and workers not running <literal>etcd</literal>
members): These nodes contain local state possibly created by
applications running on top of the cluster. They need to be
shut down in a graceful way too, when possible.
</para>
</listitem>
</itemizedlist>
</sect2>
<sect2 xml:id="sec.admin.nodes.graceful_shutdown.complete">
<title>Complete Shutdown</title>
<sect3 xml:id="sec.admin.nodes.graceful_shutdown.complete.shutdown">
<title>Shutting Down</title>
<para>
All commands are executed on the admin node.
</para>
<procedure>
<step>
<para>
Disable scheduling on the whole cluster. This will avoid
&kube; rescheduling jobs while you are shutting down nodes.
</para>
<screen>&prompt.root.admin;<command>kubectl get nodes -o name | xargs -I{} kubectl cordon {}</command></screen>
</step>
<step>
<para>
Gracefully shut down all worker nodes.
</para>
<screen>&prompt.root.admin;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt --async -G 'roles:kube-minion' cmd.run 'systemctl poweroff'</command></screen>
</step>
<step>
<para>
Gracefully shut down all master nodes.
</para>
<screen>&prompt.root.admin;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt --async -G 'roles:kube-master' cmd.run 'systemctl poweroff'</command></screen>
</step>
<step>
<para>
Shut down the &admin_node;:
</para>
<screen>&prompt.root.admin;<command>systemctl poweroff</command></screen>
</step>
</procedure>
</sect3>
<sect3 xml:id="sec.admin.nodes.graceful_shutdown.complete.startup">
<title>Starting Up</title>
<note>
<title><literal>kubectl</literal> Needs Master Nodes To Function</title>
<para>
<command>kubectl</command> requires use of the &kube; API hosted on the
master nodes. Therefore, until at least some of the master nodes have
started successfully, you will see error messages of the type
<literal>HTTP 503</literal>.
</para>
<screen>
Error from server (InternalError): an error on the server
("<html><body><h1>503 Service Unavailable</h1>\nNo server is available
to handle this request.\n</body></html>") has prevented the request
from succeeding (get nodes)
</screen>
</note>
<procedure>
<step>
<para>
Start the &admin_node; up. All commands are executed on the
&admin_node;.
</para>
</step>
<step>
<para>
Once that the admin node is up, start the master nodes. Keep checking
the status of the master nodes. Continue as soon as all master nodes are
<literal>Ready</literal>.
</para>
<screen>&prompt.root.admin;<command>kubectl get nodes</command>
NAME STATUS ROLES AGE VERSION
master-0 Ready,SchedulingDisabled master 21h v1.9.8
master-1 Ready,SchedulingDisabled master 21h v1.9.8
master-2 Ready,SchedulingDisabled master 21h v1.9.8
worker-0 NotReady,SchedulingDisabled <none> 21h v1.9.8
worker-1 NotReady,SchedulingDisabled <none> 21h v1.9.8
worker-2 NotReady,SchedulingDisabled <none> 21h v1.9.8
worker-3 NotReady,SchedulingDisabled <none> 21h v1.9.8
worker-4 NotReady,SchedulingDisabled <none> 21h v1.9.8
</screen>
</step>
<step>
<para>
Continue by starting all the worker nodes. Keep checking the
status of the nodes. Continue when all nodes are <literal>Ready</literal>.
</para>
<screen>&prompt.root.admin;<command>kubectl get nodes</command>
NAME STATUS ROLES AGE VERSION
master-0 Ready,SchedulingDisabled master 21h v1.9.8
master-1 Ready,SchedulingDisabled master 21h v1.9.8
master-2 Ready,SchedulingDisabled master 21h v1.9.8
worker-0 Ready,SchedulingDisabled <none> 21h v1.9.8
worker-1 Ready,SchedulingDisabled <none> 21h v1.9.8
worker-2 Ready,SchedulingDisabled <none> 21h v1.9.8
worker-3 Ready,SchedulingDisabled <none> 21h v1.9.8
worker-4 Ready,SchedulingDisabled <none> 21h v1.9.8
</screen>
</step>
<step>
<para>
Uncordon all nodes so they can receive new workloads:
</para>
<screen>&prompt.root.admin;<command>kubectl get nodes -o name | xargs -I{} kubectl uncordon {}</command></screen>
</step>
</procedure>
</sect3>
</sect2>
<sect2 xml:id="sec.admin.nodes.graceful_shutdown.segmented">
<title>Segmented Reboots</title>
<para>
A sequential reboot of cluster segments is a way to completely
avoid the downtime of services or at least reduce it as much as
possible. However, downtime of services occurs if:
</para>
<itemizedlist>
<listitem>
<para>
All pods of a service are forced on one node
</para>
</listitem>
<listitem>
<para>
A pod has only one replica
</para>
</listitem>
</itemizedlist>
<sect3 xml:id="sec.admin.nodes.graceful_shutdown.segmented.worker">
<title>Rebooting Worker Nodes</title>
<para>
The number of worker nodes to reboot at once depends on the number
of total worker nodes and their labels.
</para>
<para>
For example: If there are 5 worker nodes with 2 of them having the label
<literal>diskType: ssd</literal>, then the two nodes with SSDs must not be
shut down at the same time.
</para>
<para>
The size of segments for simultaneous reboots depends on the
topology of the cluster and the workload. We recommend to use
small segment sizes. This makes it less likely that all nodes
running replicas of the same pod are shut down at the same time.
</para>
<para>
During this migration time, the worker nodes need to be able
to reach the master nodes at all times. This includes master nodes
that are already or not yet updated.
</para>
</sect3>
<sect3>
<title>Rebooting Master Nodes</title>
<para>
Master nodes should not run user workloads. This means that the
decision to batch the reboots of master nodes depends on whether
you want to keep control of the cluster while the reboot is
taking place.
</para>
<para>
If all the master nodes disappear at the same time, the worker
nodes continue serving the services they are running. No further operation
will take place on the worker nodes, since they cannot contact an
<literal>apiserver</literal> to discover new workloads or perform any other
operations.
</para>
<para>
It is safe to choose batches as desired. Rebooting one by one is
the safest, two by two is generally safe too. For larger batches
than two, certain cluster services, for example
<literal>dex</literal>, could be completely shut down.
</para>
</sect3>
</sect2>
<sect2 xml:id="sec.admin.nodes.graceful_shutdown.etcd">
<title>Behavior of <literal>etcd</literal></title>
<para>
<literal>etcd</literal> is a distributed key-value store. Some
nodes on the cluster run <literal>etcd</literal> members that
sync with other peers in order to provide a fault-tolerant storage
that &kube; uses for persistence.
</para>
<para>
<literal>etcd</literal> is the central component where &kube; reads and
writes in order to have global knowledge about the cluster status
and desired state.
</para>
<para>
It's very important to note that <literal>etcd</literal>
automatically recovers from temporary failures like machine
reboots.
</para>
<para>
<literal>etcd</literal> knows how many peers conform the
<literal>etcd</literal> cluster; based on this information the
<literal>etcd</literal> cluster can be in three different states:
healthy, degraded or unavailable.
</para>
<variablelist>
<varlistentry>
<term>Healthy</term>
<listitem>
<para>
All <literal>etcd</literal> members are working as expected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Degraded</term>
<listitem>
<para>
Some <literal>etcd</literal> members are not working as
expected, but there's still a majority in the working ones. This
still means the cluster is working, because it has quorum.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Unavailable</term>
<listitem>
<para>
There is no working majority of peers. The cluster is not
available and cannot be used because the quorum is lost.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
Whether <literal>etcd</literal> is available or not depends on how many
<literal>etcd</literal> members are available/not available at a given
moment. It is important to differentiate between transient and
permanent failures. Transient failures happen when a member is
temporarily not available, for example when a machine running one
<literal>etcd</literal> member is rebooting. Permanent failures
happen when a member was irrevocably lost, for example a machine
hard disk failure. The <literal>etcd</literal> cluster can tolerate
up to (N - 1) / 2 permanent failures, where N is the number of
<literal>etcd</literal> members; a subset of masters and possibly
workers. The number of etcd nodes must always maintain
<literal>Majority</literal> quorum.
</para>
<para>
<literal>Majority</literal> means that the number of available etcd cluster
members must never be lower or equal to the number of unavailable nodes.
If, for example, you have only <literal>1</literal> or <literal>2</literal>
etcd members, the cluster has a fault tolerance of <literal>0</literal>
because <literal>0</literal> nodes can be faulty for the cluster to maintain
<literal>Majority</literal>.
</para>
<para>
If you have <literal>6</literal> nodes, a maximum of <literal>2</literal>
nodes can become faulty for the cluster to remain in degraded but working
state. If <literal>3</literal> or more nodes fail, there is no longer a
majority of nodes working, therefore the cluster becomes unavailable.
</para>
<para>
For example: The fault tolerance of a cluster with <literal>7</literal>
nodes is <literal>3</literal>, because you need at least <literal>4</literal>
active nodes to maintain majority.
</para>
<para>
</para>
<para>
When (N - 1) / 2 or fewer permanent failures happen in a given
<literal>etcd</literal> cluster, the cluster still has a quorum. It
is then possible to remove the faulty members and add new ones. The
new members will synchronize with the existing ones. This does not
require an explicit backup/restore procedure, as it is normal
<literal>etcd</literal> operation.
</para>
<para>
When more than (N - 1) / 2 permanent failures happen in a given
<literal>etcd</literal> cluster, the quorum is lost irrevocably.
That means that there is no way to recover from that situation,
because it is no longer possible to remove faulty members or add
new members. In this case, it is necessary to start a new
<literal>etcd</literal> cluster from a backup, and grow it.
</para>
</sect2>
</sect1>
<sect1 xml:id="sec.admin.scale_cluster">
<title>Scaling the Cluster</title>
<para>
The default maximum number of nodes in a cluster is 40. The Salt
Master configuration needs to be adjusted to handle installation and
updating a of larger cluster:
</para>
<table>
<title>Node Count and Salt Worker Threads</title>
<tgroup cols="2">
<thead>
<row>
<entry>
<para>
Nodes
</para>
</entry>
<entry>
<para>
Salt Worker Threads
</para>
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<para>
>40
</para>
</entry>
<entry>
<para>
20
</para>
</entry>
</row>
<row>
<entry>
<para>
>60
</para>
</entry>
<entry>
<para>
30
</para>
</entry>
</row>
<row>
<entry>
<para>
>75
</para>
</entry>
<entry>
<para>
40
</para>
</entry>
</row>
<row>
<entry>
<para>
>85
</para>
</entry>
<entry>
<para>
50
</para>
</entry>
</row>
<row>
<entry>
<para>
>95
</para>
</entry>
<entry>
<para>
60
</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
To change the variable in the &smaster; configuration, run the
following on the &admin_node;:
</para>
<screen>&prompt.root;<command>echo "worker_threads: 20" > /etc/caasp/salt-master-custom.conf</command>
&prompt.root;<command>docker restart $(docker ps | grep salt-master | awk '{print $1}')</command></screen>
<para>
&smaster; will be automatically restarted by kubelet.
</para>
<para>
Following bootstrapping failure, you can check if Salt
worker_threads is too low.
</para>
<screen>&prompt.root;<command>docker logs $(docker ps | grep salt-master | \
awk '{print $1}') 2>&1 | grep -i worker_threads</command></screen>
</sect1>
<sect1 xml:id="sec.admin.velum.registry">
<title>Configuring Remote Container Registry</title>
<para>
A remote registry allows you to access container images locally.
This is commonly used in cases where a &productname; cluster is not allowed
to have direct access to the internet. You can create a local registry with
the images that you will need and add the information for that registry here.
If the registry is using a self-signed certificate, it can be added here to
create trust between Kubernetes and the registry.
</para>
<para>
By default, the &suse; container registry is configured as the only remote
registry and has the name <literal>SUSE</literal>.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_settings_registry_overview.png" width="100%"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_settings_registry_overview.png" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
<sect2 xml:id="sec.admin.velum.registry.add">
<title>Adding A Remote Registry</title>
<procedure>
<step>
<para>
Log in to &dashboard; and navigate to
<guimenu>Settings → Remote Registries</guimenu>.
</para>
</step>
<step>
<para>
Click on <guimenu>Add Remote Registry</guimenu> to add a new remote
registry configuration.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_settings_remote_registry.png" width="100%"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_settings_remote_registry.png" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
</step>
<step>
<para>
Fill in the options for the new registry.
</para>
<informalfigure>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="velum_settings_new_registry.png" width="100%"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="velum_settings_new_registry.png" width="100%"/>
</imageobject>
</mediaobject>
</informalfigure>
<variablelist>
<varlistentry>
<term>Name</term>
<listitem>
<para>
Define a name for the registry.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>URL</term>
<listitem>