Operations Documentation (#6)

* New operations documentation
EGI-Federation · Mar 3, 2020 · 457c15e · 457c15e
1 parent 458b9ed
commit 457c15e
Show file tree

Hide file tree

Showing 3 changed files with 359 additions and 10 deletions.
diff --git a/source/technical.rst → source/architecture.rst b/source/technical.rst → source/architecture.rst
@@ -1,5 +1,5 @@
-Service Architecture
---------------------
+Internal Service Architecture
+-----------------------------
 
 The EGI Notebooks service relies on the following technologies to provide its
 functionality:
@@ -18,16 +18,13 @@ functionality:
 
 * `Prometheus <https://prometheus.io/>`_ for monitoring resource consumption.
 
-* Specific EGI hooks for `monitoring <https://github.com/EGI-Foundation/egi-notebooks-monitoring>`_
-  and `accounting <https://github.com/EGI-Foundation/egi-notebooks-accounting>`_.
+* Specific EGI hooks for `monitoring <https://github.com/EGI-Foundation/egi-notebooks-monitoring>`_,
+  `accounting <https://github.com/EGI-Foundation/egi-notebooks-accounting>`_
+  and `backup <https://github.com/EGI-Foundation/egi-notebooks-backup>`_.
 
 * VO-Specific storage/Big data facilities or any pluggable tools into the
   notebooks environment can be added to community specific instances.
 
-.. image:: /_static/egi_notebooks_architecture.png
-
-.. [[File:EGI_Notebooks_Stack.png|center|650px|EGI Notebooks Achitecture]]
-
 Kubernetes
 ::::::::::
 
@@ -76,6 +73,7 @@ EGI Customisations
 EGI Notebooks is deployed as a set of customisations of the `JupyterHub helm
 charts <https://jupyterhub.github.io/helm-chart/>`_.
 
+.. image:: /_static/egi_notebooks_architecture.png
 
 Authentication
 ==============

diff --git a/source/index.rst b/source/index.rst
@@ -34,9 +34,10 @@ or any additional service requests.
    data
    integration
    communities
-   technical
-   training
    faq
+   training
+   architecture
+   operations
 
 .. to be added back
    customisation
diff --git a/source/operations.rst b/source/operations.rst
@@ -0,0 +1,350 @@
+Service Operations
+------------------
+
+In this section you can find the common operational activities related to keep
+the service available to our users.
+
+Initial set-up
+==============
+
+Notebooks VO
+::::::::::::
+
+The resources used for the Notebooks deployments are managed with the
+``vo.notebooks.egi.eu`` VO. Operators of the service should join the VO, check
+the entry at the `operations portal <https://operations-portal.egi.eu/vo/view/voname/vo.notebooks.egi.eu>`_
+and at `AppDB <https://appdb.egi.eu/store/vo/vo.notebooks.egi.eu>`_.
+
+Clients installation
+::::::::::::::::::::
+
+In order to manage the resources you will need these tools installed
+on your client machine:
+
+* ``egicli`` for discovering sites and managing tokens,
+
+* ``terraform`` to create the VMs at the providers,
+
+* ``ansible`` to configure the VMs and install kubernetes at the providers,
+
+* ``terraform-inventory`` to get the list of hosts to use from terraform.
+
+Get the configuration repo
+::::::::::::::::::::::::::
+
+All the configuration of the notebooks is stored at a git repo available in
+keybase. You'll need to be part of the ``opslife`` team in keybase to access.
+Start by cloning the repo:
+
+.. code-block:: shell
+
+   $ git clone keybase://team/opslife/egi-notebooks
+
+Kubernetes
+==========
+
+We use ``terraform`` and ``ansible`` to build the cluster at one of the EGI Cloud
+providers. If you are building the cluster for the first time, create a new
+directory on your local git repository from the template, add it to the
+repo, and get ``terraform`` ready:
+
+.. code-block:: shell
+
+   $ cp -a template <new provider>
+   $ git add <new provider>
+   $ cd <new provider>/terraform
+   $ terraform init
+
+Using the ``egicli`` you can get the list of projects and their ids
+for a given site:
+
+.. code-block:: shell
+
+   $ egicli endpoint projects --site CESGA
+   id                                Name                 enabled    site
+   --------------------------------  -------------------  ---------  ------
+   3a8e9d966e644405bf19b536adf7743d  vo.access.egi.eu     True       CESGA
+   916506ac136741c28e4326975eef0bff  vo.emso-eric.eu      True       CESGA
+   b1d2ef2cc2284c57bcde21cf4ab141e3  vo.nextgeoss.eu      True       CESGA
+   eb7ff20e603d471cb731bdb83a95a2b5  fedcloud.egi.eu      True       CESGA
+   fcaf23d103c1485694e7494a59ee5f09  vo.notebooks.egi.eu  True       CESGA
+
+And with the project ID, you can obtain all the environment variables needed
+to interact with the OpenStack APIs of the site:
+
+.. code-block:: shell
+
+   $ eval "$(egicli endpoint env --site CESGA --project-id fcaf23d103c1485694e7494a59ee5f09)"
+
+Now you are ready to use the openstack or terraform at the site. The token
+obtained is valid for 1 hour, you can refresh it at any time with:
+
+.. code-block:: shell
+
+   $ eval "$(egicli endpoint token --site CESGA --project-id fcaf23d103c1485694e7494a59ee5f09)"
+
+First get the network IDs and pool to use for the site:
+
+.. code-block:: shell
+
+   $ openstack network list
+   +--------------------------------------+-------------------------+--------------------------------------+
+   | ID                                   | Name                    | Subnets                              |
+   +--------------------------------------+-------------------------+--------------------------------------+
+   | 1aaf20b6-47a1-47ef-972e-7b36872f678f | net-vo.notebooks.egi.eu | 6465a327-c261-4391-a0f5-d503cc2d43d3 |
+   | 6174db12-932f-4ee3-bb3e-7a0ca070d8f2 | public00                | 6af8c4f3-8e2e-405d-adea-c0b374c5bd99 |
+   +--------------------------------------+-------------------------+--------------------------------------+
+
+In this case we will use ``public00`` as the pool for public IPs and
+``1aaf20b6-47a1-47ef-972e-7b36872f678f`` as the network ID. Check with the provider
+which is the right network to use. Use these values in the ``terraform.tfvars``
+file:
+
+.. code-block:: terraform
+
+   ip_pool = "public00"
+   net_id  = "1aaf20b6-47a1-47ef-972e-7b36872f678f"
+
+You may want to check the right flavors for your VMs and adapt other variables
+in ``terraform.tfvars``. To get a list of flavors you can use:
+
+.. code-block:: shell
+
+   $ openstack flavor list
+   +--------------------------------------+----------------+-------+------+-----------+-------+-----------+
+   | ID                                   | Name           |   RAM | Disk | Ephemeral | VCPUs | Is Public |
+   +--------------------------------------+----------------+-------+------+-----------+-------+-----------+
+   | 26d14547-96f2-4751-a686-f89a9f7cd9cc | cor4mem8hd40   |  8192 |   40 |         0 |     4 | True      |
+   | 42eb9c81-e556-4b63-bc19-4c9fb735e344 | cor2mem2hd20   |  2048 |   20 |         0 |     2 | True      |
+   | 4787d9fc-3923-4fc9-b770-30966fc3baee | cor4mem4hd40   |  4096 |   40 |         0 |     4 | True      |
+   | 58586b06-7b9d-47af-b9d0-e16d49497d09 | cor24mem62hd60 | 63488 |   60 |         0 |    24 | True      |
+   | 635c739a-692f-4890-b8fd-d50963bff00e | cor1mem1hd10   |  1024 |   10 |         0 |     1 | True      |
+   | 6ba0080d-d71c-4aff-b6f9-b5a9484097f8 | small          |   512 |    2 |         0 |     1 | True      |
+   | 6e514065-9013-4ce1-908a-0dcc173125e4 | cor2mem4hd20   |  4096 |   20 |         0 |     2 | True      |
+   | 85f66ce6-0b66-4889-a0bf-df8dc23ee540 | cor1mem2hd10   |  2048 |   10 |         0 |     1 | True      |
+   | c4aa496b-4684-4a86-bd7f-3a67c04b1fa6 | cor24mem50hd50 | 51200 |   50 |         0 |    24 | True      |
+   | edac68c3-50ea-42c2-ae1d-76b8beb306b5 | test-bigHD     |  4096 |  237 |         0 |     2 | True      |
+   +--------------------------------------+----------------+-------+------+-----------+-------+-----------+
+
+Finally ensure your public ssh key is also listed in the ``cloud-init.yaml``
+file and then you are ready to deploy the cluster with:
+
+.. code-block:: shell
+
+   $ terraform apply
+
+Your VMs are up and running, it's time to get kubernetes configured and running
+with ansible:
+
+.. code-block:: shell
+
+   $ cd ..   # you should be now in <new provider>
+   $ ANSIBLE_TRANSFORM_INVALID_GROUP_CHARS=silently TF_STATE=./terraform \
+     ansible-playbook --inventory-file=$(which terraform-inventory) \
+     playbooks/k8s.yaml
+
+Interacting with the cluster
+::::::::::::::::::::::::::::
+
+As the master will be on a private IP, you won't be able to directly interact
+with it, but you can still ssh into the VM using the ingress node as a gateway
+host (you can get the different hosts with ``TF_STATE=./terraform terraform-inventory --inventory``)
+
+.. code-block:: shell
+
+   $ ssh -o ProxyCommand="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -W %h:%p -q egi@<ingress ip>" \
+         -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null egi@<master ip>
+   egi@k8s-master:~$ kubectl get nodes
+   NAME            STATUS   ROLES    AGE   VERSION
+   k8s-master      Ready    master   33m   v1.15.7
+   k8s-nfs         Ready    <none>   16m   v1.15.7
+   k8s-w-ingress   Ready    <none>   16m   v1.15.7
+   egi@k8s-master:~$ helm list
+   NAME           	REVISION	UPDATED                 	STATUS  	CHART                       	APP VERSION	NAMESPACE
+   certs-man      	2       	Wed Jan  8 15:56:58 2020	DEPLOYED	cert-manager-v0.11.0        	v0.11.0    	cert-manager
+   cluster-ingress	3       	Wed Jan  8 15:56:53 2020	DEPLOYED	nginx-ingress-1.7.0         	0.24.1     	kube-system
+   nfs-provisioner	3       	Wed Jan  8 15:56:43 2020	DEPLOYED	nfs-client-provisioner-1.2.8	3.1.0      	kube-system
+
+
+Modifying/Destroying the cluster
+::::::::::::::::::::::::::::::::
+
+You should be able to change the number of workers in the cluster and re-apply
+terraform to start them and then execute the playbook to get them added to the
+cluster.
+
+Any changes in the master, NFS or ingress VMs should be done carfully as those
+will probably break the configuration of the kubernetes cluster and of any
+application running on top.
+
+.. TODO: remove nodes?
+
+.. TODO: update master/ingress/nfs
+
+Destroying the cluster can be done with a single command:
+
+.. code-block:: shell
+
+   $ terraform destroy
+
+Notebooks deployments
+=====================
+
+Once the k8s cluster is up and running, you can deploy a notebooks instance.
+For each deployment you should create a file in the `deployments` directory
+following the template provided:
+
+.. code-block:: shell
+
+   $ cp deployments/hub.yaml.template deployments/hub.yaml
+
+Each deployment will need a domain name pointing to your ingress host, you
+can create one at the `FedCloud dynamic DNS service <https://nsupdate.fedcloud.eu/>`_.
+
+Then you will need to create an OpenID Connect client for EGI Check-in to authorise users
+into the new deployment. You can create a client by going to the `Check-in demo
+OIDC clients management <https://aai-demo.egi.eu/oidc/manage/admin/clients>`_.
+Use the following as redirect URL: ``https://<your host domain name>/hub/oauth_callback``.
+
+In the `Access` tab, add ``offline_access`` to the list of scopes. Save the
+client and take note of the client ID and client secret for later.
+
+Finally you will also need 3 different random strings generated with
+``openssl rand -hex 32`` that will be used as secrets in the file describing
+the deployment.
+
+Go and edit the deployment description file to add this information (search for
+``# FIXME NEEDS INPUT`` in the file to quickly get there)
+
+For deploying the notebooks instance we will also use ``ansible``:
+
+.. code-block:: shell
+
+   $ ANSIBLE_TRANSFORM_INVALID_GROUP_CHARS=silently TF_STATE=./terraform ansible-playbook \
+          --inventory-file=$(which terraform-inventory) playbooks/notebooks.yaml
+
+The first deployment trial may fail due to a timeout caused by the downloading
+of the container images needed. You can retry after a while to re-deploy.
+
+In the master you can check the status of your deployment (the name of the
+deployment will be the same as the name of your local deployment file):
+
+.. code-block:: shell
+
+   $ helm status hub
+   LAST DEPLOYED: Thu Jan  9 08:14:49 2020
+   NAMESPACE: hub
+   STATUS: DEPLOYED
+
+   RESOURCES:
+   ==> v1/ServiceAccount
+   NAME            SECRETS  AGE
+   hub             1        6m46s
+   user-scheduler  1        3m34s
+
+   ==> v1/Service
+   NAME          TYPE       CLUSTER-IP     EXTERNAL-IP  PORT(S)                     AGE
+   hub           ClusterIP  10.100.77.129  <none>       8081/TCP                    6m46s
+   proxy-public  NodePort   10.107.127.44  <none>       443:32083/TCP,80:30581/TCP  6m45s
+   proxy-api     ClusterIP  10.103.195.6   <none>       8001/TCP                    6m45s
+
+   ==> v1/ConfigMap
+   NAME            DATA  AGE
+   hub-config      4     6m47s
+   user-scheduler  1     3m35s
+
+   ==> v1/PersistentVolumeClaim
+   NAME        STATUS   VOLUME               CAPACITY  ACCESS MODES  STORAGECLASS  AGE
+   hub-db-dir  Pending  managed-nfs-storage  6m46s
+
+   ==> v1/ClusterRole
+   NAME                              AGE
+   hub-user-scheduler-complementary  3m34s
+
+   ==> v1/ClusterRoleBinding
+   NAME                              AGE
+   hub-user-scheduler-base           3m34s
+   hub-user-scheduler-complementary  3m34s
+
+   ==> v1/RoleBinding
+   NAME  AGE
+   hub   6m46s
+
+   ==> v1/Pod(related)
+   NAME                            READY  STATUS   RESTARTS  AGE
+   continuous-image-puller-flf5t   1/1    Running  0         3m34s
+   continuous-image-puller-scr49   1/1    Running  0         3m34s
+   hub-569596fc54-vjbms            0/1    Pending  0         3m30s
+   proxy-79fb6d57c5-nj8n2          1/1    Running  0         2m22s
+   user-scheduler-9685d654b-9zt5d  1/1    Running  0         3m30s
+   user-scheduler-9685d654b-k8v9p  1/1    Running  0         3m30s
+
+   ==> v1/Secret
+   NAME        TYPE    DATA  AGE
+   hub-secret  Opaque  3     6m47s
+
+   ==> v1/DaemonSet
+   NAME                     DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
+   continuous-image-puller  2        2        2      2           2          <none>         3m34s
+
+   ==> v1/Deployment
+   NAME            DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
+   hub             1        1        1           0          6m45s
+   proxy           1        1        1           1          6m45s
+   user-scheduler  2        2        2           2          3m32s
+
+   ==> v1/StatefulSet
+   NAME              DESIRED  CURRENT  AGE
+   user-placeholder  0        0        6m44s
+
+   ==> v1beta1/Ingress
+   NAME        HOSTS                                 ADDRESS  PORTS  AGE
+   jupyterhub  notebooktest.fedcloud-tf.fedcloud.eu  80, 443  6m44s
+
+   ==> v1beta1/PodDisruptionBudget
+   NAME              MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
+   hub               1              N/A              0                    6m48s
+   proxy             1              N/A              0                    6m48s
+   user-placeholder  0              N/A              0                    6m48s
+   user-scheduler    1              N/A              1                    6m47s
+
+   ==> v1/Role
+   NAME  AGE
+   hub   6m46s
+
+
+   NOTES:
+   Thank you for installing JupyterHub!
+
+   Your release is named hub and installed into the namespace hub.
+
+   You can find if the hub and proxy is ready by doing:
+
+   kubectl --namespace=hub get pod
+
+   and watching for both those pods to be in status 'Running'.
+
+   You can find the public IP of the JupyterHub by doing:
+
+   kubectl --namespace=hub get svc proxy-public
+
+   It might take a few minutes for it to appear!
+
+   Note that this is still an alpha release! If you have questions, feel free to
+   1. Read the guide at https://z2jh.jupyter.org
+   2. Chat with us at https://gitter.im/jupyterhub/jupyterhub
+   3. File issues at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues
+
+Updating a deployment
+:::::::::::::::::::::
+
+Just edit the deployment description file and run ansible again. The helm will
+be upgraded at the cluster.
+
+.. TODO:
+   prometheus
+   grafana
+   accounting
+   backups
+   capacity management
+   share the terraform status