Skip to content

WMAgent deployment

Todor Ivanov edited this page Dec 20, 2023 · 56 revisions

Pre-requisites

  • A condor_schedd daemon must be deployed and running in your node.
  • It needs to be added to the glideinWMS pool (if not yet).
  • Create an environment setup file under /data/admin/wmagent/env.sh (check other agents to see its content). This file needs to be sourced each time you want to operate WMAgent.
  • Create a secrets file with services information/url and databases credentials under /data/admin/wmagent/WMAgent.secrets (check other agents to see its content). This file is used during WMAgent deployment in order to override some default configuration.
  • NOTE: you need to be very very careful with this file, especially if you are copying it from another agent. Make sure:
  • to overwrite the oracle settings or replace them by MYSQL credentials. Otherwise, you may delete production oracle database!!!
  • update COUCH_HOST with the proper node IP
  • and update the service URLs in case you are using cmsweb-testbed or your own private virtual machine...
  • Copy the service certificate files (service{cert,key}.pem from vocms0230) over /data/certs/ directory. Notice their permission must be at least 600.
  • Copy the short-term proxy (myproxy.pem from vocms0230) over /data/certs directory.
  • Finally, this script will be used for the deployment: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent.sh

Deployment procedure

1. Initial setup (example for CERN agents)

At this point, you should have gone through the pre-requisites, especially the changes required to WMAgent.secrets (if not, go back there!) From lxplus or aiadm, access the node with your own account and then switch to cmst1.

ssh vocmsXXX
sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc

Download the deployment script (master branch should work, but if you prefer you could replace master by the wmcore_tag you want):

cd /data/srv
wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh`

2. Deploying the agent

First, read the help/usage of the script by:

sh deploy-wmagent.sh`

There are several things you need to provide in the command line, again, read the script help from the above command. Otherwise, this would be an example of WMAgent deployment:

sh deploy-wmagent.sh -w 2.2.3.1 -t testbed-dev -p "111 222"

The command above would deploy WMAgent version 2.2.3.1, setting the agent team name to testbed-dev, applying patches 111 and 222 from official pull requests from WMCore repo and, finally.

3. Final check and starting services

Once you finish the deployment of the agent, it's worth it to check whether the config.py contains the correct configuration (according to arguments from the command line and the secrets file). Run:

source /data/admin/wmagent/env.sh. # or you can use the alias agentenv
less config/wmagentpy3/config.py

IF everything is Ok, you just need to start the components, since the services (couchdb and mysql) are started during the deployment procedure. To start all the components (the agent itself), run:

$manage start-agent

4. Additional commands

If you made some changes to the code and want to restart the agent (all components), type:

$manage stop-agent
$manage start-agent

If you want restart only specific components, type:

$manage execute-agent wmcoreD --restart --components=DBS3Upload

Deployment of a new agent in production

CERN agents

1. New machine

Ask for a new machine configured by puppet from the VOC. The machine needs to be registered as a proper schedd in the the CERN HTCondor global pool. Then follow the procedure explained above.

2. Upgrading an existing agent

Login to the machine and setup the environment:

ssh vocmsXXX

sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc

agentenv

Check for the version of the agent currently installed and if it is drained for sure.

!!! DO NOT START !!! any further actions if the agent is not completely drained.

condor_q

You should see and empty queue:

-- Schedd: vocms0283.cern.ch : <137.138.153.30:4080?... @ 11/15/19 16:21:15
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

runningagent

cmst1    1376194  0.0  0.0 112712   948 pts/1    S+   16:22   0:00 grep -E couch|wmcore|mysql|beam

Check the status of the agent:

$manage status

And in case there is something still running:

$manage stop-agent

$manage stop-services

Unregister the agent from WMStat - Clean the document from the WMStat database:

$manage execute-agent wmagent-unregister-wmstats `hostname -f`

Clean the database

$manage execute-agent clean-oracle

Executing clean-oracle  ...
Are you sure you want to wipe out CMS_WMBS_PROD13 oracle database (yes/no): yes
Alright, dropping and purging everything

SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 15 16:26:10 2019

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> 

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
Done!

Copy the old config:

cp -av /data/srv/wmagent/current/config/wmagent/config.py /data/srv/config.py.$(date -I)

Remove the old agent

rm -fr /data/srv/wmagent/v1.2.4.patch2/

Restart the whole node

Logout from cmst1 account and reboot

exit
sudo reboot

Run puppet manually

Once the machine is up again login and run puppet manually. Even though the machines are running puppet on startup sometimes it is needed more than a single run to apply a new change:

[lxplus** ]$ ssh vocms**.cern.ch
sudo -s 
sudo /opt/puppetlabs/bin/puppet agent -tv

Delete any leftovers from previous deployments

sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
cd /data/srv
rm -rf deploy*

Check WMAgent.seecrets file:

vi /data/admin/wmagent/WMAgent.secrets

Watch for 'ORACLE_TNS' and 'RUCIO_ACCOUNT'

Download the wmagent deploy script from master this time.

wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh

Run the wmagent deployment script

Before executing the command check for the correct versions of:

  • agent tag: example "1.2.8" check for the correct tag in the comments here
  • team name: "production"
  • agent number: "13"

Take those from the previous wmagent config file, and run:

agentTag='1.3.0'
teamName=$(grep -i teamName config.py.$(date -I)|awk '{print $3}') && teamName=${teamName#\'} && teamName=${teamName%\'}
agentNumber=$(grep -i agentNumber config.py.$(date -I)|awk '{print $3}')

sh deploy-wmagent.sh -w $agentTag -t $teamName -n $agentNumber |tee -a /data/srv/deployment.log.$(date -I) 

Or in case we need a patched deployment:

agentTag='1.3.0'
teamName=$(grep -i teamName config.py.$(date -I)|awk '{print $3}') && teamName=${teamName#\'} && teamName=${teamName%\'}
agentNumber=$(grep -i agentNumber config.py.$(date -I)|awk '{print $3}')
patchNum="9439"

sh deploy-wmagent.sh -w $agentTag -t $teamName -n $agentNumber -p "$patchNum" |tee -a /data/srv/deployment.log.$(date -I) 

Watch out for errors. Need to go through every step in the installation and confirm that it finished with no errors. Especially the parts related to CouchDB

Check that the newly generated config file differs from the previous one only by the agent version (or reflects changes that you made intentionally):

agentenv
diff -u  config/wmagent/config.py /data/srv/config.py.$(date -I)  |less

Check status

Check the status of the agent in its local couchdb by visiting the following (change the machine name):

https://cmsweb.cern.ch/couchdb/_utils/document.html?reqmgr_auxiliary/WMAGENT_CONFIG_vocms0283.cern.ch

Run the agent

agentenv
$manage start-agent

Eventually Once the agent is validated you do not need deployment output and the old config, clean:

rm /data/srv/*$(date -I)

Edit the relevant twiki:

Current set of dev/testbed WMAgents

Node Site Responsible Condor pool
vocms0192 CERN DMWM stable Global
vocms0193 CERN DMWM stable Global
vocms0260 CERN Todor ITB
vocms0261 CERN Alan Global
vocms0262 CERN Alan ITB
vocms0263 CERN Kenyi ITB
vocms0264 CERN Todor ITB
vocms0265 CERN Kenyi Global
vocms0267 CERN CMS@Home CMS@Home
vocms0290 CERN Todor Global
vocms0291 CERN Valentin Global
cmsgwms-submit1 FNAL DMWM stable ITB
cmssrv217 FNAL ??? HEPCloud
cmssrv620 FNAL ??? HEPCloud
Clone this wiki locally