Merge 6172ca7 into 36fb0f4

IQSS · Apr 10, 2024 · a10295a · a10295a
2 parents 36fb0f4 + 6172ca7
commit a10295a
Show file tree

Hide file tree

Showing 12 changed files with 191 additions and 20 deletions.
diff --git a/doc/release-notes/make-data-count-.md b/doc/release-notes/make-data-count-.md
@@ -0,0 +1,3 @@
+### Counter Processor 1.05 Support
+
+This release includes support for counter-processor-1.05 for processing Make Data Count metrics. If you are running Make Data Counts support, you should reinstall/reconfigure counter-processor as described in the latest Guides. (For existing installations, note that counter-processor-1.05 requires a Python3, so you will need to follow the full counter-processor install. Also note that if you configure the new version the same way, it will reprocess the days in the current month when it is first run. This is normal and will not affect the metrics in Dataverse.)
diff --git a/doc/sphinx-guides/source/_static/util/counter_daily.sh b/doc/sphinx-guides/source/_static/util/counter_daily.sh
@@ -1,6 +1,6 @@
 #! /bin/bash
 
-COUNTER_PROCESSOR_DIRECTORY="/usr/local/counter-processor-0.1.04"
+COUNTER_PROCESSOR_DIRECTORY="/usr/local/counter-processor-1.05"
 MDC_LOG_DIRECTORY="/usr/local/payara6/glassfish/domains/domain1/logs/mdc"
 
 # counter_daily.sh

diff --git a/doc/sphinx-guides/source/admin/make-data-count.rst b/doc/sphinx-guides/source/admin/make-data-count.rst
@@ -16,7 +16,7 @@ Architecture
 
 Dataverse installations who would like support for Make Data Count must install `Counter Processor`_, a Python project created by California Digital Library (CDL) which is part of the Make Data Count project and which runs the software in production as part of their `DASH`_ data sharing platform.
 
-.. _Counter Processor: https://github.com/CDLUC3/counter-processor
+.. _Counter Processor: https://github.com/gdcc/counter-processor
 .. _DASH: https://cdluc3.github.io/dash/
 
 The diagram below shows how Counter Processor interacts with your Dataverse installation and the DataCite hub, once configured. Dataverse installations using Handles rather than DOIs should note the limitations in the next section of this page.
@@ -84,9 +84,9 @@ Configure Counter Processor
 
 * Change to the directory where you installed Counter Processor.
 
-  * ``cd /usr/local/counter-processor-0.1.04``
+  * ``cd /usr/local/counter-processor-1.05``
 
-* Download :download:`counter-processor-config.yaml <../_static/admin/counter-processor-config.yaml>` to ``/usr/local/counter-processor-0.1.04``.
+* Download :download:`counter-processor-config.yaml <../_static/admin/counter-processor-config.yaml>` to ``/usr/local/counter-processor-1.05``.
 
 * Edit the config file and pay particular attention to the FIXME lines.
 
@@ -99,7 +99,7 @@ Soon we will be setting up a cron job to run nightly but we start with a single
 
 * Change to the directory where you installed Counter Processor.
 
-  * ``cd /usr/local/counter-processor-0.1.04``
+  * ``cd /usr/local/counter-processor-1.05``
 
 * If you are running Counter Processor for the first time in the middle of a month, you will need create blank log files for the previous days. e.g.:
 

diff --git a/doc/sphinx-guides/source/developers/make-data-count.rst b/doc/sphinx-guides/source/developers/make-data-count.rst
@@ -1,7 +1,7 @@
 Make Data Count
 ===============
 
-Support for Make Data Count is a feature of the Dataverse Software that is described in the :doc:`/admin/make-data-count` section of the Admin Guide. In order for developers to work on the feature, they must install Counter Processor, a Python 3 application, as described below. Counter Processor can be found at https://github.com/CDLUC3/counter-processor
+Support for Make Data Count is a feature of the Dataverse Software that is described in the :doc:`/admin/make-data-count` section of the Admin Guide. In order for developers to work on the feature, they must install Counter Processor, a Python 3 application, as described below. Counter Processor can be found at https://github.com/gdcc/counter-processor
 
 .. contents:: |toctitle|
         :local:
@@ -49,7 +49,7 @@ Once you are done with your configuration, you can run Counter Processor like th
 
 ``su - counter``
 
-``cd /usr/local/counter-processor-0.1.04``
+``cd /usr/local/counter-processor-1.05``
 
 ``CONFIG_FILE=counter-processor-config.yaml python39 main.py``
 
@@ -82,7 +82,7 @@ Second, if you are also sending your SUSHI report to Make Data Count, you will n
 
 ``curl -H "Authorization: Bearer $JSON_WEB_TOKEN" -X DELETE https://$MDC_SERVER/reports/$REPORT_ID``
 
-To get the ``REPORT_ID``, look at the logs generated in ``/usr/local/counter-processor-0.1.04/tmp/datacite_response_body.txt``
+To get the ``REPORT_ID``, look at the logs generated in ``/usr/local/counter-processor-1.05/tmp/datacite_response_body.txt``
 
 To read more about the Make Data Count api, see https://github.com/datacite/sashimi
 

diff --git a/doc/sphinx-guides/source/installation/prerequisites.rst b/doc/sphinx-guides/source/installation/prerequisites.rst
@@ -434,7 +434,7 @@ firewalled from your Dataverse installation host).
 Counter Processor
 -----------------
 
-Counter Processor is required to enable Make Data Count metrics in a Dataverse installation. See the :doc:`/admin/make-data-count` section of the Admin Guide for a description of this feature. Counter Processor is open source and we will be downloading it from https://github.com/CDLUC3/counter-processor
+Counter Processor is required to enable Make Data Count metrics in a Dataverse installation. See the :doc:`/admin/make-data-count` section of the Admin Guide for a description of this feature. Counter Processor is open source and we will be downloading it from https://github.com/gdcc/counter-processor
 
 Installing Counter Processor
 ============================
@@ -444,9 +444,9 @@ A scripted installation using Ansible is mentioned in the :doc:`/developers/make
 As root, download and install Counter Processor::
 
         cd /usr/local
-        wget https://github.com/CDLUC3/counter-processor/archive/v0.1.04.tar.gz
-        tar xvfz v0.1.04.tar.gz
-        cd /usr/local/counter-processor-0.1.04
+        wget https://github.com/gdcc/counter-processor/archive/refs/tags/v1.05.tar.gz
+        tar xvfz v1.05.tar.gz
+        cd /usr/local/counter-processor-1.05
 
 Installing GeoLite Country Database
 ===================================
@@ -457,22 +457,23 @@ The process required to sign up, download the database, and to configure automat
 
 As root, change to the Counter Processor directory you just created, download the GeoLite2-Country tarball from MaxMind, untar it, and copy the geoip database into place::
 
-        <download or move the GeoLite2-Country.tar.gz to the /usr/local/counter-processor-0.1.04 directory>
+        <download or move the GeoLite2-Country.tar.gz to the /usr/local/counter-processor-1.05 directory>
         tar xvfz GeoLite2-Country.tar.gz
         cp GeoLite2-Country_*/GeoLite2-Country.mmdb maxmind_geoip
+        Note: GeoLite2-Country_20191217 is already included in the installation. You can skip the download and untar if you use this version. Simply 'cp maxmind_geoip/GeoLite2-Country_20191217/GeoLite2-Country.mmdb maxmind_geoip
 
 Creating a counter User
 =======================
 
 As root, create a "counter" user and change ownership of Counter Processor directory to this new user::
 
         useradd counter
-        chown -R counter:counter /usr/local/counter-processor-0.1.04
+        chown -R counter:counter /usr/local/counter-processor-1.05
 
 Installing Counter Processor Python Requirements
 ================================================
 
-Counter Processor version 0.1.04 requires Python 3.7 or higher. This version of Python is available in many operating systems, and is purportedly available for RHEL7 or CentOS 7 via Red Hat Software Collections. Alternately, one may compile it from source.
+Counter Processor version 1.05 requires Python 3.7 or higher. This version of Python is available in many operating systems, and is purportedly available for RHEL7 or CentOS 7 via Red Hat Software Collections. Alternately, one may compile it from source.
 
 The following commands are intended to be run as root but we are aware that Pythonistas might prefer fancy virtualenv or similar setups. Pull requests are welcome to improve these steps!
 
@@ -483,7 +484,7 @@ Install Python 3.9::
 Install Counter Processor Python requirements::
 
         python3.9 -m ensurepip
-        cd /usr/local/counter-processor-0.1.04
+        cd /usr/local/counter-processor-1.05
         pip3 install -r requirements.txt
 
 See the :doc:`/admin/make-data-count` section of the Admin Guide for how to configure and run Counter Processor.

diff --git a/scripts/makedatacount/process_mdc_logs.sh b/scripts/makedatacount/process_mdc_logs.sh
@@ -0,0 +1,167 @@
+#! /bin/bash
+set -x
+
+# This script will process each file from s3 bucket where archive log files are stored
+# 1. Loop through each file not already processed (by date).
+# 2. Call counter-processor to convert the log files to SUSHI formatted files
+# 3. counter-processor will call Dataverse API: /api/admin/makeDataCount/addUsageMetricsFromSushiReport?reportOnDisk=... to store dataset metrics in dataverse DB.
+# 4. counter-processor will upload the data to DataCite if upload_to_hub is set to True.
+# 5. The state of each file is inserted in Dataverse DB. This allows failed files to be re-tried as well as limiting the number of files being processed with each run.
+
+# MDC logs. There is one log per node per day, .../domain1/logs/counter_YYYY-MM-DD.log
+# To enable MDC logging set the following settings:
+# curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:DisplayMDCMetrics
+# curl -X PUT -d '/opt/dvn/app/payara6/glassfish/domains/domain1/logs' http://localhost:8080/api/admin/settings/:MDCLogPath
+declare -a NODE=("app-1" "app-2")
+COUNTERPROCESSORDIR=/usr/local/counter-processor-1.05
+LOGDIR=/opt/dvn/app/payara6/glassfish/domains/domain1/logs
+ARCHIVEDIR=s3://dvn-cloud/Admin/logs/payara/counter
+REPORTONDISKDIR=$LOGDIR
+CopyFromArchiveCmd="aws s3 cp ${ARCHIVEDIR}"
+RunAsCounterProcessorUser="sudo -u counter"
+upload_to_hub=False
+platform_name="Harvard Dataverse"
+hub_base_url="https://api.datacite.org"
+# If uploading to DataCite make sure the hub_api_token is defined in COUNTERPROCESSORDIR/config/secrets.yaml and not hard coded in this script
+
+# Testing with dataverse running in docker
+if [ -d docker-dev-volumes/ ]; then
+  echo "Docker Directory exists."
+  RunAsCounterProcessorUser="sudo"
+  DATAVERSESOURCEDIR=$PWD
+  #COUNTERPROCESSORDIR=$DATAVERSESOURCEDIR/../counter-processor
+  LOGDIR=$DATAVERSESOURCEDIR/docker-dev-volumes/app/data/temp
+  ARCHIVEDIR=$DATAVERSESOURCEDIR/tests/data
+  REPORTONDISKDIR=/dv/temp
+  CopyFromArchiveCmd="cp -v ${ARCHIVEDIR}"
+  platform_name="Harvard Dataverse Test Account"
+  hub_base_url="https://api.test.datacite.org"
+  upload_to_hub=False
+fi
+
+log_name_pattern="${COUNTERPROCESSORDIR}/log/counter_(yyyy-mm-dd).log"
+output_report_file=$COUNTERPROCESSORDIR/tmp/make-data-count-report
+
+# This config file contains the settings that can not be overwritten here.
+# path_types:
+#    investigations:
+#    requests:
+export CONFIG_FILE="${COUNTERPROCESSORDIR}/config/counter-processor-config.yaml"
+# See: https://guides.dataverse.org/en/latest/admin/make-data-count.html#configure-counter-processor
+# and download https://guides.dataverse.org/en/latest/_downloads/f99910a3cc45e4f68cc047f7c033c7f0/counter-processor-config.yaml
+
+function process_json_file () {
+  # Process the logs by calling counter-processor
+  year_month="${1}"
+  cd $COUNTERPROCESSORDIR
+
+  l=$(ls log/counter_${year_month}-*.log | sort -r)
+  log_date=${l:12:10}
+  sim_date=$(date -j -v +1d -f "%Y-%m-%d" "${log_date}" +%F)
+  response=$(curl -sS -X GET "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState") 2>/dev/null
+  state=$(echo "$response" | jq -j '.data.state')
+  rerun=True
+  if [[ "${state}" == "FAILED" ]]; then
+    rerun=True
+  fi
+  curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState?state=processing"
+  : > $COUNTERPROCESSORDIR/tmp/datacite_response_body.txt
+  eval "$RunAsCounterProcessorUser YEAR_MONTH=${year_month} SIMULATE_DATE=${sim_date} PLATFORM='${platform_name}' LOG_NAME_PATTERN='${log_name_pattern}' OUTPUT_FILE='${output_report_file}' UPLOAD_TO_HUB='${upload_to_hub}' HUB_BASE_URL='${hub_base_url}' CLEAN_FOR_RERUN='${rerun}' python3 main.py &> $COUNTERPROCESSORDIR/tmp/counter.log"
+  cat $COUNTERPROCESSORDIR/tmp/counter.log
+  cat $COUNTERPROCESSORDIR/tmp/datacite_response_body.txt
+  report=counter_${log_date}.json
+  cp -v ${output_report_file}.json ${LOGDIR}/${report}
+  response=$(curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/addUsageMetricsFromSushiReport?reportOnDisk=${REPORTONDISKDIR}/${report}") 2>/dev/null
+  if [[ "$(echo "$response" | jq -j '.status')" != "OK" ]]; then
+    state="failed"
+  else
+    state="done"
+    # ok to delete the report now. The original is still in counter-processor if needed
+    rm -rf ${LOGDIR}/${report}
+  fi
+  curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState?state="$state
+  # If the month is complete update the year_month
+  if [[ "${sim_date:8:2}" == "01" ]]; then
+    curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState?state="$state
+  else
+    # TODO: will we ever encounter a tar file with an incomplete month? If so then we need to figure out how to skip it until it's complete
+    curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState?state=skip"
+  fi
+}
+
+function process_archived_files () {
+  # Check each node for the newest file. If multiple nodes have the same date file we need to merge the files
+  nodeArraylength=${#NODE[@]}
+  for (( i=0; i<${nodeArraylength}; i++ ));
+  do
+    echo "index: $i, value: ${NODE[$i]}"
+    ls ${ARCHIVEDIR}/${NODE[$i]}/counter_*.tar | sort -r | while read l
+    do
+        year_month=${l:(-11):7}
+        echo "Found archive file for "$year_month
+        response=$(curl -sS -X GET "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState") 2>/dev/null
+        state=$(echo "$response" | jq -j '.data.state')
+        if [[ "${state}" == "DONE" ]] || [[ "${state}" == "SKIP" ]]; then
+          echo "Skipping due to state:${state}"
+        else
+          NEW_LOGDIR=${LOGDIR}/${NODE[$i]}_${year_month}
+          mkdir -p ${NEW_LOGDIR}
+          # Copy the tar file from archive back to local, un-tar it and clean up intermediate files.
+          eval "$CopyFromArchiveCmd/${NODE[$i]}/counter_${year_month}.tar ${NEW_LOGDIR}/counter_${year_month}.tar"
+          tar -xvzf ${NEW_LOGDIR}/counter_${year_month}.tar --directory ${NEW_LOGDIR}
+          ls ${NEW_LOGDIR}/counter_${year_month}-* | while read l
+          do
+            gzip -d $l
+          done
+          rm -r ${NEW_LOGDIR}/counter_${year_month}.tar
+          break
+        fi
+    done
+  done
+
+  # Determine which node/nodes have the newest files. Unless a node was down for the month they should all have files
+  # for the same dates so merging is a must.
+  # Get a list of directories under LOGDIR that are in format NODE_yyyy-mm and strip to get yyyy_mm
+  # Sort so newest yyyy-mm is first in the list
+  ls -1d $LOGDIR/*/ | rev | cut -d'_' -f1 | rev | sort -r | uniq > /tmp/archived_files
+  # Read first line and strip off trailing '/' to get the newest year_month to process
+  read -r line < /tmp/archived_files
+  year_month=${line:(-8):7}
+  echo $year_month
+  # year_month will be empty if no more files to process
+  if [ ! -z "$year_month" ]; then
+    # Get the list of directories to merge for this year_month
+    ls -1d $LOGDIR/*_$year_month/ > /tmp/archived_files
+
+    # Merge subsequent directories into firstDirectory. Note: firstDirectory may or may not be NODE 1. It shouldn't matter
+    read -r firstDirectory < /tmp/archived_files
+    tail -n +2 /tmp/archived_files| while read l
+    do
+       ls ${l}counter_*.log | while read l
+         do
+           # It should never happen but if 1 of the files is missing create it so the merge will not fail
+           if [ ! -e "$l" ]; then
+             touch $l
+           fi
+           # Strip off just the file name ie. counter_2024-02-01.log
+           log_file=${l:(-22)}
+           sort -um -o ${firstDirectory}${log_file} ${firstDirectory}${log_file} ${l}
+        done
+    done < /tmp/archived_files
+
+    # Now firstDirectory has all the merged data so we can move it to the counter_processor log directory and clean up the NODE directories
+    eval "$RunAsCounterProcessorUser cp ${firstDirectory}*.log $COUNTERPROCESSORDIR/log"
+    for (( i=0; i<${nodeArraylength}; i++ ));
+    do
+      rm -rf $LOGDIR/${NODE[$i]}*
+    done
+
+    process_json_file "$year_month"
+
+    # After processing is done delete the log files from counter_processor log directory
+    eval "$RunAsCounterProcessorUser rm -rf $COUNTERPROCESSORDIR/log/counter_*.log"
+  fi
+}
+
+# Main
+process_archived_files
diff --git a/src/main/java/edu/harvard/iq/dataverse/makedatacount/DatasetMetrics.java b/src/main/java/edu/harvard/iq/dataverse/makedatacount/DatasetMetrics.java
@@ -117,10 +117,10 @@ public class DatasetMetrics implements Serializable {
      * For an example of sending various metric types (total-dataset-requests,
      * unique-dataset-investigations, etc) for a given month (2018-04) per
      * country (DK, US, etc.) see
-     * https://github.com/CDLUC3/counter-processor/blob/5ce045a09931fb680a32edcc561f88a407cccc8d/good_test.json#L893
+     * https://github.com/gdcc/counter-processor/blob/5ce045a09931fb680a32edcc561f88a407cccc8d/good_test.json#L893
      *
      * counter-processor uses GeoLite2 for IP lookups according to their
-     * https://github.com/CDLUC3/counter-processor#download-the-free-ip-to-geolocation-database
+     * https://github.com/gdcc/counter-processor#download-the-free-ip-to-geolocation-database
      */
     @Column(nullable = true)
     private String countryCode;

diff --git a/src/main/java/edu/harvard/iq/dataverse/makedatacount/MakeDataCountUtil.java b/src/main/java/edu/harvard/iq/dataverse/makedatacount/MakeDataCountUtil.java
@@ -27,15 +27,15 @@
  * How to Make Your Data Count July 10th, 2018).
  *
  * The recommended starting point to implement Make Data Count is
- * https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md
+ * https://github.com/gdcc/Make-Data-Count/blob/master/getting-started.md
  * which specifically recommends reading the "COUNTER Code of Practice for
  * Research Data" mentioned in the user facing docs.
  *
  * Make Data Count was first implemented in DASH. Here's an example dataset:
  * https://dash.ucmerced.edu/stash/dataset/doi:10.6071/M3RP49
  *
  * For processing logs we could try DASH's
- * https://github.com/CDLUC3/counter-processor
+ * https://github.com/gdcc/counter-processor
  *
  * Next, DataOne implemented it, and you can see an example dataset here:
  * https://search.dataone.org/view/doi:10.5063/F1Z899CZ

diff --git a/tests/data/app-1/counter_2024-02.tar b/tests/data/app-1/counter_2024-02.tar
diff --git a/tests/data/app-1/counter_2024-03.tar b/tests/data/app-1/counter_2024-03.tar
diff --git a/tests/data/app-1/counter_2024-04.tar b/tests/data/app-1/counter_2024-04.tar
diff --git a/tests/data/app-2/counter_2024-02.tar b/tests/data/app-2/counter_2024-02.tar