Skip to content

Commit

Permalink
Merge 6172ca7 into 36fb0f4
Browse files Browse the repository at this point in the history
  • Loading branch information
stevenwinship committed Apr 10, 2024
2 parents 36fb0f4 + 6172ca7 commit a10295a
Show file tree
Hide file tree
Showing 12 changed files with 191 additions and 20 deletions.
3 changes: 3 additions & 0 deletions doc/release-notes/make-data-count-.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
### Counter Processor 1.05 Support

This release includes support for counter-processor-1.05 for processing Make Data Count metrics. If you are running Make Data Counts support, you should reinstall/reconfigure counter-processor as described in the latest Guides. (For existing installations, note that counter-processor-1.05 requires a Python3, so you will need to follow the full counter-processor install. Also note that if you configure the new version the same way, it will reprocess the days in the current month when it is first run. This is normal and will not affect the metrics in Dataverse.)
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/_static/util/counter_daily.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#! /bin/bash

COUNTER_PROCESSOR_DIRECTORY="/usr/local/counter-processor-0.1.04"
COUNTER_PROCESSOR_DIRECTORY="/usr/local/counter-processor-1.05"
MDC_LOG_DIRECTORY="/usr/local/payara6/glassfish/domains/domain1/logs/mdc"

# counter_daily.sh
Expand Down
8 changes: 4 additions & 4 deletions doc/sphinx-guides/source/admin/make-data-count.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Architecture

Dataverse installations who would like support for Make Data Count must install `Counter Processor`_, a Python project created by California Digital Library (CDL) which is part of the Make Data Count project and which runs the software in production as part of their `DASH`_ data sharing platform.

.. _Counter Processor: https://github.com/CDLUC3/counter-processor
.. _Counter Processor: https://github.com/gdcc/counter-processor
.. _DASH: https://cdluc3.github.io/dash/

The diagram below shows how Counter Processor interacts with your Dataverse installation and the DataCite hub, once configured. Dataverse installations using Handles rather than DOIs should note the limitations in the next section of this page.
Expand Down Expand Up @@ -84,9 +84,9 @@ Configure Counter Processor

* Change to the directory where you installed Counter Processor.

* ``cd /usr/local/counter-processor-0.1.04``
* ``cd /usr/local/counter-processor-1.05``

* Download :download:`counter-processor-config.yaml <../_static/admin/counter-processor-config.yaml>` to ``/usr/local/counter-processor-0.1.04``.
* Download :download:`counter-processor-config.yaml <../_static/admin/counter-processor-config.yaml>` to ``/usr/local/counter-processor-1.05``.

* Edit the config file and pay particular attention to the FIXME lines.

Expand All @@ -99,7 +99,7 @@ Soon we will be setting up a cron job to run nightly but we start with a single

* Change to the directory where you installed Counter Processor.

* ``cd /usr/local/counter-processor-0.1.04``
* ``cd /usr/local/counter-processor-1.05``

* If you are running Counter Processor for the first time in the middle of a month, you will need create blank log files for the previous days. e.g.:

Expand Down
6 changes: 3 additions & 3 deletions doc/sphinx-guides/source/developers/make-data-count.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Make Data Count
===============

Support for Make Data Count is a feature of the Dataverse Software that is described in the :doc:`/admin/make-data-count` section of the Admin Guide. In order for developers to work on the feature, they must install Counter Processor, a Python 3 application, as described below. Counter Processor can be found at https://github.com/CDLUC3/counter-processor
Support for Make Data Count is a feature of the Dataverse Software that is described in the :doc:`/admin/make-data-count` section of the Admin Guide. In order for developers to work on the feature, they must install Counter Processor, a Python 3 application, as described below. Counter Processor can be found at https://github.com/gdcc/counter-processor

.. contents:: |toctitle|
:local:
Expand Down Expand Up @@ -49,7 +49,7 @@ Once you are done with your configuration, you can run Counter Processor like th

``su - counter``

``cd /usr/local/counter-processor-0.1.04``
``cd /usr/local/counter-processor-1.05``

``CONFIG_FILE=counter-processor-config.yaml python39 main.py``

Expand Down Expand Up @@ -82,7 +82,7 @@ Second, if you are also sending your SUSHI report to Make Data Count, you will n

``curl -H "Authorization: Bearer $JSON_WEB_TOKEN" -X DELETE https://$MDC_SERVER/reports/$REPORT_ID``

To get the ``REPORT_ID``, look at the logs generated in ``/usr/local/counter-processor-0.1.04/tmp/datacite_response_body.txt``
To get the ``REPORT_ID``, look at the logs generated in ``/usr/local/counter-processor-1.05/tmp/datacite_response_body.txt``

To read more about the Make Data Count api, see https://github.com/datacite/sashimi

Expand Down
17 changes: 9 additions & 8 deletions doc/sphinx-guides/source/installation/prerequisites.rst
Original file line number Diff line number Diff line change
Expand Up @@ -434,7 +434,7 @@ firewalled from your Dataverse installation host).
Counter Processor
-----------------

Counter Processor is required to enable Make Data Count metrics in a Dataverse installation. See the :doc:`/admin/make-data-count` section of the Admin Guide for a description of this feature. Counter Processor is open source and we will be downloading it from https://github.com/CDLUC3/counter-processor
Counter Processor is required to enable Make Data Count metrics in a Dataverse installation. See the :doc:`/admin/make-data-count` section of the Admin Guide for a description of this feature. Counter Processor is open source and we will be downloading it from https://github.com/gdcc/counter-processor

Installing Counter Processor
============================
Expand All @@ -444,9 +444,9 @@ A scripted installation using Ansible is mentioned in the :doc:`/developers/make
As root, download and install Counter Processor::

cd /usr/local
wget https://github.com/CDLUC3/counter-processor/archive/v0.1.04.tar.gz
tar xvfz v0.1.04.tar.gz
cd /usr/local/counter-processor-0.1.04
wget https://github.com/gdcc/counter-processor/archive/refs/tags/v1.05.tar.gz
tar xvfz v1.05.tar.gz
cd /usr/local/counter-processor-1.05

Installing GeoLite Country Database
===================================
Expand All @@ -457,22 +457,23 @@ The process required to sign up, download the database, and to configure automat

As root, change to the Counter Processor directory you just created, download the GeoLite2-Country tarball from MaxMind, untar it, and copy the geoip database into place::

<download or move the GeoLite2-Country.tar.gz to the /usr/local/counter-processor-0.1.04 directory>
<download or move the GeoLite2-Country.tar.gz to the /usr/local/counter-processor-1.05 directory>
tar xvfz GeoLite2-Country.tar.gz
cp GeoLite2-Country_*/GeoLite2-Country.mmdb maxmind_geoip
Note: GeoLite2-Country_20191217 is already included in the installation. You can skip the download and untar if you use this version. Simply 'cp maxmind_geoip/GeoLite2-Country_20191217/GeoLite2-Country.mmdb maxmind_geoip

Creating a counter User
=======================

As root, create a "counter" user and change ownership of Counter Processor directory to this new user::

useradd counter
chown -R counter:counter /usr/local/counter-processor-0.1.04
chown -R counter:counter /usr/local/counter-processor-1.05

Installing Counter Processor Python Requirements
================================================

Counter Processor version 0.1.04 requires Python 3.7 or higher. This version of Python is available in many operating systems, and is purportedly available for RHEL7 or CentOS 7 via Red Hat Software Collections. Alternately, one may compile it from source.
Counter Processor version 1.05 requires Python 3.7 or higher. This version of Python is available in many operating systems, and is purportedly available for RHEL7 or CentOS 7 via Red Hat Software Collections. Alternately, one may compile it from source.

The following commands are intended to be run as root but we are aware that Pythonistas might prefer fancy virtualenv or similar setups. Pull requests are welcome to improve these steps!

Expand All @@ -483,7 +484,7 @@ Install Python 3.9::
Install Counter Processor Python requirements::

python3.9 -m ensurepip
cd /usr/local/counter-processor-0.1.04
cd /usr/local/counter-processor-1.05
pip3 install -r requirements.txt

See the :doc:`/admin/make-data-count` section of the Admin Guide for how to configure and run Counter Processor.
Expand Down
167 changes: 167 additions & 0 deletions scripts/makedatacount/process_mdc_logs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
#! /bin/bash
set -x

# This script will process each file from s3 bucket where archive log files are stored
# 1. Loop through each file not already processed (by date).
# 2. Call counter-processor to convert the log files to SUSHI formatted files
# 3. counter-processor will call Dataverse API: /api/admin/makeDataCount/addUsageMetricsFromSushiReport?reportOnDisk=... to store dataset metrics in dataverse DB.
# 4. counter-processor will upload the data to DataCite if upload_to_hub is set to True.
# 5. The state of each file is inserted in Dataverse DB. This allows failed files to be re-tried as well as limiting the number of files being processed with each run.

# MDC logs. There is one log per node per day, .../domain1/logs/counter_YYYY-MM-DD.log
# To enable MDC logging set the following settings:
# curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:DisplayMDCMetrics
# curl -X PUT -d '/opt/dvn/app/payara6/glassfish/domains/domain1/logs' http://localhost:8080/api/admin/settings/:MDCLogPath
declare -a NODE=("app-1" "app-2")
COUNTERPROCESSORDIR=/usr/local/counter-processor-1.05
LOGDIR=/opt/dvn/app/payara6/glassfish/domains/domain1/logs
ARCHIVEDIR=s3://dvn-cloud/Admin/logs/payara/counter
REPORTONDISKDIR=$LOGDIR
CopyFromArchiveCmd="aws s3 cp ${ARCHIVEDIR}"
RunAsCounterProcessorUser="sudo -u counter"
upload_to_hub=False
platform_name="Harvard Dataverse"
hub_base_url="https://api.datacite.org"
# If uploading to DataCite make sure the hub_api_token is defined in COUNTERPROCESSORDIR/config/secrets.yaml and not hard coded in this script

# Testing with dataverse running in docker
if [ -d docker-dev-volumes/ ]; then
echo "Docker Directory exists."
RunAsCounterProcessorUser="sudo"
DATAVERSESOURCEDIR=$PWD
#COUNTERPROCESSORDIR=$DATAVERSESOURCEDIR/../counter-processor
LOGDIR=$DATAVERSESOURCEDIR/docker-dev-volumes/app/data/temp
ARCHIVEDIR=$DATAVERSESOURCEDIR/tests/data
REPORTONDISKDIR=/dv/temp
CopyFromArchiveCmd="cp -v ${ARCHIVEDIR}"
platform_name="Harvard Dataverse Test Account"
hub_base_url="https://api.test.datacite.org"
upload_to_hub=False
fi

log_name_pattern="${COUNTERPROCESSORDIR}/log/counter_(yyyy-mm-dd).log"
output_report_file=$COUNTERPROCESSORDIR/tmp/make-data-count-report

# This config file contains the settings that can not be overwritten here.
# path_types:
# investigations:
# requests:
export CONFIG_FILE="${COUNTERPROCESSORDIR}/config/counter-processor-config.yaml"
# See: https://guides.dataverse.org/en/latest/admin/make-data-count.html#configure-counter-processor
# and download https://guides.dataverse.org/en/latest/_downloads/f99910a3cc45e4f68cc047f7c033c7f0/counter-processor-config.yaml

function process_json_file () {
# Process the logs by calling counter-processor
year_month="${1}"
cd $COUNTERPROCESSORDIR

l=$(ls log/counter_${year_month}-*.log | sort -r)
log_date=${l:12:10}
sim_date=$(date -j -v +1d -f "%Y-%m-%d" "${log_date}" +%F)
response=$(curl -sS -X GET "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState") 2>/dev/null
state=$(echo "$response" | jq -j '.data.state')
rerun=True
if [[ "${state}" == "FAILED" ]]; then
rerun=True
fi
curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState?state=processing"
: > $COUNTERPROCESSORDIR/tmp/datacite_response_body.txt
eval "$RunAsCounterProcessorUser YEAR_MONTH=${year_month} SIMULATE_DATE=${sim_date} PLATFORM='${platform_name}' LOG_NAME_PATTERN='${log_name_pattern}' OUTPUT_FILE='${output_report_file}' UPLOAD_TO_HUB='${upload_to_hub}' HUB_BASE_URL='${hub_base_url}' CLEAN_FOR_RERUN='${rerun}' python3 main.py &> $COUNTERPROCESSORDIR/tmp/counter.log"
cat $COUNTERPROCESSORDIR/tmp/counter.log
cat $COUNTERPROCESSORDIR/tmp/datacite_response_body.txt
report=counter_${log_date}.json
cp -v ${output_report_file}.json ${LOGDIR}/${report}
response=$(curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/addUsageMetricsFromSushiReport?reportOnDisk=${REPORTONDISKDIR}/${report}") 2>/dev/null
if [[ "$(echo "$response" | jq -j '.status')" != "OK" ]]; then
state="failed"
else
state="done"
# ok to delete the report now. The original is still in counter-processor if needed
rm -rf ${LOGDIR}/${report}
fi
curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$log_date/processingState?state="$state
# If the month is complete update the year_month
if [[ "${sim_date:8:2}" == "01" ]]; then
curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState?state="$state
else
# TODO: will we ever encounter a tar file with an incomplete month? If so then we need to figure out how to skip it until it's complete
curl -sS -X POST "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState?state=skip"
fi
}

function process_archived_files () {
# Check each node for the newest file. If multiple nodes have the same date file we need to merge the files
nodeArraylength=${#NODE[@]}
for (( i=0; i<${nodeArraylength}; i++ ));
do
echo "index: $i, value: ${NODE[$i]}"
ls ${ARCHIVEDIR}/${NODE[$i]}/counter_*.tar | sort -r | while read l
do
year_month=${l:(-11):7}
echo "Found archive file for "$year_month
response=$(curl -sS -X GET "http://localhost:8080/api/admin/makeDataCount/$year_month/processingState") 2>/dev/null
state=$(echo "$response" | jq -j '.data.state')
if [[ "${state}" == "DONE" ]] || [[ "${state}" == "SKIP" ]]; then
echo "Skipping due to state:${state}"
else
NEW_LOGDIR=${LOGDIR}/${NODE[$i]}_${year_month}
mkdir -p ${NEW_LOGDIR}
# Copy the tar file from archive back to local, un-tar it and clean up intermediate files.
eval "$CopyFromArchiveCmd/${NODE[$i]}/counter_${year_month}.tar ${NEW_LOGDIR}/counter_${year_month}.tar"
tar -xvzf ${NEW_LOGDIR}/counter_${year_month}.tar --directory ${NEW_LOGDIR}
ls ${NEW_LOGDIR}/counter_${year_month}-* | while read l
do
gzip -d $l
done
rm -r ${NEW_LOGDIR}/counter_${year_month}.tar
break
fi
done
done

# Determine which node/nodes have the newest files. Unless a node was down for the month they should all have files
# for the same dates so merging is a must.
# Get a list of directories under LOGDIR that are in format NODE_yyyy-mm and strip to get yyyy_mm
# Sort so newest yyyy-mm is first in the list
ls -1d $LOGDIR/*/ | rev | cut -d'_' -f1 | rev | sort -r | uniq > /tmp/archived_files
# Read first line and strip off trailing '/' to get the newest year_month to process
read -r line < /tmp/archived_files
year_month=${line:(-8):7}
echo $year_month
# year_month will be empty if no more files to process
if [ ! -z "$year_month" ]; then
# Get the list of directories to merge for this year_month
ls -1d $LOGDIR/*_$year_month/ > /tmp/archived_files

# Merge subsequent directories into firstDirectory. Note: firstDirectory may or may not be NODE 1. It shouldn't matter
read -r firstDirectory < /tmp/archived_files
tail -n +2 /tmp/archived_files| while read l
do
ls ${l}counter_*.log | while read l
do
# It should never happen but if 1 of the files is missing create it so the merge will not fail
if [ ! -e "$l" ]; then
touch $l
fi
# Strip off just the file name ie. counter_2024-02-01.log
log_file=${l:(-22)}
sort -um -o ${firstDirectory}${log_file} ${firstDirectory}${log_file} ${l}
done
done < /tmp/archived_files

# Now firstDirectory has all the merged data so we can move it to the counter_processor log directory and clean up the NODE directories
eval "$RunAsCounterProcessorUser cp ${firstDirectory}*.log $COUNTERPROCESSORDIR/log"
for (( i=0; i<${nodeArraylength}; i++ ));
do
rm -rf $LOGDIR/${NODE[$i]}*
done

process_json_file "$year_month"

# After processing is done delete the log files from counter_processor log directory
eval "$RunAsCounterProcessorUser rm -rf $COUNTERPROCESSORDIR/log/counter_*.log"
fi
}

# Main
process_archived_files
Original file line number Diff line number Diff line change
Expand Up @@ -117,10 +117,10 @@ public class DatasetMetrics implements Serializable {
* For an example of sending various metric types (total-dataset-requests,
* unique-dataset-investigations, etc) for a given month (2018-04) per
* country (DK, US, etc.) see
* https://github.com/CDLUC3/counter-processor/blob/5ce045a09931fb680a32edcc561f88a407cccc8d/good_test.json#L893
* https://github.com/gdcc/counter-processor/blob/5ce045a09931fb680a32edcc561f88a407cccc8d/good_test.json#L893
*
* counter-processor uses GeoLite2 for IP lookups according to their
* https://github.com/CDLUC3/counter-processor#download-the-free-ip-to-geolocation-database
* https://github.com/gdcc/counter-processor#download-the-free-ip-to-geolocation-database
*/
@Column(nullable = true)
private String countryCode;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,15 @@
* How to Make Your Data Count July 10th, 2018).
*
* The recommended starting point to implement Make Data Count is
* https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md
* https://github.com/gdcc/Make-Data-Count/blob/master/getting-started.md
* which specifically recommends reading the "COUNTER Code of Practice for
* Research Data" mentioned in the user facing docs.
*
* Make Data Count was first implemented in DASH. Here's an example dataset:
* https://dash.ucmerced.edu/stash/dataset/doi:10.6071/M3RP49
*
* For processing logs we could try DASH's
* https://github.com/CDLUC3/counter-processor
* https://github.com/gdcc/counter-processor
*
* Next, DataOne implemented it, and you can see an example dataset here:
* https://search.dataone.org/view/doi:10.5063/F1Z899CZ
Expand Down
Binary file added tests/data/app-1/counter_2024-02.tar
Binary file not shown.
Binary file added tests/data/app-1/counter_2024-03.tar
Binary file not shown.
Binary file added tests/data/app-1/counter_2024-04.tar
Binary file not shown.
Binary file added tests/data/app-2/counter_2024-02.tar
Binary file not shown.

0 comments on commit a10295a

Please sign in to comment.