Spark hadoop commands for pipelines

Intro

We try to describe here some hdfs and spark commands useful for pipelines.

Make a directory

sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports

Copy to local some file or directory

sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp

Delete some dr

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*

or bigger clean

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*

If you are trying to remove everything, perhaps:

shutdown the hadoop cluster
use:

hdfsadmin -dfs format

(this was suggested by Dave in Slack).

copy all duplicateKeys.csv files into /tmp

If some dr has duplicate keys it cannot be indexed and you can see a log like:

The dataset can not be indexed. See logs for more details: HAS_DUPLICATES

In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:

for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v "     0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done

Remove some orphans occurrences from biocache-store

Durante the migration of uuids you can find occurrences of drs that not longer exist in your collectory. In this case you will have some indexing error for that missing drs with the message NOT_AVAILABLE

And in hdfs you only have that uuids in identifiers:

-

So we'll delete from biocache-store.

You have to install yq and avro-tools and follow these steps:

Create a file with all this drs, let's call it /tmp/missing
Copy the avro files of that drs:

for i in `cat /tmp/missing` ; do mkdir /tmp/missing-uuids/$i/; sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/identifiers/ala_uuid/* /tmp/missing-uuids/$i/; done

join all uuids to delete in some file:

for i in `ls /tmp/missing-uuids/dr*/*avro`; do avrocat $i | jq .uuid.string | sed 's/"//g' >> /tmp/del_uuids; done

scp that /tmp/del_uuids file to your biocache-store.
Delete in biocache store with biocache-store delete-records -f /tmp/del_uuids.

Restart spark & hadoop

sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh

Remove old backup of identifiers

⚠️ Double check this before executing it to be sure (for instance adding a echo to the rm command to verify what will do)

#!/bin/bash

# Find and delete all 'ala_uuid_backup' in any sub-directory of '/pipelines-data/*/1/identifiers/'
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data/*/1/identifiers/ | grep 'ala_uuid_backup' | awk '{print $8}' | while read -r file
do
    if [[ -n "$file" ]]; then
        sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$file"
    fi
done

Remove all except identifiers/uuids

If you want to delete all except the identifiers info to preserver the same uuids.