# Verify the completeness of downloaded files 

OT provides sha1 checksum for each file, we can use it to verify files on our end with the checksum provided by OT.

**steps in general**

1. Download files from OT FTP
2. Create checksum of downladed files
3. Download checksum file that OT provided
4. Verify downloaded files' checksum with OT provided 



------

### Download OT files from FTP (Bash or Tools)

URL: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/ 

We can use tool filezilla to download files to you local environment,or you can use wget to download files. 

```console

foo@bar:~$ for i in {1..30}; do wget  -e robots=off -rnp -c --recursive --no-parent -N --no-host-directories --cut-dirs 8 https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.09/output/etl/json ; done
```

---



### Create a checksum of downloaded files (Bash)


We use bash to generate checksum of downloaded files, store the checksum in a file. 

``` console
find ./json -name *.json  -exec shasum "{}" + > downloaded.sha1

```

---

### Get and edit OT's checksum (Bash)

OT's checksum document contains checksum for all input&output files, we only focus on the etl output files. 
Filter out the ones we do not care and save those to a seperate file. 

```console

cat release_data_integrity.sha1 | sed -n '/.*output\/etl\/json\/.*.json/p' >> updated_release_data_integrity.sha1


```

---

### Comparsion checksum (Python)


In [59]:

OT_UPDATED_CHECKSUM_ADDR = "/Users/cheny39/Documents/work/22.09/updated_release_data_integrity.sha1"
DOWNLOAD_FILE_CHECKSUM_ADDR = "/Users/cheny39/Documents/work/22.09/downloaded.sha1"

dic_download ={}

# create key/value pair in a dic for each check up. 
# key : file paht
# value: checksum
# root:  downloaded file's checksum


download_checksum_file = open(DOWNLOAD_FILE_CHECKSUM_ADDR, 'r')
download_checksum_file_lines = download_checksum_file.readlines()


for line in download_checksum_file_lines:
    array = line.split()
    value = array[0].strip()
    key = str("/".join(array[1].split("/")[-2:]))
    dic_download[key]=value

    
#do the comparsion 

ot_checksum_file = open(OT_UPDATED_CHECKSUM_ADDR, 'r')
ot_checksum_file_lines = ot_checksum_file.readlines()

diff_err_count = 0
miss_err_count = 0
for line in ot_checksum_file_lines:
    array = line.split()
    value = array[0].strip()
    if "error" not in array[1]:
        key = str("/".join(array[1].split("/")[-2:]))
        if key in dic_download.keys():
            if dic_download[key] != value:
                diff_err_count += 1 
                print("[ERROR] File", key, "Not identical with OT")
        else:
                miss_err_count += 1 
                print("[ERROR] File", key, "Not present")
                
print (diff_err_count , " of file(s) different with OT" )
print (miss_err_count , " of file(s) missing from OT" )



0  of file(s) different with OT
0  of file(s) missing from OT


---

### line of record within dir


``` console
addr = `pwd`
b=(`ls `)

for a in ${b[*]}; do
    echo $a >> ls3.txt
    find $addr$a  -name *.json |xargs   wc -l  | awk '{sum+=$1}END{print sum}' >> ls3.txt
done

```