# Download the pathology image and annotations

#### Dataset is from this Huggingface repo: [Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset](https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset)

## Download the reference table

In [4]:
from datasets import load_dataset
import pandas as pd

df = load_dataset("Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset")
print(df.diagnosis[0])
print(df.slide_id[0])
print(df.slide_name[0])
print(df.label[0])
print(len(df.diagnosis))

AttributeError: 'DatasetDict' object has no attribute 'diagnosis'

In [None]:
from huggingface_hub import hf_hub_download
import pandas as pd

df = pd.read_csv("hf://datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset/dataset/PRAD/PRAD.csv")
print(df.diagnosis[0])
print(df.slide_id[0])
print(df.slide_name[0])
print(df.label[0])
print(len(df.diagnosis))

  from .autonotebook import tqdm as notebook_tqdm


This slide contains predominately Gleason Grade 5 tumor, characterized by the absence of glandular differentiation, alongside Gleason Grade 4 tumor, characterized by glandular fusion or colloid structure or cribriform structure.
eb00cbed-63c4-4d47-9b6a-9dde1306b8cd
TCGA-KK-A7B2-01Z-00-DX1.3E779031-6FE4-4BD0-838C-D9ED49E1B9A7.svs
Gleason Score 5+4
138


#### This PRAD.csv table saves the slide ids, annotation labels and diagnosis text. There are 138 cases/images.

## Download client tool

Download gdc-client tool, so that we can download the whole slide images from TCGA.
``` bash
wget https://gdc.cancer.gov/system/files/public/file/gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
unzip gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
chmod 755 gdc-client
echo "export PATH=/data/jjiang10/Tools:$PATH" >> ~/.bashrc
source ~/.bashrc
```

Use gdc-client to download the slide image.
For example:
``` bash
./gdc-client download eb00cbed-63c4-4d47-9b6a-9dde1306b8cd
```
"eb00cbed-63c4-4d47-9b6a-9dde1306b8cd" is the slide ID in df.slide_id

Here is the instruction from [GDC documentation](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/).
Downloading Data Using GDC File UUIDs
The GDC Data Transfer Tool also supports downloading of one or more individual files using UUID(s) instead of a manifest file. To do this, enter the UUID(s) after the download command:
``` bash
gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005
```
Multiple UUIDs can be specified, separated by a space:
``` bash
gdc-client download e5976406-473a-4fbd-8c97-e95187cdc1bd fb3e261b-92ac-4027-b4d9-eb971a92a4c3
```


## Download a snapshot of the dataset
The snapshot contains the all the files within this dataset, especially ROI annotations, which is our major interest.  

In [None]:
# download a snapshot (the latest version of data) to local drive.
from huggingface_hub import snapshot_download
local_dir = "/data/jjiang10/Data/ProstatePathology"
ddir = snapshot_download(repo_id="Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset", local_dir=local_dir, repo_type="dataset")
ddir

In [None]:
# check the geojson file, which are the regional annotations
import os, glob
geojson_files = glob.glob(os.path.join(local_dir, "dataset/PRAD/*.geojson"))
len(geojson_files)

## Download whole slide images
Whole slide images (WSIs) need to be downloaded with the gdc-client tool from TCGA.
The following code generates the command line to download all the original WSIs.

In [None]:
# Create WSI download command
cmd_str = "gdc-client download " + " ".join(list(df.slide_id))
print(cmd_str)



Create folder to save original whole slide image data
``` bash
cd /data/jjiang10/Data/ProstatePathology
mkdir WSIs
cd WSIs

```

In [None]:
import subprocess

with open("download_script.sh", "w") as f:
    f.write("#!/bin/bash\n")
    f.write("cd /data/jjiang10/Data/ProstatePathology/WSIs \n")
    f.write(cmd_str + "\n")
    f.write("ls -l\n")

subprocess.run(["chmod", "+x", "download_script.sh"])

Run the created bash script to download the images.

In [None]:
# List all the WSI files.
import glob
import os 
local_dir = "/data/jjiang10/Data/ProstatePathology"
wsi_list = glob.glob(os.path.join(local_dir, "WSIs", "*", "*.svs"))
print(os.path.join(local_dir, "WSIs", "*", "*.svs"))
print(len(wsi_list))
wsi_list