## Step-by-step instructions to download HEST-1k 

This tutorial will guide you to:

- Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)
- Download some samples of HEST-1k 
- Download samples with some attributes (e.g., all breast cancer cases) 
- Inspect freshly downloaded samples


## Instructions for Setting Up HuggingFace Account and Token

### 1. Create an Account on HuggingFace
Follow the instructions provided on the [HuggingFace sign-up page](https://huggingface.co/join).

### 2. Accept terms of use of HEST

1. Go to [HEST HuggingFace page](https://huggingface.co/datasets/MahmoodLab/hest)
2. Request access (access will be automatically granted)
3. At this stage, you can already manually inspect the data by navigating in the `Files and version`

### 3. Create a Hugging Face Token

1. **Go to Settings:** Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting `Settings` from the dropdown menu.

2. **Access Tokens:** In the settings menu, find and click on `Access tokens`.

3. **Create New Token:**
   - Click on `New token`.
   - Set the token name (e.g., `hest`).
   - Set the access level to `Write`.
   - Click on `Create`.

4. **Copy Token:** After the token is created, copy it to your clipboard. You will need this token for authentication.

### 4. Logging

Install the python library `datasets` and run cell below. If successful, you should see:

```
Your token has been saved to /home/usr/.cache/huggingface/token
Login successful
```

In [None]:
%%bash
pip install datasets
pip install huggingface-hub

In [None]:
from huggingface_hub import login
import os
import zipfile

from huggingface_hub import snapshot_download
from tqdm import tqdm
login(token="")

def download_hest(patterns, local_dir):
    repo_id = 'MahmoodLab/hest'
    snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type="dataset", local_dir=local_dir)

    seg_dir = os.path.join(local_dir, 'cellvit_seg')
    if os.path.exists(seg_dir):
        print('Unzipping cell vit segmentation...')
        for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):
            path_zip = os.path.join(seg_dir, filename)
                        
            with zipfile.ZipFile(path_zip, 'r') as zip_ref:
                zip_ref.extractall(seg_dir)

### Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)

In [None]:
import datasets
import pandas as pd

local_dir='../../data/Brest_spatialTranscriptome/' # hest will be dowloaded to this folder

meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_1_0.csv")

# Filter the dataframe by organ, oncotree code...
meta_df = meta_df[meta_df['organ'] == 'Breast']

ids_to_query = meta_df['id'].values

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir) # see method definition above

In [None]:
import datasets
import pandas as pd
meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_1_0.csv")
meta_df['oncotree_code'].unique()

### Inspect freshly downloaded samples

For each sample, we provide:

- **wsis/**: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)
- **st/**: Spatial transcriptomics expressions in a scanpy .h5ad object
- **metadata/**: Metadata
- **spatial_plots/**: Overlay of the WSI with the st spots
- **thumbnails/**: Downscaled version of the WSI
- **tissue_seg/**: Tissue segmentation masks:
    - `{id}_mask.jpg`: Downscaled or full resolution greyscale tissue mask
    - `{id}_mask.pkl`: Tissue/holes contours in a pickle file
    - `{id}_vis.jpg`: Visualization of the tissue mask on the downscaled WSI
- **pixel_size_vis/**: Visualization of the pixel size
- **patches/**: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see **st/**) with a barcode.
- **patches_vis/**: Visualization of the mask and patches on a downscaled WSI.
- **transcripts/**: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns `['he_x', 'he_y']`
- **cellvit_seg/**: Cellvit nuclei segmentation
- **xenium_seg**: xenium segmentation on DAPI and aligned to H&E


In [None]:
from hest import iter_hest
import pandas as pd

# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv('../assets/HEST_v1_1_0.csv')

id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values

print('load hest...')
# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=id_list):
    print(st)

In [2]:
import pandas as pd

# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_1_0.csv")
meta_df

Unnamed: 0,dataset_title,id,image_filename,organ,disease_state,oncotree_code,species,patient,st_technology,data_publication_date,...,treatment_comment,pixel_size_um_embedded,pixel_size_um_estimated,magnification,fullres_px_width,fullres_px_height,tissue,disease_comment,subseries,hest_version_added
0,Fresh Frozen Mouse Brain Hemisphere with 5K Mo...,TENX159,TENX159.tif,Brain,Healthy,,Mus musculus,,Xenium,7/31/24,...,,0.273768,0.274027,40x,17051,24689,Brain,,,v1_1_0
1,FFPE Human Skin Primary Dermal Melanoma with 5...,TENX158,TENX158.tif,Skin,Cancer,SKCM,Homo sapiens,,Xenium,7/31/24,...,,0.273777,0.273754,40x,18669,35787,Skin,Primary Dermal Melanoma,,v1_1_0
2,FFPE Human Prostate Adenocarcinoma with 5K Hum...,TENX157,TENX157.tif,Prostate,Cancer,PRAD,Homo sapiens,,Xenium,7/31/24,...,,0.273772,0.273741,40x,25002,49976,Prostate,,,v1_1_0
3,Characterization of immune cell populations in...,TENX156,TENX156.tif,Bowel,Cancer,COAD,Homo sapiens,Patient 1,Visium HD,7/11/24,...,,0.264583,0.273802,40x,71106,58791,Colon,Stage II-A,"Visium HD, Sample P1 CRC",v1_1_0
4,Characterization of immune cell populations in...,TENX155,TENX155.tif,Bowel,Cancer,COAD,Homo sapiens,Patient 1,Visium HD,7/11/24,...,,0.264583,0.273874,40x,75250,48740,Colon,,"Visium HD, Sample P2 CRC",v1_1_0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1224,spatialLIBD,MISC5,MISC5.tif,Brain,Healthy,,Homo sapiens,,Visium,,...,,,0.727113,20x,13332,13332,dorsolateral prefrontal cortex,,151672,v1_0_0
1225,spatialLIBD,MISC4,MISC4.tif,Brain,Healthy,,Homo sapiens,,Visium,,...,,,0.726109,20x,13332,13332,dorsolateral prefrontal cortex,,151673,v1_0_0
1226,spatialLIBD,MISC3,MISC3.tif,Brain,Healthy,,Homo sapiens,,Visium,,...,,,0.725124,20x,13332,13332,dorsolateral prefrontal cortex,,151674,v1_0_0
1227,spatialLIBD,MISC2,MISC2.tif,Brain,Healthy,,Homo sapiens,,Visium,,...,,,0.726109,20x,13332,13332,dorsolateral prefrontal cortex,,151675,v1_0_0
