<a href="https://colab.research.google.com/github/DivyaSwamy/Channel-Dynamics/blob/master/access_datasets_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### ***Author - Divya Swaminathan***
##### ***Date - August 2025***
---

### **Goals** -
> In this tutorial we will load vision datasets from huggingface and explore them.

**Step1**  - Generate a token by logging into Hugging face (optional but recommended)
  * Login to HF,
  * Navigate to access tokens, generate token for google colab,
  * On your google colab notebook open secrets key on the left panel and paste copied token.

**Step2** -
  * We will load & explore a dataset in the process learning the how datasets works within huggingface.
    * Navigate to the huggingface portal, search for relevant datasets and copy path to dataset. One can use this to access the dataset from huggingface.
    * I have chosen "Aliounethegoat/classification-medicale-multi-cancer" for this tutorial
    
**Step3**
  * Excercise - load and explore an EHR dataset
  * "QuirkyDataScientist/synthetic_ehr_dataset_part_5"


> Install transformers and datasets

In [None]:
%pip install -q transformers
%pip install -U datasets[vision]

# use %pip install -U datasets to access all datasets in Huggingface.
# datasets[vision] -one is only accessing vision datasets

> The following bit of code checks for the huggingface token.

In my notebook, under settings, the token is saved under the name 'huggingface_token'.
If this name is changed, correspondingly the ***userdata.get(name)*** will change.

In [None]:
from huggingface_hub import login
from google.colab import userdata

HF_TOKEN = userdata.get('huggingface_token') # Retrieve the token from secrets

if HF_TOKEN:
  login(HF_TOKEN)
  print("Successfully logged in to Hugging Face!")
else:
  print("Hugging Face token not found in Colab Secrets.")


In [None]:
from datasets import load_dataset_builder, load_dataset
from datasets import  get_dataset_split_names, DatasetDict, Dataset



* One can navigate to the huggingface portal, search for relevant datasets and copy path to dataset. One can use this to access the dataset from huggingface.

* *`load_dataset_builder(path_to_dataset)`* - use function to load a dataset builder and inspect a dataset’s attributes without committing to downloading it:

* For every dataset you can check it's *info* (description of dataset) and *features* (what's in it, images, labels etc.) and *splits* (train, test, validate) fields.


In [None]:
path_to_dataset = "Aliounethegoat/classification-medicale-multi-cancer"
#path_to_dataset = "cornell-movie-review-data/rotten_tomatoes"

ds_builder = load_dataset_builder(path_to_dataset)

* *`ds_builder = load_dataset_builder(path_to_dataset)`* , here you will get a warning if you are logged out of your hugging face account. Once logged in the warning disappears.


* If you are happy with the dataset use function *`load_dataset()`* to download dataset.


In [None]:
# for ds_builder there are 2 main features - description and features.

print(ds_builder.info.description)
print('-------')
print(ds_builder.info.features)
print('-------')
print(ds_builder.info.splits)


In [None]:
for elem in ds_builder.info.features:
  print(elem, ':-',ds_builder.info.features[elem] )

In [None]:
# Load the dataset
dataset = load_dataset(path_to_dataset)
print('---------')
dataset

In [None]:
# Only training set with 130002 images.
# Each image has a label & label_name associated with it.
dataset

In [None]:
# If the DatasetDict is such that it has train, test and validate datasets here
# the output of this command would be a list stating the same.
get_dataset_split_names(path_to_dataset)

In [None]:
print( dataset['train'][0])
print(dataset['train'][1500])
print(dataset['train'][13000])



* What is the difference b/w label and label_name?

The dataset has images from various cancer types. Label_names labels cancer type. Label is the sub classification for a cancer type. Without a datadict, it's unclear what each sub classification refers to

In [None]:
from collections import defaultdict

count_label_names = defaultdict(int)
count_label = defaultdict(int)

for item in dataset['train']['label']:
  count_label[item]+= 1

for item in dataset['train']['label_name']:
  count_label_names[item]+= 1


In [None]:
count_label

In [None]:
count_label_names

##### Can you isolate all the breast cancer images !!

Yes.

* dataset['train'][25000:35000] is also viable though slow.


In [None]:
# This logic runs very slowly
breast_cancer = dataset['train'].filter(lambda x: x['label_name'] == 'cancer_sein')


In [None]:
breast_cancer

In [None]:
print(breast_cancer[500])
print(breast_cancer[-500])

In [None]:
#next(iter(vision_dataset['train']))

> Let's plot a random section of images from breast_cancer dataset

In [None]:
from PIL import Image
import random
import matplotlib.pyplot as plt
import numpy as np

In [None]:
idx = [random.randint(0, 10000) for i in range(10)]

fig, axs = plt.subplots(2,5, figsize = (10,5))
axs = axs.ravel()

for i in range(10):
  j = idx[i]
  image_np = np.array(breast_cancer[j]['image'])
  pil_image = Image.fromarray(image_np)
  # print('Image Dimensions', image_np.shape, pil_image.size)
  axs[i].imshow(image_np)
  axs[i].set_title(breast_cancer[j]['label_name'] +': '
         + breast_cancer[j]['label'], fontsize = 6)
  axs[i].set_axis_off()


In [None]:
# Is it a balanced dataset?

count_label = defaultdict(int)

for item in breast_cancer['label']:
  count_label[item]+= 1

print(count_label)

> Using iterable datasets -> convert dataset on to iteratable dataset and explore

 *` There are two types of dataset objects, a Dataset and an IterableDataset. Whichever type of dataset you choose to use or create depends on the size of the dataset. In general, an IterableDataset is ideal for big datasets (think hundreds of GBs!) due to its lazy behavior and speed advantages, while a Dataset is great for everything else. This page will compare the differences between a Dataset and an IterableDataset to help you pick the right dataset object for you.`*

 > Since breast_cancer is a subset of ds, it wasn't necessary to use iterable_dataset here. Neverthless good for practise.

In [None]:
iterable_bc = breast_cancer.to_iterable_dataset()

In [None]:
x = next(iter(breast_cancer.shuffle())) # try this what happens if you use x = next(iter(breast_cancer))

image_np = np.array(x['image'])
pil_image = Image.fromarray(image_np)
plt.imshow(pil_image)
plt.title(x['label'])



For an iterable dataset, use .shuffle() to shuffle images.
  * .take() can be used to extract a specific image id.
  * using list with .take() will list the number of images specified within .take(). The larger this number the longer it takes for the program to list all images.



In [None]:

list(breast_cancer.shuffle().take(5))


In [None]:
# using .take you are accessing a single image from the dataset.
breast_cancer.take(1000)

In [None]:
    from ipywidgets import Widget
    Widget.close_all()