<a href="https://colab.research.google.com/github/activeloopai/examples/blob/istranic-adding-colabs/Getting_Started_with_Hub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1**: _Hello World_

## Installing Hub

Hub can be installed via `pip`.

In [None]:
from IPython.display import clear_output
!pip3 install hub
clear_output()

In [None]:
# IMPORTANT - Please restart your Colab runtime after installing Hub!
# This is a Colab-specific issue that prevents some imports from working properly.
import os
os.kill(os.getpid(), 9)

## Fetching your first Hub dataset

Begin by loading in [MNIST](https://en.wikipedia.org/wiki/MNIST_database), the hello world dataset of machine learning. 

First, load the `Dataset` by pointing to its storage location. Datasets hosted on the Activeloop Platform are typically identified by the namespace of the organization followed by the dataset name: `activeloop/mnist-train`.

In [None]:
import hub

dataset_path = 'hub://activeloop/mnist-train'
ds = hub.load(dataset_path) # Returns a Hub Dataset but does not download data locally

## Reading Samples From a Hub Dataset

Data is not immediately read into memory because Hub operates [lazily](https://en.wikipedia.org/wiki/Lazy_evaluation). You can fetch data by calling the `.numpy()` method, which reads data into a NumPy array.


In [None]:
# Indexing
W = ds.images[0].numpy() # Fetch image return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch label and store as list of NumPy array

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array if possible
                               # This method produces an exception if
                               # the shape of the images is not equal
Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store as list of 
                                           # NumPy arrays

In [None]:
print('X is {}'.format(X))

Congratulations, you've got Hub working on your local machine! 🤓

# **Step 2**: _Creating Hub Datasets_
*Creating and storing Hub Datasets manually.*

Creating Hub datasets is simple, you have full control over connecting your source data (files, images, etc.) to specific tensors in the Hub Dataset.

## Manual Creation

Let's follow along with the example below to create our first dataset. First, download and unzip the small classification dataset below called the *animals dataset*.

In [None]:
# Download dataset
from IPython.display import clear_output
!wget https://firebasestorage.googleapis.com/v0/b/gitbook-28427.appspot.com/o/assets%2F-M_MXHpa1Cq7qojD2u_r%2F-MbI7YlHiBJg6Fg-HsOf%2F-MbIUlXZn7EYdgDNncOI%2Fanimals.zip?alt=media&token=c491c2cb-7f8b-4b23-9617-a843d38ac611
clear_output()

In [None]:
# Unzip to './animals' folder
!unzip -qq /content/assets%2F-M_MXHpa1Cq7qojD2u_r%2F-MbI7YlHiBJg6Fg-HsOf%2F-MbIUlXZn7EYdgDNncOI%2Fanimals.zip?alt=media

The dataset has the following folder structure:

animals
- cats
  - image_1.jpg
  - image_2.jpg
- dogs
  - image_3.jpg
  - image_4.jpg

Now that you have the data, you can **create a Hub `Dataset`** and initialize its tensors. Running the following code will create a Hub dataset inside of the `./animals_hub` folder.


In [None]:
import hub
from PIL import Image
import numpy as np
import os

ds = hub.empty('./animals_hub') # Creates the dataset

Next, let's inspect the folder structure for the source dataset `'./animals'` to find the class names and the files that need to be uploaded to the Hub dataset.

In [None]:
# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))

Next, let's **create the dataset tensors and upload metadata**. Check out our page on [Storage Synchronization](https://docs.activeloop.ai/how-hub-works/storage-synchronization) for details about the `with` syntax below.


In [None]:
with ds:
  # Create the tensors with names of your choice.
  ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
  ds.create_tensor('labels', htype = 'class_label', class_names = class_names)

  # Add arbitrary metadata - Optional
  ds.info.update(description = 'My first Hub dataset')
  ds.images.info.update(camera_type = 'SLR')

**Note:** Specifying `htype` and `dtype` is not required, but it is highly recommended in order to optimize performance, especially for large datasets. Use `dtype` to specify the numeric type of tensor data, and use `htype` to specify the underlying data structure. More information on `htype` can be found [here](https://api-docs.activeloop.ai/htypes.html).

Finally, let's **populate the data** in the tensors.         

In [None]:
with ds:
    # Iterate through the files and append to hub dataset
    for file in files_list:
        label_text = os.path.basename(os.path.dirname(file))
        label_num = class_names.index(label_text)
        
        ds.images.append(hub.read(file))  # Append to images tensor using hub.read
        ds.labels.append(np.uint32(label_num)) # Append to labels tensor

**Note:** `ds.images.append(hub.read(path))` is functionally equivalent to `ds.image.append(PIL.Image.fromarray(path))`. However, the `hub.read()` method is significantly faster because it does not decompress and recompress the image if the compression matches the `sample_compression` for that tensor. Further details are available in the next section.

Check out the first image from this dataset. More details about Accessing Data are available in **Step 5**.

In [None]:
Image.fromarray(ds.images[0].numpy())

## Automatic Creation

The above animals dataset can also be converted to Hub format automatically using 1 line of code:

In [None]:
src = "./animals"
dest = './animals_hub_auto'

ds = hub.ingest(src, dest)

In [None]:
Image.fromarray(ds.images[0].numpy())

**Note**: Automatic creation currently only supports image classification datasets, though support for other dataset types is continually being added. A full list of supported datasets is available [here](https://api-docs.activeloop.ai/#hub.ingest).

## Creating Tensor Hierarchies

Often it's important to create tensors hierarchically, because information between tensors may be inherently coupled—such as bounding boxes and their corresponding labels. Hierarchy can be created using tensor `groups`:

In [None]:
ds = hub.empty('./groups_test') # Creates the dataset

# Create tensor hierarchies
ds.create_group('my_group')
ds.my_group.create_tensor('my_tensor')

# Alternatively, a group can us created using create_tensor with '/'
ds.create_tensor('my_group_2/my_tensor') #Automatically creates the group 'my_group_2'

Tensors in groups are accessed via:

In [None]:
ds.my_group.my_tensor

For more detailed information regarding accessing datasets and their tensors, check out the next section.

# **Step 3**: _Understanding Compression_

*Using compression to achieve optimal performance.*

All sample data in Hub can be stored in a raw uncompressed format. However, in order to achieve optimal performance in terms of speed and memory, it is critical to specify an appropriate compression method for your data.

For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the `sample_compression` input:

In [None]:
import hub

ds = hub.empty('./compression_test')

In [None]:
ds.create_tensor("images_example", htype = "image", sample_compression = "jpeg")

In this example, every image added in subsequent `.append(...)` calls is compressed using the specified `sample_compression` method. If the source data is already in the correct compression format, it is saved as-is. Otherwise, it is recompressed to the specified format, as described in detail below. 

#### **When choosing the optimal compression, the primary tradeoffs are lossiness, memory, and runtime:**

**Lossiness** - Certain compression techniques are lossy, meaning that there is irreversible information loss when saving the data in the compressed format. 

**Memory** - Different compression techniques have substantially different memory footprints. For instance, `png` vs `jpeg` compression may result in a 10X difference in the size of a Hub dataset. 

**Runtime** - The highest uploads speeds can be achieved when the `sample_compression` value matches the compression of the source data, such as:

In [None]:
# sample_compression and my_image are "jpeg"
ds.create_tensor("images_jpeg", htype = "image", sample_compression = "jpeg")
ds.images_jpeg.append(hub.read("/content/animals/dogs/image_3.jpg"))

However, a mismatch between compression of the source data and `sample_compression` in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified `sample_compression` before saving:

In [None]:
# sample_compression is "png" and my_image is "jpeg"
ds.create_tensor("images_png", htype = "image", sample_compression = "png")
ds.images_png.append(hub.read("/content/animals/dogs/image_3.jpg"))

**Note:** Therefore, due to the computational costs associated with decompressing and recompressing data, it is important that you consider the runtime implications of uploading source data that is compressed differently than the specified `sample_compression`. 

# **Step 4**: _Accessing Data_
_Accessing and loading Hub Datasets._

## Loading Datasets

Hub Datasets can be loaded and created in a variety of storage locations with minimal configuration. 

In [None]:
import hub

In [None]:
# Local Filepath
ds = hub.load('./animals_hub') # Dataset created in Step 2 in this Colab Notebook

In [None]:
# S3
# ds = hub.load('s3://my_dataset_bucket', creds={...})

In [None]:
# Public Dataset hosted by Activeloop
ds = hub.load('hub://activeloop/k49-train')

In [None]:
# Dataset in another workspace on Activeloop Platform
# ds = hub.load('hub://workspace_name/dataset_name')

**Note:** Since `ds = hub.dataset(path)` can be used to both create and load datasets, you may accidentally create a new dataset if there is a typo in the path you provided while intending to load a dataset. If that occurs, simply use `ds.delete()` to remove the unintended dataset permanently.

## Referencing Tensors

Hub allows you to reference specific tensors using keys or via the `.` notation outlined below. 


**Note:** data is still not loaded by these commands.

In [None]:
ds = hub.dataset('hub://activeloop/k49-train')

In [None]:
### NO HIERARCHY ###
ds.images # is equivalent to
ds['images']

ds.labels # is equivalent to
ds['labels']

### WITH HIERARCHY ###
# ds.localization.boxes # is equivalent to
# ds['localization/boxes']

# ds.localization.labels # is equivalent to
# ds['localization/labels']

## Accessing Data

Data within the tensors is loaded and accessed using the `.numpy()` command:

In [None]:
# Indexing
ds = hub.dataset('hub://activeloop/k49-train')

W = ds.images[0].numpy() # Fetch an image and return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch a label and store it as a 
                                    # list of NumPy arrays

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array
                             # The method above produces an exception if 
                             # the images are not all the same size

Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store 
                                        # them as a list of NumPy arrays

**Note:** The `.numpy()` method will produce an exception if all samples in the requested tensor do not have a uniform shape. If that's the case, running `.numpy(aslist=True)` solves the problem by returning a list of NumPy arrays, where the indices of the list correspond to different samples. 

# **Step 5**: _Using Activeloop Storage_

_Storing and loading datasets from Activeloop Platform Storage._

You can store your Hub Datasets on Activeloop Platform by first creating an account in the CLI using:

In [None]:
!activeloop register

In order for the Python API to authenticate with the Activeloop Platform, you should log in from the CLI using:

In [None]:
!activeloop login -u username -p password

# Alternatively use "activeloop login" ... which is followed by prompts for username and password

You can then access or create Hub Datasets by passing the Activeloop Platform path to `hub.dataset()`.

In [None]:
import hub

# platform_path = 'hub://workspace_name/dataset_name'
#                 'hub://jane_smith/my_awesome_dataset'
               
ds = hub.dataset(platform_path)

**Note**: 

When you create an account in Activeloop Platform, a default workspace is created that has the same name as your username. You are also able to create other workspaces that represent organizations, teams, or other collections of multiple users. 

Public datasets such as `'hub://activeloop/mnist-train'`  can be accessed without logging in.

# **Step 6**: _Connecting Hub Datasets to ML Frameworks_

_Connecting Hub Datasets to machine learning frameworks such as PyTorch and TensorFlow._

You can connect Hub Datasets to popular ML frameworks such as PyTorch and TensorFlow using minimal boilerplate code, and Hub takes care of the parallel processing!

## PyTorch

You can train a model by creating a PyTorch DataLoader from a Hub Dataset using `ds.pytorch()`.

In [None]:
import hub
from torch.utils.data import DataLoader

ds = hub.dataset('hub://activeloop/cifar100-train') # Hub Dataset
dataloader = ds.pytorch(batch_size = 16, num_workers = 2) #PyTorch DataLoader

for data in dataloader:
    print(data)
    break
    # Training Loop

## TensorFlow

Similarly, you can convert a Hub Dataset to a TensorFlow Dataset via the `tf.Data` API. 

In [None]:
ds # Hub Dataset object, to be used for training
ds_tf = ds.tensorflow() # A TensorFlow Dataset

# **Step 7**: _Parallel Computing_

_Running computations and processing data in parallel._

Hub enables you to easily run computations in parallel and significantly accelerate your data processing workflows. This example primarily focuses on parallel dataset uploading, and other use cases such as dataset transformations can be found in [this tutorial](https://docs.activeloop.ai/tutorials/data-processing-using-parallel-computing).

Parallel compute using Hub has two core elements: #1. defining a function or pipeline that will run in parallel and #2. evaluating it using the appropriate inputs and outputs. Let's start with #1 by defining a function that processes files and appends their data to the labels and images tensors. 

**Defining the parallel computing function**

The first step for running parallel computations is to define a function that will run in parallel by decorating it using `@hub.compute`. In the example below, `file_to_hub` converts data from files into hub format, just like in **Step 2: Creating Hub Datasets Manually**. If you have not completed Step 2, please complete the section that downloads and unzips the *animals* dataset

In [None]:
import hub
from PIL import Image
import numpy as np
import os

@hub.compute
def file_to_hub(file_name, sample_out, class_names):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    # Other arguments are optional
    
    # Find the label number corresponding to the file
    label_text = os.path.basename(os.path.dirname(file_name))
    label_num = class_names.index(label_text)
    
    # Append the label and image to the output sample
    sample_out.labels.append(np.uint32(label_num))
    sample_out.images.append(hub.read(file_name))
    
    return sample_out

In all functions decorated using `@hub.compute`, the first argument must be a single element of any input iterable that is being processed in parallel. In this case, that is a filename `file_name`, becuase `file_to_hub` reads image files and populates data in the dataset's tensors. 

The second argument is a dataset sample `sample_out`, which can be operated on using similar syntax to dataset objects, such as `sample_out.append(...)`, `sample_out.extend(...)`, etc.

The function decorated using `@hub.compute` must return `sample_out`, which represents the data that is added or modified by that function.

**Executing the transform**

To execute the transform, you must define the dataset that will be modified by the parallel computation.

In [None]:
ds = hub.empty('./animals_hub_transform') # Creates the dataset

Next, you define the input iterable that describes the information that will be operated on in parallel. In this case, that is a list of files `files_list` from the animals dataset in Step 2.

In [None]:
# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))

You can now create the tensors for the dataset and **run the parallel computation** using the `.eval` syntax. Pass the optional input arguments to `file_to_hub`, and we skip the first two default arguments `file_name` and `sample_out`. 

The input iterable `files_list` and output dataset `ds` is passed to the `.eval` method as the first and second argument respectively.

In [None]:
with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
    file_to_hub(class_names=class_names).eval(files_list, ds, num_workers = 2)

In [None]:
Image.fromarray(ds.images[0].numpy())

Congrats! You just created a dataset using parallel computing! 🎈

# **Step 8**: _Version Control_

_Running computations and processing data in parallel._

Hub version control allows user to manage changes to datasets with commands very similar to Git. It provides critical insights into how data is evolving, and it works with datasets of any size!


Let's create a hub dataset and check out how version control works!

In [None]:
import hub
import numpy as np

# Set overwrite = True for re-runability
ds = hub.dataset('./version_control', overwrite = True)

# Create a tensor and append 200X 100x100x3 arrays
with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.images.extend(np.ones((200, 100, 100, 3), dtype = 'uint8'))

##Commit

To commit the data added above, simply run `ds.commit`:


In [None]:
first_commit_id = ds.commit('Added 200X 100x100x3 arrays')

print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))

The printout shows that the first commit has 200 samples. Next, let's add 50X more samples and commit the update:

In [None]:
with ds:
    ds.images.extend(np.ones((50, 150, 150, 3), dtype = 'uint8'))
    
second_commit_id = ds.commit('Added 50X 150x150x3 arrays')
print('Dataset in commit {} has {} samples'.format(second_commit_id, len(ds)))

The printout now shows that the second commit has 250 samples. 


##Log

The commit history starting from the current commit can be show using `ds.log`:


In [None]:
log = ds.log()

This command prints the log to the console and also assigns it to the specified variable log. The author of the commit is the username of the [Activeloop account](https://docs.activeloop.ai/getting-started/using-activeloop-storage) that logged in on the machine.

##Branch

Branching takes place by running the `ds.checkout` command with the parameter `create = True` . Let's create a new branch, add a `labels` tensor, populate it with data, create a new commit on that branch, and display the log.

In [None]:
ds.checkout('new_branch', create = True)

with ds:
    ds.create_tensor('labels', htype = 'class_label')
    ds.labels.extend(np.zeros((250,1), dtype = 'uint32'))
    
new_branch_commit_id = ds.commit('Added labels tensor and 250X labels')
print('Dataset in commit {} has tensors: {}'.format(new_branch_commit_id, ds.tensors))

The printout shows that the dataset on the `new_branch` branch contains `images` and `labels` tensors.


The log now shows a commit on `new_branch` as well as the previous commits on the `main`:

In [None]:
ds.log()

##Checkout

A previous commit of branch can be checked out using `ds.checkout`:

In [None]:
ds.checkout('main')

print('Dataset in branch {} has tensors: {}'.format('main', ds.tensors))

As expected, the printout shows that the dataset on `main` only contains the `images` tensor, since the `labels` tensor was added on `new_branch`.

##HEAD Commit


Unlike Git, Hub's version control does not have a staging area because changes to datasets are not stored locally before they are committed. All changes are automatically reflected in the dataset's permanent storage (local or cloud). **Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch**. This means that the uncommitted changes do not appear on other branches. Let's see how this works:

You should currently be on the `main` branch, which has 250 samples. Let's add 75 more samples:


In [None]:
print('Dataset on {} branch has {} samples'.format('main', len(ds)))

with ds:
    ds.images.extend(np.zeros((75, 100, 100, 3), dtype = 'uint8'))
    
print('After updating, the HEAD commit on {} branch has {} samples'.format('main', len(ds)))

Next, if you checkout the first commit, the dataset contains 200 samples, which is sample count from when the first commit was made. Therefore, the 75 uncommitted samples that were added to the `main` branch above are not reflected when other branches or commits are checked out.

In [None]:
ds.checkout(first_commit_id)

print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))

Finally, when checking our the `main` branch again, the prior uncommitted changes and visible and they are stored in the `HEAD` commit on `main`:

In [None]:
ds.checkout('main')

print('Dataset in {} branch has {} samples'.format('main', len(ds)))

##Diff - Coming Soon

Understanding changes between commits is critical for managing the evolution of datasets. The `diff` function will enable users to determine the number of samples that were added, removed, or updated for each tensor. Activeloop is currently working on an implementation.

##Merge - Coming Soon


Merging is a critical feature for collaborating on datasets, and Activeloop is currently working on an implementation.

Congrats! You just are now an expert in dataset version control!🎓