## Store reuse with multiple datasets

#### This notebook describes a store reuse scenario with multiple datasets containing common datapoints.

When the same file is used in multiple datasets, that file will be added to the bucket only once, in order to optimize the space usage in the bucket. To exemplify this use case, two entities will be created: the people entity contains 10 images with faces of people, while famous entity contains 7 images with faces of famous people, being 5 of them also contained in the people entity.

This way, when sending the files to the repository, the 5 images that are being used in the two entities will not be duplicated in the bucket. The two entities will refer to the same image stored in the bucket.

#### To start using the ml-git api we need to import it into our script

In [None]:
from ml_git import api

#### After that, we define some variables that will be used by the notebook

In [None]:
# The type of entity we are working on
entity = 'dataset'

# The entity name we are working on
entity_name_people = 'people_faces'

# The entity name we are working on
entity_name_famous = 'famous_faces'

#### To start, let's take into account that you have a repository with git settings to make the clone. If this is not your scenario, you will need to configure ml-git outside this notebook (At the moment the api does not have the necessary methods to perform this configuration). 

#### Or you can manually configure the repository using the command line, following the steps in the [First Project](https://github.com/HPInc/ml-git/blob/development/docs/first_project.md) documentation.

In [None]:
repository_url = '/local_ml_git_config_server.git'

api.clone(repository_url)

#### Create the people dataset

![dataset](people_faces.jpg)

In [None]:
!ml-git dataset create people_faces --category=computer-vision --category=images --store-type=s3h --bucket-name=faces_bucket --version=1 --import='people_faces' --unzip

#### We can now proceed with the necessary steps to send the new data to store.

In [None]:
api.add(entity, entity_name_people, bumpversion=True)

Commit the changes

In [None]:
# Custom commit message
message = 'Commit example'

api.commit(entity, entity_name_people, message)

#### As we are using MinIO locally to store the data in the bucket, we were able to check the number of files that are in the local bucket.

In [None]:
import os

def get_bucket_files_count():
  print("Number of files on bucket: " +  str(len(os.listdir('../../data/faces_bucket'))))

#### Amount of files in the buket before pushing the people dataset

In [None]:
get_bucket_files_count()

As we have not yet uploaded any version of our dataset, the bucket is empty.

#### Pushing the people dataset

In [None]:
api.push(entity, entity_name_people)

#### Amount of files in the buket after pushing the people dataset

In [None]:
get_bucket_files_count()

After sending the data, we can observe the presence of 20 blobs related to the 10 images that were versioned. In this case, two blobs were added for each image in our dataset.

#### Create the famous dataset

Let's create our second dataset that has some images equals to the first dataset.

![dataset](famous_faces.jpg)

In [None]:
!ml-git dataset create famous_faces --category=computer-vision --category=images --store-type=s3h --bucket-name=faces_bucket --version=1 --import='famous_faces' --unzip

#### We can now proceed with the necessary steps to send the new data to store.

In [None]:
api.add(entity, entity_name_famous, bumpversion=True)

Commit the changes

In [None]:
# Custom commit message
message = 'Commit example'

api.commit(entity, entity_name_famous, message)

And finally, sending the data

In [None]:
api.push(entity, entity_name_famous)

#### Amount of files in the buket after pushing the famous dataset

In [None]:
get_bucket_files_count()

As you can see, only 4 blobs were added to our bucket. Of the set of 7 images, only 2 images were different from the other dataset, so ml-git can optimize store by adding blobs related only to these new images.