# Moving Datasets Across Stores

This notebook shows you how to move datasets from one (or many) store to (many) others.

## Prep work

In [1]:
from __future__ import print_function  # For Python 2 compatibility
import kosh
import os
import random

def prep_stores(source_name="my_source_store.sql", dest_name="my_dest_store.sql", data_dir="my_data_dir"):
    """
    This creates two new stores and adds a dataset with 3 associated files to it to the first store"
    """
    
    try:
        os.remove(source_name)
    except:
        pass
    try:
        os.remove(dest_name)
    except:
        pass
    # Let's create a "source" and a "destination" store
    source_store = kosh.create_new_db(source_name)
    dest_store = kosh.create_new_db(dest_name)

    # Let's create a dataset we'd like to transfer
    ds = source_store.create(name="a_dataset", metadata={"int_attr":1, "float_attr":2., "str_attr": "string"})
    
    # let's create some files to associate
    # first a directory
    try:
        os.makedirs(data_dir)
    except Exception:
        pass
    filenames = ["a.txt", "b.txt", "c.py"]
    filenames = [os.path.join(data_dir, f) for f in filenames]
    
    ds.associate(filenames, "test")
    for filename in filenames:
        with open(filename, "w") as f:
            print("some data", file=f)
            print(random.randint(0, 10000000), file=f)  # to ensure unique SHAs


## All data resides on the same file system

In this case a simple python Python script will suffice

In [2]:
# Let's prepare the stores
prep_stores()

# Let's open our source store:
my_store = kosh.KoshStore("my_source_store.sql")

#Let's open our target store
target_store = kosh.KoshStore("my_dest_store.sql")

# Let's search for the dataset(s) of interest in the source
datasets = my_store.search(name="a_dataset")

#And let's transfer
for dataset in datasets:
    target_store.import_dataset(dataset.export())
    
# Voila! Let's check

print(target_store.search(name="a_dataset")[0])



KOSH DATASET
	id: d5e29d3941e444789a96fb621f466280
	name:a_dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	float_attr: 2.0
	int_attr: 1
	name: a_dataset
	str_attr: string
--- Associated Data (3)---
	Mime_type: test
		/g/g19/cdoutrix/git/kosh/examples/my_data_dir/a.txt ( 2a0f1640f59342c6b0c1f89e2889e8b6 )
		/g/g19/cdoutrix/git/kosh/examples/my_data_dir/b.txt ( 1d70c00db96440fe8ade9252f161df76 )
		/g/g19/cdoutrix/git/kosh/examples/my_data_dir/c.py ( 0f0041df119a4895a8e81dc2bfb890ff )



## Data needs to be moved or copied.


### On the same file system

If you need to move some files simply use `kosh mv`

Example: moving file.py to new_named_file.py

```bash
kosh mv --stores store1.sql --sources file.py --destination new_named_file.py
```


```
usage: kosh mv --stores STORES [--destination-stores DESTINATION_STORES] --sources SOURCES [SOURCES ...]
            [--dataset_record_type DATASET_RECORD_TYPE] [--dataset_matching_attributes DATASET_MATCHING_ATTRIBUTES]
            --destination DESTINATION [--version]
```

You can also copy files to another place and store

```bash
kosh cp --stores store1.sql --sources file.py --destination new_named_file.py
```

```
Usage: kosh cp --stores STORES [--destination-stores DESTINATION_STORES] --sources SOURCES [SOURCES ...]
            [--dataset_record_type DATASET_RECORD_TYPE] [--dataset_matching_attributes DATASET_MATCHING_ATTRIBUTES]
            --destination DESTINATION [--version]
```


Kosh should handle properly directories and patterns (*)

### After the fact

Ooops! You moved the files to a new place but forgot to do so via `kosh mv`

Fear not! Kosh can probably help you fix your stores

```
usage: kosh reassociate --stores STORES --new_uris NEW_URIS [NEW_URIS ...] [--original_uris ORIGINAL_URIS [ORIGINAL_URIS ...]]
            [--no_absolute_path]
```


#### Option 1: just point to the new files

```bash
kosh reassociate --stores store.sql --new_uris new_named_file.py
```

Kosh will compute the "short sha" on the target(s) and try to find a match.

The *new_uris* can be a directory or pattern

#### Option 2: Using the old name

```bash
kosh reassociate --stores store.sql --new_uris new_named_file.py --original_uris file.py
```

#### Option 3: I know the fast sha

```bash
kosh reassociate --stores store.sql --new_uris new_named_file.py --original_uris c6a15fa59ae2d070a88a6a96503543d4baeb8f381f247854ef04adb67f79d818
```

### Moving files across filesystem (remote host)

Here we assume that we need to bring data from a remote machine

Because Kosh will need to do a **LOT** of talking with the remote host
it is preferable to setup an ssh agent so you do not need to type you password over and over

Please [see this guide](https://docs.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) to setup yor keys and agent properly

In [None]:
# Let's ask for the password and setup ssh agent
import getpass
password = getpass.getpass()+"\n"

from subprocess import Popen, PIPE
import shlex
agent = Popen("ssh-agent", stdin=PIPE, stdout=PIPE, stderr=PIPE)
o,e = agent.communicate()
for line in o.decode().split("\n"):
    sp = line.split("=")
    if len(sp) > 1:
        variable = sp[0]
        value = sp[1].split(";")[0]
        os.environ[variable] = value
add = Popen("ssh-add", stdin=PIPE, stdout=PIPE, stderr=PIPE)
add.communicate(password.encode())

In [4]:
# Let's prepare our data

prep_stores()

Now let's fake our "remote host"

In [5]:
import socket

user = getpass.getuser()
hostname = socket.gethostname()

Ok all we need to do is to copy the data from the remote host to a new local directory

In [6]:
# Let's cleanup first
import shutil
try:
    shutil.rmtree("my_new_data_dir")
except:
    pass
os.makedirs("my_new_data_dir")

Let's build the command line to copy the data over

In [7]:
import sys
cmd = "{}/bin/kosh cp --stores my_source_store.sql --destination_stores my_dest_store.sql --sources {}@{}:{}/my_data_dir --destination my_new_data_dir".format(sys.prefix, user, hostname, os.getcwd())
print("We will be executing:\n{}".format(cmd))

We will be executing:
/g/g19/cdoutrix/miniconda3/envs/kosh/bin/kosh cp --stores my_source_store.sql --destination_stores my_dest_store.sql --sources cdoutrix@surface86:/g/g19/cdoutrix/git/kosh/examples/my_data_dir --destination my_new_data_dir


In [8]:
p = Popen(shlex.split(cmd), stdin=PIPE, stdout=PIPE, stderr=PIPE)
o, e = p.communicate()

In [9]:
# Now let's check our second store (on the remote) contains data
remote_store = kosh.KoshStore("my_dest_store.sql")
print(remote_store.search())

[<kosh.sina.core.KoshSinaDataset object at 0x2aaafbaa8470>]


### Moving files across disconnected filesystems

Let's assume you have a LOT of data, you need to move it to another computer but you have a **VERY** slow connection to the other computer. 

Using scp/rsync will take months and you can't wait.

Kosh solution at this point is to `tar` your data on the original machine, manually transfer the data to the other machine (USB stick, DVD, etc...) and run tar again on the other end

Kosh will look for the datasets referencing the files your tarring and add them to the tarball.

When extracting Kosh will add these dataset (with the new local paths) to your destination store.

The syntz is the same as your regular `tar` (you can pass any command accepted by `tar`) except you need to point to the kosh store and the tarball name must be specified via -f
Example:

```bash
kosh tar cv --stores store1.sql store2.sql -f my_big_tar.tgz *.hdf5
```

Once one the destination machine you can do:

```bash
kosh tar cv --stores destination_store.sql -f my_big_tar.tgz
```

Your files are untarred and the dataset originally in store1 and store2 that pointed to these files are now in destination_store