# Data Versioning with Lakefs: All You Need to Know
## Manage your data like you manage code
<img src='images/lake.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@heiner-56542?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Heiner</a>
        on 
        <a href=''>Pexels</a>
    </strong>
</figcaption>

### Introduction to Data Versioning

### Getting Started With lakeFS, Docker Installation

From personal experience, installing lakeFS to run local instances, Docker Compose is the best solution.

First of all, ensure that you have Docker installed with compose version `1.25.04` or higher. If you don't have Docker installed, here are links for installation guides: [MacOS](https://docs.docker.com/docker-for-mac/install/), [Windows](https://docs.docker.com/docker-for-windows/install/), [Linux Distros](https://docs.docker.com/engine/install/centos/).

You can verify that you have correctly installed docker by running `docker version` on the shell:

In [1]:
!docker version

Client: Docker Engine - Community
 Cloud integration: 1.0.2
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:00:27 2020
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:07:04 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683


Next, start `lakeFS` instance with a single command:

```bash
curl https://compose.lakefs.io | docker-compose -f - up
```

If your output is anything like this, you are on the right track:

<img src='images/1.png'></img>

You can also see the image running on your Docker Desktop Console:

<img src='images/2.png'></img>

When you run the `docker-compose` command for the first time, you should set up a user admin by opening `http://127.0.0.1:8000/setup` on your browser. It will open up this page:

<img src='images/3.png'></img>

Enter a username of your choice and it will give you one-time only credentials. You should store them securely in a file somewhere because we will need it later. 

Next, proceed to `http://127.0.0.1:8000/login` where you will be able to log in using your credentials. As soon as you login, you will land on your repositories page, think of it like your GitHub account but with `lakeFS`:

<img src='images/4.png'></img>

This page is your UI for interacting with all of your repos and your user account. 

However, for peeps who love the shell, `lakeFS` provides even more powerful Command Line Interface which we will cover in the next section.

### Installing the lakeFS CLI (Comman Line Interface)

`lakeFS CLI` is installed using its CLI binary. First, go to [this](https://github.com/treeverse/lakeFS/releases) GitHub releases page of `lakeFS`. Click on the latest release and scroll to the bottom. You will find download options depending on your OS:

<img src='images/5.png'></img>

Download yours and place it somewhere in your `PATH`. If you want to run the CLI for a single project, you can extract it directly into the root directory of your project:

<img src='images/6.png'></img>

Before running the CLI commands check that you are still running the `lakeFS` image from your Docker Console. Then, run this command to check if the CLI is working:

```
lakectl help
```

If the shell displays help page, congratulations, you are running the CLI on your local machine!

### lakeFS namespace

Before we move on, it is important that you know the `lakeFS` namespace. Different operations all reference components of `lakeFS` repositories through the `lakefs://` keyword. Here is a reference list of patterns for referring to different components:
- Repositories: `lakefs://<repo-name>`
- Commits: `lakefs://<repo-name>@<commit-id>`
- Branches: `lakefs://<repo-name>@<branch-id>`
- Files (objects): `lakefs://<repo-name>@<branch-id>/<object path>`
> Ignore `<>`.

### `lakectl` authorization

To start interacting with repositories under your account, you should first authorize your session (done every time you start a new session). Start by running `lakectl config` which should show this output:

<img src='images/7.png'></img>

Copy and paste your Access key ID you saved from the earlier section (you did right?). Do the same for your secret key:

<img src='images/8.png'></img>

It does not ask twice for each field, it is just displayed like that. The Server Endpoint URL you see is `http://127.0.0.1:8000/api/v1`. After you enter these values into the fields, you will be authorized and will be able to control pretty much everything related to `lakeFS`. You can check if you are authorized by running this command:

```
lakectl repo list
```

<img src='images/9.png'></img>

It should give you empty table since we did not create any repos. So, shall we?

### Working With Repos in General

To create a repository, we will use `lakectl repo` command which gives access to all commands to control repositories:

```
lakectl repo create lakefs://example local://storage
```

The above command will create a repo named `example` in the `local` storage since we are using `local://` keyword. The `storage` word is arbitrary:

<img src='images/10.png'></img>

From now on, this repository can be referenced only with `lakefs://example` URI (Uniform Resource Identifier). If we run `lakectl repo list`, we should be able to see it now:

```bash
lakectl repo list
```

<img src='images/12.png'></img>

Just like Git, each repo has a default `master` branch when created. You can delete repositories with `delete` keyword:

```
lakectl repo delete lakefs://repo-name
```

<img src='images/13.png'></img>

For full repository commands, check out the [CLI reference](https://docs.lakefs.io/reference/commands.html) of lakeFS.

### Loading Data To Repositories

At this point, we start to interact with our data. Remember our main aim for using `lakeFS`. We want to manage data with any magnitude just like we do our code. So, what you will find useful is to integrate `lakeFS` with `Git` itself. 

The idea is that we control any-data related changes through `lakeFS` and manage our code with `Git`. To achieve this, you should put all of file extensions in `.gitignore` file which won't be tracked afterwards. 

Now, say we want to upload some audio files to our `lakeFS` repo which are stored inside `data` directory:

<img src='images/14.png'></img>

Since they have `.wma` extension, make sure you add `*.wma` as a new line to `.gitignore`.

Let's upload all the files in `data`. Just like `lakectl repo` command, `lakectl fs` gives access to manipulate files and objects. We will use the `upload` command which has this pattern:

```bash
lakectl fs upload --recursive --source path/ lakefs://repo-name@branch-name/extra-path/
```

The above command works for uploading both single or many files from a given directory. You should provide the path after `--source` flag. For the destination, you must include the repository name followed by a branch name. It is also very important to end both source and destination path with a `/` otherwise, the command fails.

Here is the sample command to upload the 4 audio files:

```bash
lakectl fs upload --recursive --source data/ lakefs://example@master/data/
```

<img src='images/15.png'></img>

`lakectl fs upload` is an equivalent to `git add`. To list the contents of a directory, we can run:

```bash
lakectl fs ls lakefs://repo@branch/path/
```

<img src='images/16.png'></img>

The above command is equivalent to shell's `ls` command. When you give the path name, just like the others it should start with `lakefs://` followed by repository name, branch name and path.

### Making Commits With `lakectl`

We just uploaded new files to our repository. Notice we did not write any code so we do not need to make a commit through `git`. However, to save the changes on our lakeFS repo, we should make a commit. 

`lakectl` commit commands generally follow this pattern:

```bash
lakectl commit lakefs://repo-name@branch-name --message "Commit Message"
```

But before committing, it is usually helpful to see the changes we have made since the last commit. Just like `git diff`, there exists similar command for `lakectl` and follows this pattern:

```bash
lakectl diff lakefs://repo-name@branch-name
```

<img src='images/19.png'></img>

The `diff` command shows all the uncommitted changes made to lakeFS repository on the branch you specify.

Now, after making sure everything is good, we can commit our changes:

<img src='images/17.png'></img>

We will get a success message once the changes are committed. The commit message also gives some details such as commit ID and timestamp.

You can see the list of commits on your repo with this command:

```bash
lakectl log lakefs://repo-name@branch
```

<img src='images/18.png'></img>

### Working With Branches

The real power of `lakeFS` can be seen in the instance of branches. Creating branches in `git` allows to duplicate your code base and work with it in an isolation to try out experiments and features. However, doing this for repositories with enormous amount of data is not feasible both storage-wise and time-wise. 

`lakeFS` prides itself in solving this problem. For example, if you create a branch for your `lakeFS` repository, the task is performed instantaneously and without duplication of data. Creating a branch at particular point of your repo's commit history will create a snapshot of repo's state at that particular commit, again without duplication. The official website says that it is all about handling file metadata under the hood.

Before we get to creating branches, I will upload some more data and make a few commits for example purposes. 

<img src='images/20.png'></img>

Next, we will create a new branch at the head, meaning from the latest commit. First, get yourself acquainted with the command to create branches:

```bash
lakectl branch create lakefs://repo-name@new-branch-name/ --source lakefs://repo-name@source-branch
```

When you create a branch you should specify both the new branch's name and the one it should branch out from. This means that you can create branches from existing ones, it does not have to be the `master`.

```bash
lakectl branch create lakefs://example@new --source lakefs://example@master
```

This will create a branch named `new` and you can list out existing ones with this command:

```bash
lakectl branch list lakefs://example
```

Now, suppose you did some experimentation with your data and tested out new features. When are satisfied you may want to merge this newly-created branch back to `master`. 

First of all, make sure that you commit any unsaved changes on your new branch:

```bash
lakectl commit lakefs://example@new --message "Tested some new features"
```

Before merging, you may want to see what is getting modified when you merge two branches. In this scenario, you can use `diff` command again. The below command will yield the difference between two branches:

```bash
lakectl diff lakefs://example@master lakefs://example@new
```

Once you are satisfied, merge the branches with:

```bash
lakectl merge lakefs://example@new lakefs://example@master
```

Note that any uncommitted changes will be committed and merged with the above command.

One final point for working with branches: If you are unsatisfied with the changes in any branch, you can always revert them with `lakectl`. The CLI provides 4 options depending on the situation. I won't list them out here but you can always learn about them from [CLI reference](https://docs.lakefs.io/reference/commands.html).