# Archgit - How To Create A Commit From Scratch

Many software developers use Git every day, but have you ever wondered how all of the Git magic acutally works? "Sure it uses hashes and stuff" is what you may be thinking right now. Which is correct by the way but it is only part of the story. I will try to enlighten you with some details about the guts of Git with this interactive notebook.

To do so I will explain how to create a basic Git repository and commit a file without using any Git commands all. Welcome to Archgit or how to create a commit from scratch! Without further ado, let's get started!

First we need to create a bare bones repository. With `git init` the repository is already bloated with lots of unnecessary stuff...

In [None]:
!git init -q ./bloated
!tree ./bloated/.git/
!rm -rf ./bloated

What we actually only need is the `.git` directory with the `HEAD` file, which tells Git from where to start to build our working directory, as well as the `objects` and `refs` direcotries. The `objects` directory is where Git actually stores all the data and the `refs` directory enables us to save references to specific points in time in a human readable or rather rememberable format.

In [None]:
!mkdir -p basicrepo/.git/refs/heads
!mkdir -p basicrepo/.git/objects
!echo "ref: refs/heads/master" >> ./basicrepo/.git/HEAD
!tree basicrepo/.git
!git --git-dir=$PWD/basicrepo/.git --work-tree=$PWD/basicrepo status

As we can see `git status` is happy with our minimal repository and says that we are on the *master* branch, which is a little bit of a lie at the moment since there is no reference to the *master* branch in `refs/heads`.

Let's takse this opportunity to get into how Git actually stores *things* and first and foremost what kind of *things* there actually are

## Things in Git - Blobs, Trees and Commits

Git basically knows three kinds of objects (there are more but they are not important for the scope of this tutorial): Blobs, trees and commits. Blobs represent the contents of the files, trees represent the structure of the repository (one can think of them as the filesystem of Git) and commits represent snapshots of the repository including some additional information.

All of these objects are stored in a compressed (using *zlib*), binary format in the `.git/objects` directory. To reference them the SHA-1 hash function is used to create a unique 20 byte fingerprint for each object. The first two hex characters are the directory and the remaining 38 characters are the filename. This prevents a very flat hierarchy and is easier for an OS to handle.

## Blobs

Let's start all the way at the bottom. A blob represents a chunk (or blob for that matter) of binary data. This is the basic way that Git saves all its files.

The format of a blob is as follows:  
![blob_format](img/blob_format.png)

So let's create our very own blob

In [None]:
import os
import zlib
from hashlib import sha1

blob_content = 'Hello World\n'
blob_header = f'blob {len(blob_content)}\x00'
blob_store = blob_header + blob_content

# Create the fingerprint to reference our blob later
blob_digest = sha1(blob_store.encode('utf-8')).hexdigest()
blob_compressed = zlib.compress(blob_store.encode('utf-8'))

print('Content: ', blob_content)
print('Header:', blob_header)
print('Store:', blob_store)
print('Digest:', blob_digest)
print('Dir:', blob_digest[:2])
print('File:', blob_digest[2:])
print('\nCompressed:', blob_compressed)

os.makedirs(os.path.dirname(f'basicrepo/.git/objects/{blob_digest[:2]}/'))
with open(f'basicrepo/.git/objects/{blob_digest[:2]}/{blob_digest[2:]}', 'wb') as blob:
    blob.write(blob_compressed)


We have now created our first Git object and will examine it more closely to verify that Git actually understands the file that we have just created.  
For this we will use the Git command (yes, yes I said no Git commands but bear with me) `git cat-file`.

In [None]:
!tree basicrepo/.git
!echo '\nType of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -t 557d
!echo 'Content of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -p 557d
!echo 'Raw content:'
!cat basicrepo/.git/objects/55/7db03de997c86a4a028e1ebd3a1ceb225be238

As you can see, Git understands that our object is a blob and that it's content is *'Hello World\n'* and we used the digest to reference it (in most cases the first four hex characters are enough to uniquely identify the object we mean but we could be more exact here). So far so good!  

An interesting thing to note here is that we stored only the content of a file in the blob, not including a filename, for example. This means that two files with identical content will result in the same blob and therefore are only saved once.  

But how does Git then know my filenames? Good question, let's move on!

## Trees
Trees are used to represent the file system in a repository. They can answer questions like: What is the name of a file? Which directories are there? etc.  
They are also used to build your working tree when you checkout a branch, by the way.

![tree1](img/tree1.png)

The filesystem is build by trees referencing other trees or blobs as seen in the figure above. Each entry in the tree object associates a name to another object which is either a tree or a blob. The resulting structure of the figure above would be a root directory with two files (README and Rakefile) and a subdirectory called lib which in turn is a tree object which references a third file named simplegit.rb.  
So a tree object could have two entries with different names but referencing the same blob which would create two files with identical content but different names.

![tree2](img/tree2.png)

The format of a tree object is as follows:

![tree2](img/tree_format.png)

Now that we know everything we need let's get to it!

In [None]:
tree_filename = "hello.txt"
tree_content = b'100644 ' + tree_filename.encode('utf-8') + b'\x00' + bytes.fromhex(blob_digest)
tree_header = f'tree {len(tree_content)}\x00'
tree_store = tree_header.encode('utf-8') + tree_content

tree_digest = sha1(tree_store).hexdigest()
tree_compressed = zlib.compress(tree_store)

print('Ref (blob) Hash: ', blob_digest)
print('Header:', tree_header)
print('Store:', tree_store)
print('Digest:', tree_digest)
print('Dir:', tree_digest[:2])
print('File:', tree_digest[2:])
print('Compressed:', tree_compressed)

os.makedirs(os.path.dirname(f'basicrepo/.git/objects/{tree_digest[:2]}/'))
with open(f'basicrepo/.git/objects/{tree_digest[:2]}/{tree_digest[2:]}', 'wb') as tree:
    tree.write(tree_compressed)

As before we can now verify that our tree object is valid and check out its content using `git cat-file`

In [None]:
!tree basicrepo/.git
!echo '\nType of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -t 97b4
!echo 'Content of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -p 97b4
!echo 'Raw content:'
!cat basicrepo/.git/objects/97/b49d4c943e3715fe30f141cc6f27a8548cee0e

## Commits

Lastly we need to create a commit object from our blob and tree. Commits basically mark a point in time of a repository. They annotate a tree with a lot of meta data such as author and committer information, timestamps and comments.

The structure of a commit object is as follows:

![commit_format](img/commit_format.png)

A commit always refers to a single tree object which will be placed in the root of the Git repository and then expanded to create the working directory.

![commit](img/commit2.png)

In [None]:
import time

author_name = 'John Doe'
author_email = 'jd@someplace.com'
# using a constant value instead of int(time.time())
# for stability and to be able to use it in the following script
seconds_since_epoch = 1562917933
time_zone = '+0000'
commit_message = 'This is it! We made it!\n'

commit_content = f'tree {tree_digest}'
commit_content += f'\nauthor {author_name} <{author_email}> {seconds_since_epoch} {time_zone}'
commit_content += f'\ncommitter {author_name} <{author_email}> {seconds_since_epoch} {time_zone}'
commit_content += f'\n\n{commit_message}'

commit_header = f'commit {len(commit_content)}\x00'
commit_store = commit_header.encode('utf-8') + commit_content.encode('utf-8')

commit_digest = sha1(commit_store).hexdigest()
commit_compressed = zlib.compress(commit_store)

os.makedirs(os.path.dirname(f'basicrepo/.git/objects/{commit_digest[:2]}/'))
with open(f'basicrepo/.git/objects/{commit_digest[:2]}/{commit_digest[2:]}', 'wb') as commit:
    commit.write(commit_compressed)

print('Header:', commit_header)
print('Content:\n', commit_content)
print('Store:', commit_store)
print('Digest:', commit_digest)
print('Dir:', commit_digest[:2])
print('File:', commit_digest[2:])
print('Compressed:', commit_compressed)


Again we use `git cat-file` to check the content of our commit.

In [None]:
!tree basicrepo/.git
!echo '\nType of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -t ebc0
!echo 'Content of object:'
!git --git-dir=$PWD/basicrepo/.git cat-file -p ebc0
!echo 'Raw content:'
!cat basicrepo/.git/objects/eb/c094d762552e26513c7a9d64bfa8441c309cc6

Now that we created a commit all that is left to do is create a new branch and checkout our commit.  
A branch is only a friendly name for a hash. So creating a branch is really simple. We just have to write the commit hash to a file with our desired branch name as filename. This file has to be saved in the *refs/heads/* direcrory.

In [None]:
!echo ebc094d762552e26513c7a9d64bfa8441c309cc6 > basicrepo/.git/refs/heads/custom-branch
!git --git-dir=$PWD/basicrepo/.git --work-tree=$PWD/basicrepo checkout custom-branch
!git --git-dir=$PWD/basicrepo/.git --work-tree=$PWD/basicrepo status
!tree -a basicrepo/

As you can see we have successfully created a commit and checked it out in a custom branch. Our *hello.txt* file was created in our working directory during checkout using only the Git objects that we have created earliers. That's it! If you want to experiment some more yourserlf, have a look at the scripts prvided in the Git repository that also contains this notebook. They offer some more flexibility to create blobs, trees and commits with your own content.

## Cleanup

If you want to rerun this notebook you should delete the *basicrepo/* directory to avoid conflicts. Also you should restart the python kernel.

In [None]:
!rm -rf basicrepo/

## References
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- https://stackoverflow.com/questions/22968856/what-is-the-file-format-of-a-git-commit-object
- https://stackoverflow.com/questions/14790681/what-is-the-internal-format-of-a-git-tree-object