Building PES-VCS — A Version Control System from Scratch

Objective: Build a local version control system that tracks file changes, stores snapshots efficiently, and supports commit history. Every component maps directly to operating system and filesystem concepts.

Platform: Ubuntu 22.04

Getting Started

Prerequisites

sudo apt update && sudo apt install -y gcc build-essential libssl-dev

Using This Repository

This is a template repository. Do not fork it.

Click "Use this template" → "Create a new repository" on GitHub
Name your repository (e.g., SRN-pes-vcs) and set it to public. Replace SRN with your actual SRN, e.g., PESXUG24CSYYY-pes-vcs
Clone this repository to your local machine and do all your lab work inside this directory.
Important: Remember to commit frequently as you progress. You are required to have a minimum of 5 detailed commits per phase. Refer to Submission Requirements for more details.
Clone your new repository and start working

The repository contains skeleton source files with // TODO markers where you need to write code. Functions marked // PROVIDED are complete — do not modify them.

Building

make          # Build the pes binary
make all      # Build pes + test binaries
make clean    # Remove all build artifacts

Author Configuration

PES-VCS reads the author name from the PES_AUTHOR environment variable:

export PES_AUTHOR="Your Name <PESXUG24CS042>"

If unset, it defaults to "PES User <pes@localhost>".

File Inventory

File	Role	Your Task
`pes.h`	Core data structures and constants	Do not modify
`object.c`	Content-addressable object store	Implement `object_write`, `object_read`
`tree.h`	Tree object interface	Do not modify
`tree.c`	Tree serialization and construction	Implement `tree_from_index`
`index.h`	Staging area interface	Do not modify
`index.c`	Staging area (text-based index file)	Implement `index_load`, `index_save`, `index_add`
`commit.h`	Commit object interface	Do not modify
`commit.c`	Commit creation and history	Implement `commit_create`
`pes.c`	CLI entry point and command dispatch	Do not modify
`test_objects.c`	Phase 1 test program	Do not modify
`test_tree.c`	Phase 2 test program	Do not modify
`test_sequence.sh`	End-to-end integration test	Do not modify
`Makefile`	Build system	Do not modify

Understanding Git: What You're Building

Before writing code, understand how Git works under the hood. Git is a content-addressable filesystem with a few clever data structures on top. Everything in this lab is based on Git's real design.

The Big Picture

When you run git commit, Git doesn't store "changes" or "diffs." It stores complete snapshots of your entire project. Git uses two tricks to make this efficient:

Content-addressable storage: Every file is stored by the SHA hash of its contents. Same content = same hash = stored only once.
Tree structures: Directories are stored as "tree" objects that point to file contents, so unchanged files are just pointers to existing data.

Your project at commit A:          Your project at commit B:
                                   (only README changed)

    root/                              root/
    ├── README.md  ─────┐              ├── README.md  ─────┐
    ├── src/            │              ├── src/            │
    │   └── main.c ─────┼─┐            │   └── main.c ─────┼─┐
    └── Makefile ───────┼─┼─┐          └── Makefile ───────┼─┼─┐
                        │ │ │                              │ │ │
                        ▼ ▼ ▼                              ▼ ▼ ▼
    Object Store:       ┌─────────────────────────────────────────────┐
                        │  a1b2c3 (README v1)    ← only this is new   │
                        │  d4e5f6 (README v2)                         │
                        │  789abc (main.c)       ← shared by both!    │
                        │  fedcba (Makefile)     ← shared by both!    │
                        └─────────────────────────────────────────────┘

The Three Object Types

1. Blob (Binary Large Object)

A blob is just file contents. No filename, no permissions — just the raw bytes.

blob 16\0Hello, World!\n
     ↑    ↑
     │    └── The actual file content
     └─────── Size in bytes

The blob is stored at a path determined by its SHA-256 hash. If two files have identical contents, they share one blob.

2. Tree

A tree represents a directory. It's a list of entries, each pointing to a blob (file) or another tree (subdirectory).

100644 blob a1b2c3d4... README.md
100755 blob e5f6a7b8... build.sh        ← executable file
040000 tree 9c0d1e2f... src             ← subdirectory
       ↑    ↑           ↑
       │    │           └── name
       │    └── hash of the object
       └─────── mode (permissions + type)

Mode values:

100644 — regular file, not executable
100755 — regular file, executable
040000 — directory (tree)

3. Commit

A commit ties everything together. It points to a tree (the project snapshot) and contains metadata.

tree 9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d
parent a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0
author Alice <alice@example.com> 1699900000
committer Alice <alice@example.com> 1699900000

Add new feature

The parent pointer creates a linked list of history:

    C3 ──────► C2 ──────► C1 ──────► (no parent)
    │          │          │
    ▼          ▼          ▼
  Tree3      Tree2      Tree1

How Objects Connect

                    ┌─────────────────────────────────┐
                    │           COMMIT                │
                    │  tree: 7a3f...                  │
                    │  parent: 4b2e...                │
                    │  author: Alice                  │
                    │  message: "Add feature"         │
                    └─────────────┬───────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────────┐
                    │         TREE (root)             │
                    │  100644 blob f1a2... README.md  │
                    │  040000 tree 8b3c... src        │
                    │  100644 blob 9d4e... Makefile   │
                    └──────┬──────────┬───────────────┘
                           │          │
              ┌────────────┘          └────────────┐
              ▼                                    ▼
┌─────────────────────────┐          ┌─────────────────────────┐
│      TREE (src)         │          │     BLOB (README.md)    │
│ 100644 blob a5f6 main.c │          │  # My Project           │
└───────────┬─────────────┘          └─────────────────────────┘
            ▼
       ┌────────┐
       │ BLOB   │
       │main.c  │
       └────────┘

References and HEAD

References are files that map human-readable names to commit hashes:

.pes/
├── HEAD                    # "ref: refs/heads/main"
└── refs/
    └── heads/
        └── main            # Contains: a1b2c3d4e5f6...

HEAD points to a branch name. The branch file contains the latest commit hash. When you commit:

Git creates the new commit object (pointing to parent)
Updates the branch file to contain the new commit's hash
HEAD still points to the branch, so it "follows" automatically

Before commit:                    After commit:

HEAD ─► main ─► C2 ─► C1         HEAD ─► main ─► C3 ─► C2 ─► C1

The Index (Staging Area)

The index is the "preparation area" for the next commit. It tracks which files are staged.

Working Directory          Index               Repository (HEAD)
─────────────────         ─────────           ─────────────────
README.md (modified) ──── pes add ──► README.md (staged)
src/main.c                            src/main.c          ──► Last commit's
Makefile                               Makefile                snapshot

The workflow:

pes add file.txt → computes blob hash, stores blob, updates index
pes commit -m "msg" → builds tree from index, creates commit, updates branch ref

Content-Addressable Storage

Objects are named by their content's hash:

# Pseudocode
def store_object(content):
    hash = sha256(content)
    path = f".pes/objects/{hash[0:2]}/{hash[2:]}"
    write_file(path, content)
    return hash

This gives us:

Deduplication: Identical files stored once
Integrity: Hash verifies data isn't corrupted
Immutability: Changing content = different hash = different object

Objects are sharded by the first two hex characters to avoid huge directories:

.pes/objects/
├── 2f/
│   └── 8a3b5c7d9e...
├── a1/
│   ├── 9c4e6f8a0b...
│   └── b2d4f6a8c0...
└── ff/
    └── 1234567890...

Exploring a Real Git Repository

You can inspect Git's internals yourself:

mkdir test-repo && cd test-repo && git init
echo "Hello" > hello.txt
git add hello.txt && git commit -m "First commit"

find .git/objects -type f          # See stored objects
git cat-file -t <hash>            # Show type: blob, tree, or commit
git cat-file -p <hash>            # Show contents
cat .git/HEAD                     # See what HEAD points to
cat .git/refs/heads/main          # See branch pointer

What You'll Build

PES-VCS implements five commands across four phases:

pes init              Create .pes/ repository structure
pes add <file>...     Stage files (hash + update index)
pes status            Show modified/staged/untracked files
pes commit -m <msg>   Create commit from staged files
pes log               Walk and display commit history

The .pes/ directory structure:

my_project/
├── .pes/
│   ├── objects/          # Content-addressable blob/tree/commit storage
│   │   ├── 2f/
│   │   │   └── 8a3b...   # Sharded by first 2 hex chars of hash
│   │   └── a1/
│   │       └── 9c4e...
│   ├── refs/
│   │   └── heads/
│   │       └── main      # Branch pointer (file containing commit hash)
│   ├── index             # Staging area (text file)
│   └── HEAD              # Current branch reference
└── (working directory files)

Architecture Overview

┌───────────────────────────────────────────────────────────────┐
│                      WORKING DIRECTORY                        │
│                  (actual files you edit)                       │
└───────────────────────────────────────────────────────────────┘
                              │
                        pes add <file>
                              ▼
┌───────────────────────────────────────────────────────────────┐
│                           INDEX                               │
│                (staged changes, ready to commit)              │
│                100644 a1b2c3... src/main.c                    │
└───────────────────────────────────────────────────────────────┘
                              │
                       pes commit -m "msg"
                              ▼
┌───────────────────────────────────────────────────────────────┐
│                       OBJECT STORE                            │
│  ┌───────┐    ┌───────┐    ┌────────┐                         │
│  │ BLOB  │◄───│ TREE  │◄───│ COMMIT │                         │
│  │(file) │    │(dir)  │    │(snap)  │                         │
│  └───────┘    └───────┘    └────────┘                         │
│  Stored at: .pes/objects/XX/YYY...                            │
└───────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────────────────────────────────────────────────┐
│                           REFS                                │
│       .pes/refs/heads/main  →  commit hash                    │
│       .pes/HEAD             →  "ref: refs/heads/main"         │
└───────────────────────────────────────────────────────────────┘

Phase 1: Object Storage Foundation

Filesystem Concepts: Content-addressable storage, directory sharding, atomic writes, hashing for integrity

Files: pes.h (read), object.c (implement object_write and object_read)

What to Implement

Open object.c. Two functions are marked // TODO:

object_write — Stores data in the object store.
- Prepends a type header ("blob <size>\0", "tree <size>\0", or "commit <size>\0")
- Computes SHA-256 of the full object (header + data)
- Writes atomically using the temp-file-then-rename pattern
- Shards into subdirectories by first 2 hex chars of hash
object_read — Retrieves and verifies data from the object store.
- Reads the file, parses the header to extract type and size
- Verifies integrity by recomputing the hash and comparing to the filename
- Returns the data portion (after the \0)

Read the detailed step-by-step comments in object.c before starting.

Testing

make test_objects
./test_objects

The test program verifies:

Blob storage and retrieval (write, read back, compare)
Deduplication (same content → same hash → stored once)
Integrity checking (detects corrupted objects)

📸 Screenshot 1A: Output of ./test_objects showing all tests passing.

📸 Screenshot 1B: find .pes/objects -type f showing the sharded directory structure.

Phase 2: Tree Objects

Filesystem Concepts: Directory representation, recursive structures, file modes and permissions

Files: tree.h (read), tree.c (implement all TODO functions)

What to Implement

Open tree.c. Implement the function marked // TODO:

tree_from_index — Builds a tree hierarchy from the index.
- Handles nested paths: "src/main.c" must create a src subtree
- This is what pes commit uses to create the snapshot
- Writes all tree objects to the object store and returns the root hash

Testing

make test_tree
./test_tree

The test program verifies:

Serialize → parse roundtrip preserves entries, modes, and hashes
Deterministic serialization (same entries in any order → identical output)

📸 Screenshot 2A: Output of ./test_tree showing all tests passing.

📸 Screenshot 2B: Pick a tree object from find .pes/objects -type f and run xxd .pes/objects/XX/YYY... | head -20 to show the raw binary format.

Phase 3: The Index (Staging Area)

Filesystem Concepts: File format design, atomic writes, change detection using metadata

Files: index.h (read), index.c (implement all TODO functions)

What to Implement

Open index.c. Three functions are marked // TODO:

index_load — Reads the text-based .pes/index file into an Index struct.
- If the file doesn't exist, initializes an empty index (this is not an error)
- Parses each line: <mode> <hash-hex> <mtime> <size> <path>
index_save — Writes the index atomically (temp file + rename).
- Sorts entries by path before writing
- Uses fsync() on the temp file before renaming
index_add — Stages a file: reads it, writes blob to object store, updates index entry.
- Use the provided index_find to check for an existing entry

index_find , index_status and index_remove are already implemented for you — read them to understand the index data structure before starting.

Expected Output of `pes status`

Staged changes:
  staged:     hello.txt
  staged:     src/main.c

Unstaged changes:
  modified:   README.md
  deleted:    old_file.txt

Untracked files:
  untracked:  notes.txt

If a section has no entries, print the header followed by (nothing to show).

Testing

make pes
./pes init
echo "hello" > file1.txt
echo "world" > file2.txt
./pes add file1.txt file2.txt
./pes status
cat .pes/index    # Human-readable text format

📸 Screenshot 3A: Run ./pes init, ./pes add file1.txt file2.txt, ./pes status — show the output.

📸 Screenshot 3B: cat .pes/index showing the text-format index with your entries.

Phase 4: Commits and History

Filesystem Concepts: Linked structures on disk, reference files, atomic pointer updates

Files: commit.h (read), commit.c (implement all TODO functions)

What to Implement

Open commit.c. One function is marked // TODO:

commit_create — The main commit function:
- Builds a tree from the index using tree_from_index() (not from the working directory — commits snapshot the staged state)
- Reads current HEAD as the parent (may not exist for first commit)
- Gets the author string from pes_author() (defined in pes.h)
- Writes the commit object, then updates HEAD

commit_parse, commit_serialize, commit_walk, head_read, and head_update are already implemented — read them to understand the commit format before writing commit_create.

The commit text format is specified in the comment at the top of commit.c.

Testing

./pes init
echo "Hello" > hello.txt
./pes add hello.txt
./pes commit -m "Initial commit"

echo "World" >> hello.txt
./pes add hello.txt
./pes commit -m "Add world"

echo "Goodbye" > bye.txt
./pes add bye.txt
./pes commit -m "Add farewell"

./pes log

You can also run the full integration test:

make test-integration

📸 Screenshot 4A: Output of ./pes log showing three commits with hashes, authors, timestamps, and messages.

📸 Screenshot 4B: find .pes -type f | sort showing object store growth after three commits.

📸 Screenshot 4C: cat .pes/refs/heads/main and cat .pes/HEAD showing the reference chain.

Phase 5 & 6: Analysis-Only Questions

The following questions cover filesystem concepts beyond the implementation scope of this lab. Answer them in writing — no code required.

Branching and Checkout

Q5.1: A branch in Git is just a file in .git/refs/heads/ containing a commit hash. Creating a branch is creating a file. Given this, how would you implement pes checkout <branch> — what files need to change in .pes/, and what must happen to the working directory? What makes this operation complex?

Q5.2: When switching branches, the working directory must be updated to match the target branch's tree. If the user has uncommitted changes to a tracked file, and that file differs between branches, checkout must refuse. Describe how you would detect this "dirty working directory" conflict using only the index and the object store.

Q5.3: "Detached HEAD" means HEAD contains a commit hash directly instead of a branch reference. What happens if you make commits in this state? How could a user recover those commits?

Garbage Collection and Space Reclamation

Q6.1: Over time, the object store accumulates unreachable objects — blobs, trees, or commits that no branch points to (directly or transitively). Describe an algorithm to find and delete these objects. What data structure would you use to track "reachable" hashes efficiently? For a repository with 100,000 commits and 50 branches, estimate how many objects you'd need to visit.

Q6.2: Why is it dangerous to run garbage collection concurrently with a commit operation? Describe a race condition where GC could delete an object that a concurrent commit is about to reference. How does Git's real GC avoid this?

Submission Checklist

Screenshots Required

Phase	ID	What to Capture
1	1A	`./test_objects` output showing all tests passing
1	1B	`find .pes/objects -type f` showing sharded directory structure
2	2A	`./test_tree` output showing all tests passing
2	2B	`xxd` of a raw tree object (first 20 lines)
3	3A	`pes init` → `pes add` → `pes status` sequence
3	3B	`cat .pes/index` showing the text-format index
4	4A	`pes log` output with three commits
4	4B	`find .pes -type f \| sort` showing object growth
4	4C	`cat .pes/refs/heads/main` and `cat .pes/HEAD`
Final	--	Full integration test (`make test-integration`)

Code Files Required (5 files)

File	Description
`object.c`	Object store implementation
`tree.c`	Tree serialization and construction
`index.c`	Staging area implementation
`commit.c`	Commit creation and history walking

Analysis Questions (written answers)

Section	Questions
Branching (analysis-only)	Q5.1, Q5.2, Q5.3
GC (analysis-only)	Q6.1, Q6.2

Submission Requirements

1. GitHub Repository

You must submit the link to your GitHub repository via the official submission link (which will be shared by your respective faculty).
The repository must strictly maintain the directory structure you built throughout this lab.
Ensure your github repository is made public

2. Lab Report

Your report, containing all required screenshots and answers to the analysis questions, must be placed at the root of your repository directory.
The report must be submitted as either a PDF (report.pdf) or a Markdown file (README.md).

3. Commit History (Graded Requirement)

Minimum Requirement: You must have a minimum of 5 commits per phase with appropriate commit messages. Submitting fewer than 5 commits for any given phase will result in a deduction of marks.
Best Practices: We highly prefer more than 5 detailed commits per phase. Granular commits that clearly show the delta in code block changes allow us to verify your step-by-step understanding of the concepts and prevent penalties <3

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
Makefile		Makefile
PES2UG24CS430.pdf		PES2UG24CS430.pdf
README.md		README.md
commit.c		commit.c
commit.h		commit.h
index.c		index.c
index.h		index.h
object.c		object.c
pes.c		pes.c
pes.h		pes.h
test_objects.c		test_objects.c
test_sequence.sh		test_sequence.sh
test_tree.c		test_tree.c
tree.c		tree.c
tree.h		tree.h

Folders and files

Latest commit

History

Repository files navigation

Building PES-VCS — A Version Control System from Scratch

Getting Started

Prerequisites

Using This Repository

Building

Author Configuration

File Inventory

Understanding Git: What You're Building

The Big Picture

The Three Object Types

1. Blob (Binary Large Object)

2. Tree

3. Commit

How Objects Connect

References and HEAD

The Index (Staging Area)

Content-Addressable Storage

Exploring a Real Git Repository

What You'll Build

Architecture Overview

Phase 1: Object Storage Foundation

What to Implement

Testing

Phase 2: Tree Objects

What to Implement

Testing

Phase 3: The Index (Staging Area)

What to Implement

Expected Output of pes status

Testing

Phase 4: Commits and History

What to Implement

Testing

Phase 5 & 6: Analysis-Only Questions

Branching and Checkout

Garbage Collection and Space Reclamation

Submission Checklist

Screenshots Required

Code Files Required (5 files)

Analysis Questions (written answers)

Submission Requirements

Further Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Expected Output of `pes status`

Packages