Objective: Build a local version control system that tracks file changes, stores snapshots efficiently, and supports commit history. Every component maps directly to operating system and filesystem concepts.
Platform: Ubuntu 22.04
sudo apt update && sudo apt install -y gcc build-essential libssl-devThis is a template repository. Do not fork it.
- Click "Use this template" → "Create a new repository" on GitHub
- Name your repository (e.g.,
SRN-pes-vcs) and set it to public. ReplaceSRNwith your actual SRN, e.g.,PESXUG24CSYYY-pes-vcs - Clone this repository to your local machine and do all your lab work inside this directory.
- Important: Remember to commit frequently as you progress. You are required to have a minimum of 5 detailed commits per phase. Refer to Submission Requirements for more details.
- Clone your new repository and start working
The repository contains skeleton source files with // TODO markers where you need to write code. Functions marked // PROVIDED are complete — do not modify them.
make # Build the pes binary
make all # Build pes + test binaries
make clean # Remove all build artifactsPES-VCS reads the author name from the PES_AUTHOR environment variable:
export PES_AUTHOR="Your Name <PESXUG24CS042>"If unset, it defaults to "PES User <pes@localhost>".
| File | Role | Your Task |
|---|---|---|
pes.h |
Core data structures and constants | Do not modify |
object.c |
Content-addressable object store | Implement object_write, object_read |
tree.h |
Tree object interface | Do not modify |
tree.c |
Tree serialization and construction | Implement tree_from_index |
index.h |
Staging area interface | Do not modify |
index.c |
Staging area (text-based index file) | Implement index_load, index_save, index_add |
commit.h |
Commit object interface | Do not modify |
commit.c |
Commit creation and history | Implement commit_create |
pes.c |
CLI entry point and command dispatch | Do not modify |
test_objects.c |
Phase 1 test program | Do not modify |
test_tree.c |
Phase 2 test program | Do not modify |
test_sequence.sh |
End-to-end integration test | Do not modify |
Makefile |
Build system | Do not modify |
Before writing code, understand how Git works under the hood. Git is a content-addressable filesystem with a few clever data structures on top. Everything in this lab is based on Git's real design.
When you run git commit, Git doesn't store "changes" or "diffs." It stores complete snapshots of your entire project. Git uses two tricks to make this efficient:
- Content-addressable storage: Every file is stored by the SHA hash of its contents. Same content = same hash = stored only once.
- Tree structures: Directories are stored as "tree" objects that point to file contents, so unchanged files are just pointers to existing data.
Your project at commit A: Your project at commit B:
(only README changed)
root/ root/
├── README.md ─────┐ ├── README.md ─────┐
├── src/ │ ├── src/ │
│ └── main.c ─────┼─┐ │ └── main.c ─────┼─┐
└── Makefile ───────┼─┼─┐ └── Makefile ───────┼─┼─┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Object Store: ┌─────────────────────────────────────────────┐
│ a1b2c3 (README v1) ← only this is new │
│ d4e5f6 (README v2) │
│ 789abc (main.c) ← shared by both! │
│ fedcba (Makefile) ← shared by both! │
└─────────────────────────────────────────────┘
A blob is just file contents. No filename, no permissions — just the raw bytes.
blob 16\0Hello, World!\n
↑ ↑
│ └── The actual file content
└─────── Size in bytes
The blob is stored at a path determined by its SHA-256 hash. If two files have identical contents, they share one blob.
A tree represents a directory. It's a list of entries, each pointing to a blob (file) or another tree (subdirectory).
100644 blob a1b2c3d4... README.md
100755 blob e5f6a7b8... build.sh ← executable file
040000 tree 9c0d1e2f... src ← subdirectory
↑ ↑ ↑
│ │ └── name
│ └── hash of the object
└─────── mode (permissions + type)
Mode values:
100644— regular file, not executable100755— regular file, executable040000— directory (tree)
A commit ties everything together. It points to a tree (the project snapshot) and contains metadata.
tree 9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d
parent a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0
author Alice <alice@example.com> 1699900000
committer Alice <alice@example.com> 1699900000
Add new feature
The parent pointer creates a linked list of history:
C3 ──────► C2 ──────► C1 ──────► (no parent)
│ │ │
▼ ▼ ▼
Tree3 Tree2 Tree1
┌─────────────────────────────────┐
│ COMMIT │
│ tree: 7a3f... │
│ parent: 4b2e... │
│ author: Alice │
│ message: "Add feature" │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ TREE (root) │
│ 100644 blob f1a2... README.md │
│ 040000 tree 8b3c... src │
│ 100644 blob 9d4e... Makefile │
└──────┬──────────┬───────────────┘
│ │
┌────────────┘ └────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ TREE (src) │ │ BLOB (README.md) │
│ 100644 blob a5f6 main.c │ │ # My Project │
└───────────┬─────────────┘ └─────────────────────────┘
▼
┌────────┐
│ BLOB │
│main.c │
└────────┘
References are files that map human-readable names to commit hashes:
.pes/
├── HEAD # "ref: refs/heads/main"
└── refs/
└── heads/
└── main # Contains: a1b2c3d4e5f6...
HEAD points to a branch name. The branch file contains the latest commit hash. When you commit:
- Git creates the new commit object (pointing to parent)
- Updates the branch file to contain the new commit's hash
- HEAD still points to the branch, so it "follows" automatically
Before commit: After commit:
HEAD ─► main ─► C2 ─► C1 HEAD ─► main ─► C3 ─► C2 ─► C1
The index is the "preparation area" for the next commit. It tracks which files are staged.
Working Directory Index Repository (HEAD)
───────────────── ───────── ─────────────────
README.md (modified) ──── pes add ──► README.md (staged)
src/main.c src/main.c ──► Last commit's
Makefile Makefile snapshot
The workflow:
pes add file.txt→ computes blob hash, stores blob, updates indexpes commit -m "msg"→ builds tree from index, creates commit, updates branch ref
Objects are named by their content's hash:
# Pseudocode
def store_object(content):
hash = sha256(content)
path = f".pes/objects/{hash[0:2]}/{hash[2:]}"
write_file(path, content)
return hashThis gives us:
- Deduplication: Identical files stored once
- Integrity: Hash verifies data isn't corrupted
- Immutability: Changing content = different hash = different object
Objects are sharded by the first two hex characters to avoid huge directories:
.pes/objects/
├── 2f/
│ └── 8a3b5c7d9e...
├── a1/
│ ├── 9c4e6f8a0b...
│ └── b2d4f6a8c0...
└── ff/
└── 1234567890...
You can inspect Git's internals yourself:
mkdir test-repo && cd test-repo && git init
echo "Hello" > hello.txt
git add hello.txt && git commit -m "First commit"
find .git/objects -type f # See stored objects
git cat-file -t <hash> # Show type: blob, tree, or commit
git cat-file -p <hash> # Show contents
cat .git/HEAD # See what HEAD points to
cat .git/refs/heads/main # See branch pointerPES-VCS implements five commands across four phases:
pes init Create .pes/ repository structure
pes add <file>... Stage files (hash + update index)
pes status Show modified/staged/untracked files
pes commit -m <msg> Create commit from staged files
pes log Walk and display commit history
The .pes/ directory structure:
my_project/
├── .pes/
│ ├── objects/ # Content-addressable blob/tree/commit storage
│ │ ├── 2f/
│ │ │ └── 8a3b... # Sharded by first 2 hex chars of hash
│ │ └── a1/
│ │ └── 9c4e...
│ ├── refs/
│ │ └── heads/
│ │ └── main # Branch pointer (file containing commit hash)
│ ├── index # Staging area (text file)
│ └── HEAD # Current branch reference
└── (working directory files)
┌───────────────────────────────────────────────────────────────┐
│ WORKING DIRECTORY │
│ (actual files you edit) │
└───────────────────────────────────────────────────────────────┘
│
pes add <file>
▼
┌───────────────────────────────────────────────────────────────┐
│ INDEX │
│ (staged changes, ready to commit) │
│ 100644 a1b2c3... src/main.c │
└───────────────────────────────────────────────────────────────┘
│
pes commit -m "msg"
▼
┌───────────────────────────────────────────────────────────────┐
│ OBJECT STORE │
│ ┌───────┐ ┌───────┐ ┌────────┐ │
│ │ BLOB │◄───│ TREE │◄───│ COMMIT │ │
│ │(file) │ │(dir) │ │(snap) │ │
│ └───────┘ └───────┘ └────────┘ │
│ Stored at: .pes/objects/XX/YYY... │
└───────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ REFS │
│ .pes/refs/heads/main → commit hash │
│ .pes/HEAD → "ref: refs/heads/main" │
└───────────────────────────────────────────────────────────────┘
Filesystem Concepts: Content-addressable storage, directory sharding, atomic writes, hashing for integrity
Files: pes.h (read), object.c (implement object_write and object_read)
Open object.c. Two functions are marked // TODO:
-
object_write— Stores data in the object store.- Prepends a type header (
"blob <size>\0","tree <size>\0", or"commit <size>\0") - Computes SHA-256 of the full object (header + data)
- Writes atomically using the temp-file-then-rename pattern
- Shards into subdirectories by first 2 hex chars of hash
- Prepends a type header (
-
object_read— Retrieves and verifies data from the object store.- Reads the file, parses the header to extract type and size
- Verifies integrity by recomputing the hash and comparing to the filename
- Returns the data portion (after the
\0)
Read the detailed step-by-step comments in object.c before starting.
make test_objects
./test_objectsThe test program verifies:
- Blob storage and retrieval (write, read back, compare)
- Deduplication (same content → same hash → stored once)
- Integrity checking (detects corrupted objects)
📸 Screenshot 1A: Output of ./test_objects showing all tests passing.
📸 Screenshot 1B: find .pes/objects -type f showing the sharded directory structure.
Filesystem Concepts: Directory representation, recursive structures, file modes and permissions
Files: tree.h (read), tree.c (implement all TODO functions)
Open tree.c. Implement the function marked // TODO:
tree_from_index— Builds a tree hierarchy from the index.- Handles nested paths:
"src/main.c"must create asrcsubtree - This is what
pes commituses to create the snapshot - Writes all tree objects to the object store and returns the root hash
- Handles nested paths:
make test_tree
./test_treeThe test program verifies:
- Serialize → parse roundtrip preserves entries, modes, and hashes
- Deterministic serialization (same entries in any order → identical output)
📸 Screenshot 2A: Output of ./test_tree showing all tests passing.
📸 Screenshot 2B: Pick a tree object from find .pes/objects -type f and run xxd .pes/objects/XX/YYY... | head -20 to show the raw binary format.
Filesystem Concepts: File format design, atomic writes, change detection using metadata
Files: index.h (read), index.c (implement all TODO functions)
Open index.c. Three functions are marked // TODO:
-
index_load— Reads the text-based.pes/indexfile into anIndexstruct.- If the file doesn't exist, initializes an empty index (this is not an error)
- Parses each line:
<mode> <hash-hex> <mtime> <size> <path>
-
index_save— Writes the index atomically (temp file + rename).- Sorts entries by path before writing
- Uses
fsync()on the temp file before renaming
-
index_add— Stages a file: reads it, writes blob to object store, updates index entry.- Use the provided
index_findto check for an existing entry
- Use the provided
index_find , index_status and index_remove are already implemented for you — read them to understand the index data structure before starting.
Staged changes:
staged: hello.txt
staged: src/main.c
Unstaged changes:
modified: README.md
deleted: old_file.txt
Untracked files:
untracked: notes.txt
If a section has no entries, print the header followed by (nothing to show).
make pes
./pes init
echo "hello" > file1.txt
echo "world" > file2.txt
./pes add file1.txt file2.txt
./pes status
cat .pes/index # Human-readable text format📸 Screenshot 3A: Run ./pes init, ./pes add file1.txt file2.txt, ./pes status — show the output.
📸 Screenshot 3B: cat .pes/index showing the text-format index with your entries.
Filesystem Concepts: Linked structures on disk, reference files, atomic pointer updates
Files: commit.h (read), commit.c (implement all TODO functions)
Open commit.c. One function is marked // TODO:
commit_create— The main commit function:- Builds a tree from the index using
tree_from_index()(not from the working directory — commits snapshot the staged state) - Reads current HEAD as the parent (may not exist for first commit)
- Gets the author string from
pes_author()(defined inpes.h) - Writes the commit object, then updates HEAD
- Builds a tree from the index using
commit_parse, commit_serialize, commit_walk, head_read, and head_update are already implemented — read them to understand the commit format before writing commit_create.
The commit text format is specified in the comment at the top of commit.c.
./pes init
echo "Hello" > hello.txt
./pes add hello.txt
./pes commit -m "Initial commit"
echo "World" >> hello.txt
./pes add hello.txt
./pes commit -m "Add world"
echo "Goodbye" > bye.txt
./pes add bye.txt
./pes commit -m "Add farewell"
./pes logYou can also run the full integration test:
make test-integration📸 Screenshot 4A: Output of ./pes log showing three commits with hashes, authors, timestamps, and messages.
📸 Screenshot 4B: find .pes -type f | sort showing object store growth after three commits.
📸 Screenshot 4C: cat .pes/refs/heads/main and cat .pes/HEAD showing the reference chain.
The following questions cover filesystem concepts beyond the implementation scope of this lab. Answer them in writing — no code required.
Q5.1: A branch in Git is just a file in .git/refs/heads/ containing a commit hash. Creating a branch is creating a file. Given this, how would you implement pes checkout <branch> — what files need to change in .pes/, and what must happen to the working directory? What makes this operation complex?
Q5.2: When switching branches, the working directory must be updated to match the target branch's tree. If the user has uncommitted changes to a tracked file, and that file differs between branches, checkout must refuse. Describe how you would detect this "dirty working directory" conflict using only the index and the object store.
Q5.3: "Detached HEAD" means HEAD contains a commit hash directly instead of a branch reference. What happens if you make commits in this state? How could a user recover those commits?
Q6.1: Over time, the object store accumulates unreachable objects — blobs, trees, or commits that no branch points to (directly or transitively). Describe an algorithm to find and delete these objects. What data structure would you use to track "reachable" hashes efficiently? For a repository with 100,000 commits and 50 branches, estimate how many objects you'd need to visit.
Q6.2: Why is it dangerous to run garbage collection concurrently with a commit operation? Describe a race condition where GC could delete an object that a concurrent commit is about to reference. How does Git's real GC avoid this?
| Phase | ID | What to Capture |
|---|---|---|
| 1 | 1A | ./test_objects output showing all tests passing |
| 1 | 1B | find .pes/objects -type f showing sharded directory structure |
| 2 | 2A | ./test_tree output showing all tests passing |
| 2 | 2B | xxd of a raw tree object (first 20 lines) |
| 3 | 3A | pes init → pes add → pes status sequence |
| 3 | 3B | cat .pes/index showing the text-format index |
| 4 | 4A | pes log output with three commits |
| 4 | 4B | find .pes -type f | sort showing object growth |
| 4 | 4C | cat .pes/refs/heads/main and cat .pes/HEAD |
| Final | -- | Full integration test (make test-integration) |
| File | Description |
|---|---|
object.c |
Object store implementation |
tree.c |
Tree serialization and construction |
index.c |
Staging area implementation |
commit.c |
Commit creation and history walking |
| Section | Questions |
|---|---|
| Branching (analysis-only) | Q5.1, Q5.2, Q5.3 |
| GC (analysis-only) | Q6.1, Q6.2 |
1. GitHub Repository
- You must submit the link to your GitHub repository via the official submission link (which will be shared by your respective faculty).
- The repository must strictly maintain the directory structure you built throughout this lab.
- Ensure your github repository is made
public
2. Lab Report
- Your report, containing all required screenshots and answers to the analysis questions, must be placed at the root of your repository directory.
- The report must be submitted as either a PDF (
report.pdf) or a Markdown file (README.md).
3. Commit History (Graded Requirement)
- Minimum Requirement: You must have a minimum of 5 commits per phase with appropriate commit messages. Submitting fewer than 5 commits for any given phase will result in a deduction of marks.
- Best Practices: We highly prefer more than 5 detailed commits per phase. Granular commits that clearly show the delta in code block changes allow us to verify your step-by-step understanding of the concepts and prevent penalties <3
- Git Internals (Pro Git book): https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain
- Git from the inside out: https://codewords.recurse.com/issues/two/git-from-the-inside-out
- The Git Parable: https://tom.preston-werner.com/2009/05/19/the-git-parable.html