# Git For the Modern Data Scientist: 9 Git Concepts You Can't Ignore
![](images/galaxy.png)

### Introduction

Most data scientists feel like a fish out of water when it comes to Git. There are software engineers who talk about nothing but Git-things and there are data scientists that say "Huh?" (I wish I could add a sound to this) every time. 

That stops today! Since Git is an unignorable tool essential to collaboration, I will break down _nine_ of the most critical Git concepts data scientists must know like the back of their hand. 

I can promise that you won't be nodding your head in fake understanding next time someone talks about Git or version control.

Let's get started!

### For the 1000th time...

You may have heard it a few hundred times already but I will err on the side of caution and say it for the few hundred first time:

> Git is one of the most critical tools in developing ML and AI systems.

![](images/git_proof.png)

If your idea of a machine learning or data science project involves models cooked up in notebooks with creatively named files such as "notebook1", "notebook2", "notebook_final", and "notebook_final_final", then don't bother with Git.

However, if you aim to deploy models that others can use without migraines, Git is a relatively small price to pay.

Git allows you to keep track of changes to your code and data, collaborate with others, and maintain a history of your project. With Git, you can easily revert to a previous version of your work, compare different versions, and merge changes made by multiple contributors.

Moreover, Git easily integrates with other popular MLOps tools like DVC for data version control, making it an essential tool for data scientists.

### 0. Repository

Basically, a repository is this:

![](images/1_repo.png)

It is a folder in your machine. It can have no files, three files or a hundred. The only thing needed to convert that folder into a Git repository is to call `git init` inside.

![](images/2_init.png)

A machine learning repository usually have folders to store data, models, code for loading/cleaning/transforming data and-or for selecting/training/saving models for deployment.

There will be other miscellany files like `.git` folder for Git internals, metadata files, etc.

All these make up a single repo and Git is usually enough to track them (except data and models. For that, see [this article](https://towardsdatascience.com/how-to-version-gigabyte-sized-datasets-just-like-code-with-dvc-in-python-5197662e85bd) afterwards).

### 1. Tracked, untracked

When you initialize Git inside a directory, by default any existing or new files/directories you create will be untracked by Git. 

![](images/3_status.png)

This means any future changes you make on them will be untracked as well. So, you have to put those files under Git supervision by calling `git add path/to/file.py`.


![](images/4_add.png)

After calling `git add` on files, they will be under Git-watch. 

If you wish to add all files in the repository (highly unlikely though), you can call `git add .`.

There are also cases where you never want files to be tracked by Git. This is when you create a `.gitignore` file.

Like the name suggests, files added to `.gitignore` won't be tracked or indexed by Git for as long as they are there. Typical stuff you should add in data projects to `.gitignore` are large data files like CSVs, parquets, images, video or audio. Git is historically horrible at handling those.

It handles the rest like a champ.

P.S. you can create `.gitignore` in the terminal with `touch .gitignore` and add files/folders to it with `echo "filename" >> .gitignore` on new lines.

### 2. Commit

A Git commit is a precious, precious thing. The whole idea of version control is based on it.

When you call `git commit` inside a Git repository, you are taking a snapshot of every Git-tracked file for that specific point in time. Think of it like a time capsule with contents (__versions__) of your project from different time periods.

![image.png](attachment:a221e0c0-25d1-4d89-90f0-4579fb64dfc9.png)

All the commits you make will form your Git history or Git tree like below. 

![](images/6_git_linear.png)

A good Git tree organizes the linear progression of your repository. 

By breaking down your code changes into discrete, well-defined commits, you can map out the progress of your repository almost like a book.

Then, you can browse through the pages of this Git book through commits.

Just like a writer puts a lot of effort into writing each page of their book, you should treat your commits with care.

You shouldn't be making commits for the sake of commits. Consider them as little pieces of history, and know that future versions of yourself and other developers should look at them with delight, rather than disgust.

> Traditional advice: A good commit has an informative message describing the changes made.

Some common scenarios to commit in a typical machine learning project:
- Implementing a new feature: writing code that adds a new functionality like a new function, class, class method, training a new model, new data cleaning operation, etc.
- Fixing a bug: documenting bug fixes to existing functions, methods and classes
- Improving performance: writing code that enhances an existing feature like optimizing blocks of code
- Updating docs and dependencies
- Machine learning experiments: in a project, you will run dozens of experiments to choose and tune the best model. [Each model run should be tracked as a commit](https://pub.towardsai.net/how-to-track-ml-experiments-with-dvc-inside-vscode-to-boost-your-productivity-a654ace60bab).

### 3. Staging area

By talking about commits, we have got ahead of ourselves. Before closing the cap of the commit capsule, you have to make sure the contents within are right.

This involves telling Git exactly which changes from which files you want to commit. Sometimes, new changes might come from several files and you may only want to commit some of them and leave the rest for future commits. 

This is where we lift the curtains and reveal the staging area (pun intended):

![](images/7_stage.png)

The idea is that you must have some way of double-checking, edit or undoing the changes you want to add to your Git history before you press that commit button. 

Adding the new changes to the staging area (or __Git index__ as some kids say it) allows you to do that. The staging area holds the changes you want to include in the next commit. 

Let's say you made changes to both `clean.py` and `train.py`. If you add the changes in `train.py` with `git add train.py` to the staging area, the next commit will only include that change. 

The modified `clean.py` will stay as is (uncommitted). 

![](images/7_stage.png)

So, here is an easy workflow for you:

1. Track new files with Git (only done once)
2. Add changes in tracked files to the staging area with `git add changed_file.extension`
3. Commit the changes in the staging area to history with `git commit -m "Commit message"`.

### 4. Hashes and tags

Apart from messages, all Git commits have hashes so you can point to them more easily.

![](images/8_hashes.png)

A hash is a string with 40 hexadecimal characters that give each commit unique IDs, like `1a3b5c7d9e2f4g6h8i0j1k2l3m4n5o6p7q8r9s0t`. 

They make switching between commits (different versions of your code base) with `git checkout HASH`. You don't have to write the full hash when switching. Only the first few characters of the hash that make it unique are enough. 

You can list all the commits you've made with their hashes using `git log` (this shows the commit author and message).

To list only the hash and the message without cluttering up your screen, you can use `git log --oneline`.

![](images/9_log.png)

If hashes intimidate you, there are also Git tags. A Git tag is a friendly nickname you can give to some important commits (or any) to remember and refer to them even more easily. 

![](images/10_tags.png)

You can give out tags to commits where you wrote an important feature, release a new version of your code base like v1.0.0 or fine-tuned your best model like `random_forest_best` using the command `git tag "tag_name"`.

Think of tags as little human-readable milestones that stand out among all the commit hashes.

> `git tag "tag_name"` only adds a tag to the last commit. To add tags to any commit, provide the commit hash right at the end after tag name.



### 5. Branch

After commits, branches are the bread and butter of Git. 99% of the time, you will be working inside a Git branch.

By default, the branch you are on when you initialize Git inside a folder will be named either `main` or `master`.

![](images/11_master.png)

You can think other branches as alternate realities of your code base.

By creating a Git branch, you can test and experiment with new features, ideas and fixes without the fear that you will mess up your code base.

For example, you can test a new algorithm for a classification task in a new branch without disrupting the main code base:

![](images/12_sgd.png)

Git branches are very cheap. When you call `git branch new_branch_name`, Git creates a pseudo-copy of the master branch without duplicating any of the files.

Then, after you tinker around with your new ideas in the new branch, you can delete it if the idea doesn't look promising or join the branch to master if otherwise.

### 6. HEAD

A Git repository can have several branches and hundreds of commits. So you might rise the excellent question of "How does Git know which branch or commit you are at?".

And the answer is the __special pointer__ called HEAD that Git uses.

![](images/13_head.png)

Basically, HEAD is you. Wherever you are, HEAD follows you in Git. 99% of the time, HEAD will be pointing to the latest commit in the current branch.

If you make a new commit, HEAD will move on to that. If you switch to a new or an old branch, HEAD will switch to the latest commit in that branch.

HEAD is useful when comparing changes in different commits to each other. For example, calling `git diff HEAD~1` will compare the latest commit to the commit immediately before it. 

This also means that `HEAD~n` syntax in Git refers to the _nth_ commit _before_ wherever the HEAD is.

![](images/14_headn.png)

You may also go into the dreaded __detached HEAD state__. This doesn't mean Git has lost track of you and doesn't know where to point.

Detached head state happens when, instead of `git checkout branch_name`, you checkout a specific commit with `git checkout HASH`, forcing the HEAD not to be at the tip of a branch but somewhere in the middle.

![](images/15_new_head.png)

Any changes or commits you make in the detached HEAD state will be isolated or orphaned and won't be part of your Git history. The reason is that HEAD is, well, the head of branches. It strongly fancies to attach itself to branch tips or heads, not its stomach or legs.

So, if you want to make changes in a detached HEAD state, you should call `git switch -c new_branch` to create a new branch at the current commit. This gets you out of the state and moves the HEAD.

Getting the hand of the HEAD will go a long way in helping you navigate any tangled Git tree.

### 7. Merge

So, what happens after you create a new branch? 

Do you discard it if your experiment doesn't pan out with `git branch -d branch_name`? Or do you perform a fabled Git merge?

Basically, a Git merge is a fancy party where two or even more branches come together to create a single thicker branch.

![](images/16_merge.png)

When you merge branches, Git takes the code from each branch and combines them into a single cohesive code base.

If there are overlapping changes in the branches, i.e. both branches has changed lines 5-10 in `train.py`, Git raises a merge conflict.

A merge conflict is as nasty as it sounds. To resolve the conflict, you have to decide which branch's changes you want to keep.

Solving merge conflicts without swearing and boiling from the ears is a rare skill developed over time. So, I won't talk much about them and refer you to [this excellent article](https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts) from Atlassian.

### 8. Stash

I tend to screw up a lot when coding. An idea strikes me; I try it out only to realize that it was rubbish.

In the beginning, I would foolishly erase the mess into oblivion but I would later regret it. Even though the idea was rubbish, it doesn't mean I couldn't use certain code blocks in the future.

Then, I discovered Git stashes and they quickly became my favorite Git feature. 

When you call Git stash, Git automatically stashes or hides both staged and unstaged changes in the working directory. The files revert back to a state where they just came out of a commit. 

![image.png](attachment:a3f6a2c2-7688-44ca-9131-ebc8fcfb8945.png)

After stashing, you can continue working on whatever you were working on. But whenever you want to reapply the changes saved in your stash, you can call `git stash apply` or `git stash pop` and they will magically reappear in the working directory. 

> Note that `git stash` doesn't hide untracked files. For the command to stash both tracked and untracked (ignored files won't be included) files, you have to add the `-u` tag to the commmand.

### 9. GitHub

So, we come to the age-old question - what is the difference between Git and GitHub?

This is like asking the difference between a burger and a cheeseburger.

Git is a version control system that tracks repositories. On the other hand, GitHub is a web-pased platform used to store Git-controlled repositories online.

Git really shines when its repositories are made online and hence, open for collaboration. If a repository is only on your local machine, people can't work on it with you. 

So, think of GitHub as a remote mirror of your local repo that people can clone, fork, and suggest pull requests.

And if these terms sound alien to you, stick around for my next article where I explain N (I don't know how many right now) GitHub concepts that will clear the confusion right away.

### Wrap