# Git For the Modern Data Scientist: 9 Git Concepts You Can't Ignore
![](images/galaxy.png)

### Introduction

Most data scientists feel like a fish out of water when it comes to Git. There are software engineers who talk about nothing but Git-things and there are data scientists that say "Huh?" (I wish I could add a sound to this) every time. 

That stops today! Since Git is an unignorable tool essential to collaboration, I will break down _nine_ of the most critical Git concepts data scientists must know like the back of their hand. 

I can promise that you won't be nodding your head in fake understanding next time someone talks about Git or version control.

Let's get started!

### For the 1000th time...

You may have heard it a few hundred times already but I will err on the side of caution and say it for the few hundred first time:

> Git is one of the most critical tools in developing ML and AI systems.

![](images/git_proof.png)

If your idea of a machine learning or data science project involves models cooked up in notebooks with creatively named files such as "notebook1", "notebook2", "notebook_final", and "notebook_final_final", then don't bother with Git.

However, if you aim to deploy models that others can use without migraines, Git is a relatively small price to pay.

Git allows you to keep track of changes to your code and data, collaborate with others, and maintain a history of your project. With Git, you can easily revert to a previous version of your work, compare different versions, and merge changes made by multiple contributors.

Moreover, Git easily integrates with other popular MLOps tools like DVC for data version control, making it an essential tool for data scientists.

### 0. Repository

Basically, a repository is this:

![](images/1_repo.png)

It is a folder in your machine. It can have no files, three files or a hundred. The only thing needed to convert that folder into a Git repository is to call `git init` inside.

![](images/2_init.png)

A machine learning repository usually have folders to store data, models, code for loading/cleaning/transforming data and-or for selecting/training/saving models for deployment.

There will be other miscellany files like `.git` folder for Git internals, metadata files, etc.

All these make up a single repo and Git is usually enough to track them (except data and models. For that, see [this article](https://towardsdatascience.com/how-to-version-gigabyte-sized-datasets-just-like-code-with-dvc-in-python-5197662e85bd) afterwards).

### 1. Tracked, untracked

When you initialize Git inside a directory, by default any existing or new files/directories you create will be untracked by Git. 

![](images/3_status.png)

This means any future changes you make on them will be untracked as well. So, you have to put those files under Git supervision by calling `git add path/to/file.py`.


![](images/4_add.png)

After calling `git add` on files, they will be under Git-watch. 

If you wish to add all files in the repository (highly unlikely though), you can call `git add .`.

There are also cases where you never want files to be tracked by Git. This is when you create a `.gitignore` file.

Like the name suggests, files added to `.gitignore` won't be tracked or indexed by Git for as long as they are there. Typical stuff you should add in data projects to `.gitignore` are large data files like CSVs, parquets, images, video or audio. Git is historically horrible at handling those.

It handles the rest like a champ.

P.S. you can create `.gitignore` in the terminal with `touch .gitignore` and add files/folders to it with `echo "filename" >> .gitignore` on new lines.

### 2. Commit

- A git commit is a precious precious thing
- When you call `git commit` on the terminal, you are taking a snapshot of your code base
- It will be saved as a time capsule of your code base frozen at a specific point in time

![image.png](attachment:e720370f-4ade-403f-93ee-2ed3f27dce5c.png)

- All the commits you have saved will form your Git history or Git tree.
- And they also require you to add a message to say what the capsule contains, what are the new changes
- Git commit is also a way to organize the linear progression of your repository

![image.png](attachment:23f49ed8-fef2-4a6d-a2ba-24afba21c77d.png)

- By breaking your code changes into discrete, well-defined commits with proper messages, you can map out the progress of your project almost like a book
- Then, you can browse through the pages of this git book
- Just like a writer puts real effort into writing each page of a book, you should treat your commits with care
- You shouldn't be making them for the sake of committing
- Consider them as little pieces of history, and know that future versions of your self and other developers should look at it with delight, rather than disgust
- Some common scenarios to commit in a typical ml project:
    - Implementing a new feature: writing code that adds a new functionality, like a new function (tested), method or class, training a new model, new data cleaning operation, etc.
    - Fixing a bug: documenting bug fixes to existing functions, methods and classes
    - Improving performance: writing code that enhances an existing feature like optimizing blocks of code
    - Updating docs and dependencies
    - Machine learning experiments: in a project, you will run dozens of experiments to choose and tune the best model. Each model run should be tracked as a commit

### 3. Staging area

- By talking about commits, we have got ahead of ourselves
- Before closing the cap of the commit capsule, you should take deliberate action when filling its contents
- This involves telling git exactly which changes from which files you want to commit
- A single commit might include changes across several files but you may want to commit changes from some of them
- This is where we lift the curtains and reveal the staging area

![image.png](attachment:61701c49-8702-4e62-b1ad-06af2902e223.png)

- A staging area in Git is used to hold the changes you want to include in the next commit.
- Then, if you modify any of these files, you can record the changes in it by calling `git add path/to/another_file.py` so that the change will be included in the next commit
- So, the commit workflow is this:
    1. Track new files with git (only done once)
    2. Add changes in tracked files to the staging area with `git add changed_file.txt`
    3. Commit the changes in the staging area to history with `git commit -m "Commit message"`

### 4. Hashes and tags

- All commits in git have hashes so you can point to them more easily

![image.png](attachment:680ca781-31de-48fb-ba31-f8922bf53c7e.png)

- A hash is a string with 40 hexadecimal characters that give each commit unique IDs, like `1a3b5c7d9e2f4g6h8i0j1k2l3m4n5o6p7q8r9s0t`
- They make it easier to switch between commits (different versions of your codebase) with `git checkout HASH`. 
- Note that you only have to provide the first few characters of the hash like the first 5-10, not the all 40 characters
- You can list all your commits with their hashes with `git log` (this shows the commit author and the commit message)
- Or list only their message with the first 7 characters with `git log --oneline`

![image.png](attachment:5ef5bbec-ad6b-4e39-8148-a9cc024f10fe.png)

- If hashes intimidate you, there are also git tags
- A git tag is a friendly nickname you can give to some important commits to remember and refer to them more easily

![image.png](attachment:62e99e8b-e9d3-4319-9ce1-24ac1e81922c.png)

- You can give out tags to commits where you wrote an important feature, release a new version of your code base like v1.0.0 or fine-tuned your best model like `random_forest_best`
- Think of tags as little milestones that stand out among all your commit IDs.


### 5. Branch

- Branches are the bread and butter of git, after commits 
- 99% of the time, you will be working in a git branch
- Mostly, it will be the `main` or `master` branch

![image.png](attachment:e3b81604-5958-4314-837a-2f1c458030fc.png)

- You can think of them as alternate realities for your code base
- By creating a git branch, you can test and experiment with new features, ideas and fixes without the fear that you will mess up your code base
- For example, you can test a new algorithm for a classification vision task in a new branch without disrupting the main code base
![image.png](attachment:328005b9-dc70-4564-ade6-1a05c0a27f9b.png)
- Branches are also very cheap
- When you call `git branch new_branch_name`, git creates a pseudo-copy of all the files in the repo without actually duplicating them
- Then, after you tinker around with your new ideas in a branch, you can either delete the branch if the idea doesn't look promising or merge it to the main development branch

### 6. HEAD

- How does git know which branch or commit you are at?
- It uses a special pointer called HEAD
- HEAD is basically you
- Wherever you are, HEAD follows you in git

![image.png](attachment:3c77d9e7-7d1b-4818-bc72-afc5b131e57d.png)

- 99% of the time, HEAD will be pointing to the latest commit in the current branch
- If you make a new commit, HEAD will move on to that
- If you switch to a new branch or an old branch, HEAD will switch to the last commit in that branch
- HEAD is useful when comparing the changes in different commits to each other
- For example, calling `git diff HEAD~1` will compare the latest commit to the commit immediately before it
- This means `HEAD~n` refers to nth commit _before_ wherever the HEAD is
- There is also the detached HEAD state

![image.png](attachment:308d8eb3-aa73-4379-a7a6-016ffab85c25.png)

- This happens when, instead of switching to a branch with `git checkout branch_name`, you switch to a commit with `git checkout commit_hash`

![image.png](attachment:a994c98b-0e6d-4853-9963-8a2bad500f5f.png)

- Any changes or commits you make in detached HEAD state will be isolated or orphaned and won't be part of your Git history
- The reason is that HEAD is well, the head of branches, it attaches itself to branch tips or heads, not its stomach or legs.
- So, if you want to make changes in detached HEAD, you should call `git switch -c new_branch` to create a new branch at the current commit. This gets you out of the state and moves the HEAD. 


### 7. Merge

- Like I mentioned, a git merge is a fancy party where two or even more branches come together to create a single thicker branch

![image.png](attachment:cdc444b1-def7-4f85-adc9-28b679d8548e.png)

- When you merge branches, git takes the code from each branch and combines them into a single cohesive codebase.
- If there are overlapping changes in the branches, like both branches include changes made to lines 5-10 in `train.py`, git raises a merge conflict
- Basically, git asks you to decide which changes from the two versions of the file you want to keep
- Solving merge conflicts without swearing, and boiling from the ears is a rare skill developed over time. So, I won't talk much about them refer you to this excellent article.

https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts

### 8. Stash

- One of my favorite features of Git is stashes
- You can think of git stashes as hiding places for your dirty laundry
- When you call `git stash`, it automatically stashes or hides both staged and unstaged changes in your current working directory
- and everything reverts to the last commit's version
- Note that git stash doesn't hide untracked files. For that, you would have to include the `-u` tag. 
- Then, in any branch or anywhere down the git tree, you can call `git stash apply` or `git stash pop` to apply the stashed changes to the working directory

### 9. GitHub

- So the age-old question, what is the difference between Git and GitHub?
- This is like asking the difference between a burger and a cheeseburger
- Git is a version control system that holds a repository together
- GitHub on the other hand is a web-based platform used to store repositories controlled by Git
- By storing your git repos on such platforms, you make them open to collaboration with others
- While Git is still immensely useful even when you work on repos on your own, it really shines when the repo is open for collaboration
- If your repo is only on your local machine, people can't collaborate on it
- So, think of GitHub as a remote mirror of your local repo that people can clone, fork, and suggest pull requests 
- And if these terms sound alien to you, stick around for my next article where I explain 9 GitHub concepts that will clear the confusion right away

### Wrap