# Git For the Modern Data Scientist: 9 Git Concepts You Can't Ignore
![](images/galaxy.png)

### Introduction

Most data scientists feel like a fish out of water when it comes to Git. There are software engineers who talk about nothing but Git-things and there are data scientists that say "Huh?" (I wish I could add a sound to this) every time. 

That stops today! Since Git is an unignorable tool essential to collaboration, I will break down _nine_ of the most critical Git concepts data scientists must know like the back of their hand. 

I can promise that you won't be nodding your head in fake understanding next time someone talks about Git or version control.

Let's get started!

### For the 1000th time...

- You may have heard it a few hundred times already
- But I will err on the side of caution and reiterate
- Git is a critical tool to developing machine learning and AI systems
- If your idea of machine learning or data science project is a model you produced after tinkering around in a notebook, then Git is not necessary

- But if you want to deploy models that others can actually use, Git is a must-know tool

![image.png](attachment:18cc846f-1dd6-42dc-b2c8-c5fee8c6d3a1.png)

- You will have lots of code changes, data changes, and so on.
- You don't want notebook1, notebook2, notebook_final, notebook_final_finals littering your environment
- Git is a very small price to pay to alleviate all these problems and save you hours of headache and manual work
- There are two things: machine learning models and machine learning systems
- A model is a single file with weights or hyperparameters saved for later
- A machine learning system can involve dozens of tiny moving components, all of which produce a single product like ChatGPT

### 0. Repository

![image.png](attachment:4ba1ada8-780d-4032-820c-11ecc263f0d6.png)

- A repository is this: paste the image from Excalidraw
- It is just a folder on your machine, regardless of whether it is empty, containing three files or a hundred
- A typical machine learning repository usually contains directories to store data/models, code for loading/cleaning/transforming data, selecting/training/saving models and deployment.
![image.png](attachment:9a99a6f7-c7c0-45c6-895e-3a7369ec9a4f.png)
- There will be other miscellany files like the `.git` folder itself for Git internals, metadata files, etc.
- All these combined make up a single repository. 
- For the purposes of this article, our sample repository will have two files and a single directory

### 1. Tracked, untracked

- When you initialize Git inside a directory, by default any existing or new files/directories you create will be untracked by Git

![image.png](attachment:33e54596-a124-42fb-a31b-6e43a686fc81.png)

- This means any future changes you make to them will be untracked as well
- So, you have to put those files under git supervision by calling `git add path/to/file.py`

![image.png](attachment:694f440b-6629-4d8b-9553-fba77366ed7d.png)

- If you wish to add everything inside the repo to git, call `git add .`
- In cases where you want to hide files from Git tracking, you have to create a file called `.gitignore`
- Like the name suggests, files added to `.gitignore` won't be indexed or tracked by git for as long as they are in there
- You can create the file with `touch .gitignore`
- And add files to it in new lines with `file.py >> .gitignore`
- Or you can use your IDE if you want to look less cool

### 2. Commit

- A git commit is a precious precious thing
- When you call `git commit` on the terminal, you are taking a snapshot of your code base
- It will be saved as a time capsule of your code base frozen at a specific point in time

![image.png](attachment:1b3bf9c4-79c6-48e7-9eb6-4b90d96ada24.png)

- All the commits you have saved will form your Git history or Git tree.
- And they also require you to add a message to say what the capsule contains, what are the new changes
- Git commit is also a way to organize the linear progression of your repository

![image.png](attachment:23f49ed8-fef2-4a6d-a2ba-24afba21c77d.png)

- By breaking your code changes into discrete, well-defined commits with proper messages, you can map out the progress of your project almost like a book
- Then, you can browse through the pages of this git book
- Just like a writer puts real effort into writing each page of a book, you should treat your commits with care
- You shouldn't be making them for the sake of committing
- Consider them as little pieces of history, and know that future versions of your self and other developers should look at it with delight, rather than disgust
- Some common scenarios to commit in a typical ml project:
    - Implementing a new feature: writing code that adds a new functionality, like a new function (tested), method or class, training a new model, new data cleaning operation, etc.
    - Fixing a bug: documenting bug fixes to existing functions, methods and classes
    - Improving performance: writing code that enhances an existing feature like optimizing blocks of code
    - Updating docs and dependencies
    - Machine learning experiments: in a project, you will run dozens of experiments to choose and tune the best model. Each model run should be tracked as a commit

### 3. Staging area

- By talking about commits, we have got ahead of ourselves
- Before closing the cap of the commit capsule, you should take deliberate action when filling its contents
- This involves telling git exactly which changes from which files you want to commit
- A single commit might include changes across several files but you may want to commit changes from some of them
- This is where we lift the curtains and reveal the staging area
![image.png](attachment:b08502f8-9b4d-48e7-9151-6cc8f01da106.png)

- A staging area in Git is used to hold the changes you want to include in the next commit.
- Then, if you modify any of these files, you can record the changes in it by calling `git add path/to/another_file.py` so that the change will be included in the next commit
- So, the commit workflow is this:
    1. Track new files with git (only done once)
    2. Add changes in tracked files to the staging area with `git add changed_file.txt`
    3. Commit the changes in the staging area to history with `git commit -m "Commit message"`

### 4. Hashes and tags

- All commits in git have hashes so you can point to them more easily

![image.png](attachment:680ca781-31de-48fb-ba31-f8922bf53c7e.png)

- A hash is a string with 40 hexadecimal characters that give each commit unique IDs, like `1a3b5c7d9e2f4g6h8i0j1k2l3m4n5o6p7q8r9s0t`
- They make it easier to switch between commits (different versions of your codebase) with `git checkout HASH`. 
- Note that you only have to provide the first few characters of the hash like the first 5-10, not the all 40 characters
- You can list all your commits with their hashes with `git log` (this shows the commit author and the commit message)
- Or list only their message with the first 7 characters with `git log --oneline`

![image.png](attachment:3d1e252f-77c3-4701-8a22-a7b196701fb5.png)

- If hashes intimidate you, there are also git tags
- A git tag is a friendly nickname you can give to some important commits to remember and refer to them more easily

![image.png](attachment:62e99e8b-e9d3-4319-9ce1-24ac1e81922c.png)

- You can give out tags to commits where you wrote an important feature, release a new version of your code base like v1.0.0 or fine-tuned your best model like `random_forest_best`
- Think of tags as little milestones that stand out among all your commit IDs.


### 5. Branch

- Branches are the bread and butter of git, after commits 
- 99% of the time, you will be working in a git branch
- Mostly, it will be the `main` or `master` branch

![image.png](attachment:e3b81604-5958-4314-837a-2f1c458030fc.png)

- You can think of them as alternate realities for your code base
- By creating a git branch, you can test and experiment with new features, ideas and fixes without the fear that you will mess up your code base
- For example, you can test a new algorithm for a classification vision task in a new branch without disrupting the main code base
![image.png](attachment:328005b9-dc70-4564-ade6-1a05c0a27f9b.png)
- Branches are also very cheap
- When you call `git branch new_branch_name`, git creates a pseudo-copy of all the files in the repo without actually duplicating them
- Then, after you tinker around with your new ideas in a branch, you can either delete the branch if the idea doesn't look promising or merge it to the main development branch

### 6. HEAD

- How does git know which branch or commit you are at?
- It uses a special pointer called HEAD
- HEAD is basically you
- Wherever you are, HEAD follows you in git

![image.png](attachment:3c77d9e7-7d1b-4818-bc72-afc5b131e57d.png)

- 99% of the time, HEAD will be pointing to the latest commit in the current branch
- If you make a new commit, HEAD will move on to that
- If you switch to a new branch or an old branch, HEAD will switch to the last commit in that branch
- HEAD is useful when comparing the changes in different commits to each other
- For example, calling `git diff HEAD~1` will compare the latest commit to the commit immediately before it
- This means `HEAD~n` refers to nth commit _before_ wherever the HEAD is
- There is also the detached HEAD state

![image.png](attachment:308d8eb3-aa73-4379-a7a6-016ffab85c25.png)

- This happens when, instead of switching to a branch with `git checkout branch_name`, you switch to a commit with `git checkout commit_hash`

![image.png](attachment:a994c98b-0e6d-4853-9963-8a2bad500f5f.png)

- Any changes or commits you make in detached HEAD state will be isolated or orphaned and won't be part of your Git history
- The reason is that HEAD is well, the head of branches, it attaches itself to branch tips or heads, not its stomach or legs.
- So, if you want to make changes in detached HEAD, you should call `git switch -c new_branch` to create a new branch at the current commit. This gets you out of the state and moves the HEAD. 


### 7. Merge

- Like I mentioned, a git merge is a fancy party where two or even more branches come together to create a single thicker branch

![image.png](attachment:cdc444b1-def7-4f85-adc9-28b679d8548e.png)

- When you merge branches, git takes the code from each branch and combines them into a single cohesive codebase.
- If there are overlapping changes in the branches, like both branches include changes made to lines 5-10 in `train.py`, git raises a merge conflict
- Basically, git asks you to decide which changes from the two versions of the file you want to keep
- Solving merge conflicts without swearing, and boiling from the ears is a rare skill developed over time. So, I won't talk much about them refer you to this excellent article.

https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts

### 8. Stash

- One of my favorite features of Git is stashes
- You can think of git stashes as hiding places for your dirty laundry
- When you call `git stash`, it automatically stashes or hides both staged and unstaged changes in your current working directory
- and everything reverts to the last commit's version
- Note that git stash doesn't hide untracked files. For that, you would have to include the `-u` tag. 
- Then, in any branch or anywhere down the git tree, you can call `git stash apply` or `git stash pop` to apply the stashed changes to the working directory

### 9. GitHub

- So the age-old question, what is the difference between Git and GitHub?
- This is like asking the difference between a burger and a cheeseburger
- Git is a version control system that holds a repository together
- GitHub on the other hand is a web-based platform used to store repositories controlled by Git
- By storing your git repos on such platforms, you make them open to collaboration with others
- While Git is still immensely useful even when you work on repos on your own, it really shines when the repo is open for collaboration
- If your repo is only on your local machine, people can't collaborate on it
- So, think of GitHub as a remote mirror of your local repo that people can clone, fork, and suggest pull requests 
- And if these terms sound alien to you, stick around for my next article where I explain 9 GitHub concepts that will clear the confusion right away

### Wrap