# SLU03 - Git Basics

In this notebook we will be covering the following:

* [1. (Painless) Introduction to Version Control](#1_Painless_Introduction_to_Version_Control)
    - [1.1 So what is a Version Control System?](#1_1_So_what_is_a_Version_Control_System)
    - [1.2 Collaboration](#1_2_Collaboration)
    - [1.3 Storing versions properly](#1_3_Storing_versions_properly)
    - [1.4 Auditability](#1_4_Auditability)
* [2. Repositories: where it all begins](#2_Repositories_where_it_all_begins)
    - [2.1 Creating a repository](#2_1_Creating_a_repository)
    - [2.2 README](#2_2_README)
    - [2.3 .gitignore](#2_3_gitignore)
    - [2.4 Creating a repository not hosted on GitHub servers](#2_4_Creating_a_repository_not_hosted_on_GitHub_servers)
* [3. Working with Git: the basics](#3_Working_with_Git_the_basics)
    - [3.1 Main Git commands](#3_1_Main_Git_commands)
        - [git add](#git_add)
        - [git commit](#git_commit)
        - [git push](#git_push)
        - [git pull](#git_pull)
        - [git status](#git_status)
        - [git log](#git_log)
* [4. Summary workflow](#4_Summary_workflow)
* [5. GitHub and other Version Control Systems](#5_GitHub_and_other_Version_Control_Systems)
* [Useful links](#Useful_links)

We will be using the command line to interact with Git which is where you can run all existing Git commands. There are also GUI versions of Git, of varying capabilities.

## <a name="1_Painless_Introduction_to_Version_Control"></a>1. (Painless) Introduction to Version Control

**Imagine this scenario with me...**

It's two AM and you've just finished your first programming project. Everything works, your "*hello worlds*" are all very hello'y, your *1*s and *0*s are all very *True* and *False*, and there's not a *ZeroDivisionError* to be seen in the land. Everything is wonderful and you go to sleep as a happy programmer, albeit a tired one... 

You wake up the next morning, relaxed and happy with no undereye bags to be seen on your lovely face. Your amazing programming project is due that afternoon and you go check if everything is running smoothly... but something terrible happened! All your beautiful code has been replaced by cat emojis! You look suspiciously at Whiskers who you once considered your four-legged furry best friend. You haven't given him as much attention as you should have lately, with all the late-night coding... And he just did you good. **Never trust a cat.**

If only there was a way to "*go back in time*" and save all your precious work... 

**With version control there is!** 

*And all is good in the coding land!*

### <a name="1_1_So_what_is_a_Version_Control_System"></a>1.1 So what is a Version Control System?

In short, [Version Control System (VCS)](https://en.wikipedia.org/wiki/Version_control) records changes to a file or set of files over time so that you can recall specific versions later.

It allows you to revert selected files to a previous state, revert the entire project to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.

There are a lot of decentralized services for VCS such as [GitHub](https://github.com/), [GitLab](https://about.gitlab.com/) and [Bitbucket](https://bitbucket.org/), and all of them rely on [Git](https://git-scm.com/), the VCS that is fueling all of these services and recording all the changes. FYI: Git is free and open-source meaning that there are no barriers to take advantage of it. :)

You noticed that all of our [Prep Course Material](https://github.com/LDSSA/ds-prep-course-2021) is hosted on GitHub. This makes **our** lives easier as all changes to the learning materials are being tracked, and also who made them, and it also makes **your** life easier when accessing the materials.

Saving your work from evil cats is not the only reason VCS is useful. There are three main reasons why VCS are used: 
* collaboration, 
* storing versions properly,
* auditability. 

Let's try to understand all of them.

### <a name="1_2_Collaboration"></a>1.2 Collaboration
Remember when you were editing a text document with 10 other people at the same time? In the end, it was a mess of differently colored crossed out lines. Version control systems were developed to solve the problem of **multiple people working on the same code at the same time**. Each person works on their own copy and when they're done, the changes are merged back to the original file, resolving conflicts along the way. You will learn more about this in SLU05.

![image](media/git_collaboration.png)

While you'll be working on your own in the prep course, it is still important for you to know how Git facilitates the development process when multiple people are involved. In the industry, almost no developer works alone (*thankfully!*).

### <a name="1_3_Storing_versions_properly"></a>1.3 Storing versions properly
<img width="400" src="https://www.groovecommerce.com/hs-fs/hub/188845/file-4063238065-png/blog-files/version-control-comic.png"/>

**Is this a familiar scenario to you?** Imagine this but with many *many* files in a software program... Words are not enough to explain the utter horror of such an event... Worse... Imagine collaborating on a software project *over email*.  (>_<) 

Saving a version of your project after making changes is an essential habit. In VCS, we call this a **commit**. Just like in an old-school videogame, when you feel you have made some progress and a boss around the corner might take you down, you save your game. The same logic applies to a **commit**. 

(*Note for your future self who will thank me later: don't commit broken or unfinished work. Your future self will have zero clues about what the heck was going on back when you commmitted your code...*)

Another important feature of the **commit**, which we will revisit later on, is that it asks you for a message. This is to help your future self, and others collaborating with you, know what is going on. A succint but clear message about the changes that were made is a very good practice (and a mandatory one!). 

In the end, when using a VCS, because everything is being stored on a server (such as GitHub), you only have **one** version of the project that you're currently working on on your disk. Everything else - all the past versions and variants - is neatly packed inside the VCS. When/if you need them, you can request any version at any time, and you'll have a snapshot of the complete project right at hand, including the history of all the changes done thus far.

### <a name="1_4_Auditability"></a>1.4 Auditability

When you're working with a team on a single project, sometimes it gets hard to know "who did what". VCS offers you an easy way to track "who made what change and when". It unlocks the ability to:
- Debug more effectively by finding when a breaking change was introduced
- Track the reason why certain changes were made
- Find the person who made a change and ask them why they did it (not with violence, but with love!)

<img width="600" src="https://miro.medium.com/max/1400/1*wQ2mtIZHzVkJ0Y2suuVGpQ.jpeg">

## <a name="2_Repositories_where_it_all_begins"></a>2. Repositories: where it all begins

A repository is simply a **place that stores and tracks changes to the files contained in it**. It associates identities with changes to those files. 

Take a look at [a familiar repository](https://github.com/LDSSA/ds-prep-course-2022):
![image](media/repo_example.png)

This is the Prep Course public repository, from where all of you pull the SLUs every week. Under the hood, our lovely instructors (shown in **contributors**) work really hard to make the SLUs available to you every week. You can see that at the time of the creation of this Learning Notebook, **12 commits** had already been **pushed** (meaning saved) to the Prep Course repository, but more on this later.

### <a name="2_1_Creating_a_repository"></a>2.1 Creating a repository

Congrats! You may not have realized it at the time, but you've already created a repository by following these instructions: https://github.com/LDSSA/ds-prep-course-2022#12-setup-git-and-github. 

When you followed these steps, you've:
- Installed Git, a VCS
- Created an account on GitHub, a development platform that implements Git
- Set up your own (private) repository to save your progress throughout the Prep Course
- Cloned the Prep Course repository to access the materials

You're basically a master at this already!

Now, you'll learn a little bit more about the README file and the .gitignore file.

### <a name="2_2_README"></a>2.2. README

The **README** is probably **the single most important document in a repository**. It is the starting point by which when someone arrives at the repo, they know what the repository is all about and what they should do next. It is the documentation of the repo. It uses [Markdown](https://en.wikipedia.org/wiki/Markdown) (which you have learned in the Jupyter Notebook SLU). The .md extension identifies markdown files.

(*Fun fact: this notebook is also written in markdown!*)


### <a name="2_3_gitignore"></a>2.3 .gitignore
<img width="600" src="https://imgur.com/VmhvbAt.jpeg">

The **.gitignore** is a special file that contains information about the files, file types, and/or directories in your git repo that you **do not want to track automatically** (*just like that poor dude is being ignored, so will these files/directories*). 

It's very easy to accidentally commit a file that you never intended to make available in the repository (especially if you use `git add .` to stage all files in the current directory). That's where a .gitignore file comes in handy! It lets Git know that it should ignore certain files and not track them.

So, which files would you normally not want to track?
- Any file with more than 2 MB, so always include the dataset that you're using
- Log files
- Files with API keys/secrets, credentials, or sensitive information
- Useless system files such as the annoying mac .DS_store
- Dependencies which can be downloaded from a package manager. They already have their own VC systems and you only need a particular version of it. No need to do their job for them and track their changes!

There's a nice website which tells you what to ignore depending on your operating system, text editor or IDE, languages, and frameworks: https://www.gitignore.io/. Another good source of .gitignore examples is GitHub: https://github.com/github/gitignore

Additionally, when setting up a repository, GitHub allows you to add a pre-set .gitignore file depending on the language you're using (Python, R, C...). Very handy!

To include files in the .gitignore, just type the name of the file. The same thing for directories, except that directories should end with a slash (/). 

**Ignoring any file with a given extension:**
Some slightly more advanced ways of excluding files make use of the wildcard (`*`). The wildcard matches 0 or more characters. So, for example, if you want to exclude every .log file you would include *.log in your .gitignore. 

However, when using this rule, you may end up ignoring a specific file that you want to commit, e.g. you may want to commit a specific log named important.log and exclude all the others. If that happens, you can use a `!` to specifically negate a file that would be ignored. Our .gitignore would be like:

    *.log
    !important.log

(*Note: The negation must be placed after the rule from which you're excluding the file*)  

### <a name="2_4_Creating_a_repository_not_hosted_on_GitHub_servers"></a>2.4 Creating a repository not hosted on GitHub servers

This is what you know until now:
- How to create a repo in GitHub
- How to create a README and understand why it is important
- How to create a .gitignore and add files that you don't want to see committed

But what if you wanted to benefit from the advantages of a version control system, but you don't want the content of your work to be hosted on GitHub servers?
Imagine this: you're a writer and you're working on your new book. For obvious reasons you don't want the draft version of your book to be public, but actually you're not even interested in sharing it with anyone at this stage, so not even a private repo makes sense.
Imagine that you delete some text as you're writing because you don't like it. But in the future, you often find yourself wanting to bring that text back. You know you can do that with a version control system, but you don't want neither a private or a public repository. What do you do?
Luckily for you, there's a way to create a repo on your machine without hosting it on GitHub. Let's see how:
- Open the terminal and go to the folder that contains the files you would like to track
- Run `git init --bare`. The bare option is crucial as by default git will create a repository with a remote server. So, you have to use it if you want to keep everything locally.
- Run `git remote add origin <path to the folder>`. This command will create a local repository, which is something that you'll learn about in the next section of this SLU :) You can get the path to the folder by running the `pwd` command while you're inside the folder. 
- Run `git checkout -b main`. You'll only learn about branches in the SLU05 so for now don't think too much about this command :)

And done! Whenever you feel like using the advantages of a Version Control System but you don't want to host anything on GitHub, you now know how to! 

## <a name="3_Working_with_Git_the_basics"></a>3. Working with Git: the basics

So... This is how Git works! Looks pretty confusing, right?

![image](media/git_spaces.png)

Let's break this diagram into parts, using the concepts that we've already acquired in this SLU. And remember, you've already done most of this stuff!

The **remote repository** is the GitHub repository that is hosted on the GitHub servers. The first question you may ask is: **why do we need a workspace and a local repository? And why is there a staging area in between them?**

You can think about it as building a photo album. The world around you is changing (the workspace), you take snapshots of it and store them in a box (staging area) to later paste them into the photo album (the local repository). The snapshots pasted in the photo album are there forever. Whether the snapshots in the box land in the photo album or not is still up to your decision. You also decide which of them will be pasted together.

In more technical terms:

- The **workspace** consists of files that you are currently working on. You can think of the workspace as a file system where you can view and modify files. 
- The **staging area** is where commits are prepared. It basically represents the snapshots of files taken at a given time that you plan to move to the local repository. The staging area allows you to group file snapshots that should be committed together because the changes you made in them are somehow related. You can always change your mind and decide not to commit a certain snapshot by removing it from the staging area. The staging area can also be referred to as **index** because the list that contains the file snapshots is stored in a file named index
- The **local repository** holds all the **commits** — snapshots of files at a point in time. Once the file snapshots are committed to the local repository they disappear from the staging area. These snapshots are permanent and show the history of how each file was changing. You can access the previous file versions if you want. It's a good idea to create small and frequent commits so that it’s easy to track down bugs and revert changes with minimal impact on the rest of the project.

The **basic Git workflow** goes something like this:
1. You modify files in your working tree.
2. You selectively stage just those changes that you want to be part of your next commit.
3. You do a commit, which takes the files as they are in the staging area and stores those versions permanently to your local repository.
4. You again start modifying your files and the cycle continues... That's Git *circle of life* for files!


    
It is also possible to skip the staging area and directly commit the files to the local repository, we will talk about this later. Git tracks all the files in the workspace that were staged or committed sometime in the past and notices if they were changed since the last staging or commit. It also shows you the state of each file in the Git life cycle. Go on to the next section to learn more about this!

### <a name="3_1_Main_Git_commands"></a>3.1 Main Git commands
We will now look at the six most frequent Git commands that move the files through the Git circle of life, steer the traffic between the local and remote repositories, and inform you about the status of your files: add, commit, push, pull, status, and log.

### <a name="git_add"></a>`git add`

<img width="500" src="https://imgur.com/Qit7nJ2.png"/>

The `git add` command adds snapshots of selected files in the working directory to the staging area. It tells Git that you want to include the current version of these files in the next commit. 

In **Step 5** of the [guide to the learning materials](https://github.com/LDSSA/ds-prep-course-2022#22---working-on-the-learning-units), we ask you to stage all changes with the `git add .` command. 

Useful command options:
- `git add <file>` stages a snapshot of the given file for the next commit.
- `git add <file1> <file2>` stages the snapshots of these two files for the next commit. You can include more than two files.
- `git add <directory>` stages a snapshot of the directory for the next commit.
- `git add .` stages **everything** in your current directory for the next commit.

### <a name="git_commit"></a>`git commit `

<img width="500" src="https://imgur.com/Dh1vMsm.png"/>

By now you already know what this command does, it simply stores the snapshots of the files that you have in the staging area to the local repository. The command launches the editor of your choice where you need to type the commit message:

![image](media/git_commit.png)


Afterwards, type *ctrl + X* to exit, type *Y* to save the changes, then type *Enter* to confirm the commit file name. If you want to avoid this process, you can write your commit message directly on the `git commit` command, like this: `git commit -m "commit message"`.

In **Step 5** of the [guide to the learning materials](https://github.com/LDSSA/ds-prep-course-2022#22---working-on-the-learning-units), we ask you to commit your resolved notebook with the command `git commit -m "Exercises for Week <week number>"`, where you substitute the `<week number>` by the corresponding value. Adding an explicit message to the commit will make your life easier down the road to trace back to where you were at the time. 

Other useful command options:
- `git commit -a` will check if the files in the staging area have changed after being staged and if so, restage them. Then all staged files are committed.
- `git commit --amend` modifies the last commit so that currently staged files will be added to it.

### <a name="git_push"></a>`git push`
![image](media/git_push.png)

The `git push` command is used to upload local repository content to a remote repository. Pushing is how you transfer commits from your local repository to a remote repo. The push will share the modifications you made with remote team members.

In **Step 5** of the [guide to the learning materials](https://github.com/LDSSA/ds-prep-course-2022#22---working-on-the-learning-units), you finish with the `git push` command to send all your changes to your working repository.

This command only works if you cloned from a server to which you have write access and if nobody has pushed in the meantime. If you and someone else clone at the same time and they push upstream and then you push upstream, your push will rightly be rejected. You’ll have to fetch their work first and incorporate it into yours before you’ll be allowed to push. More details on this in SLU05.

### <a name="git_pull"></a>`git pull`
![image](media/git_pull.png)

The `git pull` command is used to fetch and download content from a remote repository and immediately update the local repository to match that content. It’s an easy way to synchronize your local repository with upstream changes. In a way, the opposite of the `git push` command.

We asked in [the guide of week 00](https://github.com/LDSSA/ds-prep-course-2022#21-weekly-setup---get-the-learning-materials) to start the set-up of the learning materials with a `git pull` of the [ds-prep-course-2022](https://github.com/LDSSA/ds-prep-course-2022). You already had to do it again for **Week 01 and Week 02** when you followed the [basic workflow to update the learning units](https://github.com/LDSSA/ds-prep-course-2022#3-updates-to-learning-units), so you're already very familiar with this command.

Sometimes conflicts arise when you have files conflicting with the remote repository, but don't worry about it now. We'll learn how to solve those in SLU05!

### <a name="git_status"></a>`git status`

The `git status` command is very straightforward. It lets you see the status of the files in your workspace and the staging area. The files in these two areas have 4 possible states:

- **New**: files in the staging area that were not committed before in this workspace (newly tracked)
- **Modified**: tracked files in the workspace or staging area that have changed since you last staged or committed them
- **Unmodified**: tracked files in the workspace or staging area that were not changed since you last staged or committed them
- **Untracked**: files that were not yet staged or committed in this workspace, e.g newly created files

In a graphical way - remember the Git circle of life?

![image](media/git_circle_of_life.png)


The image bellow shows an example of running a `git status` command where: 
- the README.md file is newly staged (it was untracked before and was staged just now - notice the git add command in the previous line)
- all other files are untracked

![image](media/git_status_0.png)

Example of modifying a staged file:

![image](media/git_status_1.png)

Here we have staged the Learning notebook.ipynb file, then modified it. Notice that the file appears both in the staging area and the workspace - these are different versions/snapshots of the same file!

Example of git status after staging the modified Learning notebook and committing:

![image](media/git_status_2.png)

There are no modified files and nothing in the staging area.

Bonus commands:

You may have noticed two more more git commands in the git status output: reset and checkout.
- `git reset HEAD <file>` unstages a file
- `git checkout -- <file>` will undo the changes to a file in the workspace you've made after staging - be careful with this command to avoid losing data

### <a name="git_log"></a>`git log`

The `git log` command shows the contents of the Git log file - the commit history of the local repo. The log file will help you in case you need to revert to a previous commit, something that you'll also learn in the next Git SLU! For now, see what the output of a `git log` command looks like:

<img width="700" src="https://imgur.com/Ys9IF79.png"/>

Each commit has a unique identifier and you can see its author, date, and commit message. There is more information in the log file that you can get with different command options. To exit the log, type q.

Other useful command options:
- `git log -n <limit>` shows the last n commits. For example, `git log -n 3` will display only the last 3 commits.
- `git log -- stat` includes the information about which files were altered and the number of lines that were added or deleted from each of them.
- `git log --author="<pattern>"` filters the commits by the person who made changes in the committed files. For those of you who are familiar with regular expressions, you can also use a regular expression instead of a string.
- `git log --grep="<pattern>"` filters the commits by searching for a specific pattern in the commit message. One example would be to filter commits that mention a specific detail you're interested in. Again, you can search for a string or use a regular expression.
- `git log <file>` shows only the commits that include the specified file.

## <a name="4_Summary_workflow"></a>4. Summary workflow

Now you know what each command does. The **git status, add, and commit** can be chained in a workflow that makes sense.

A typical use-case would be:
- use `git status` to check for any modified files, 
- use `git add` to stage the modified files, then `git status` again to check that those files are now staged 
- use `git commit -m <commit message>` to commit them
- finally, use `git push` to send all our changes to the remote repository

So now you know how to chain these commands in a way that makes sense! With experience, you may end up dropping one or both the git status commands as you'll intuitively know in which states your files are, but it's always ok to do a check once in a while. 

<img width="500" src="https://imgs.xkcd.com/comics/git_2x.png"/>

## <a name="5_GitHub_and_other_Version_Control_Systems"></a>5. GitHub and other Version Control Systems

Git differs from many other version control systems in that it stores snapshots of files at a given time instead of just storing the changes. It is a distributed control system - the whole repository is copied (cloned) to your PC, so that you can work offline. It is also lightweight and very fast.

So that's it! Hope you've enjoyed this Learning Notebook and please take your time to assimilate all the concepts. The Exercise Notebook consists of multiple choice questions, so it's really important that you have a good grasp of everything before starting it. **Good luck!**

<a name="Useful_links"></a>**Useful links:**

Pro Git book by Scott Chacon and Ben Straub https://git-scm.com/book/en/v2 - currently available in 14 languages and translations started for other 15 languages