# GitHub For The Modern Data Scientist: 7 Concepts You Can't .gitignore
## Explained with pun, fun, wit and visuals
![image.png](attachment:040209c3-3411-4b36-9a8a-fca6ae006cbd.png)

### Introduction

### 0. Remote repository

Git repositories come in two flavors: local and remote. A local Git repository is stored only on your computer. Its remote counterpart, on the other hand, is stored on a server (usually GitHub) rather than on some rusty laptop so that others can collaborate with you on the project without having to be physically in the room. 

![](images/remote_repo.png)

However, a remote repository isn't only for collaboration. Since it is on a server, you can access it from any machine with an internet connection. This beats carrying around your heavy laptop from office to home to local cafe to office to home.

### 1. REAMDE

READMEs are an essential part of any project, yet they are often overlooked. We all know what a README is. We are not children. So why don't we take the time to add it to our repositories? 

![image.png](images/readme.png)

Perhaps because, since we know full well what our code is about and how it works, we feel we don't need to explain it to others. But that isn't always the case.

For example, imagine taking a break for a couple of months from a project and then getting back to it. Would you remember what each and every one of the files does? 

Personally, when I come back to a long-paused project, it always feels like I am reading someone else's code and ideas. In situations like this, a README can save the project by telling you where you left off and what you were doing. You won't have to give up the entire thing because you feel like you have to start from scratch.

There are two types of READMEs: personal READMEs and public-facing READMEs. For personal READMEs, the purpose is to help you, the developer. 

On the other hand, public-facing READMEs are for others who might be interested in your project. 99% of the time, when a person comes across a GitHub repository, they judge it not by how well the code is written (they almost never look at the code) but how well it is represented to others. Think of it as the user manual for your project. It is the first thing people will see when they visit the repo, and it should give them a clear idea of what your project is and how to use it.

In addition, a public-facing README is your only chance to explain your thought process behind model selection, methodology for validation, and results of model testing in a clear and concise manner. If you still hate writing READMEs after all this, you should check out the Awesome README Templates repository, put together for lazies like us.

https://github.com/matiassingers/awesome-readme

### 2. Clone it or fork it?

When someone looks at a remote repository, there are four things that may happen. The likeliest course of action (case 0) is they ignore it or give it a star if they are feeling generous. 

In the first case, if the README was convincing enough, they might clone it. 

![image.png](images/clone.png)

Cloning a remote repository with commands like `git clone https://github.com/username/awesome_repo` creates an exact copy of `awesome_repo` on your local machine, giving you access to the project's entire Git history as well as write access to all the files. However, if you make changes to this local copy of `awesome_repo`, its remote copy won't feel a thing.

In the third case, if the README was even more convincing, a person might fork it. 

![image.png](images/fork.png)

When you fork `awesome_person`'s `awesome_repo` on GitHub, you will have the exact copy of `awesome_repo` under your account. 

Your GitHub page will have a new `your_username/awesome_repo` repository with the same content and history as `awesome_person/awesome_repo`. If you want to make changes to this copy, you can clone `your_username/awesome_repo` so that it is also on your local machine. 

There are a number of reasons why someone might fork another's repo. The number one reason is to contribute to `someone/awesome_repo` by submitting pull requests. Another reason is to create a new project based on the original code, without affecting it. 

A notable example of this is [the Manim GitHub community](https://github.com/ManimCommunity/manimhttps://github.com/ManimCommunity/manim), which is a more maintained and documented fork of [the legendary Manim repository](https://github.com/3b1b/manim) by Grant Sanderson (creator of 3Blue1Brown and all its videos). 

> To differentiate between originals and forks, GitHub adds a "forked from original_repo" label on repository pages.

The fourth case is when you access one of your own remote repositories from a different machine. For instance, you left your laptop in a dry cleaner's, and you want to continue working on the project in the office. 

In this case, all you have to do is clone the repo to download its contents to the office Mac. But, the Git installation on the Mac must be under your GitHub username if you want to sync your changes.

### 3. Push and pull

When using Git, the terms "push" and "pull" are related to changes in your repository. 
Let's go through the most common cases of using these commands. 

In Case 0, you have a local repository with many commits and branches, and you want to send them all to a new empty remote repository on GitHub. First, you need to specify the web address of the remote repository in Git using the command `git remote add remote_name https://github.com/username/repo_name.git`. Usually, the replacement for `remote_name` is `origin`. 

![](images/push.gif)

Then, you perform your first "push" by calling `git push`, which sends all your commits in the current branch to GitHub. Congrats!

In Case 1, you make new local commits to the repository and want to send them to the remote, so you call git `push again`. This command is always used to keep the remote repository up-to-date with the local repository. 

In Case 2, you may be working on a big project with many contributors, and you already have a copy of the project on your machine that is two days old. To download the new changes made by others during this time, you need to perform a `git pull`. 

![](images/pull.gif)

This command ensures that your local repository is up-to-date with the remote in case multiple people are making changes to it.

### 4. Pull requests

- Pull requests are like knocking on your neighbor's door and saying, "Hey, I made some changes to your lawn. Take a look and see if you like it!" But instead of lawns, we're talking about repositories.
- Pull requests are a key part of the collaborative nature of open-source projects.
- They allow contributors (regular programmers) to suggest changes and improvements to an existing project and for the project maintainers to review and approve those changes before they are merged into the main codebase
- Let's say you want to make a pull request to the Scikit-learn repo (sweet dream :)
- Here is the workflow you have to follow (for making pull request to any repo):
- Step 0: Fork Scikit-learn so that you have a copy under your account
- Step 1: Clone your Scikit-learn copy so that you have a copy on your machine
- Step 2: Create a new branch so you can tinker around without affecting the main branch
- Step 3: Make your changes - write new code, fix bugs, or typos in the docs, or make other improvements
- Step 4: Test your changes - make sure your changes are error-free and work as intended
- Step 5: Push the branch with your changes (`git push` will send them to your remote copy of Scikit-learn)
- Step 6: Create the request - from your forked repository, click on the "New pull request" button. This will put your changes under the eyes of Scikit-learn maintainers
- Step 7: Wait till probably your hair turns gray for maintainers to review your request
- Step 8: Be disappointed or happy as they either reject or accept your request
- Conditional step 9: The maintainers merge your request so that it will be included in the next release

### 5. GitHub issues

- Think of GitHub issues as a more democratic, civilized but less organized version of StackOverflow threads
- People can use issues to track and report problems, bugs or features requests for a specific repository
- Maintainers of the repo might assign issues to individuals or teams, or label them according to their severity and impact
- In simple words, GitHub issues are management tools for software projects
- Most popular repositories have issue templates so that you can ask useful, _reviewable_ and _answerable_ questions without rambling

### 6. GitHub Actions

- The rabbit of hole of GitHub actions is very deep
- But we will only dip a few meters into the surface and learn the generics
- As a data scientist, you have a million things to do like analyzing data, building models, writing code, getting coffee...
- All of this can be quite repetitive and overwhelming
- But if you know GitHub Actions, it will be like having a personal assistance for all repository-related tasks
- With GitHub Actions, you can set up automated workflows that run whenever certain events occur
- For example, you can create an action that automatically validates the model performance on the test data when you train a new model
- Or you could set up an action that deploys the model once you push a new release of your repo
- As you can imagine, there are many other workflows that can be automated this way. Here are some examples:

1. Run tests - automatically run unit or integration tests whenever you push new code
2. Train models - retrain existing ML models on a regular basis (e.g. daily, weekly, etc.)
3. Perform pre-processing - run common processing tasks such as data cleaning, normalization, missing value imputation, feature extraction, etc. 

- GitHub Actions are stupidly easy to set up. 
- They are defined inside YAML files, which you can edit in any way to meet your needs.
- Here is a sample YAML for an action that runs unit test:

```python
name: Run Tests

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  tests:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.9
      uses: actions/setup-python@v2
      with:
        python-version: '3.9'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Run unit tests
      run: |
        python -m pytest tests/
```

To understand what is happening above, give a read to the official GitHub docs [here](https://docs.github.com/en/actions).

### 7. CI/CD pipelines

- CI/CD (continuous integration and development) is a concept stolen from DevOps in software engineering
- CI/CD pipelines are series of automated steps that your code goes through to ensure it's in the most tip-top shape possible before releasing it into the wild
- We can apply the same process so that our ML models and the code to train/deploy them are of the highest quality as well

![image.png](images/cicd.png)

- In other words, CI/CD pipelines are like well-oiled machines that shift into gear whenever you make a new commit
- Before the commit is merged, it goes through a series of steps which test the code and models within from various aspects
- And you guessed it, each step is defined as a GitHub action
- Let's take a helicopter's-camera view at a sample machine learning lifecycle CI/CD pipeline:

1. Data collection and preprocessing - Combine data from various sources and transform it before training

2. Model training - machine learning models are trained on the preprocessed data.

3. Model evaluation - the trained models are evaluated on a separate set of data to assess their performance.

4. Model deployment - the best performing model is deployed to production.

5. Continuous monitoring - where the deployed model is continuously monitored to ensure it's performing as expected

- Each of these steps deserve articles of their own or even entire courses. 
- In short, CI/CD pipelines are silent heroes in most (active) machine learning projects. 

### Conclusion