<a href="https://colab.research.google.com/github/Tarleton-Math/data-science-20-21-hwk01/blob/master/git_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Git commands
## Data Science (masters)
## Math 5364 & 5366, Fall 20 & Spring 21
## Tarleton State University
## Dr. Scott Cook

In light of the volatility we face in Fall 2020, I've tried to setup our course to be as flexible as possible regarding compute resources.  We'll use a combinations of excellent and free cloud-based services so that you are not rooted to any specific machine or location.  

Let's discuss these services, how they interact, and a few catches to watch for.

### Git & GitHub

[Git](https://en.wikipedia.org/wiki/Git) is the most-widely used version control system.  It is designed to allow teams to colaborate without stepping on each other's toes.  It also allows single users to same history and revert if they screw something else.  The basic unit for git is called a *repository* (or repo).

[GitHub](https://github.com/) is the most-widely used cloud git platform.

Generally, you have copies of a *local* repo on your machine and the *origin* on GitHub which you sync using either
- command line
- GUI

My favorite git resource is https://try.github.io/.  Honestly, I find git difficult in part because the terminology is dense and cryptic.  But it is a key skill for many jobs, so I want you to gain experience with it (even if I am not a good person to teach it to you).

### Python

[Python](https://en.wikipedia.org/wiki/Python_%28programming_language%29) and [R](https://en.wikipedia.org/wiki/R_(programming_language) are the most important languages for data science.  We do Python in Data Science and R in Statistical Models (Fall 21).

One difference is that Python started as a general purpose language that was adopted by scientists and mathematicians, whereas R was built specifically to handle data.  Thus, there are a lot of things Python can do that R can not.  But, Python has a steeper learning curve too.

Python is like an onion - it comes with many layers
- Core python
    - lists, sets, strings, comprehension, classes, etc
    - My favorite intro is Jake Vanderplas's [Whirlwind Tour Of Python](https://github.com/jakevdp/WhirlwindTourOfPython).  It's a little dated, but covers the key intro material for a data scientist.  And its free and comes as Jupyer notebooks you can run and manipulate!
- [SciPy](https://scipy.org)
    - Python modules built for science.
    - Most of our coding will utilize SciPy (with core python always playing a role)
    - My favorite intro is Jake Vanderplas's [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook).  It's a little dated, but covers the key intro material for a data scientist.  And its free and comes as Jupyer notebooks you can run and manipulate!

(Yes - I'm an acolyte of Vanderplas.  He moved from Univ of Washington to Google and is now part of the team building Jax, currently the nicest way to use GPU's in Python.  He's also the guy in the intro to Colab video.)

### Jupyter

Python started as command line, then added interactive iPython, then added notebook-driven [Jupyter](https://jupyter.org/).  Serious developers oftern prefer IDEs like [PyCharm, Spyder, others](https://www.guru99.com/python-ide-code-editor.html).  But data science is primarily done in Jupyter notebooks because you can easily combine text ([Markdown](https://daringfireball.net/projects/markdown/)), code, output, and visualizations.

Almost everything you submit for this class will be in the form of a Jupyter notebook (or files supporting a notebook).

### Google Colab

Python is an onion - packages built on packages built on packages ... That's great, except for [dependency hell](https://en.wikipedia.org/wiki/Dependency_hell).  One important solution to dependency hell are package managers and distributions, such as [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)) and the awesome (and Austin-based) [Anaconda](https://www.anaconda.com/) products.  These focus on local installations to your own machine.

More and more activity is occuring on cloud services like [Google Cloud](https://cloud.google.com/), [Amazon Web Services](https://aws.amazon.com/), and [Microsoft Azure](https://azure.microsoft.com).  These provide (essentially) unlimited compute power (at a cost), but require some configuration and maintenance.

[Google Colaboratory (Colab)](https://colab.research.google.com) launched about 2 years ago and provides a free, pre-configured environment that is accessible anywhere and support synchronous editing (like google docs).  I honestly love Colab; it is the perfect place to get started.  In past year, we spent the first 2 days of this class just setting up and getting configured.  It is instantaneous with Colab.

Colab does have hardware limitations that we may eventually hit against that will push us to beefier options.  We'll handle that when we get there.

### Google Drive

One limitation of Colab is that it "recycles" resources.  Like a library, it loans you a resource (computer) and returns it to the stacks when you're done.

Like all cloud services, Colab give you access to a DIFFERENT machine living somewhere in Googlandia.  So, everything you do lives on *that* machine (not yours).  Click the folder icon in the lef navigation bar to see its file structure.

When the loaner machine gets returned, it resets and erases everything you did.  You will find this a bit vexing if you had planned to turn that in as homework.

One solution is to mount your google drive to this loaner machine so you can access files on it and save your work to it.  See below.

### GitHub Classroom

GitHub created a tool for teachers that let's me create a repo and then create copies for each of you.  My repo for this assignment is data-science-20-21-hwk01 and yours is data-science-20-21-hwk01-yourusername.

You will work in your repo and push your progress.  I have access to it and will grade directly from your repo - no need to "turn in" anything.

Repos are public by default, but I'll make your repos private so only you and I can see them.  This complicates thing a bit.

Another complication is that I may update assignments after posted to clarify questions or fix errors.  You'll want to pull these updates without overwriting your own work.

### Workflow

Git allows "branches" of the same repo/project so multiple users can work without overwriting each other.  The original branch is called the "master".  I suggest you reserve that branch to stay in sync with my original and simply create and work on another branch called "working".  When I grade, I grade this "working" branch and may send comment back via a branch called "feedback."

### "Open in Colab"

This year, Colab & GitHub learned to play nicely with each.  If you opened this file in your GitHub repo, you may see the "Open in Colab" button on top.  This is SOOO handy.

Conversely, it is easy to push & pull individual files to GitHub from Colab using "File → Save a copy in GitHub" and "File → open notebook → GitHub".

***WARNING: SELECT THE CORRECT BRANCH (usually "working")*** - If you push to "master" (default), your work may get overwritten if you sync changes from MY original later.

That works for individual files, but we often have multiple files (data, images, output, etc).  Plus, we want to save everything on your Google drive, not the Colab instance's file system because that erases when the instance recycles.

The code below is my attempt to make things easier to manage.  It attempts to automatically mount your Google Drive to the Colab instance, clone the repo (if needed), give functions to push and pull btw google drive & GitHub, and sync changes I make to my original to your master branch.

Catches:
- May need to retype "preferred_path" more than once
- Must create & securely store a GitHub [personal access token](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token)
    - From GitHub, click your face → Settings → Developer settings → Personal access tokens → Generate new token → check "repo" → store securely (I like KeePass, but any manager will suffice)
    - Token might [auto-delete](https://github.community/t/personal-access-token-deleting-itself/13955) if you're not careful.  No problem - just create another one.
- (complicated thing coming) If you "Open in colab" this file, clone/pull the repo it came from, edit the notebook, and push() to repo ... nothing will change (probably).  Your edits here will (likely) not show up on the repo.
    - Why? When you "open in colab", a temporary copy of the notebook is created.  That's the one you've edited.  When you mount google drive & clone/pull the repo, that brings in *another* copy of this file which you are *not* editing.  When you push, that *other* copy gets pushed back to GitHub, not the copy you edited.
    - This shouldn't be an issue b/c I don't *think* you'll need to edit this file.
    - But if you do need to edit this file, use the alternate way to push a single to GitHub described above: "File → Save a copy in GitHub".
        - Again - make sure you push to the correct branch (working).

In [1]:
preferred_path = f"My Drive/active/m53646-data-science-20-21/as-student/hwk"   # your preferred google drive path
hwk_num = "01"

####### I don't think you'll need to change anything below this line #######

import os
import google.colab
import getpass

# Mount your Google Drive to this Colab instance
root_path = "/content/drive/"
google.colab.drive.mount(root_path)

# Get you github credentials
user  = getpass.getpass(prompt="Enter your github user name")
token = getpass.getpass(prompt="Enter your github access token")
email = getpass.getpass(prompt="Enter your email address (optional)")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/
Enter your github user name··········
Enter your github access token··········
Enter your email address (optional)··········


In [3]:
def push(msg=f"enter commit message here"):
    %cd "{local_path}"
    ! git add .  # Add any new files
    ! git commit -a -m "{msg}"  # Commit changes
    ! git push  # push changes to origin's working branch


def pull():
    %cd "{orig_path}"
    ! git pull
    %cd "{local_path}"
    ! git pull


! git config --global user.name "{user}"    # set git info
! git config --global user.email "{email}"  # set git info

# recall the terminology
# origin = repo living on Github
# local  = repo living on the machine you're working on (Colab instance, your laptop, your mom's computer, etc)

orig_name  = f"data-science-20-21-hwk{hwk_num}"
local_name = orig_name
if user != "drscook" and user != "":
    local_name += f"-{user}"  # appends your username so you push/pull your repo, not Cook's original

orig_url  = f"https://github.com/Tarleton-Math/{orig_name}.git"  # url to Cook's original public GitHub repo
local_url = f"https://{user}:{token}@github.com/Tarleton-Math/{local_name}.git"  # url to your private GitHub repo

orig_path  = os.path.join(root_path, preferred_path, orig_name)  # path to folder containing your local repo
local_path = os.path.join(root_path, preferred_path, local_name)  # path to folder containing your local repo

os.makedirs( orig_path, exist_ok=True)  # create repo_path if necessary
os.makedirs(local_path, exist_ok=True)  # create repo_path if necessary

%cd "{orig_path}"
! git clone "{orig_url}" .  # clone origin to a local repo in repo_path (if necessary)

%cd "{local_path}"
! git clone "{local_url}" .  # clone origin to a local repo in repo_path (if necessary)

pull()  # pull changes from origin to local (does nothing the first time, but gets changes made later)

/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01
fatal: destination path '.' already exists and is not an empty directory.
/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
fatal: destination path '.' already exists and is not an empty directory.
/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01
Already up to date.
/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
Already up to date.


In [28]:
# recall the terminology
# origin = repo living on Github
# local  = repo living on the machine you're working on (Colab instance, your laptop, your mom's computer, etc)

hwk_name = f"data-science-20-21-hwk{hwk_num}"
repo_name = hwk_name
if user != "drscook" and user != "":
    repo_name += f"-{user}"  # appends your username so you push/pull your repo, not Cook's original
repo_path = os.path.join(root_path, preferred_path, repo_name)  # path to folder containing your local repo
repo_url = f"https://{user}:{token}@github.com/Tarleton-Math/{repo_name}.git"  # url to your private GitHub repo
original_url = f"https://github.com/Tarleton-Math/{hwk_name}.git"  # url to Cook's original public GitHub repo

def push(msg=f"enter commit message here"):
    %cd "{repo_path}"
    ! git checkout working  # switch to working branch
    ! git add .  # Add any new files
    ! git commit -a -m "{msg}"  # Commit changes
    ! git push origin working  # push changes to origin's working branch

def sync_upstream():
    # get & synch changes from Cook's original to your master branch
    # this overwrites all changes you've made to your master branch
    # that's why you should always work on your "working" branch
    %cd "{repo_path}"
    ! git pull origin master # pull changes from origin to local
    ! git checkout master  # switch to master branch
    ! git fetch upstream  # pull changes from Cook's original
    ! git reset --hard upstream/master  # force those change onto local's master branch
    ! git push -f origin master  # force those changes up to origin's master branch
    ! git checkout working  # switch to working branch

def pull(sync_original=False):
    %cd "{repo_path}"
    ! git pull origin working  # pull changes from origin to local
    ! git checkout working  # switch to working branch
    if sync_original:
        sync_upstream()


os.makedirs(repo_path, exist_ok=True)  # create repo_path if necessary
%cd "{repo_path}"
! git config --global user.name "{user}"    # set git info
! git config --global user.email "{email}"  # set git info
! git clone "{repo_url}" .  # clone origin to a local repo in repo_path (if necessary)
! git remote add upstream "{original_url}"  # add Cook's original repo as a "remote" named upstream
! git checkout -b working master  # create (if necessary) and switch to a branch named "working"
pull()  # pull changes from origin to local (does nothing the first time, but gets changes made later)

/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
fatal: destination path '.' already exists and is not an empty directory.
fatal: remote upstream already exists.
fatal: A branch named 'working' already exists.
/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
From https://github.com/Tarleton-Math/data-science-20-21-hwk01-scook4242
 * branch            working    -> FETCH_HEAD
Already up to date.
Already on 'working'


In [31]:
sync_upstream()

/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
From https://github.com/Tarleton-Math/data-science-20-21-hwk01-scook4242
 * branch            master     -> FETCH_HEAD
Already up to date.
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
HEAD is now at 6ee016d Created using Colaboratory
Everything up-to-date
Switched to branch 'working'


In [33]:
pull(sync_original=False)

/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
From https://github.com/Tarleton-Math/data-science-20-21-hwk01-scook4242
 * branch            working    -> FETCH_HEAD
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.
README.md: needs merge
error: you need to resolve your current index first


In [24]:
push()

/content/drive/My Drive/active/m53646-data-science-20-21/as-student/hwk/data-science-20-21-hwk01-scook4242
Already on 'working'
On branch working
nothing to commit, working tree clean
Everything up-to-date


In [None]:
pull()