# Recipes Modules: 
* `Recipes/jupyter_utils.py` 
* `Recipes/git_info_exclude.py`

Please, consult `Recipes/jupyter_utils.py` for other functions more specific to the jupyter environment.  
I just want to highlight these two, which help determine whether the python code is running in a notebook:  

* `test_ipkernel`:
```
def test_ipkernel(verbose=False):
    found = 'ipykernel_launcher.py' in sys.argv[0]
    if verbose:
        verb = 'IS' if found else 'IS NOT'
        msg = f'Code *{verb}* running in Jupyter platform (notebook, lab, etc.)'       
        print(msg)
    return found
```
Note: I've only tested the above function in JupyterLab and VS Code, it might not work in Spyder.

* `is_lab_notebook`:
```
def is_lab_notebook():
    import re
    import psutil

    return any(re.search('jupyter-lab-script', x)
               for x in psutil.Process().parent().cmdline())
``` 

---
---
# Updating a repo's `.git/info/exclude` file to exclude large files
---

## Two possible ways, only one taken...

There are two way to exclude files (for whatever reasons) within a folder that a git repo:  
1. Use a `.gitignore` file
2. Use the `exclude` file in the `.git/info` folder

I have chosen the second option to implement the automatic exclusion of files with size greater than the GitHub limit (currently 100MB).  
The reasons are logic and portability:  
* The .git folder is unique, at least under _normal usage_ of code versioning with git, and won't work in any other folder.  
* The .gitignore file can be copied in multiple subfolders within a repo, and amended differently therein; the desired outcome might work but it will depend on your keeping track of the precedence hierarchy git uses to apply the exclusions. Here is a small portion of what the offical [Git docs say](https://git-scm.com/docs/gitignore):
>When deciding whether to ignore a path, Git normally checks gitignore patterns from multiple sources, with the following order of precedence, from highest to lowest (within one level of precedence, the last matching pattern decides the outcome):

> * Patterns read from the command line for those commands that support them.
> * Patterns read from a `.gitignore` file in the same directory as the path, or in any parent directory, with patterns in the higher level files (up to the toplevel of the work tree) being overridden by those in lower level files down to the directory containing the file. These patterns match relative to the location of the `.gitignore` file. A project normally includes such `.gitignore` files in its repository, containing patterns for files generated as part of the project build.
> * Patterns read from `$GIT_DIR/info/exclude`.
> * Patterns read from the file specified by the configuration variable `core.excludesFile`.

---
## Implementation:

I only use the `pathlib.Path` and `sys` libraries. The latter is used in a refinement of the update message if the code was run from a Jupyter notebook (lab or classic).

I have written 3 functions:

1. `find_bigfiles`(data_folder, gte_size=100, verbose=False)
```
    Return a list of file paths (truncated from the name part of data_folder.name)
    for all file exceeding gte_size MB.
```
2. `is_git_repo`(folder_path)
```
    Return True if folder contains a .git folder, else False.
```
3. `update_git_info_exclude`(top_folder_path, data_folder_name)
```
    Update the $GIT_DIR/info/exclude file with paths of 
    found big files, so that separate .gitignore file 
    stays generic (portable).
```
---
## Module: `git_info_exclude.py`

Although I am going to add the functions in my JupyterLab notebook template, I've also put them in a module, `Recipes/git_info_exclude.py`:          

```
# git_info_exclude.py

from pathlib import Path
import sys


def is_git_repo(folder_path):
    """
    Return True if folder contains a .git folder
    """
    return Path(folder_path).joinpath('.git').is_dir()


def find_bigfiles(data_folder, gte_size=100, verbose=False):
    """
    Return a list of file paths (truncated from the  
    data_folder.name part) for all file exceeding 
    gte_size MB.
    gte_size: multiple of 1 MB => 100 MB default.
           :: github file size limit = 100 * 2^20
    """
    if gte_size == 0:
        print('No file size limit: gte_size=0.')
        return

    mb = 2**20  #kb = 2**10   #gb = 2**30
    size_limit = gte_size * mb
    
    out = []
    p = Path(data_folder)
    
    # output only parts from data_folder name, i.e.: data/file.csv
    for f in p.glob('**/*.*'):
        if f.stat().st_size >= size_limit:
            parent_idx = f.parts.index(p.name)
            fout = '/'.join(f for f in f.parts[parent_idx:])
            out.append(fout)
            
            if verbose: print(f, '\t', f.stat().st_size)
    return out


def update_git_info_exclude(top_folder_path,
                            data_folder_name,
                            gte_size=100):
    """Update the $GIT_DIR/info/exclude file with path of
       found big files, so that separate .gitignore file
       stays generic (portable).
    """
    import sys
    
    repo = Path(top_folder_path)
    if not is_git_repo(repo):
        msg = f'Folder: {top_folder_path} is not a repo.'
        msg += '\nType `>git init .` to initialize.'
        print(msg)
        return

    data_folder = repo.joinpath(data_folder_name)
    bigones = find_bigfiles(data_folder, gte_size=gte_size)
    if len(bigones) == 0:
        print(f'No big files (>= {gte_size} MB) found in {data_folder}.')
        return
    
    git_exclude = repo.joinpath('.git', 'info', 'exclude')

    with git_exclude.open(mode='r+') as fh:
        content = [line.strip('\n') for line in fh.readlines()]
        for fname in bigones:
            if fname not in content:
                n = fh.write('\n' + fname)
                
    msg = 'Updated .git/info/exclude file.\n'            
    if 'ipykernel_launcher.py' in sys.argv[0]:               
        msg += 'Enter `%load .git/info/exclude` in a cell to verify.'
    print(msg)
    return
```

# Example (empty data folder)

In [1]:
from pathlib import Path

import git_info_exclude

In [2]:
repo = Path.cwd()

git_info_exclude.update_git_info_exclude(repo, 'data')

Folder: C:\Users\catch\Documents\GitHub\Jupyter_Sphere\Recipes is not a repo.
Type `>git init .` to initialize.


Uncomment the next cell to verify the file:

In [None]:
#%load .git/info/exclude