# Tools for Data Pipeline Automation

Our Team uses a variety of tools to automate data pipelines. This document will cover four main tools: Git Bash, Github Desktop, Miniconda, Task Scheduler and how these programs are used together to manage virtual environments and version control and automate data pipelines. 

# Table of Contents

1. <a href="#Git-Bash">Git Bash</a>  
    1.a. <a href="#How-to-Install-Git-and-Bash">How to Install Git and Bash</a>  
    2.a. <a href="#How-to-Install-Github-Desktop">How to Install Github Desktop</a>  
    3.a. <a href="#Github-Repositories">Github Repositories</a>  
2. <a href="#Miniconda">Miniconda</a>  
    2.a. <a href="#How-to-Install-Miniconda">How to Install Miniconda</a>  
    2.b. <a href="#How-to-Create-a-Conda-Virtual-Environment">How to Create a Conda Virtual Environment</a>
3. <a href="#Running-Code-Within-an-Environment">Running Code Within an Environment</a>  
    3.a. <a href="#Use-a-Batch-File-to-Run-Code-in-an-Environment">Use a Batch File to Run Code in an Environment</a>  
5. <a href="#Appendix">Appendix</a>  
    5.a. <a href="#Jupyter-Lab">Jupyter Lab</a>  
  

# Git Bash

Git Bash is a Unix shell emulator that allows a user to execute bash commands in a windows environment as if it were a linux machine. In addition, Git Bash provides git command line experience in the windows operating system. For example, a repository can be cloned by simply typing “git clone url_to_github” in the shell prompt. For more information, follow [this link](https://www.geeksforgeeks.org/working-on-git-bash/).

## How to Install Git and Bash

Please follow the directions [here for installing Git Bash on Windows.](https://www.makeuseof.com/install-git-git-bash-windows/)

## How to Install Github Desktop

If your machine does not already have Github Desktop, [please follow these directions to install it](https://docs.github.com/en/desktop/installing-and-authenticating-to-github-desktop/installing-github-desktop). 

## Github Repositories

A Github repository is a place where you can store your code, files, and each file's revision history on the Github cloud. Repositories can have multiple collaborators and can be either public or private.

Copies of Github repositories can be pulled from Github onto your local machine (local repository). You can edit and add changes to your local repository and then push your changes back to the Github cloud (remote repository). Other collaborators can then pull the latest version of the remote repository to add their changes. 

```{figure} onboarding_guide_1.png
:name: beach
:align: left
:width: 90%
Relaxing at the beach 🏖
```

In [4]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "onboarding_guide_1.png", width = 200, height = 200)

In [5]:
%matplotlib inline

# standard Markdown


![architecture](onboarding_guide_1.png)

#html rendoring as markdown - what the difference from above?

<img src="onboarding_guide_1.png" width=700/>

<div class="image-wrapper">
  <img src="files/onboarding_guide_1.png">
</div>

<img src="onboarding_guide_1.png" width=700/>

### Add a Local Repository to Github

In [3]:
import Markdown

ModuleNotFoundError: No module named 'Markdown'

using Markdown
    ('<img src="onboarding_guide_1.png" style="height: 300px;"/>'))

Follow these [directions](https://docs.github.com/en/enterprise-server@3.6/desktop/adding-and-cloning-repositories/adding-a-repository-from-your-local-computer-to-github-desktop) if you already have files on your local machine that you would like to turn into a Github Repository.

First ensure all the files you want to add as a repository are all in the same folder. Use a file path for your local repository that will be easy to replicate across all team members to avoid code breaking due to differing file paths among repository collaborators. We recommend storing local repositories in the following location: `"C:\Users\[USER NAME]\GitHub\[REPOSITORY NAME]"`

### Clone a Repository from Github

Follow these [directions](https://docs.github.com/en/desktop/adding-and-cloning-repositories/cloning-and-forking-repositories-from-github-desktop#cloning-a-repository) if there is an existing remote Github repository that you would like to copy (or clone) to your local machine. 

Whenever you want to collaborate on a Github repo you must 'map' the remote repository to the local repository on your machine through cloning. Use a file path for your local repository that will be easy to replicate across all team members to avoid code breaking due to differing file paths among repository collaborators. We recommend storing local repositories in the following location: `"C:\Users\[USER NAME]\GitHub\[REPOSITORY NAME]"`

# Miniconda

## How to Install Miniconda

1. Navigate to [this link](https://docs.conda.io/projects/miniconda/en/latest/) and select the windows 64 bit installation.
2. You should see a Welcome to Miniconda3 popup - select next until you reach the "Chose Install Location" window. 
    - In accordance with legacy storage of miniconda on the team you should change the destination folder to C:\Users\ [username] \Miniconda3 
        - Make sure the M is capitalized in Miniconda3. This will be important for running legacy code.  
        <img src="files/onboarding_guide_2.png" width=500/>
2. On the next page select the second checkbox and press 'install'. <br>  
    <img src="files/onboarding_guide_3.png" width=500/>

<img src="files/onboarding_guide_3.png" width=500/>

## How to Create a Conda Virtual Environment

Virtual environments in Python and R store specified versions of modules and packages needed to run a particular program. By using virtual environments, you can "freeze" the packages you need in time so that even if packages are updated it will not affect your program. Miniconda is a program that helps to store packages and utilize environments in R and Python. 

### Create a YAML file 

YAML files can be used to define virtual environment specifications. You can create a YAML file using any text editor. Upon saving, add the .yml extension. 

#### Example 1: R virtual environment

This environment will have all the R packages listed under `dependencies` and will use the `conda-forge` channel to download the R packages. 

*Read [here](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html#what-is-a-conda-channel) for more information on Conda Channels. See which packages/modules are available on which conda channels [here.](https://anaconda.org/anaconda/repo)*

#### Example 2: Python virtual environment

This environment will have all the Python modules listed under `dependencies` and will use the default channel to download the Python modules. It will use Python version 3.9.

For more information and creating YAML files, see [how to create a virtual environment .yml file manually](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-file-manually).

### Create a Virtual Environment From a YAML file

In this example, we will use the Git Bash terminal to manage virtual conda environments.

<img src="files/onboarding_guide_4.png" width=500/>
**In GitBash:**

1. Save the .yml file to your directory of choice and then navigate to that directory using the command `cd [file path]`

2. Create the environment from the `environment.yml` file:  
`conda env create -f environment.yml`

3. Acitvate the new environment:  
`conda activate myenv`

4. Verify the new environment was installed:  
`conda env list`

<img src="files/onboarding_guide_5.png" width=500/>  

* The asterisk next to the transform22 environment in the list of environments indicates that it is the active environment   
* The environment will take the same name as the .yml file

# Running Code Within an Environment

## Use a Batch File to Run Code in an Environment

### Create a batch file

Batch files will allow you to activate an environment and run code within one file. To run a batch file open any text editor and using the examples below add your directory information, environment name and file name. Then save the file using the .bat extension.

**Example 1: Batch file to run a Python program**

Line one is where your activate.bat file is stored. This file is automatically added during Miniconda installation and should be in the same location as this example.  
Line two activates the 'transform22' environment. Replace with the name of your desired environment.  
Line three calls the python file to be ran. Replace with the file path to your python file.  
Line four will then deactivate the environment after the code has run. 

**Example 2: Batch file to run an R program**

Line one is where your activate.bat file is stored. This file is automatically added during Miniconda installation and should be in the same location as this example.  
Line two activates the 'Rtransform23' environment. Replace with the name of your desired environment.  
Line three calls the R file to be ran. Replace with the file path to your R file.  
Line four will then deactivate the environment after the code has run. 

### How to run your batch file

**Using The Gitbash terminal**

Although simply double clicking the batch file will run it, we recommend running it from within your Gitbash terminal as this will allow you to see output and error messages when running the code. 

1. Navigate to the directory containing your batch file:  
`cd [filepath]`  

2. Run your batch file:  
`./[batch file name].bat`

### Use Task Scheduler to Run Your Batch File Automatically

1. Open windows task scheduler. Under the 'Action' menu select 'Create Task'.<br></p>  
<img src="files/onboarding_guide_6.png" width=300/>  
2. In the task popup, enter a name for the job and select the check box for 'Run with hightest privileges'.<br></p>    
<img src="files/onboarding_guide_7.png" width=500/> 
3. Under 'Triggers', set the desired time and frequency for running the job.<br></p>
<img src="files/onboarding_guide_8.png" width=500/>
4. Under 'Actions', add the path to the batch file or script for the task to run. In the 'Start in' box, add the path to the parent folder where the script is stored (excluding the actual file name). Leave the final backslash.<br></p>
<img src="files/onboarding_guide_9.png" width=500/>
5. Close windows task scheduler. 

# Appendix

## Jupyter Lab

Jupyter Lab is a useful tool for editing and running both R and Python Code. Although not required for automating data pipelines, it is a great way to run code within various Conda environments. 

### Launch Jupyter Lab Within an Environment

**Option 1: In GitBash**

1. Acitvate the new environment:  
`conda activate myenv`  

2. Manually install jupyter into your new environment. This is required to launch jupyter lab from that environment:  
`conda install jupyter`

3. Navigate to your directory that has the python files you want to work in:  
ex. `cd C:/Users/ekp0303/Github/redcap_api_test`  
<img src="files/onboarding_guide_14.png" width=500/> <br>
    *In this example "(Caitlin_edits)" is the current Git branch you are in for that repo*  

4. Launch Jupyter Lab:  
`jupyter lab`

**Option 2: In the anaconda navigator**  

1. Open Conda navigator (automatically downloaded with miniconda).  

2. Select the environment you want in the top left drop down menu.  

3. Launch Jupyter lab. *If Jupyter labs is not installed yet on the environment you selected, the button will say "install" instead of "launch". Install Jupyter lab, then launch it.*  

<img src="files/onboarding_guide_10.png" width=900/>

### Use multiple environments within one Jupyter lab instance

* The `nb_conda_kernels` module allows you to select what environment you want to run your kernel in from within Jupyter labs. [Read the documentation here.](https://github.com/Anaconda-Platform/nb_conda_kernels)
* You must launch Jupyter labs from gitbash to use `nb_conda_kernel` features

**In GitBash**

1. In Gitbash install nb_conda_kernels from within your 'base' environment:   
`conda activate base`  
`conda install nb_conda_kernels`

2. Navigate to your directory that has the python files you want to work in:  
ex. `cd C:/Users/ekp0303/Github/redcap_api_test`  

3. Launch Jupyter Lab:  
`jupyter lab`. *You should now see a kernel for each environment.*<br></p>  
<img src="files/onboarding_guide_11.png" width=600/>

4. If you want to run a file with a certain environment, open the program you want to run. From within the program, right click and select "Create Console for Editor".
<img src="files/onboarding_guide_12.png" width=700/>  

5. Select the kernel in the environment you want.
<img src="files/onboarding_guide_13.png" width=700/>

6. If an environment is not showing in jupyter labs, activate that environment in gitbash and install ipykernel from within that environment.  
`conda install -n python_env ipykernel`

### Faster loading of R environments

Initializing a Conda R environment for the first time can be time consuming. As a workaround, run the following code in sequential order from your base environment in Gitbash:

`conda update -n base conda`  
`conda install -n base conda-libmamba-solver`  
`conda config --set solver libmamba`

Now you should be able to set up an environment from a .yml file faster.

Source: See response from [Mike Battaglie](https://stackoverflow.com/questions/53250933/conda-takes-20-minutes-to-solve-environment-when-package-is-already-installed)  
*Conda will make this the default mvoing forward so this will hopefully not be necessary soon*