# Intro to Bash & Bioinformatics: Start of Your Wizard Journey

Tired of having to farm out your gene seqs to Todd down the street, never fear, bioinformatics is here! Following along thorugh this notebook, you can learn how to process your biological data from raw to processed awesomeness.

# Chapters:

### Ch. 1: Bash, Terminal, and Becoming a Terminal Ninja 🥷

### Ch. 2: So What, How Does This Help in Bioinformatics?

### Ch. 3: Putting it All Together: A Crash Course in Metabolite Predicition!
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# <u> Ch. 1: Bash, Terminal, and Becoming a Terminal Ninja </u> 🥷 
Traditionally, we are biologists try to spend as *little* time in front of a computer as possible. However, in order to keep ourselves employed from the terror that our AI overlords will eventually inflict upon us, or just to stick to Todd down the street, learning how to use your terminal is <span style="color: #FFAD69;">**CRUCIAL**</span>.
<br><br>
Given the fact that we are *mainly* going to be talking about bash, there will be very little "code" within this section, but lots of things that we can do in our terminals.
<br><br>
## <u>Bash basics: How to Move Around</u>

### How to move into around your terminal

`cd YOUR_FOLDER` ## Move into a folder

`cd ../` ## Move back ONE folder
<br>

`cd ../../` ## Move back TWO folders

### How to make a directory
`mkdir YOUR_NEW_DIR`

### How to see what is inside a dir
`ls` ## Must be WITHIN the dir you want to know about
<br>

`ls -lah` ## List things inside AND see metadata about it
<br>

`## NOTE: The "-" is called a flag, and this tells your computer, HEY this is a special instruction I want you to do`

### There are <span style="color: #FFAD69;">***many***</span> kinds of flags, and we will use them itermittently throughout this coruse, here is a list of some of the common ones:


| Flag               | Description                                                                                       | Example                              |
|--------------------|---------------------------------------------------------------------------------------------------|--------------------------------------|
| -c string          | Reads commands from the provided string rather than from a file or standard input.               | `bash -c 'echo "Hello from Bash!"'`   |
| -i                 | Starts Bash in interactive mode, even if standard input is not a terminal.                       | `bash -i`                            |
| -l or --login      | Starts Bash as a login shell. Reads login-specific startup files (e.g., /etc/profile, ~/.bash_profile). | `bash -l`                          |
| -r or --restricted | Starts a restricted shell session, disabling certain functionalities (like changing directories or modifying variables). | `bash -r`                      |
| -s                 | Tells Bash to read commands from standard input. Useful when supplying commands via pipe or redirection. | `cat script.sh` | `bash -s`        |
| --posix            | Changes the behavior of Bash to more closely conform with the POSIX standard.                    | `bash --posix`                       |
| --help             | Displays help information about Bash options and then exits.                                   | `bash --help`                        |
| --version          | Outputs the version information for Bash and exits.                                            | `bash --version`                     |



# <u> Ch. 2: So What, How Does This Help in Bioinformatics? </u>

I will talk about this, but here is a brief summary:

Bash is a critical tool in bioinformatics because it streamlines the process of handling and processing large datasets, automates repetitive tasks, and effectively orchestrates complex workflows. For scientists, mastering Bash can significantly boost productivity and reproducibility in research. Here are some key reasons:

<span style="color: #FFAD69;">**Automation and Scripting:**</span> Bash allows you to write scripts that automate data preprocessing, file management, and the execution of other bioinformatics tools. This saves time and minimizes human error.

<span style="color: #FFAD69;">**Integration with Other Tools:**</span> The Unix-based environment is the backbone of many bioinformatics tools. Bash enables seamless integration of various command-line utilities, custom Python scripts, and other software, allowing you to build efficient data processing pipelines.

<span style="color: #FFAD69;">**Handling Big Data:**</span> Bioinformatics often involves working with large genome sequences and complex datasets. Bash’s ability to filter, search, and manipulate text files quickly is essential when dealing with such data.

<span style="color: #FFAD69;">**Reproducible Research:**</span> Scripts written in Bash provide a detailed record of the exact sequence of commands used in an analysis, fostering reproducibility and sharing of methodologies within the research community.

<span style="color: #FFAD69;">**Efficiency in Workflow Management:**</span> Bash’s support for piping (the output of one command becoming the input of another) and job control helps in managing multi-step analyses, making it easier to execute and debug complex workflows.

For newcomers, learning Bash alongside Python offers a powerful combination—Bash can handle the day-to-day data manipulation and environment management, while Python excels in data analysis and visualization. Together, they form a robust foundation for bioinformatics research.

# <u> Ch. 3: Putting it all Together: Metabolite Predicition! </u>


### We will be going through how to utilize FunBGC, a CLI tool for biosynthetic gene cluster predicition and annotation.
For more information on FunBGC check out these links:
<br>--https://link.springer.com/chapter/10.1007/978-981-97-5131-0_22
<br>
--https://github.com/ydmatsd/funbgcex?tab=readme-ov-file

### <u> Lets Get Coding!</u>

First, create a New Conda Env

```conda create -n funbgc```

Then we need ot activate our env so that we can get to installing packages.
<br>

```conda activate funbgc```
Within this env, we need to download three packages:

DIAMOND
<br>
<br>
HMMER
<br>
<br>
Python
<br>

```conda install PACKAGE```

Then we can download the actual package itself, <span style="color: #FFAD69;">**FunGBC**</span>, but if you look at the GitHub installation guide, we have to use <span style="color: #FFAD69;">**Pip**</span> instead of conda. So we want to install it last, so that all of our other pacakges are in the right version. Furthermore, we want to install python LAST when we are conda installing, as this will create issues with package compatability. I have tried it both ways and I promise if you download it last it will be ok.
<br>

```pip install funbgcex```
<br>
<br>

<span style="color: #FFAD69;">**NOTE:**</span> If you want to double check the install of all of these packages, simply type

```PACKAGE_NAME -h``` or ```PACKAGE_NAME --help``` and you should see some form of help message in your output.

<span style="color: #FFAD69;">**Now we are ready to do some BGC mining!**</span>

### <u> Basic Structure of FunBGC</u>
<br>

```funbgcex input_directory output_directory```
<br>

When working with FunBGC, you have to specificy and input directory **NOT** a file. And these files need to be a GenBank file, or a .gbff, which can be downloaded from NCBI's Genome webpage: https://www.ncbi.nlm.nih.gov/datasets/genome/
<br>
-- Search for your organisms of interest
<br><br>
<img src="./photos/Screenshot 2025-04-18 at 10.47.54.png" alt="NCBI Genome Screenshot"> <br>
<br>
-- Click "Download"<br>
-- Download the <span style="color: #FFAD69;">**Sequence and Annotation**</span> file<br><br>
<img src="./photos/Screenshot 2025-04-18 at 10.48.33.png" alt="NCBI Genome Screenshot"> <br>
<br><br>
-- Open the zip files and navigate to the <span style="color: #FFAD69;">**GCA_XX**</span> folder, there you will find the .gbff file, this .gbff file is what is your input for FunBGC.
<br><br>

For the purposes of this tutorial, I have added a some files for you all to use as an example, they're located in the data folder. I have chosen to run this code as a <span style="color: #FFAD69;">**bash script**</span> rather than in the terminal. This is because I wanted to be able to simple define my terms and run it simply. My script is below:

```#!/bin/bash

## Get Paths
INPUT_DIR="../../Data/PMI_DSE_Metabolome_antiSMASH/"
OUTPUT_DIR="./output/"

## Run FunBGC on all GenBank Files (NO OTHER FILES IN DIR)
## Using default settings to find all possible (accoridng to their database) BGCs
funbgcex "$INPUT_DIR" "$OUTPUT_DIR" 
```
<br>
I have also included the script in the GitHub if you all want to use mine as a template. If you read through the documentation on FunBGC, you will see LOTS of other functionality that it has. So feel free to play around with this! It also has the ability to work with metageneomic data, to look for fungal BGCs within that. Food for thought.
<br><br>
When you run this code you will get a series of output files, but they're in CSV format, so you can import the output straight into Python or R and get to doing some analysis. This is a simple example but can have some really intersting results if you use the tool!
<br><br>
<span style="color: #FFAD69;">Happy coding!</span>
