# Installing and Managing Bioinformatic Software

------
### Learning Objectives:

+ Create a conda environment with a yml file

+ Build and add to a conda environment with a list of software

+ Discuss basic quality metrics for FASTQ files

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Tip: </b>  Creating and switching between environments can be tricky, so if you're having trouble feel free to leverage Gemini (Google's advanced generative AI model) at the bottom of this module.
</div>  

## Creating a New Terminal Consoloe in Jupyter Lab

Jupyter lab allows you to create a new console window within its layout. This is also possible on Jupyter lab on AWS SageMaker. Adding a terminal console window where there is a jupyter notebook tab is helpful when needing to run command line code. 

### 1) In Jupyter lab within a Jupyter notebook, right click on any whitespace in the notebook. 

<p align="center">
  <img src="images/Right_click_notebook.png" width="100%"/>
</p>

### 2) Select 'New Console for Notebook'

<p align="center">
  <img src="images/Right_click_notebook_highlighted.png" width="100%"/>
</p>


### 3) Select the + sign next to the new console you just made. This will open up a new launcher tab. 

<p align="center">
  <img src="images/New_console.png" width="100%"/>
</p>


### 4) Select Terminal under the Other section when selecting the type of launcher. 

<p align="center">
  <img src="images/Terminal_launcher.png" width="100%"/>
</p>


This new console window can be dragged to fit elsewhere on the screen and create the desired visual. 

<p align="center">
  <img src="images/Split_view.png" width="100%"/>
</p>


## Accessing Bioinformatic Software 
-----

Bioinformatic software can be installed and managed in a number of ways. It is important to keep track of software versions so that you can report what was used for specific analyses/projects. Similar to how you would report the methods of a bench experiment, you should report any information about a bioinformatic analysis that would enable another researcher to repeat the experiment that you performed and get the same or very similar results. This includes the source of the data you used, the version of the software used, and any arguments or settings that were used during the analysis. 

### Software Pre-Installed on the System
Linux systems will have many core utilities for navigating the file system, creating, editing and removing files, downloading and uploading files, and many more.  These utilities are commonly found in `/usr/bin`. Commands that we have been using like `ls` or `cd` are examples of software that are pre-installed in the system. To get an idea of the software that are pre-installed use the command `ls /usr/bin/` (make sure you've enabled the option for scrolling on the results by right clicking on the notebook and selecting *Enable Scrolling for Outputs*).


In [None]:
%%bash

# List preinstalled software
ls /usr/bin/


### Full Package and Environment Management (e.g. Conda, Mambaforge)

[Conda](https://docs.conda.io/projects/conda/en/latest/) is an open source package and environment manager that runs on Windows, MacOS, and Linux. Conda allows you to install and update software packages as well as organize them efficiently into *environments* that you can switch between to manage software collections and versions. 

One way you might use these environments is to create an environment for each type of analysis you might perform, one environment could be for RNAseq preprocessing and another environment could be for variant calling. Conda allows you to create a virtually unlimited number of software environments that can be used for specific analyses, and therefore presents efficient and reproducible way to manage your software across multiple projects. Importantly, Conda also enables you to share environments with others interested in replicating an analysis you've performed.


# Getting Started with Conda in Jupyter Notebooks
--------

In this lesson we will create a conda environment and install software that will enable us to check the quality of some FASTQ files. 

## Creating a Conda Environment
--------

There are two ways to create a conda environment. You can create an environment by specifying the name of the environment and the name of the software that you would like installed, or you can create an environment from a yml file, which is a file that contains a list of all software that should be installed. Let's start with installing from a yml file, in your home directory we have created a file called `env.yml`. Let's take a minute to look inside this file. 


In [None]:
%%bash

## Print the contents of the yml file to the screen
cat env.yml

You can see that the file has three components; the `name:`, the `channels:`, and the `dependencies:` 

The `name` component indicates the name for the conda environment to be created, in this case the environment will be called **test_env**. 

The `channels` component proceeds a list indicating the locations of the software packages to be installed. Conda packages are downloaded from remote locations called channels, which are URLs to directories containing conda packages. This file specifies that the packages can be found on the *bioconda* channel, a popular channel for bioinformatic software packages.

The third component is the `dependencies` which proceeds a list containing the software packages to be installed in the environment. Here we are installing python version 3.9, ipykernel, fastqc, and multiqc. You can see that the version of python is specified with the equal sign. 

Now switching over to the terminal, we can create a new conda environment using the `env.yml` file with the command `conda env create` and the flag `--file`. This command should take between 1-5 minutes. 

`conda env create --file env.yml`

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  Warning this command can take a while to download all the software and their dependencies.
</div>  

This command has created the conda environment named **test_env** which contains the software specified in the yml file. 

In the terminal to activate a conda environment use the command `conda activate test_env`. You will see in parentheses before your prompt the text will change from `base` to `test_env`. If you are not sure what conda environments are available for you to load (or you forgot the spelling of the environment name) you can use the command `conda env list` to list the names of the available conda environments.

To ensure that your conda environment is activated let's list the available software with the command `conda list`. 

You can see that in addition to the software listed in the yml file, there are many additional software that have been installed. These are the software dependencies that are required to use the software programs we specified in the yml file. This is one of the strengths of using a package manager like conda - the appropriate version of each dependency required for the specified software are automatically added to your environment. 

If you do not have a yml file that specifies the software needed for an analysis you can create conda environments by listing the software you would like installed with the `conda create` command. Here we use the `-n` flag to specify the name of our environment, the `--channel` flag to specify the channel software packages should be loaded from, and the names of the software as the argument for the command.  

`conda create -n test --channel bioconda python=3.9 ipykernel fastqc -y`

I know we are missing multiQC in this list of software - we will get to that later.

It is easy enough to enter this command manually if you have only a small number of packages to be installed, but the command can get unwieldy if there are many packages to be installed. Furthermore, having the yml file creates some record of how the environment was created and enables you to share the file with others who may want to create and use the same environment.

## Modifying a Conda Environment
------

Generally once a conda environment is created it won't need to be modified for as long as the analysis pipeline is stable. However there are times when software will need to be added, or versions of software might need to be updated. For the purpose of this lesson, I have left one of the software packages, `multiQC`, off of the list of dependencies in the "test" environment. Let's add this package to our environment now with the `conda install` command. Remember to first active your environment using `conda activate test`.

`conda install --channel bioconda multiqc -y`

## Quality Control of FastQ Files
----------

Now that we have created a conda environment, let's make use of some of the tools that we installed using the looping structures that we learned about in lesson 4. The two pieces of software we will use are fastQC and multiQC - both of these softwares are for checking the quality of FASTQ files and organizing the output of quality reports to improve readability. These should have been installed in both the "test" and "test_env" environments. Make sure that one of those is activated in the terminal.

### FastQC

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is an excellent tool that is specifically designed assess quality of FASTQ files . FastQC is composed of a number of analysis modules that calculate various QC metrics from FASTQ files (such as GC content, distribution of base quality, etc.) and summarizes this all into a nice QC report in HTML format, that can be opened in a web browser. Checking the quality of raw reads is important before you begin analyzing data as your analysis pipeline might have to be modified to mitigate poor quality data or contaminated data. 


<div style="display:flex">
     <div style="flex:1;padding-right:10px;">
          <img src="images/fastqc-good.png" width="70%"/>
     </div>
     <div style="flex:1;padding-left:10px;">
          <img src="images/fastqc-bad.png" width="70%"/>
     </div>
</div>


In the figure above is an example of a high quality FASTQ dataset on the left, and a poor quality FASTQ dataset on the right. You can see on the left panel of each image are a series of quality metrics. In the high quality data each of the metrics has a green check next to it, indicating the file has met the QC thresholds for that metric. In the poor quality FASTQ file some of the metrics have an orange exclamation point, and some have a red X indicating there are some issues with this data. 

Not all high quality data will pass all quality metrics, but generally high data will pass most of the QC metrics. Failed QC metrics are not an indication that you should toss the data and start over, rather they indicate that your analysis might need to be modified to mitigate some of the quality issues in the data. 

The first two quality metrics from the left panel, *Basic Statistics* and *Per base sequence quality* are also shown in the figures above. You can see that the basic statistics are very similar in the two reports, but there is a stark difference in per base sequence quality in the high quality dataset (left) and poor quality dataset (right).

As a reminder quality scores indicate the confidence the base caller has in the base call that was made. A quality score of 30 indicates that there is a 0.1% chance that the base call is not accurate, a score of 20 indicates a 1% chance the base call is inaccurate. 

Ideally, most bases will be in the green region of the plots with a quality score of 28 or higher, and a very low probability of an inaccurate base call. This is what we see in the high quality data. In the poor quality data there are some bases in the high quality range, but the median base call at most positions is well below this threshold. Before analyzing data like this I would recommend trimming poor quality base calls, a step that isn't necessary in the high quality dataset. 

In both the high quality data and the poor quality data you can see that quality scores begin to drop off toward the end of the read. This is a known feature of Illumina sequencing data, the drop off is very slight in the high quality data. A significant dropoff as in the poor quality data might be an indication of issues with sample quality or the sequencing run. 

It is important to remember that FastQC expects a random and diverse library, and that a failure of any metric could be a feature of the experiment. Below is a table outlining what many of the metrics are measuring in the dataset, as well as the thresholds for failure. More information about these metrics can be found in the analysis modules section of the [FastQC manual](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).


<table>
<tr><th>FastQC quality metrics and their thresholds for failing the QC metric.</th></tr>
<tr><td><table></table>


|Metric | Measurement | Failure threshold|
|--- | --- | ---|
|Per base sequence quality | Overall range in quality statistics by base position| lower quartile of quality score at any position is less than 5, or median less than 20 |
|Per tile sequence quality | Only reported for Illumina data, average quality per flow cell tile| Any tile with a mean quality score 5 less than other tiles for a given base |
|Per sequence quality score | Distribution of overall quality scores per read| Mean quality less than 20, corresponding to a 1% error rate |
|Per base sequence content | The proportion of each base pair (ATGC) at each read position | difference in proportions of any base greater than 20%|
|Per sequence GC content | GC content of reads in the library | The GC content of more than 30% of reads in the library deviate from the normal distribution |
|Per base N content | Proportion of N base calls by position| more than 20% N base calls at any position|
|Sequence length distribution | Distribution of read lengths | Any sequence has a length of 0|
|Sequence duplication levels| Degree of duplication for each sequence, this metric only uses the first 100,000 reads | Duplicated sequences make up more than 50% of the library |
|Overrepresented sequences| For each sequence making up 0.1% of the total library matches to common contaminants are reported| Any match makes up more than 1% of reads|
|Adapter content| Proportion of reads with adapters detected at any position | Adapter sequences are present in more than 10% of reads |
</td></tr> </table>


Now that we have some familiarity with the report and how it can be use to inform your downstream analysis, let's take a look at the manual for FastQC to get an idea of the flags available to customize QC reporting.

`fastqc --help`


In the simplest case this tool can be used by supplying a list of fastq files to be analyzed - this will work fine for most datasets. 

However, reading through the help page you can see that there are a number of flags that enable you to run the analysis more efficiently. For example, you can use the `-o, --output` flag to specify the name of a directory to store QC reports from the fastqc analysis. The `-t, --threads` flag accepts an integer as an argument and can be used to speed up the analysis by using multiple threads. 

Let's run an analysis now and use the `-o` flag to designate a directory for the results called `fastqc`.

`mkdir fastqc`

`fastqc -o fastqc gcp_research_workflow/SRR1039508_1.chr20.fastq.gz gcp_research_workflow/SRR1039508_2.chr20.fastq.gz`

Take a moment to go into the fastqc file you just made and have a look at the fastqc report you generated for one of these datasets. 

- How do the data look, do they resemble the high quality report or the poor quality report?
- Were any of the quality metrics flagged (orange !), or did any of them fail (red x)?
- Were the metrics the same between the forward and reverse files?
- Does this seem like a dataset that needs adjustment prior to an analysis, or do these data seem fairly high quality?



We discussed parallelization in the first lesson, and demonstrated how it can be used to speed up an analysis. We can use the `-t` flag to parallelize a FastQC run.

`fastqc -t 2 -o fastqc gcp_research_workflow/SRR1039509_1.chr20.fastq.gz gcp_research_workflow/SRR1039509_2.chr20.fastq.gz`

You will notice that in the first run where we only used one thread the output indicated that the analysis analyzed each file sequentially, whereas when we used two threads the files were analyzed in parallel and finished at almost the same time. The parellelization in this case uses one thread per file so using more than two threads would not have sped up the analysis time.

In many cases you will be analyzing a group of fastq files that will be compared downstream, in this case we have 4 samples in our group, samples SRR1039508, SRR1039509, SRR1039512, and SRR1039513. As we learned in the lesson beyond_basic_bash it can be helpful to use looping to perform the same function multiple times on a group of files. Use the code box below to write a loop that analyzes all of the fastq files in the `gcp_research_workflow` directory.


<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Write a loop to analyze all fastq files in the gcp_research_workflow directory using 4 threads.<br><br>Multithreading doesn't speed up your analysis why?</div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (enter and run your answers in the terminal)

## Use a loop to analyze all fastq files in the gcp_research_workflow directory

## Using 4 threads to analyze these files in a loop won't speed up your analysis, why?


In [None]:
from IPython.display import IFrame
IFrame("quiz_files/quiz5-1.html", width=600, height=350)

The QC report that we looked at earlier, `SRR1039508_1.chr20_fastqc.html,` had a couple of warnings but mostly high passes. The warnings are flagged for metrics *per tile sequence quality*, *per sequence GC content*, and *sequence duplication levels*. These warnings could indicate a problem with the library or the run. If the problem was with only that sample, (ie. an issue with the library) then I would expect the same warnings to be present in the paired file `SRR1039508_2.chr20_fastqc.html` but not in other files from the same run. 


However, if the issue was with the sequencing run I would expect the warnings to be present in all samples in the group. Rather than click through each of the 8 HTML reports and compare the results we can use the software **MultiQC** to collate the results from all of our fastqc reports into one single report. 

Let's begin by looking at the help page for the MultiQC.


`multiqc --help`

This help page is organized nicely to blocks indicating the categories of flags that can be used to modify the analysis and summary output of the tool.

Again you can see that there are several flags that would enable you to customize the report, or the tool can be run by supplying the directories with the files to be summarized listed as arguments. MultiQC also provides the option of modifying the run with a config file `--config` rather than using multiple flags at the command line to control the format of the report. The *MultiQC* config file is similar to the yml file that we used to build our conda environment in that it serves as a record of which options were used in the command to generate the report. Additionally the config file can be used many times to generate the same format report across multiple projects.

Here we are going to use the simplest case and run the tool with only the flag `--flat` to suppress the creation of interactive figures which are difficult to view in Jupyter notebook.



`multiqc --flat .`


If you double click the html file called `multiqc_report.html` you will notice as with the fastqc report there are links on the left of the file that will enable you to navigate to that section of the report quickly. Unlike the fastqc report these links are not colored according to warnings, but all samples within the directory are summarized in one single report. Perhaps more useful than flagged metrics in the fastqc report is the ability to look for consistency across samples that will be analyzed as a group. You can see in the screen shot of the report below that all samples are grouped into a single report.


<p align="center">
<img src="images/multiqc.png" alt="multiqc" width="80%"/>
</p>


We can see that there is some slight variation in sample size, but all of our samples are very high quality with base quality scores above 30 even at the end of the read. 

Perhaps most importantly, and easiest to see in the coallated report is that all samples are consistent with each other in all metrics, similar GC content, duplication rates, and quality score distributions. The small variation in library sizes is expected and can be mitigated with the normalization methods of most RNAseq differential expression software so should not pose a problem.


<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b> To fully interact with a HTML file don't forget to click the <img src="images/trust_html.png" alt="trust HTML" width="10%"/> button on the top left side of the HTML viewer.
</div>

## Gemini (Optional)
--------

If you're having trouble with this submodule (or others within this tutorial), feel free to leverage Gemini by running the cell below. Gemini is Google's advanced generative AI model designed to enhance the capabilities of AI applications across various domains.

In [None]:
# Ensure you have the necessary libraries installed
!pip install -q google-generativeai google-cloud-secret-manager
!pip install -q git+https://github.com/NIGMS/NIGMS-Sandbox-Repository-Template.git#subdirectory=llm_integrations
!pip install -q ipywidgets

import sys
import os
util_path = os.path.join(os.getcwd(), 'util')
if util_path not in sys.path:
    sys.path.append(util_path)

from gemini import run_gemini_widget, create_gemini_chat_widget 
from IPython.display import display

run_gemini_widget()