In [None]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

<img style="float: left; margin-right:20px;" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/trifusion-icon-64.png"><p style="font-weight:bold; font-size:23px; line-height:23px; margin-bottom:50px; color:#37abc8;">Alignment data visualization with TriFusion's Statistics</p>

This is a walkthrough of the data visualization capabilities of TriFusion's __Statistics__ module. It is not intended to provide a description of each graphical/statistical option in this module (that is already provided in the user guide), but rather to show a typical usage of the module and some useful utilities that can help you in your data visualization. Here we'll cover:

- [Summary statistics overview](#Summary-statistics-overview)
    - [Overall summary statistics](#Overall-summary-statistics)
    - [Gene specific summary statistics](#Gene-specific-summary-statistics)
    - [Displaying summary statistics](#Displaying-summary-statistics)
- [Discover and select data exploration analyses](#Discover-and-select-data-exploration-analyses)
    - [How to view analysis specific information](#How-to-view-analysis-specific-information)
    - [The available plot types per analysis and how to select them](#The-available_plot-types-per-analysis-and-how-to-select-them)
    - [Executing an analysis](#Executing-an-analysis)
    - [Quickly changing the plot type for the current analysis](#Quickly-changing_the-plot-type-for-the-current-analysis)
    - [Taking advantage of fast plot switching](#Taking-advantage_of-fast-plot-switching)
    - [The Single Gene analysis](#The-Single-Gene-analysis)
- [How to switch between plot types](#How-to-switch-between-plot-types)
- [How to change/update the active data set](#How-to-change/update-the-active-data-set)
- [How to export figures and tables](#How-to-export-figures-and-tables)
    - [Export a figure](#Export-a-figure)
    - [Export a table](#Export-a-table)
- [Dealing with outliers](#Dealing-with-outliers)

__Input data:__ For this tutorial, we will use a medium sized protein data set of 614 genes and 48 species (You can download it [here](https://github.com/ODiogoSilva/TriFusion-tutorials/raw/master/tutorials/Datasets/Process/medium_protein_dataset/medium_protein_dataset.zip)).

# Summary statistics overview

As soons as you load your data into TriFusion and navigate to the __Statistics__ module, the computation of general and gene specific summary statistics will start. This computation is being done in the backgroud, and unless you start to generate a plot or load more data into TriFusion, it will continue to do so. When finished, a summary statistic overview for the currently active data set will be displayed in the __Statistics__ screen.

## Overall summary statistics

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_sum_stat_general.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_overall_stats.gif">
</figure>


<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_sum_stat_general.png">

Information is sorted in three main cateagories: _General_, _Missing data_ and _Sequence variation_.

The values in the _General_ section are mostly self-explanatory. We only note that the _Total alignment length_ refers to the lenght of the alignment as a whole, not the sum of each sequence in the alignment.

The _Missing data_ section separates the role of gaps (usually denoted by "-" in the alignment file) and true missing data (usually "N" in nucleotide sequences and "X" in protein sequences). The _Gaps_ and _Missing data_ values refer to the total number of gaps or missing data across all sequences, not alignment columns. Therefore,the associated percentages provide the relationship between these values and the sum of total characters in the alignment (in this case, 48 * 350 725). 

The _Sequence variation_ section provides the number of variable (at least one variant) and informative (one of the variants must be represented at least in two taxa) sites across the data set. In this case, these values correspond to the number of alignment columns, so percentages are relative to the _Total alignment length_.

## Gene specific summary statistics

To visualize the same statistics as in the previous section descriminated for each alignment file, click the __Display gene table__ at the bottom of the screen. This will change the display to show a list with individual alignment files as rows and summary statistics in the different columns.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_gene_specific_stats.png">

Note that, due to performance issues, only the first 50 alignments are shown by default. You can increment the number of shown alignments by scrolling to the bottom and clicking the __Show more 25__ button. Alternatively, you can export this data into a .csv file that can be read by LibreOffice or MS Excel by clicking the __Export as table__ button. 

As in the previous section, there are three main summary statistic categories, which are color coded along the table for convenience. A lenged of each summary statistic is provided at the top of the table. 

To switch to the overall summary statistics view, click the __Display overall table__ button.

## Displaying summary statistics

At any time, you can return to the summary statistics display by clicking the _Summary statistics_ icon button at the edge of the __Statistics__' side panel.

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_summary_button2.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_select_stats.gif">
</figure>


# Discover and select data exploration analyses

All data exploration analyses are contained within the four main category buttons that are found in __Statistics'__ sidepanel. Clicking any of these buttons will expand all available analyses under that category. For example, clicking the __Polymorphism and Variation__ button, will show four individual analyses.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_variation_category.png">

## How to view analysis specific information

A complete and detailed description of each analysis is provided in TriFusion's user guide, but you can also click the information buttons (__i__) that are coupled with every analysis button. For instance, clicking the information button of the _Pairwise sequence similarity_ analysis shows a pop-up with a short description of the analyses, the available plot types and what the axis represent.

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_pairwise_help.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_info.gif">
</figure>

## The available plot types per analysis and how to select them

In the majority of the individual analysis, there are up to three plot types available that represent different perspectives of the same analsyis:

- Single gene: You choose a single a gene from the data set and the analysis is performed on that gene (usually a sliding window plot).
- Per species: The analyses will be focused on gathering information for each taxa or discriminates it by taxa in some way.
- Average: The analyses will produce an average distribution/result across the whole data set.

For example, clicking the _Pairwise sequence similarity_ button will ask you which plot type you wish to produce.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_all_plot_types.png">

In this case, all three plot types are available. However, some options will have only two plot types available, and others only one. It will depend on the analysis.

## Executing an analysis

Let's explore the distribution of sequence similarity across our entire data set. Since we are interested in an average of the data set, click on the __Average__ button. The computation of sequence similarity and segregating sites are some of the most computationally intensive in TriFusion, so this may take some time the first time. However, TriFusion uses a hash look-up table technique which considerably speeds up future computations of these analyses in the same session. Once complete, you should see a bar plot with the distribution and mean of the pairwise sequence similarity across the data set. 

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_sequence_sim_avg.png">

## Quickly changing the plot type for the current analysis

If you want to change the plot type of the current analysis, there is a floating box in the top right of the screen. 

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_change_plot_type.png">

The current plot type appears with a filled blue background ("Average" in this case). To change to the _Per species_ plot type, simply click the corresponding button and a new analyses should be started. At the end of the analysis, you should see a triangular heatmap matrix with the sequence similary between every species pair in the dataset.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_seq_sim_sp.png">

## Taking advantage of fast plot switching

While the active data set remains the same, all generate plots are stored locally. This means that if you need to visualize an analyses that you already performed in your current session, you do not have to repeat the entire computation. For instance, we are currently visualizing the _Per species_ plot type of the _Pairwise sequence similarity_ analysis. If you click the __Average__ button in the floating box to change the plot type, you'll notice that the switch will be almost instantaneous.

## The Single Gene analysis

Some analyses can be performed for single genes in the form of a sliding window analysis that contain additional features. Let's investigate the averaged pairwise sequence similarity for a single gene in our data set. Click the _Pairwise sequence similarity_ analysis and then the _Single gene_ plot type. 

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_single_gene.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_single_gene.gif">
</figure>

In this dialog, you can choose the sliding window range and the target gene. We'll leave the default sliding window value of 10 and choose the first gene in the list. 

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_single_gebe.png">

In the footer of the screen, you can set an horizontal line to evaluate regions above or below a given threshold.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_single_gene_threshold.png">

If you want to change to another gene, you can do as in a normal plot type switch. The floating box on the top right of the screen should have the option to _Change gene_, where you can select a new gene and/or a new sliding window range.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_single_gene_threshold.png">


# How to change/update the active data set

The analyses in the __Statistics__ module are not limited to the total data set loaded into TriFusion. You can modify the active file/taxa data sets or create data set groups in TriFusion (see tutorial [Creating and using active data set groups](http://nbviewer.jupyter.org/github/ODiogoSilva/TriFusion-tutorials/blob/master/tutorials/Creating%20and%20using%20active%20data%20set%20groups..ipynb)), and then select them in the bottom of the __Statistics__ sidepanel. 

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_active_dataset.png" alt="Static Image" data-alt="https://github.com/ODiogoSilva/TriFusion-tutorials/raw/master/tutorials/gifs/stats_tutorial1_change_active.gif">
</figure>

Following the guidelines in the [Creating and using active data set groups](http://nbviewer.jupyter.org/github/ODiogoSilva/TriFusion-tutorials/blob/master/tutorials/Creating%20and%20using%20active%20data%20set%20groups..ipynb) tutorial, we created a taxa group of 12 elements that contains taxa whose name starts with an "A", "B" or "C", named <i>A_to_C</i>.
To change the taxa data set to the newly define group, click in the dropdown menu for the taxa data set and select the <i>A_to_C</i> option. 

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_selecting_taxa_group.png">

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_taxa_selected.png">

Now, all selected analyses will use this set of 12 taxa instead of the full 48 taxa data set. If you want to update the currently displayed analyses, click the refresh button next to the data set selection drop down menus.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_subset.png">

# How to export figures and tables

All plots generated in TriFusion can be exported as a graphics file and almost all can be exported in table format. These functions are available in the plot screen bar at the right of the screen. 

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_export_bar.png">

## Export a figure

Click the __Export as graphics__ button in the plot screen right bar. This will open a filechooser where you can choose where to export the figure, its name and graphics format. 

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_export_figure.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_export_graphic.gif">
</figure>

Here we provided some name to our figure, and set the image format to _svg_. If you wish to convert the plot into grayscale, you can do so by checking the _Grayscale_ box. Finally, click __Save__ and the figure will be exported. 

## Export a table

Click the __Export as table__ button in the plot screen right bar. As in the previous section, this will open a filechooser where you can chooser where to export the table and its name.

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_export_table.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_export_table.gif">
</figure>

Then click __Save__ to export the table. The generated table will be in _csv_ format, which can be readily imported by LibreOffice or MS Excel or viewed as a plain text file. 


# Dealing with outliers

Outlier analyses in TriFusion are a bit different because they offer you the option to remove files and/or taxa that may have an outlier behaviour for some statistics. If you click on the _Outlier Dectection_ category in __Statistic's__ sidepanel you'll see three outlier detection analyses: by _missing data_, _segregating sites_ and _sequence size_.

<img width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_outlier_analyses.png">

Let's exemplify outlier handling by checking for outlier taxa for missing data, that is, taxa that contain unusual amounts of missing data. Click on the __Missing data outliers__ button, and then the __Per species__ plot type.

<figure>
    <br>
    <p style="font-size: 14px; text-align: center; font-weight: bold;">Click figure to animate</p>
    <img class="animation" width="90%" src="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/stats_taxa_missing_outlier.png" alt="Static Image" data-alt="https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/gifs/stats_tutorial1_outliers.gif">
</figure>

You can see that the missing data distribution is bimodal (two peaks) and that one taxa outlier was found (see the footer of the screen). In the footer of the screen are three functions to handle potential outliers:

- Remove: Clicking the __Remove__ button will remove the outlier taxa from the current TriFusion session. This is equivalent to manually remove the taxa in TriFusion's sidepanel.
- Export: Clicking the __Export__ button will save the outlier taxa to a _csv_ file, where each line will contain a taxon name. This can be used to change the active data set in TriFusion using a text file (see this section of the [Creating and using active data set groups](http://nbviewer.jupyter.org/github/ODiogoSilva/TriFusion-tutorials/blob/master/tutorials/Creating%20and%20using%20active%20data%20set%20groups..ipynb#Import-selection-from-file) tutorial). 
- View: Clicking the __View__ will display a list of the outlier taxa. 

In [2]:
%%javascript
var getGif = function() {
    var gif = [];
    $('.animation').each(function() {
        console.log("here")
        var data = $(this).data('alt');
        gif.push(data);
    });
    return gif;
}
var gif = getGif();
console.log(gif)

//Preload all the GIF.
var image = [];
  
$.each(gif, function(index) {
    image[index]     = new Image();
    image[index].src = gif[index];
    });

$('figure').on('click', function() {
     
    var $this   = $(this),
    $index  = $this.index(),
    $img    = $this.children('img'),
    $imgSrc = $img.attr('src'),
    $imgAlt = $img.attr('data-alt'),
    $imgExt = $imgAlt.split('.');
           
    if($imgExt[1] === 'gif') {
        $img.attr('src', $img.data('alt')).attr('data-alt', $imgSrc);
    } else {
        $img.attr('src', $imgAlt).attr('data-alt', $img.data('alt'));
    } 
});

<IPython.core.display.Javascript object>