# Backbones of the Analysis

- **License:** [MIT License](https://opensource.org/licenses/MIT)
- **Version:** 0.2
- **Edit Log:** 
    - 2024-01-19: Initial version of the notebook
    - 2024-02-21: Revised the notes 

This notebook presents the essential packages that facilitated the data analysis and construction of the QuEStVar framework. While these packages are mentioned in the manuscript and throughout the notebooks, they are displayed here with their references, links, and the versions used.

The references were primarily sourced from citebay.com, and the links direct to the project's main page. Please note that the versions used in the analysis may reflect something other than the most current version of the project at the time of writing this notebook.

## Details on the Libraries

The libraries are listed in order of their criticality to the analysis. Some packages, only used once or twice, are included for completeness, but their exclusion would not disrupt the analysis.

1. [Python 3](https://www.python.org/)
    - **Version:** 3.9.18
    - **Reference:** [Van Rossum and Drake 2009](https://dl.acm.org/doi/book/10.5555/1593511)
    - **Description:** Python is an interpreted, high-level and general-purpose programming language.
    - **How it was used:** The whole analysis was done in Python 3. The main libraries that were used are listed below.
2. [Jupyter](https://jupyter.org/)
    - **Version:** 3.5.3
    - **Reference:** [Kluyver et al. 2016](https://ebooks.iospress.nl/publication/42900)
    - **Description:** Jupyter is a project that supports interactive data science and scientific computing across many programming languages. It provides a browser-based notebook interface that can be used to write and execute code, and makes these forms of computing intuitive and reproducible.
    - **How it was used:** The whole analysis was done in Jupyter notebooks. The `IPython`, `jupyterlab`, `notebook`, and `nbconvert` packages were used to create the notebooks and convert them to HTML and PDF formats.
2. [Pandas](https://pandas.pydata.org/)
    - **Version:** 1.5.2
    - **Reference:** [McKinney 2010](https://conference.scipy.org/proceedings/scipy2010/mckinney.html)
    - **Description:** Pandas is a versitile data analysis and manipulation library for Python.
    - **How it was used:** Pandas can be found everywhere in the analysis from read/write to manipulate tables for summary, plotting, and analysis.
3. [NumPy](https://numpy.org/)
    - **Version:** 1.24.3
    - **Reference:** [Harris et al. 2020](https://www.nature.com/articles/s41586-020-2649-2)
    - **Description:** Fundamental numerical computations with efficient array operations.
    - **How it was used:** NumPy is used to improve the complex and intense computations necessary for masked arrays and vectorized operations. Where pandas is built on top of NumPy, it is not as efficient as NumPy for these operations.
4. [Matplotlib](https://matplotlib.org/)
    - **Version:** 3.7.2
    - **Reference:** [Hunter 2007](https://ieeexplore.ieee.org/document/4160265)
    - **Description:** Matplotlib is a comprehensive plotting library for Python.
    - **How it was used:** Matplotlib's backbone functionality is used to complement the pandas, seaborn, and other plottin specific libraries.
5. [Seaborn](https://seaborn.pydata.org/)
    - **Version:** 0.11.2
    - **Reference:** [Waskom 2021](https://joss.theoj.org/papers/10.21105/joss.03021)
    - **Description:** Seaborn is high-level interface for statistical data visualization built on top of Matplotlib.
    - **How it was used:** Seaborn is used to create the majority of the plots along with Matplotlib.
6. [SciPy](https://www.scipy.org/)
    - **Version:** 1.10.1
    - **Reference:** [Virtanen et al. 2020](https://www.nature.com/articles/s41592-019-0686-2)
    - **Description:** SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python.
    - **How it was used:** SciPy is used its masked-array based testing fundamental functions such as interpolate, special, and stats.
7. [Feather-format](https://arrow.apache.org/docs/python/feather.html)
    - **Version:** 0.4.1
    - **Reference:** [Apache-Feather](https://github.com/wesm/feather)
    - **Description:** Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames.
    - **How it was used:** Feather is used to store the data frames in a binary format that can be read very quickly to optimize storage and read/write time.
8. [g:Profiler Python API](https://biit.cs.ut.ee/gprofiler/page/apis)
    - **Version:** 1.0.0
    - **Reference:** [Reimand et al. 2016](https://academic.oup.com/nar/article/44/W1/W83/2499353)
    - **Description:** g:Profiler is a public web server for characterizing and manipulating gene lists.
    - **How it was used:** g:Profiler is used to annotate the protein sets in single or multi-query modes.
9. [BioPython](https://biopython.org/)
    - **Version:** 1.81
    - **Reference:** [Cock et al. 2009](https://academic.oup.com/bioinformatics/article/25/11/1422/330687)
    - **Description:** Collection of modules for biological computation.
    - **How it was used:** Biopython is used to parse fasta with molecular weight. Fasta to table
10. [Polars](https://docs.pola.rs/user-guide/)
    - **Version:** 0.18.1
    - **Reference:** [Pola.rs](https://github.com/pola-rs/polars)
    - **Description:** Rust-based very fast implementation of data handling with labeled 2d tables. It is a new-generation and rapidly evolving library, and likely will replace Pandas in the future. 
    - **How it was used:** Polars is used for very intense calculations that are best to be done in labeled table setting such as protein_status_matrix. 
11. [Sklearn](https://scikit-learn.org/stable/)
    - **Version:** 1.1.1
    - **Reference:** [Pedregosa et al. 2011](https://www.semanticscholar.org/paper/Scikit-learn%3A-Machine-Learning-in-Python-Pedregosa-Varoquaux/168f28ac3c8c7ea63bf7ed25f2288e8b67e2fe74)
    - **Description:** Collection of machine learning functionality
    - **How it was used:** Used for dimensional reduction methods such as PCA and tSNE and clustering methods such as KMeans and DBSCAN, when needed.
12. [PyComplexHeatmap](https://dingwb.github.io/PyComplexHeatmap/build/html/index.html)
    - **Version:** 1.6.4
    - **Reference:** [Ding et al. 2023](https://onlinelibrary.wiley.com/doi/10.1002/imt2.115)
    - **Description:** Plotting library for complex heatmaps
    - **How it was used:** Used to add enhanced features to seaborns clustermap.
13. [Upsetplot](https://upsetplot.readthedocs.io/en/stable/)
    - **Version:** 0.7.0
    <!-- - **Reference:** [Ding et al. 2023](https://onlinelibrary.wiley.com/doi/10.1002/imt2.115) -->
    - **Description:** Plotting upset plots
    - **How it was used:** Upset plots are used to display the overlap of the protein sets from more than 3 sets in the analysis.
14. [session_info](https://gitlab.com/joelostblom/session_info)
    - **Version:** 1.0.0
    <!-- - **Reference:** [Ding et al. 2023](https://onlinelibrary.wiley.com/doi/10.1002/imt2.115) -->
    - **Description:** Simple function to display the session info of the libraries used in the analysis.
    - **How it was used:** As it is intended to be used.


## References

1. Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.
1. Kluyver, T., Ragan-Kelley, B., Fernando P&#x27;erez, Granger, B., Bussonnier, M., Frederic, J., … Willing, C. (2016). Jupyter Notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Schmidt (Eds.), Positioning and Power in Academic Publishing: Players, Agents and Agendas (pp. 87–90). IOS Press. https://doi.org/10.3233/978-1-61499-649-1-87
1. McKinney, W. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56). Austin, TX.
1. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
1. Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science and Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
1. Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021, https://doi.org/10.21105/joss.03021
1. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
1. Feather File Format — Apache Arrow v15.0.0, https://arrow.apache.org/docs/python/feather.html (accessed January 23, 2024).
1. Reimand, J., Arak, T., Adler, P., Kolberg, L., Reisberg, S., Peterson, H., & Vilo, J. (2016). g:Profiler—a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Research, 44(W1), W83–W89. https://doi.org/10.1093/nar/gkw199
1. Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., … others. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423.
1. pola-rs/polars: Dataframes powered by a multithreaded, vectorized query engine, written in Rust, https://github.com/pola-rs/polars (accessed January 23, 2024).
1. Pedregosa, F., Varoquaux, Ga"el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
1. Ding, W., Goldberg, D. and Zhou, W. (2023), PyComplexHeatmap: A Python package to visualize multimodal genomics data. iMeta e115. https://doi.org/10.1002/imt2.115


## The Session Info

The session info is used to show the full detail of the used packages, their dependencies, and the environment the analysis is conducted at. 

In [1]:
import session_info

import feather as ft

import numpy as np
import pandas as pd
import polars as pl

import scipy as sp
import sklearn as sk
import Bio as bio
import gprofiler as gp

import seaborn as sns
import matplotlib as plt
import upsetplot as up
import PyComplexHeatmap as pch

session_info.show()