Skip to content

BLOSUM Calculation Tool made with tkinter in python3.

License

Notifications You must be signed in to change notification settings

Kryptagora/pysum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌼 PYSUM 🌼

This Tool calculates a BLOSUM Matrix (log-odds ratios) given arbitrary Sequences (can be anything, not only DNA or Amino-Acid) by elimination.

Usage

To run this application, go to the folder where the main.py is located and open the command prompt in it. Then run the application by typing: python3 main.py

[TBA] You can also run this application without a GUI, by typing: python3 main.py --nogui --path path_to_sequnce_file --degree [0,100]

Requirements

To execute this application, python3 is required.
Also, following python packages are required:

package version
numpy >= 1.18
tkinter >= 8.6

Note that tkinter is installed by default on Windows10, but not on Linux.

Mathematical Foundation

This is accomplished by following mathematical foundation:

  1. Given Sequences with at least p% identity to each other are clustered. The other sequences are eliminated (The degree decides, how similar they must be).

  2. The sequences are now compared to each other, where the sequence letters (eg. DNA-Bases) are counted according to their frequency. Looking at this example:
    ATGTACGT
    TAGCTAGA
    GTACGACC
    The columns k are observed, such that equation would be then ATG and so on. By computing the C values a matrix is obtained:
    equation
    Note that this matrix is Symmetric.

  3. The sum of all entry's in the Matrix and Z (normalization factor) is given by:
    equation
    where L is the sequence length (column number, i.e. for ATGTACGT: L = 8) and N the number of sequences.

  4. Then, equation is normalized to obtain the Q-Matrix:
    equation

  5. To obtain the probability of the occurrence of one sequence letter i use:
    equation

  6. Finally the log-odds ratios are computed with:
    equation
    The result (every entry) is rounded to integers.

This calculation is based on Dr. Sepp Hochreiters Script Bioinformatics I .

Input Files

The Input file can end with any extension. The sequences in the input file should fulfill following propertys:

  • Be all the same length.
  • Every sequence is separated by a newline.
  • At least two sequences are given.
  • Any input line starting with - will be ignored.

Examples: Valid ✔️

-This is a valid input file, this line is ignored.
TACGTAGCTAGC
TGCATGCTAGCC
TGCTGCTGCCCA
TGTGTACACCCC
-This line is also ignored.

Not Valid

-This is a invalid input file, because sequences differ in length.
TACGTAGCTAGC
TGCATGCT
TGCTGCTGCCCA
TGTGTACACCC