In [1]:
#!pip install pysam



In [1]:
import pysam
from Bio import SeqIO 
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

Ouverture du fichier contenant les reads

In [2]:
reads = list(SeqIO.parse('single_Pfal_dat.fq','fastq'))

Ouverture du fichier BAM

In [14]:
samfile = pysam.AlignmentFile('./single_Pfal_dat.bam', 'rb')
iter = samfile.fetch(until_eof=True)

Premier objet du fichier

In [15]:
for x in iter:
    firstMapp = x
    break

Correspond au read :

In [16]:
firstMapp.qname

'NC_004325.2-100000'

On cherche le read avec l'id correspondant :

In [17]:
for i in range(len(reads)):
    if reads[i].id == firstMapp.qname:
        readInt = reads[i]
        print(i)
        break

0


In [18]:
readInt

SeqRecord(seq=Seq('TTTCCTTTTTAAGCGTTTTATTTTTTAATAAAAAAAATATAGTATTATATAGTA...TAA'), id='NC_004325.2-100000', name='NC_004325.2-100000', description='NC_004325.2-100000', dbxrefs=[])

In [19]:
print(firstMapp)

NC_004325.2-100000	16	#0	131735	99	80=1X19=	*	0	0	TTAATATATTCCTCATATATTTATTTATATGGATCTTTTCACCCGTTACTATATAATACTATATTTTTTTTATTAAAAAATAAAACGCTTAAAAAGGAAA	array('B', [32, 35, 35, 5, 35, 34, 32, 33, 35, 36, 35, 35, 35, 35, 35, 26, 35, 34, 35, 32, 33, 32, 37, 33, 33, 35, 37, 20, 41, 36, 35, 37, 35, 32, 33, 40, 30, 33, 35, 40, 36, 33, 40, 34, 36, 41, 40, 29, 26, 40, 34, 41, 40, 38, 41, 41, 41, 41, 40, 39, 39, 37, 38, 40, 40, 41, 41, 41, 40, 40, 41, 40, 41, 41, 41, 27, 41, 36, 23, 40, 7, 41, 40, 10, 41, 39, 41, 39, 39, 35, 31, 37, 37, 29, 15, 35, 37, 34, 31, 31])	[]


## SAM files fields

### Query Name - QNAME
Identifier that is unique to the read within the file and can be used to identify any individual read. 

In [20]:
firstMapp.qname

'NC_004325.2-100000'

### FLAG
Also sometimes called a flag score (a slight misnomer) is a decimal (base-10) number used to represent a binary (base-2) number with digits that represent different true/false statements pertaining to the alignment of the read. A value of zero indicates false while one indicates true.

|Decimal|    Binary    |  Exp.  |Meaning|
|------:|-------------:|-------:|:------|
|    $1$|           $1$| $2^{0}$|This is a paired read |
|    $2$|          $10$| $2^{1}$| This read is part of a pair that aligned properly*   |
|    $4$|         $100$| $2^{2}$| This read was not aligned |
|    $8$|        $1000$| $2^{3}$|This read is part of a pair and its mate was not aligned|
|   $16$|       $10000$| $2^{4}$|This read aligned in the reverse direction**|
|   $32$|      $100000$| $2^{5}$|This read is part of a pair and its mate aligned in the reverse direction**|
|   $64$|     $1000000$| $2^{6}$|This read is the first in the pair (read 1)|
|  $128$|    $10000000$| $2^{7}$|This read is the second in pair (read 2)|
|  $256$|   $100000000$| $2^{8}$|The given alignment is a secondary alignment***|
|  $512$|  $1000000000$| $2^{9}$|Read failed quality check (such as Illumina quality filtering)|
| $1024$| $10000000000$|$2^{10}$|Read was flagged as a duplicate (such as a PCR duplicate)|
| $2048$|$100000000000$|$2^{11}$|Supplementary alignment (Exact meaning varies by aligner)|


`*` Proper alignment indicates both reads in a pair are oriented towards one another (one forward, one reverse), are both on the same contig, and are within the expected distance from one another.

`**` Direction is relative to the reference sequence used for alignment

`***` The read had multiple potential alignments; this was one of them, but not the first choice from among them

Exemple : $99 = 64+32+2+1$

In [21]:
firstMapp.flag

16

### Reference Name - RNAME
Identifies which contig within the reference genome where the read was aligned.

In [23]:
firstMapp.rname

0

### Position - POS
Indicates the leftmost mapping position of the first matching base within the read.

In [26]:
firstMapp.pos

131734

### Mapping Quality - MAPQ
Phred-scaled confidence score indicating likelihood that the sequence was mapped correctly or incorrectly. A value of 255 for this field indicates that no probability is given and is considered a placeholder value.

In [29]:
firstMapp.mapq

99

### CIGAR - concise idiosyncratic gapped alignment report string 
Sequence of numbers and letters (in that order) indicating continuities or discontinuities in the alignment caused by inserted or deleted bases (or other causes for discontinuity).

| Op.| Description |
|:--:|:------------|
|M   | Match (alignment column containing two letters). This could contain two different letters (mismatch) or two identical letters. USEARCH generates CIGAR strings containing Ms rather than X's and ='s (see below).|
|D   | Deletion (gap in the target sequence).|
|I   | Insertion (gap in the query sequence). |
|S   | Segment of the query sequence that does not appear in the alignment. This is used with soft clipping, where the full-length query sequence is given (field 10 in the SAM record). In this case, S operations specify segments at the start and/or end of the query that do not appear in a local alignment.|
|H   | Segment of the query sequence that does not appear in the alignment. This is used with hard clipping, where only the aligned segment of the query sequences is given (field 10 in the SAM record). In this case, H operations specify segments at the start and/or end of the query that do not appear in the SAM record.|
|=   | Alignment column containing two identical letters.|
|X   | Alignment column containing a mismatch, i.e. two different letters. |

*Source : https://www.drive5.com/usearch/manual/cigar.html*

In [31]:
firstMapp.cigarstring

'80=1X19='

### Reference Name for Mate - RNEXT 
Analogous to field 3 (Reference Name) and follows the same rules, except that it describes the paired-end mate of the read (if there is one). To save space, this value will be “=” if it is identical to the Reference Name value, which should be the case most often.

In [32]:
firstMapp.rnext

-1

### Position of Mate - PNEXT 
Analogous to field 4 (Position) and follows the same rules as that field.

In [33]:
firstMapp.pnext

-1

### Template Length - TLEN 
Indicates the length of template sequence to which the read maps (this field is sometimes confused for the read length, which it is not, but will often be equal to in value). A read with multiple insertions may have a smaller template length than the read length, while a read with multiple deletions may have a template length longer than the read length. In the case of RNA or cDNA being aligned to genomic DNA reference, template length may be in the tens of thousands of bases for a short read due to the presence of an intron.

In [34]:
firstMapp.tlen

0

### Sequence - SEQ 
Actual read sequence. Should generally follow the sequence line from the source FASTQ

In [35]:
firstMapp.seq

'TTAATATATTCCTCATATATTTATTTATATGGATCTTTTCACCCGTTACTATATAATACTATATTTTTTTTATTAAAAAATAAAACGCTTAAAAAGGAAA'

### Quality String - Qual 
Should generally follow the quality string from the source FASTQ file and be Phred-scaled.

In [36]:
firstMapp.qual

'ADD&DCABDEDDDDD;DCDABAFBBDF5JEDFDABI?BDIEBICEJI>;ICJIGJJJJIHHFGIIJJJIIJIJJJ<JE8I(JI+JHJHHD@FF>0DFC@@'

###  Predefined Tags
These will be encoded tags that are predefined in the SAM/BAM file standard that give additional information on the alignment or read.

*Source : https://www.zymoresearch.com/blogs/blog/what-are-sam-and-bam-files*