<a href="https://colab.research.google.com/github/MariamZayed/Bioinformatics_Labs/blob/main/Lab5_Bioinformatics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Importing Section**

In [25]:
pip install biopython



In [26]:
import Bio 
from Bio.Seq import Seq
from Bio import SeqIO #this lib to read and itra files
from Bio import Entrez #Provides code to access NCBI over the WWW
from Bio import motifs #for creating motifes objects

##**Start initializing Motif Sequences/Instances**


- Create motifs sequences.
- function is motifs.create() that we took in the previous lab.
- We call motifs either motifs or instances.

In [27]:
instances = [Seq("TACAA"),
             Seq("TACGC"),
             Seq("TACAC"),
             Seq("TACCC"),
             Seq("AACCC"),
             Seq("AATGC"),
             Seq("AATGC")]
#create motif
motifs_seq = motifs.create(instances)
print(motifs_seq)
print(motifs_seq.counts)
motifs_seq

TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC

        0      1      2      3      4
A:   3.00   7.00   0.00   2.00   1.00
C:   0.00   0.00   5.00   2.00   6.00
G:   0.00   0.00   0.00   3.00   0.00
T:   4.00   0.00   2.00   0.00   0.00



<Bio.motifs.Motif at 0x7fdd58005e50>

##**General Initializing**

###Create any DNA sequence

In [28]:
dna_sequence = Seq("TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATT")

###Bring the complement of the DAN sequence

In [29]:
reversed_dna = dna_sequence.reverse_complement()

##**PWM**
POSITION-WEIGHT MATRICES

- We should add Pseudocounts to avoid overvitting and to prevent probabilities from becoming zero.
- next step is to normlize.
to add the pseudocount value before normlizing, we will specify a number for the pseudocounts argument in the normlize() function

####pseudocounts
avoid overfitting

In [30]:
pwm = motifs_seq.counts.normalize(pseudocounts=0.5)
print(pwm)

        0      1      2      3      4
A:   0.39   0.83   0.06   0.28   0.17
C:   0.06   0.06   0.61   0.28   0.72
G:   0.06   0.06   0.06   0.39   0.06
T:   0.50   0.06   0.28   0.06   0.06



I can give every base a different pseudocounts like coming one

In [31]:
pwm2 = motifs_seq.counts.normalize(pseudocounts={"A":0.6, "C": 0.4, "G": 0.4, "T": 0.6})
print(pwm2)

        0      1      2      3      4
A:   0.40   0.84   0.07   0.29   0.18
C:   0.04   0.04   0.60   0.27   0.71
G:   0.04   0.04   0.04   0.38   0.04
T:   0.51   0.07   0.29   0.07   0.07



Bring the consensus and anticonsensus to pwm

In [32]:
print(pwm.consensus)
print(pwm.anticonsensus)

TACGC
CCATG


Bring the complement to each value in the pwm

In [33]:
rpwm = pwm.reverse_complement()
print(rpwm)

        0      1      2      3      4
A:   0.06   0.06   0.28   0.06   0.50
C:   0.06   0.39   0.06   0.06   0.06
G:   0.72   0.28   0.61   0.06   0.06
T:   0.17   0.28   0.06   0.83   0.39



##**Position Specific Scoring Matrix.**
== POSITION WEIGHT MATRICES

###Calculate the pssm matrix

- log_odds(background=None)
- The Position-Specific Scoring Matrix (PSSM) is computed by log-odds() function
- It returns the Position-Specific Scoring Matrix.
- If the background paramter is None, a uniform background is assumed.
> - if positive then it's for symbols more frequent in the motif.
> - if negative then it's for symbols more frequent in the background.
> - if 0:0 then it means it's equally likely to see a symbol in the background and in the motif.

In [34]:
pssm = pwm.log_odds()
print(pssm)

        0      1      2      3      4
A:   0.64   1.74  -2.17   0.15  -0.58
C:  -2.17  -2.17   1.29   0.15   1.53
G:  -2.17  -2.17  -2.17   0.64  -2.17
T:   1.00  -2.17   0.15  -2.17  -2.17



To alculate the pssm with a background with unequal probabilities for A, C, G, T, use the background paramter

In [35]:
background = {"A":0.3,"C":0.2,"G":0.2,"T":0.3}
pssm = pwm.log_odds(background)
print(pssm)

        0      1      2      3      4
A:   0.37   1.47  -2.43  -0.11  -0.85
C:  -1.85  -1.85   1.61   0.47   1.85
G:  -1.85  -1.85  -1.85   0.96  -1.85
T:   0.74  -2.43  -0.11  -2.43  -2.43



###calculate the scores at all positions(pssm), for the given sequence(dna_sequence)


In [None]:
print(pssm.calculate(dna_sequence))

[  3.841277    -4.865589     1.2783408   -8.77248     -3.7660534
  -3.1286235   -5.087981    -2.1810908   -1.4732716    0.11169091
  -5.2806263   -7.017592    -3.1107016   -4.210237    -6.4326296
  -5.4326296   -8.087981   -10.409909     2.3267038   -5.2806263
  -0.38823548  -5.4326296   -8.087981    -9.409909    -1.4441253
  -5.4505515   -2.7660534   -4.865589     0.8486565   -3.7951996
  -3.9180565   -7.087981    -4.766053    -1.4732716    2.8121307
  -4.695664    -1.9731979   -6.503019    -7.087981    -4.766053
  -0.4732716   -0.95869845  -4.865589    -4.766053    -0.4732716
  -0.95869845  -4.865589    -3.7660534   -1.5436609   -1.5436609
  -1.5436609   -3.1286235   -4.865589     1.8486565   -4.865589
  -6.087981    -3.3875418   -0.22173283  -3.7280855   -2.1810908
  -7.672944    -2.7660534   -4.865589     1.8486565   -6.4505515
  -6.824947    -3.0290878   -7.7724795    0.14083725  -5.3801622
   0.474261    -5.865589   -10.994872    -1.0656136    0.1567788
  -0.4586248    2.2418149 

####Calculate the scores for reverse strand 
The scores returned by pssm.calculate are for the forward strand only. To obtain the scores on the reverse strand, you can take the reverse complement of the PSSM:


In [None]:
rpssm = pssm.reverse_complement()
print(rpssm.calculate(dna_sequence))

[ -9.994872    -3.7660534   -7.187517     0.263694    -4.503019
  -6.672944    -2.7660534   -7.672944    -8.672944    -4.365515
  -4.62855     -1.4806511   -2.6650758   -4.043587     0.2418149
   2.8267775    1.8792448    1.841277    -6.70947     -4.62855
  -5.9180565    3.3122041   -1.8211949    4.4262395   -4.70947
  -4.351016    -5.4505515   -5.351016   -10.994872    -2.0435872
  -0.4586248    0.51934886  -4.972504    -8.672944    -6.950478
  -4.043587    -0.6326542    2.7272418   -3.1810908   -4.972504
  -6.087981    -4.950478    -3.7660534   -4.972504    -6.087981
  -4.950478    -3.7660534   -2.3875418   -6.672944    -6.672944
  -6.672944    -7.035514    -5.351016    -8.409909    -2.6285498
  -2.5436609   -1.0656136   -8.77248     -2.1810908   -4.017592
  -3.7660534   -5.4505515   -5.351016    -8.409909    -2.9911199
   1.7257998   -5.07204     -0.44412524  -9.994872    -6.351016
  -2.7805529   -2.4061575    4.063669    -3.3875418   -8.672944
  -3.9180565   -6.017592    -2.4732716

###The maximum and minimum of pssm
by .max and .min properties

In [36]:
print("%4.2f" % pssm.max)
print("%4.2f" % pssm.min)

6.63
-11.58


> - The " % " acts as a placeholder to pass the the value to it.
- While % pssm.max is the vlaue that will passed in the "%4.2f".
- the 4.2f means we only want two decimal digits after the dot sign.<br>
هات رقمين بعد العلامه بس

###The mean and standard deviation

In [37]:
mean = pssm.mean(background)
std = pssm.std(background)
print("mean = %0.2f, standard deviation = %0.2f" % (mean, std))

mean = 3.21, standard deviation = 2.60


##**Searching**

### Searching for a motifs/instances
we're going to search for motifs


In [44]:
for pos, found_motifs in motifs_seq.instances.search(dna_sequence):
 print("%i %s" % (pos, found_motifs))

To search for motifs we're going to use motifs_srquence.instances.search(sequence)
- Find positions of motifs in a given sequence.
- It returns found positions of motif
instances, and the found motifs sequence.
-  motifs_seq.instances.search(sequence) means: search for the motifs in the motifs_seq object in the given sequence.

#### Searching for motifs in the reverse complement of DNA seq

In [43]:
for pos, found_motifs in motifs_seq.instances.search(reversed_dna):
 print("%i %s" % (pos, found_motifs))

###Searching Using the PSSM Score

In [45]:
for pos, score in pssm.search(dna_sequence, threshold=3.0):
 print("Position %d: score = %5.3f" % (pos, score))

Position 0: score = 3.841
Position -119: score = 3.312
Position -117: score = 4.426
Position -68: score = 4.064
Position 130: score = 3.479


> simply we will search in pssm matrix by using search() function,
- function paramters are the given sequence and the threshold

--------------