### 1. Write a function that, given a sequence as an argument, allows to detect if there are repeated sub-sequences of size k (the second argument of the function). The result should be a dictionary where keys are sub-sequences and values are the number of times they occur (at least 2). Use the function in a program that reads the sequence and k and prints the result by decreasing frequency.

In [1]:
def find_repeated_subsequences(sequence, k):
    """ this function detects if there are repeated subsequences of size k and returns a dictionary in a decreasing frequency."""
    
    subseq_count = {}  # initialize a disctionary to keep track of subsequences and their counts   
    
    
    for i in range(len(sequence) - k + 1):    # Generate all subsequences of length k
        subseq = sequence[i:i + k]           # extract a subsequence of length k starting from index i
        if subseq in subseq_count:           # check if the subsequence is already in the dictionary
            subseq_count[subseq] += 1        # if yes increment its count by 1
        else:
            subseq_count[subseq]=1           # if not add it to the dictionary with a count of 1
   

    # create a new dictionary to keep only subsequences that appear at least twice
    repeated_subseq = {subseq: count for subseq, count in subseq_count.items() if count > 1}  
    
    # sort the repeated subsequences by their count in descending order and convert back to a dictionary
    sorted_repeated_subseq = dict(sorted(repeated_subseq.items(), key=lambda item: item[1], reverse=True)) 
    
    return sorted_repeated_subseq             # return the sorted dictionary of repeated subsequences 

In [2]:
# test the code with the given DNA sequence
find_repeated_subsequences('ATCGTCGTAGTACTGTTCGGTATGATGAGTA', 3)

{'GTA': 4, 'TCG': 3, 'CGT': 2, 'AGT': 2, 'ATG': 2, 'TGA': 2}

### 2. Write a function that reads a file in the FASTA format and returns a list with all sequences.

In [3]:
def read_fasta(file_path):
    """ this function reads a file in the FASTA format and returns a list with all sequences."""
    
    sequences = []    # initialize an empty list to store sequences 
    with open('Assignmnet _Abebe_FASTA.txt', 'r') as file:    # open the file at the given path for reading
        sequence = ""                     # initialize an empty string to build each sequence
        for line in file:                 # iterate over each line in the file
            line = line.strip()           # remove leading or trailing white space from the line
            if line.startswith(">"):      # check if the line is a header (starts with '>')
                if sequence:              # If there is a sequence collected, add it to the list
                    sequences.append(sequence)
                    sequence = ""
            else:                         # if not a header add the line to the current sequence
                sequence += line
                
        if sequence:  # Add the last collected sequence to the list
            sequences.append(sequence)
    return sequences                      # return the list of sequences

In [4]:
# test the function with a file that conatilns 3 nucleotide and 2 amino acid sequences in a FASTA format. 
read_fasta('Assignmnet _Abebe_FASTA.txt')

['AGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATCATTTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAATTAAATGATCTCTGCTTTACTAATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACAATCTTGATTCTAAGGTTGGTGGTAATTATAATTACCGGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGAGATATTTCAACTGAAATCTATCAGGCCGGTAGCAAACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACTTTCCTTTACAATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTAACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTCTGCCTTTCCAACAATTTGGCAGAGACATTGCTGACACTACTGAT',
 'CCAATTTTATATCAACATTTATTTTGATTTTTTGGTCATCCTGAAGTTTATATTCTAATTTTACCAGGATTCGGTATAATTTCACATATTATTAGTCAAGAATCAGGAAAAAAGGAAACATTCGGATCATTAGGAATAATTTATGCTATGTTAGCTATTGGACTATTAGGATTTATTGTATGAGCCCATCATATATTTAC

### 3. In many proteins present in the membrane, there is a conserved motif that allows them to be identified in the transport process of these protein by the endosomes to be degraded in the lysosomes. This motif occurs in the last 10 positions of the protein, being characterized by the aminoacid tyrosine (Y), followed by any two aminoacids and terminating in a hydrophobic aminoacid of the following set – phenylalanine (F), tyrosine (Y) or threonine (T).
#### a. Write a function that, given a protein (sequence of aminoacids), returns an integer value indicating the position where the motif occurs in the sequence or 1 if it does not occur. 

In [5]:
def find_motif(protein_sequence):          
    """ this function checks for the presence of the specific motif in the last ten position of the given protein sequence.
        returns the position of the motif if found other wise return -1. """
    
    motif_length = 4  # define the length of the motif 
    if len (protein_sequence) < motif_length:    # if the sequence is shorter than the motif, return -1
        return 1
    # define the motif tattern: YXX[FYT], XX referes to any two aminoacids
        
    # Check the last 10 positions for the motif
    # iterate over the last ten positions
    for i in range (max(0, len(protein_sequence) - 10), len(protein_sequence) - motif_length + 1): 
        if protein_sequence [i] == 'Y' and \
            protein_sequence [i+3] in {'F', 'Y', 'T'}:   # check if the motif pattern matches 
            return i                                     # return the starting position of the motif
    return 1  # Return 1 if the motif does not occur    

In [6]:
# test the code with a protein sequence that does contain a motif in the last ten positions of the sequence
find_motif('MRPSGASRAKAWVTRLYYREAKTYNAYWVQD')

23

In [7]:
# test the code with a protein sequence that contain a motif but not in the last ten positions of the sequence
find_motif('MDYAGFKKIPVLLVGAAGILAVLLPCLLLVTVLNGFAEKNPGLYFKLSL')

1

In [8]:
# test the code with a protein sequence that does not contain a motif at all
find_motif('MRPSGTAGAALLALLAALCPASRAKAWVTRLRIVVQGRKQCVSA')

1

In [9]:
# test the code with a protein sequence that is shorter than the motif
find_motif('YFK')

1

In [10]:
# test the code with a protein sequence that have the motif and is equal length with the motif
find_motif('YGAF')

0

In [11]:
# test the code with a protein sequence that have equal length with the motif
find_motif('YGAM')

1

#### b. Write a function that, given a list of protein sequences, returns a list of tuples, containing the sequences that contain the previous motif (in the first position of the tuple), and the position where it occurs (in the second position). Use the previous function.

In [12]:
def find_motif_in_list(protein_list):
    """ This function checks each protein sequence in the given list for the motif. Returns a list of tuples with the 
        sequence and the position of the motif if found."""
    
    result = []  # Initialize an empty list to store results
    
    for protein_sequence in protein_list:  # Iterate over each protein sequence in the list
        position = find_motif(protein_sequence)  # Check for the motif in the sequence
        if position != 1:  # If the motif is found add the sequence and the position to the result list
            result.append((protein_sequence, position))  
    return result  # Return the list of results 

In [13]:
# test the code with the given protein sequence with and with out the given motif.
protein_list = ["MKQYHLTDF","MDFTRKLVY","GYPYHLTTY","YYYPYTFFF","AFYHLYTAYF","MKQDHYLDMT"]
find_motif_in_list(protein_list)

[('MKQYHLTDF', 3), ('GYPYHLTTY', 3), ('AFYHLYTAYF', 2)]