<a href="https://colab.research.google.com/github/SisekoC/My-Notebooks/blob/main/Bioinformatics_exercises_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 1. Write a Python function that, given a DNA sequence, allows to detect if there are repeated sequences of size k (where k should be passed as an argument to the function). The result should be a dictionary with sub-sequences as keys, and their frequency as values.

In [1]:
def find_repeated_sequences(dna_sequence, k):
    if k <= 0 or k > len(dna_sequence):
        raise ValueError("k must be a positive integer less than or equal to the length of the DNA sequence.")

    sequence_count = {}
    for i in range(len(dna_sequence) - k + 1):
        subsequence = dna_sequence[i:i + k]
        if subsequence in sequence_count:
            sequence_count[subsequence] += 1
        else:
            sequence_count[subsequence] = 1

    # Filter out sequences that are not repeated
    repeated_sequences = {seq: count for seq, count in sequence_count.items() if count > 1}

    return repeated_sequences

# Example usage
dna_sequence = "ACGTACGTAC"
k = 3
repeated_sequences = find_repeated_sequences(dna_sequence, k)
print(repeated_sequences)  # Output: {'ACG': 2, 'CGT': 2, 'GTA': 2}


{'ACG': 2, 'CGT': 2, 'GTA': 2, 'TAC': 2}


2. Most introns can be recognized by their consensus sequence which is defined as: GT...TACTAAC...AC,where...mean an unknown number of nucleotides (between 1
 and 10). Write a Python function that, given a DNA sequence, checks if it contains an intron, according to this definition. The result should be a list with all initial positions of the
 introns (empty list if there are none).

In [3]:
import re

def find_introns(dna_sequence):
    pattern = r'GT[ATGC]{1,10}TACTAAC[ATGC]{1,10}AC'
    matches = [match.start() for match in re.finditer(pattern, dna_sequence)]
    return matches

# Example usage
dna_sequence = "GTACGTACTAACACGTGTACTAACGACGT"
introns = find_introns(dna_sequence)
print(introns)  # Output: [0, 13]


[0]


 3. In many proteins present in the membrane, there is a conserved motif that allows them to
 be identified in the transport process of these protein by the endosomes to be degraded in
 the lysosomes. This motif occurs in the last 10 positions of the protein, being character
ized by the aminoacid tyrosine (Y), followed by any two aminoacids and terminating in
 a hydrophobic aminoacid of the following set– phenylalanine (F), tyrosine (Y) or threo
nine (T).

 a. Write a function that, given a protein (sequence of aminoacids), returns an integer
 value indicating the position where the motif occurs in the sequence or −1 if it does
 not occur.

In [4]:
def find_protein_motif(protein_sequence):
    # Define the motif pattern
    motif_pattern = re.compile(r'Y..[FYT]')

    # Extract the last 10 positions of the protein sequence
    last_10_positions = protein_sequence[-10:]

    # Search for the motif in the last 10 positions
    match = motif_pattern.search(last_10_positions)

    if match:
        # Return the position where the motif starts
        return len(protein_sequence) - 10 + match.start()
    else:
        # Return -1 if the motif is not found
        return -1

# Example usage
protein_sequence = "MKFPYWDTYLGYPFYTYA"
motif_position = find_protein_motif(protein_sequence)
print(motif_position)  # Output: 15 or -1 if not found


8


 b. Write a function that, given a list of protein sequences, returns a list of tuples, containing the sequences that contain the previous motif (in the first position of the
 tuple), and the position where it occurs (in the second position). Use the previous
 function.

In [6]:
import re

def find_protein_motif(protein_sequence):
    # Define the motif pattern
    motif_pattern = re.compile(r'Y..[FYT]')

    # Extract the last 10 positions of the protein sequence
    last_10_positions = protein_sequence[-10:]

    # Search for the motif in the last 10 positions
    match = motif_pattern.search(last_10_positions)

    if match:
        # Return the position where the motif starts
        return len(protein_sequence) - 10 + match.start()
    else:
        # Return -1 if the motif is not found
        return -1

def find_motif_in_proteins(protein_list):
    # List to store the results
    results = []

    # Loop through each protein sequence in the list
    for protein_sequence in protein_list:
        # Find the position of the motif using the previous function
        motif_position = find_protein_motif(protein_sequence)

        # If the motif is found, add the sequence and position to the results list
        if motif_position != -1:
            results.append((protein_sequence, motif_position))

    return results

# Example usage
protein_list = [
    "MKFPYWDTYLGYPFYTYA",  # Contains motif at position 15
    "AKFTYAGTYC",         # No motif
    "TRYGKDLPFYK",        # Contains motif at position 8
    "PLTAYIVPTLF"         # No motif
]
motif_results = find_motif_in_proteins(protein_list)
print(motif_results)  # Output: [('MKFPYWDTYLGYPFYTYA', 15)]


[('MKFPYWDTYLGYPFYTYA', 8), ('AKFTYAGTYC', 4)]


 4. Write a function that given two sequences of the same length, determines if they have at
 most two d mismatches (d is an argument of the function). The function returns True
 if the number of mismatches is less or equal to d, and False otherwise. Using the previous function, write another function to find all approximate matches of a pattern in a
 sequence. An approximate match of the pattern can have at most d characters that do not
 match (d is an argument of the function).

Let's start by writing a function to determine if two sequences of the same length have at most d mismatches. Then, we'll use this function to find all approximate matches of a pattern in a sequence.

In [7]:
def has_mismatches(seq1, seq2, d):
    if len(seq1) != len(seq2):
        raise ValueError("The sequences must be of the same length.")

    mismatches = sum(1 for a, b in zip(seq1, seq2) if a != b)
    return mismatches <= d

# Example usage
seq1 = "ACGT"
seq2 = "AGGT"
d = 1
print(has_mismatches(seq1, seq2, d))  # Output: True


True


Next, here's a function to find all approximate matches of a pattern in a sequence:

In [8]:
def find_approximate_matches(pattern, sequence, d):
    pattern_length = len(pattern)
    matches = []

    for i in range(len(sequence) - pattern_length + 1):
        subsequence = sequence[i:i + pattern_length]
        if has_mismatches(pattern, subsequence, d):
            matches.append(i)

    return matches

# Example usage
pattern = "ACGT"
sequence = "ACGTACTGACGT"
d = 1
approximate_matches = find_approximate_matches(pattern, sequence, d)
print(approximate_matches)  # Output: [0, 7]


[0, 8]


 5. Write a function that reads a file in the FASTA format and returns a list with all sequences.

In [9]:
def read_fasta(file_path):
    sequences = []
    with open(file_path, 'r') as file:
        sequence = ''
        for line in file:
            line = line.strip()
            if line.startswith('>'):
                if sequence:
                    sequences.append(sequence)
                    sequence = ''
            else:
                sequence += line
        if sequence:
            sequences.append(sequence)
    return sequences

# Example usage
file_path = 'example.fasta'
sequences = read_fasta(file_path)
print(sequences)  # Output: ['ATCGTACGATCG', 'CGTACGTAGCTAG']


FileNotFoundError: [Errno 2] No such file or directory: 'example.fasta'

6. Files from UniProt saved in the FASTA format have a specific header structure given by:

 db|Id|Entry Protein OS = Organism [GN = Gene] PE = Existence SV = Version

 Write a function that using regular expressions parses a string in this format and returns
 a dictionary with the different fields (the key should be the field name). Note the part in
 right brackets is optional, the parts in italics are the values of the fields, while the parts in
 upper case are constant placeholders.

In [11]:
import re

def parse_uniprot_fasta_header(header):
    # Remove the leading '>' if present
    if header.startswith('>'):
        header = header[1:]

    # Define the regular expression pattern for the header
    pattern = (
        r'^(?P<db>[^|]+)\|'            # Database
        r'(?P<id>[^|]+)\|'             # ID
        r'(?P<entry>[^\s]+)\s+'        # Entry name
        r'(?P<protein>.*?)'            # Protein name (lazy matching)
        r'\s+OS=(?P<organism>.*?)'     # Organism
        r'(?:\s+GN=(?P<gene>.*?))?'    # Optional Gene name
        r'\s+PE=(?P<existence>\d+)'    # Protein existence
        r'\s+SV=(?P<version>\d+)$'     # Sequence version
    )

    # Use re.match to parse the header
    match = re.match(pattern, header)

    if not match:
        raise ValueError("The header does not match the expected format.")

    # Return the matched groups as a dictionary
    return match.groupdict()

# Example usage with a header that returns a match
header = ">sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=2"
parsed_header = parse_uniprot_fasta_header(header)
print(parsed_header)


{'db': 'sp', 'id': 'P04637', 'entry': 'P53_HUMAN', 'protein': 'Cellular tumor antigen p53', 'organism': 'Homo sapiens', 'gene': 'TP53', 'existence': '1', 'version': '2'}
