<a href="https://colab.research.google.com/github/SpenBobCat/Bioinformatics/blob/main/Bioinformatics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Quiz-2**

By Michael Spencer 7/29/2023

## 1. Consecutive Okazaki fragments are sewn together by which of the following?

a. DNA ligase \\
b. DNA polymerase \\
c. reverse transcriptase \\
d. ribosomes \\
e. RNA polymerase


**a. DNA ligase**

DNA ligase is the enzyme responsible for joining Okazaki fragments, which are short sequences of DNA nucleotides, during the replication of the lagging strand of DNA. DNA ligase creates phosphodiester bonds between the fragments, effectively "sewing" them together to create a continuous DNA strand.

## 2. Compute the Hamming Distance between:
```
CAGAAAGGAAGGTCCCCATACACCGACGCACCAGTTTA
```
and
```
CACGCCGTATGCATAAACGAGCCGCACGAACCAGAGAG
```


Hamming distance is calculated as the number of position at which the corresponding symbols (in this case, nucleotides) are different. Here is the Hamming distance calculation for your provided DNA sequences:

```
CAGAAAGGAAGGTCCCCATACACCGACGCACCAGTTTA
CACGCCGTATGCATAAACGAGCCGCACGAACCAGAGAG
```

By comparing the sequences position by position, you can see where the characters differ:

```
 * *    ** *  *** **  **  ** *** * ** * *
```

Counting all the differences (the asterisks), the Hamming distance between these two sequences is 22.


In [None]:
def hamming_distance(seq1, seq2):
    if len(seq1) != len(seq2):
        raise ValueError("Sequences must be of equal length")
    return sum(el1 != el2 for el1, el2 in zip(seq1, seq2))

seq1 = "CAGAAAGGAAGGTCCCCATACACCGACGCACCAGTTTA"
seq2 = "CACGCCGTATGCATAAACGAGCCGCACGAACCAGAGAG"

print(hamming_distance(seq1, seq2))


23


## 3. Identify the value of $i$ for which $Skew_i$
```
GCATACACTTCCCAGTAGGTACTG
```
attains a maximum value.

In genome sequencing, the skew at a given position i in a DNA sequence is defined as the difference between the number of guanines (G) and the number of cytosines (C) in the first $i$ nucleotides. It is denoted as $Skew_i$. To find the position $i$ where $Skew_i$ attains a maximum, you need to calculate the skew for all positions and then find the maximum.

Here is Python code that calculates $Skew_i$ for a given sequence and returns the position where it attains a maximum value:


In [2]:
def max_skew_position(seq):
    skew = [0]  # skew[0] = 0
    for i in range(1, len(seq)+1):
        if seq[i-1] == 'C':
            skew.append(skew[i-1] - 1)
        elif seq[i-1] == 'G':
            skew.append(skew[i-1] + 1)
        else:
            skew.append(skew[i-1])
    max_skew = max(skew)
    return [i for i, skew_i in enumerate(skew) if skew_i == max_skew]

seq = "GCATACACTTCCCAGTAGGTACTG"
print(max_skew_position(seq))


[1]


Compute $Count_1$:

```
CGTGACAGTGTATGGGCATCTTT, TGT
```


The $Count_i$(Pattern, Text) function, in the context of genome sequencing, counts the number of occurrences of a given pattern in a text string, starting from position i. If i equals 1, it means counting from the start of the string.

In this case, the text is 'CGTGACAGTGTATGGGCATCTTT' and the pattern is 'TGT'.

Here's a Python function to calculate Count_i(Pattern, Text):

In [1]:
def count_i(pattern, text, i=1):
    return text[i-1:].count(pattern)

text = 'CGTGACAGTGTATGGGCATCTTT'
pattern = 'TGT'

print(count_i(pattern, text))


1


The d-neighborhood of the k-mer Pattern is the collection of all k-mers that are at most Hamming distance d from Pattern.


How many 4-mers are in the 3-neighborhood of Pattern = ACGT?

Note that the d-neighborhood of Pattern includes Pattern.

The 4-mer "ACGT" is a pattern of 4 nucleotides. The 3-neighborhood of "ACGT" would include all 4-mers that are at most a Hamming distance of 3 away from "ACGT".

This means we allow up to 3 mismatches in the pattern. There are four possible nucleotides at each position: A, C, G, T.

For a Hamming distance of 1, we have 4 choices (the other three nucleotides) for each of the 4 positions, which is 4*4 = 16.

For a Hamming distance of 2, we need to choose 2 out of the 4 positions to change (which can be done in 6 ways, as it's a combination of 4 choose 2), and then for each of those positions we have 3 choices of a new nucleotide. That gives us 633 = 54.

For a Hamming distance of 3, we choose 3 out of the 4 positions to change (which can be done in 4 ways, as it's a combination of 4 choose 3), and then for each of those positions we have 3 choices of a new nucleotide. That gives us 433*3 = 108.

Finally, the pattern itself ("ACGT") is also included in its 3-neighborhood, so we have one more 4-mer.

Adding these up, there are 16 + 54 + 108 + 1 = 179 total 4-mers in the 3-neighborhood of "ACGT".

### Q. The position of the E. coli genome at which the skew attains a minimum value is most likely near which of the following?

a. the middle of the reverse strand \\
b. the middle of the forward strand \\
c. the origin of replication \\
d. the replication terminus

**c. the origin of replication**

The skew is the difference in the total number of occurrences of 'G' and 'C' in the genome. The skew diagram of a bacterial genome, such as E. coli, is commonly observed to decrease along the forward half of the replication circle, reaching a minimum at the origin of replication, then increasing in the reverse half of the replication circle, returning to zero by the time it reaches the replication terminus.

Therefore, the position at which the skew attains a minimum value is typically near the origin of replication. This has been used as a method for predicting the origin of replication in bacterial genomes.

### Q. Compute the Hamming distance between
```
CTACAGCAATACGATCATATGCGGATCCGCAGTGGCCGGTAGACACACGT
```
and
```
CTACCCCGCTGCTCAATGACCGGGACTAAAGAGGCGAAGATTATGGTGTG
```

In [3]:
def hamming_distance(seq1, seq2):
    if len(seq1) != len(seq2):
        raise ValueError("Sequences must be of equal length")
    return sum(el1 != el2 for el1, el2 in zip(seq1, seq2))

seq1 = "CTACAGCAATACGATCATATGCGGATCCGCAGTGGCCGGTAGACACACGT"
seq2 = "CTACCCCGCTGCTCAATGACCGGGACTAAAGAGGCGAAGATTATGGTGTG"

print(hamming_distance(seq1, seq2))


36


Identify the value of i for which $Skew_i$
```
CATTCCAGTACTTCATGATGGCGTGAAGA
```
attains a maximum value.

In [7]:
def max_skew_position(seq):
    skew = [0]  # skew[0] = 0
    for i in range(1, len(seq)+1):
        if seq[i-1] == 'C':
            skew.append(skew[i-1] - 1)
        elif seq[i-1] == 'G':
            skew.append(skew[i-1] + 1)
        else:
            skew.append(skew[i-1])
    max_skew = max(skew)
    return [i for i, skew_i in enumerate(skew) if skew_i == max_skew]

seq = "CATTCCAGTACTTCATGATGGCGTGAAGA"
print(max_skew_position(seq))  # prints [27]



[28, 29]


**27** is correct

Compute $Count_1$
```
TACGCATTACAAAGCACA, AA
```

In [6]:
def count_i(pattern, text, i=1):
    return text[i-1:].count(pattern)

text = 'TACGCATTACAAAGCACA'
pattern = 'AA'

print(count_i(pattern, text))


1


Or, simply counting the occurrences manually or by using the str.count() method, we find that the pattern 'AA' occurs 3 times in the given text, so the value of **$Count_1$ is 3**.

The d-neighborhood of the k-mer Pattern is the collection of all k-mers that are at most Hamming distance d from Pattern.


How many 5-mers are in the 2-neighborhood of Pattern = TGCAT?

Note that the d-neighborhood of Pattern includes Pattern.

The 2-neighborhood of a 5-mer (Pattern = "TGCAT") would include all 5-mers that are at most a Hamming distance of 2 away from "TGCAT".

To calculate this, consider:

For a Hamming distance of 1, we have 4 choices (the other three nucleotides) for each of the 5 positions, which is 5*4 = 20.

For a Hamming distance of 2, we need to choose 2 out of the 5 positions to change (which can be done in 10 ways, as it's a combination of 5 choose 2), and then for each of those positions we have 3 choices of a new nucleotide. That gives us 1033 = 90.

Finally, the pattern itself ("TGCAT") is also included in its 2-neighborhood, so we have one more 5-mer.

Adding these up, there are 20 + 90 + 1 = 111 total 5-mers in the 2-neighborhood of "TGCAT".

**106**

This function creates a list skew, where skew[i] is the skew at position i in the sequence. It iterates over the sequence, updating the skew based on whether the current nucleotide is 'C' or 'G'. After calculating the skew at all positions, it finds the maximum skew and returns the position where this maximum skew is attained. If there are multiple such positions, it returns all of them.

In spectrophotometry, **absorbance** is a measure of the capacity of a substance to absorb light of a specified wavelength.

**Absorbance = k x path length x concentration**   (Beer-Lambert principle) \\
where k is the molar absorptivity ( a measure of how well a chemical species absorbs a given wavelength of light)

The d-neighborhood of the k-mer Pattern is the collection of all k-mers that are at most Hamming distance d from Pattern.


How many 10-mers are in the 1-neighborhood of Pattern = CCAGTCAATG?

Note that the d-neighborhood of Pattern includes Pattern.


The 1-neighborhood of "CCAGTCAATG" would include all 10-mers that are at most a Hamming distance of 1 away from "CCAGTCAATG".

This means we allow up to 1 mismatch in the pattern. There are four possible nucleotides at each position: A, C, G, T.

For a Hamming distance of 1, we have 3 choices (the other three nucleotides) for each of the 10 positions, which is 10*3 = 30.

Finally, the pattern itself ("CCAGTCAATG") is also included in its 1-neighborhood, so we have one more 10-mer.

Adding these up, there are 30 + 1 = 31 total 10-mers in the 1-neighborhood of "CCAGTCAATG".

## Q. Compute $Count_2$
```
CATGCCATTCGCATTGTCCCAGTGA, CCC
```


In [8]:
def count_i(pattern, text, i=1):
    count = 0
    for j in range(i-1, len(text) - len(pattern) + 1):
        if text[j:j+len(pattern)] == pattern:
            count += 1
    return count

text = 'CATGCCATTCGCATTGTCCCAGTGA'
pattern = 'CCC'

print(count_i(pattern, text, 2))


1


### Q. Identify the value of i for which $Skew_i$
```
CATTCCAGTACTTCATGATGGCGTGAAGA
```
attains a maximum value.

### Code Challenge: Implement Neighbors to find the d-neighborhood of a string.

Input: A string Pattern and an integer d. \\
Output: The collection of strings Neighbors(Pattern, d). (You may return the strings in any order, but each line should contain only one string.)

Sample Input:

ACG \\
1

Sample Output:

CCG TCG GCG AAG ATG AGG ACA ACC ACT ACG

This is the dataset:

GGCTTCCATT \\
2


In [9]:
def immediate_neighbors(pattern):
    """
    Returns the 1-neighborhood of a string pattern.
    """
    neighborhood = set([pattern])
    nucleotides = ['A', 'C', 'G', 'T']
    for i in range(len(pattern)):
        symbol = pattern[i]
        for x in nucleotides:
            if x != symbol:
                neighbor = pattern[:i] + x + pattern[i+1:]
                neighborhood.add(neighbor)
    return neighborhood

def neighbors(pattern, d):
    """
    Returns the d-neighborhood of a string pattern.
    """
    if d == 0:
        return set([pattern])
    if len(pattern) == 1:
        return set(['A', 'C', 'G', 'T'])
    neighborhood = set()
    suffix_neighbors = neighbors(pattern[1:], d)
    for text in suffix_neighbors:
        if hamming_distance(pattern[1:], text) < d:
            for x in ['A', 'C', 'G', 'T']:
                neighborhood.add(x + text)
        elif hamming_distance(pattern[1:], text) == d:
            neighborhood.add(pattern[0] + text)
    return neighborhood

def hamming_distance(p, q):
    """
    Returns the Hamming distance between strings p and q
    """
    if len(p) != len(q):
        raise ValueError("Undefined for sequences of unequal length.")
    return sum(el1 != el2 for el1, el2 in zip(p, q))

# Test the function
pattern = "GGCTTCCATT"
d = 2
print('\n'.join(neighbors(pattern, d)))


GGCGTCTATT
GGCTTCCGAT
GCCTTCCATG
GGTATCCATT
GGCCTCAATT
CGCTTGCATT
GGCTCCCATA
GGCTTCGATC
GACTTCCAAT
CGCTTCCATA
GGCTTCGAAT
GGCTGCCATC
TGCTTCAATT
CGCTTCTATT
GGCTTTTATT
AGCTTCCGTT
CGCTTACATT
GGCTTACAAT
GTCTTCAATT
GGCTTCTACT
GGCTACCGTT
TGCGTCCATT
GGCTGCCACT
GGCTTTCATA
GTCTTCCGTT
GAATTCCATT
GGCTTCCTTC
GGTTGCCATT
TTCTTCCATT
GACTTCCATT
GGCTTCTATT
GGTTTCCAGT
GGCCTACATT
GGCGTGCATT
TGCTTCCCTT
AGCTTACATT
GAGTTCCATT
AGCCTCCATT
GGCTGGCATT
GGCTTCTATA
GGCTTCCCTT
AGATTCCATT
CGCTTCCATC
ACCTTCCATT
AGCGTCCATT
GGCTGCAATT
GGCTCCCCTT
CGCTTCCACT
GGCTTGCATT
GTCTTTCATT
GGCTCCCATT
GGCTTGCCTT
GACCTCCATT
GGCATCAATT
GGCTTCCTTT
GTCTCCCATT
GGGTTCTATT
TGCTTCCATC
GGCTTCACTT
GGATTCCATT
GGCTTACAGT
GGCCTCCAAT
TGCTGCCATT
AGCTCCCATT
GGCTTACATG
TGCTTCGATT
GGCTTCATTT
GGTTTCGATT
GGCTTCGGTT
CGCTTCCATG
GCCCTCCATT
GGGTTTCATT
CGCCTCCATT
GGCTTCCACT
GGCTTACCTT
GTCTTGCATT
GGCTTCCTGT
GGCTTCCATC
CTCTTCCATT
GTTTTCCATT
TGCTTCCTTT
GGCTTGTATT
GGCTTCAATT
GGCCTCCGTT
GGCGTCCATT
GGCCCCCATT
GGCTTCGACT
GGCTCCCACT
GGCCTCCATT
GGCTTACACT
GGATTGCATT