# Bioinformatics Python programming placement assessment

---

Author: Xiayan Li <br>
ID: #00181874 <br>
Python Version: 3.9.1 64-bit <br>
Created Date: 06/28/2022 <br>
Last Modified Date: 06/30/2022 <br>
Description: This coding piece is for BIOINF 529 placement exam <br>


---

## Purpose

The purpose of this assessment is to determine if you already possess the Python programming skill set pre-requirements of **BIOINF 529: Bioinformatic Concepts & Algorithms** and to waive your requirement for **BIOINF 575: Programming Laboratory in Bioinformatics**.

---

## Instructions

* *Carefully* read **all** instructions for *each* section before attempting to write any code
* <u>Do not alter</u> **any** of the provided source code other than what is requested

---

## Information

### General:

---
### <font color="red">Warning</font>
As we know that there is a strong urge to perform at the edge of your ability in graduate school, we **strongly** urge you to <u>refrain from getting assistance with this assessment</u>, and <u>only submitting work that you know you are currently capable of</u>. 

This assessment is not used to evaluate you in any other way other than your Python programming abilities. 

That is, if you turn in work that is *not your own* or you *do not fully understand* and we, consequently, waive the *BIOINF 575: Programming Laboratory in Bioinformatics* requirement, <font color="red">you will likely suffer</font> from a steeper learning curve during *BIOINF 529: Bioinformatic Concepts & Algorithms*.

---

This is only an *assessment* and will not be used for any other form of evaluation other than that stated in the [Purpose](#Purpose) section of this file.

After your assessment is evaluated, the instructors of *BIOINF 529: Bioinformatic Concepts & Algorithms* will determine if you meet the pre-requirements for the course.

**Note**: We fully acknowledge that you may (or may not) have prior programming experience. This experience may be in other programming languages (e.g. MATLAB, R, etc) or in the language of instruction: **Python**. However, *BIOINF 529: Bioinformatic Concepts & Algorithms* relies on fundamental *Python* programming concepts not readily used in other languages or regularly covered in self-taught curriculum. 

Therefore, please know that we are not assessing your ability to program, but rather your ability to program using intermediate/advanced Python concepts.

# Important Readme:

1. You do **not** have to take this assessment
    * If you do take this assessment; you have the *opportunity* to waive the *BIOINF 575: Programming Laboratory in Bioinformatics* requirement
    * If you do not take this assessment; you will just have to take *BIOINF 575: Programming Laboratory in Bioinformatics* 
* You do **not** have to pass this assessment should you choose to take it
    * If you pass; the *BIOINF 575: Programming Laboratory in Bioinformatics* requirement will be waived
    * If you do not pass; you will just have to take *BIOINF 575: Programming Laboratory in Bioinformatics* 
* This assessment will **not** be stored after the assessment is complete
* If you find yourself not readily understanding what is asked of you in this assessment, do **not** worry. 
    * You **will** be taught these concepts in *BIOINF 575: Programming Laboratory in Bioinformatics* 
    * You **will** find yourself well-prepared for *BIOINF 529: Bioinformatic Concepts & Algorithms*.
* This assessment will **not** be used in any other way to evaluate you as a student or your place in this program

### Points of Assessment

You will be assessed on your ability to understand and implement the following *Points of Assessment*:

|#|Concept|
|--|:--|
|1| [Jupyter notebooks](#Part-0)|
|2| [git](#Part-0)|
|3| [I/O operations](#Part-1%3A-I%2FO-and-string-operations)|
|4| [String operations](#Part-1%3A-I%2FO-and-string-operations)|
|5| [Numpy arrays and simple Python data structures](#Part-2%3A-Numpy%2C-data-structures%2C-%3Ccode%3Erandom%3C%2Fcode%3E%2C-slicing%2C-indexing%2C-and-subsetting)|
|6| [Random selection techniques](#Part-2%3A-Numpy%2C-data-structures%2C-%3Ccode%3Erandom%3C%2Fcode%3E%2C-slicing%2C-indexing%2C-and-subsetting)| 
|7| [Indexing, slicing, and subsetting data](#Part-2%3A-Numpy%2C-data-structures%2C-%3Ccode%3Erandom%3C%2Fcode%3E%2C-slicing%2C-indexing%2C-and-subsetting)|
|8| [Object-Oriented Programming](#Part-3%3A-Object-Oriented-Progamming)|
|9| Coding by Contract|
|10| Iteration methods|
|11| Python control structures|

---

## Administration

### Timeline

You will have one (1) week to complete the assessment

### Submission

Please attach the file in an email to **both** Ryan E Mills (remills@umich.edu) and Alan Boyle (apboyle@umich.edu)

---

# Part 0

You may not know it, but you are already on your way to fulfilling one of the [Points of Assessment: *Jupyter Notebooks*](#Points-of-Assessment). Jupyter Notebooks (\*.ipynb files) are the primary medium of instruction in *BIOINF 529: Bioinformatic Concepts & Algorithms*. Since you were able to view this file, it means you have a basic understanding of using Jupyter Notebooks.

## Instructions

* Use `git` to acquire the data from this [repository](https://github.com/dcmb-courses/bioinf529-assessment-data)
* ***move this notebook*** file into the newly acquired repository

You will be using this data for [Part 1](#Part-1%3A-I%2FO-and-string-operations) of this assessment

---

# Part 1: I/O and string operations

## Instructions

### A. Write the following function
0. <u>No `import` statements are allowed in this section</u>
1. Write a function called <u>`placement_io`</u> with the following positional arguments:
    1. <u>`file`</u>: a string path and filename to data that will be ingested (e.g. '/c/user/me/some_file.txt')
    * <u>`delim`</u>: what is the delimiter of the file (e.g. tab, comma, etc)
2. The function must open the declared `file` and process it *row-by-row*:
    1. Each row must be delimited by `delim` and converted into a `list`
    * Check each row if there is a variant:
        1. If there is no variant; add a period ('.') to the end of the row
        * If there is a variant; create a new string from the reference sequence such that the variant is present in the correct position and add it to the end of the row
    * Add the newly modified row into another `list` that contains all the modified rows
* The function must <u>`return`</u> the modified rows `list`

In [1]:
# Use this cell to write the function described above
def placement_io(file:str, delim: str):
    f = open(file, "r")
    lines = f.readlines()

    modified_row = []
    for line in lines:
        # remove the newline character
        line = line.strip()
        line = line.split(delim)
        
        if line[3] == "No_variant":
            # for no variant row
            line[3]+='.'
            modified_row += [line]
        else:
            # for variant row
            # make a sequence from reference that replaced the variant and append to the row
            position, letter = line[3].split('_')[1:3]
            position = int(position) - int(line[0])
            seq_with_v = line[2][:position] + letter + line[2][position+1:]

            #line append seq_with_v
            modified_row += [line + [seq_with_v]]
        
    return modified_row
            
    

### B. Use the function on the data file provided in the git repo

In [2]:
# Use this cell to write the code that uses the above function on the provided file
for row in placement_io("sample_data.tsv", "\t"):
    print(row)


['4863', '4883', 'GATATAGCACACAAGTAGAC', 'No_variant.']
['2310', '2330', 'AAAATTAGTAGATTTCAGAG', 'Variant_2327_T', 'AAAATTAGTAGATTTCATAG']
['5313', '5333', 'CCATTTTCAGAATTGGGTGT', 'No_variant.']
['6302', '6322', 'AACTTGATATAATACCAATA', 'Variant_6303_C', 'ACCTTGATATAATACCAATA']
['7673', '7693', 'TTAACAATTACACAAGCTTA', 'No_variant.']
['2378', '2398', 'CACATCCCGCAGGGTTAAAA', 'Variant_2390_C', 'CACATCCCGCAGCGTTAAAA']
['7355', '7375', 'CTATGGGCGCAGCCTCAATG', 'No_variant.']
['2022', '2042', 'ATTAGTAGGACCTACACCTG', 'Variant_2025_G', 'ATTGGTAGGACCTACACCTG']
['8537', '8557', 'GTGGGTTTTCCAGTCACACC', 'Variant_8545_A', 'GTGGGTTTACCAGTCACACC']
['436', '456', 'AGTATGGGCAAGCAGGGAGC', 'Variant_454_T', 'AGTATGGGCAAGCAGGGATC']


---

# Part 2: Numpy, data structures, `random`, slicing, indexing, and subsetting

## Instructions

### A. Setup your workspace
0. The *only* `import` statements allowed in this section are: `numpy`, `random`, and `itertools`
* Load in the following libraries:
    1. Numpy (aliased as `np`)
    * `random` from the standard library
    * any *specific* tools needed from the `itertools` library -> Do **not** import `itertools` directly, just the sub-functions needed

In [3]:
# Use this cell to complete the directions detailed under 'A. Setup your workspace'
import numpy as np
import random
from itertools import product


### B. Write the following functions

#### a. Numpy array

1. Write a function called <u>`placement_numpy`</u> with the following **keyword** arguments:
    1. <u>`min`</u>: the minimum number to randomly select from (default: 0)
    * <u>`max`</u>: the maximum number to randomly select from (default: 42)
    * <u>`n_rows`</u>: the number of rows in the array (default: 3)
    * <u>`n_cols`</u>: the number of columns in the array (default: 4)
    * <u>`n`</u>: the number of random samples (default: 1000)
    * <u>`fn`</u>: the name of a function that will be used to aggregate results (default: `sum`)
    * <u>`seed`</u>: allow the user to define a random seed for all operations (default: None)
* The function must set the seed declared by <u>`seed`</u> using the Numpy library
* The function must use the Numpy library to create a random array of integers ranging from <u>`min`</u> to <u>`max`</u> such that it fills a Numpy array of dimensions (<u>`n_rows`</u>, <u>`n_cols`</u>)
* Collect <u>`n`</u> random samples with replacement
* Iterate through all the random samples
    * Use <u>`fn`</u> to aggregate the results
* When all iterations are complete, the function must <u>`return`</u> the aggregate answer

In [4]:
# Use this cell to complete the directions detailed under 'a. Numpy array'
def placement_numpy(min:int = 0, max:int = 42, n_rows:int = 3, n_cols:int = 4, n:int = 1000, fn = sum, seed: int = None):
    if seed is not None:
        random.seed(seed)
    # create a numpy array
    # fill it with random numbers between min and max
    np_array = np.random.randint(min, max, (n_rows, n_cols))

    # collect n random samples with replacement from np_array
    samples = np.random.choice(np_array.flatten(), n, replace=True)

    # apply fn to the samples
    results = fn(samples)

    # apply the function fn to the samples
    results = fn(samples)
    return results


In [13]:
# Use this cell to write the code that uses the placement_numpy function
placement_numpy(1,9, 3, 4, 3, sum , seed=42)

8

#### b. Python data structure: `list`

1. Write a function called <u>`placement_list`</u> with the following **keyword** arguments:
    1. <u>`left`</u>: a Python `list` object (default: None)
    * <u>`right`</u>: a Python `list` object (default: None)
* The function must ensure that both `left` and `right` are the same size or `left` is one (1) item larger than `right`
* The function must ensure that both `left` and `right` are larger than, or equal to, two (2) items
* Subset `left` by just the <u>odd</u> numbered items within it
* Subset `right` by just the <u>even</u> numbered items within it
* The function must <u>`return`</u> the Cartesian product of both subsets as a `list`

In [6]:
# Use this cell to complete the directions detailed under 'b. Python data structure: list'
def placement_list(left:list=None, right:list=None):

    if not left and not right:
        raise ValueError("left and right list is None!")
    # raise an error if input list size is incorrect
    if len(left) != len(right) and len(left) -1 !=  len(right) or (len(left) < 2 or len(right) < 2):
        raise ValueError("Input list size error!")

    # subset left into odd length
    sub_left = left[:-1] if len(left)%2==0 else left
    #subset right into even length
    sub_right = right[:-1] if len(right)%2==1 else right
    
    #  calculate cartesian product of sub_left and sub_right
    result = product(sub_left, sub_right)

    # convert the result to list
    result = list(result)
    return result

In [7]:
# Use this cell to write the code that uses the placement_list function
result = placement_list([1,2], [1,2])
print(result)


[(1, 1), (1, 2)]


#### c. Python data structure: `dict`

1. Write a function called <u>`placement_dict`</u> with the following **keyword** arguments:
    1. <u>`seq`</u>: a `list` of numbers ranging from one (1) to sixty-four (64) (inclusive) (default: None)
    * <u>`codon_key`</u>: a `dict` where each codon is associated with a number (default: None)
    * <u>`codon_trans`</u>: a `dict` where each codon is associated with its associated single letter protein designation (default: None)
* The function must use the provided code `codon_key` and `codon_trans`
* The function must reverse `codon_key` such that the number is the key and the codon is the value
* The function must iterate through both dictionaries, creating a dictionary of dictionaries:
    1. The outer key is the number of the codon
    * The outer value is a dictionary:
        1. The inner key is the codon
        * The inner value is the codon's protein translation
    * If a codon is in `codon_key` and not `codon_trans`, skip it
    * If a codon is in `codon_trans` and not `codon_key`, skip it
* The function must then process the `seq`:
    1. It must generate the nucleotide sequence given a list of numbers
    * It must generate the protein translation of the sequence given a list of numbers
* The function must `return` a `tuple` with the following items:
    1. The first item must be a string representation of the nucleotide sequence translation of `seq`
    * The second item must be a string representation of the protein sequence translation of `seq`

In [8]:
# This code is for you to use to complete the directions detailed under 'c. Python data structure: dict' 
nts = 'ACGT'
codon_key = {codon: i for i, codon in enumerate((x+y+z for x in nts for y in nts for z in nts), 1)}

trans_table = {
    'TTT': 'F', 'TTC': 'F', 'TTY': 'F', 'TTA': 'L', 'TTG': 'L', 'TTR': 'L', 'TCT': 'S', 'TCC': 'S',
    'TCA': 'S', 'TCG': 'S', 'TCN': 'S', 'TCY': 'S', 'TCR': 'S', 'TAT': 'Y', 'TAC': 'Y', 'TAY': 'Y',
    'TAA': 'X', 'TAG': 'X', 'TAR': 'X', 'TGT': 'C', 'TGC': 'C', 'TGY': 'C', 'TGA': 'X', 'TGG': 'W',
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L', 'CTY': 'L', 'CTR': 'L', 'CTN': 'L', 'YTG': 'L',
    'YTA': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CCY': 'P', 'CCR': 'P', 'CCN': 'P',
    'CAT': 'H', 'CAC': 'H', 'CAY': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CAR': 'Q', 'CGT': 'R', 'CGC': 'R',
    'CGA': 'R', 'CGG': 'R', 'CGY': 'R', 'CGR': 'R', 'CGN': 'R', 'ATT': 'I', 'ATC': 'I', 'ATA': 'I',
    'ATY': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'ACY': 'T', 'ACR': 'T',
    'ACN': 'T', 'AAT': 'N', 'AAC': 'N', 'AAY': 'N', 'AAA': 'K', 'AAG': 'K', 'AAR': 'K', 'AGT': 'S',
    'AGC': 'S', 'AGY': 'S', 'AGA': 'R', 'AGG': 'R', 'AGR': 'R', 'GTT': 'V', 'GTC': 'V', 'GTA': 'V',
    'GTG': 'V', 'GTY': 'V', 'GTR': 'V', 'GTN': 'V', 'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'GCY': 'A', 'GCR': 'A', 'GCN': 'A', 'GAT': 'D', 'GAC': 'D', 'GAY': 'D', 'GAA': 'E', 'GAG': 'E',
    'GAR': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', 'GGY': 'G', 'GGR': 'G', 'GGN': 'G'
}

In [9]:
# Use this cell to complete the directions detailed under 'c. Python data structure: dict'
def placement_dict(seq: list = None, codon_key: dict = None, codon_trans:dict = None):
    # reverse codon_key key and value
    number_to_codon = {v: k for k, v in codon_key.items()}
    
    # merging number, codon and protein representation into one dict
    nested_dict = {}
    for num, codon in number_to_codon.items():
        if codon in codon_trans.keys():
            nested_dict[num] = {codon:codon_trans[codon]}
    
    nct_seq=""
    ptn_seq=""
    for number in seq:
        # convert the seq numbers into nucleotide and protein seq
        nct, ptn = list(nested_dict[number].items())[0]
        nct_seq += nct
        ptn_seq += ptn
    
    return (nct_seq, ptn_seq)
        

In [10]:
# Use this cell to write the code that uses the placement_dict function
result = placement_dict([3,2,6,3,42,2,19,54,32,1], codon_key, trans_table)
print(result)

('AAGAACACCAAGGGCAACCAGTCCCTTAAA', 'KNTKGNQSLK')


---

# Part 3: Object-Oriented Progamming

## Instructions

A rudimentary `class` has been created for you below called `FASTA`. It represents a canonical-only <u>nucleotide</u> FASTA sequence. As objects are meant to contain and/or group attributes and methods that relate to specific type of object, it is important that you know how read and use these.

Use the `FASTA class` provided below to complete the following instructions:
1. You may not modify any of the pre-existing code. That is, you are allowed to add to, but not take away from it
* All class methods below must be added to the functionality of the `FASTA` object.
    1. Write a class method called `complement`:
        1. This class method takes no arguments
        * This class method must `return` the complement sequence of the `seq` attribute contained within the object
    * Write a class method called `reverse`:
        1. This class method takes no arguments
        * This class method must `return` the reverse sequence of the `seq` attribute contained within the object
    * Write a class method called `rev_comp`:
        1. This class method takes no arguments
        * This class method must `return` the reverse complement sequence of the `seq` attribute contained within the object
    * Write a class method called `is_complete`:
        1. This class method takes the argument `kmer` (default: 3)
        * This class method must `return` whether or not the `seq` attribute can be divided into n `kmer`-length kmers such that there is no remainder
    * Write a class method called `translate`:
        1. This class method takes a `dict` as an argument:
            * The keys of the dictionary are codons
            * The values of the dictionary are the protein translations
        * This class method must translate the `seq` attribute contained within the object into its single-letter protein translation using the `dict` argument for translation:
            * Ignore detection of 'Stop' codons
            * If a codon is observed in `seq` that is not in `dict`, the class method should translate this as 'X' instead
            * Only translate <u>complete</u> codons
        * This class method must `return` the translated sequence

In [11]:
# Use the FASTA class below to complete the instructions detailed above
class FASTA:
    
    __slots__ = ['_header', '_seq']
    
    def __init__(self, sequence = None, header = None):
        self._header = header
        self._seq = sequence
    
    @property
    def header(self):
        if self._header is None:
            return '>'
        else:
            return self._header
    
    @header.setter
    def header(self, header):
        if not header.startswith('>'):
            header = '>' + header
        self._header = header
    
    @property
    def seq(self):
        return self._seq
    
    @seq.setter
    def seq(self, sequence):
        sequence = sequence.upper()
        if all(True if letter in 'ACGT' else False for letter in sequence):
            self._seq = sequence
        else:
            raise ValueError('Sequence contains non-canonical nucleotides')
    
    def __len__(self):
        return len(self.seq)
    
    def __str__(self):
        formatted_seq = '\n'.join([self.seq[i:i+80] for i in range(0,len(g),80)])
        return f"{self.header if self.header else ' '}Sequence Length: {len(self)}\n{formatted_seq}"

    def __repr__(self):
        return f"FASTA(header='{self.header}', sequence='{self.seq}')"

    def complement(self):
        translation = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
        complement_seq = ''.join([translation[letter] for letter in self.seq])
        return complement_seq
    
    def reverse(self):
        return self.seq[::-1]
    
    def rev_comp(self):
        return self.complement()[::-1]
    
    def is_complete(self, kmer:int = 3):
        return len(self.seq) % kmer == 0

    def translate(self, codon_ptn:dict = None):
        if not codon_ptn:
            raise ValueError("No codon protein dictionary provided")

        # translate the sequence
        translated_seq = ""
        for i in range(0, len(self.seq), 3):
            codon = self.seq[i:i+3]
            if codon in codon_ptn.keys():
                translated_seq += codon_ptn[codon]
            else:
                # if codon is observed in seq that is not in trans_table
                
                # only translate completed codons
                translated_seq += "X" if len(codon) == 3 else ""
        return translated_seq  

        

In [12]:
# Use this cell to implement the FASTA class and show how your newly-written methods work
test = FASTA("AGCTGGGTGACCGT", ">test")

print(f"original test object is: {repr(test)}\n")
print(f"the complement sequence: {test.complement()}")
print(f"the reverse sequence: {test.reverse()}")
print(f"the reverse complement sequence: {test.rev_comp()}")
print(f"the sequence is complete by 4-mers : {test.is_complete(4)}")
print(f"the translated sequence: {test.translate(trans_table)}")

original test object is: FASTA(header='>test', sequence='AGCTGGGTGACCGT')

the complement sequence: TCGACCCACTGGCA
the reverse sequence: TGCCAGTGGGTCGA
the reverse complement sequence: ACGGTCACCCAGCT
the sequence is complete by 4-mers : False
the translated sequence: SWVT


---

# Assessment Rubric

1. [ ] Jupyter notebooks
* [ ] git
* [ ] I/O operations
* [ ] String operations
* [ ] Numpy arrays and simple Python data structures
* [ ] Random selection techniques
* [ ] Indexing, slicing, and subsetting data
* [ ] Object-Oriented Programming
* [ ] Coding by Contract
* [ ] Iteration methods
* [ ] Python control structures