# Part 1 Short Answer Questions

### 1 List comprehension.

```py
results=[]
for i in range(10):
    if (i % 2 == 0):
        results.append(i**2)
```

In [1]:
[i ** 2 for i in range(10) if i % 2 == 0]

[0, 4, 16, 36, 64]

### 2 Python data

A matrix is a 2-dimensional object, very much like a table, that hrows and columns.

1. Give an example of how you can represent a matrix that has 6 columns and 4 rows using lists.


In [2]:
# I use lists inside a list
# list[nrow][ncolumn]

nrow = 4
ncol = 6

matrix_ls = []

for i in range(nrow):
    ls_row = []
    for j in range(ncol):
        ls_row.append(10 * i + j + 11) # easy for debug
    matrix_ls.append(ls_row)

matrix_ls

[[11, 12, 13, 14, 15, 16],
 [21, 22, 23, 24, 25, 26],
 [31, 32, 33, 34, 35, 36],
 [41, 42, 43, 44, 45, 46]]

2. Explain how you would retrieve values from the three different scenarios (the code/solution does not have to be written in just one line):
    1. the value in row 2, column 4.
    2. all values in row 3
    3. all values in column 2

In [3]:
# the value in row 2, column 4
# in my case it should be 24
# As I said above, list[nrow][ncol]
# first index is row index, the second is column
matrix_ls[1][3]

24

In [4]:
# all values in row 3
# I set row as the outer list
# so I use the statement below
matrix_ls[2]

[31, 32, 33, 34, 35, 36]

In [5]:
# all values in column 2
# column is the inner list
# I use a list comprehension to pull them out
[row[1] for row in matrix_ls]

[12, 22, 32, 42]

### 3 Python object (30 pts)

Create a class called `Gene` which stores:

- Name - a string
- Species - as a string
- Sequence - as a string
- Coordinates - as a list of two values, first is start coordinate and the second is the upto but not including the end coordinate.


- The `__init__` function should expect the user to provide all the variables requested above when creating a new instance of gene.
- Overload the operator `__len__` such that it returns the difference between the end. For example, If the coordinates for a gene called BRCA1 are 1200, 1500, then the len(BRCA1) should return 300.

In [6]:
class Gene:
    Name = ""
    Species = ""
    Sequence = ""
    Coordinates = list() # upto but not including

    def __init__(self, name: str, species: str, sequence: str, coordinates: list):
        """
        Coordinates - as a list of two values
        1. first is start coordinate
        2. the second is the upto but not including the end coordinate.
        """
        self.Name = name
        self.Species = species
        self.Sequence = sequence
        self.Coordinates = coordinates

    def __len__(self):
        """
        Return the length of the gene
        """
        return self.Coordinates[1] - self.Coordinates[0]

In [7]:
# Test my class
BRCA1 = Gene("BRCA1", "Homo sapiens", "AAAAAATTTTTCCCCCGGGG", [1200, 1500])

In [8]:
BRCA1.Name

'BRCA1'

In [9]:
BRCA1.Species

'Homo sapiens'

In [10]:
BRCA1.Sequence

'AAAAAATTTTTCCCCCGGGG'

In [11]:
BRCA1.Coordinates

[1200, 1500]

In [12]:
len(BRCA1)

300

# 2 GFF parser

The file format that contains gene locus annotations is called a GFF file. There may be modules
written to parse them, so you are more than welcome to use them. However, the format is a
simple tab delimited format containing 9 columns, so it may easier to simply write your own
parser. The nine columns of a GFF file are :
1) Reference sequence – for example a chromosome name
2) Source – who created the annotations
3) Feature – genomic features such as gene, exons, CDS, etc. Notice they are hierarchical. For our task we are only interested in the “gene” features.
4) Start – Start position of the feature on the Reference Sequence
5) End – Stop position of the feature on the Reference Sequence
6) Score – Such as Blast score, if there isn’t one you will see a “.”
7) Strand – Positive or Negative
8) Phase – Used for CDS to explain which of the three phases is being translated.
9) Annotation – Details of the feature such as name and parent feature. For this
example we are only interested in “Name”



In [13]:
import re

class GFF:
    """
    Storage a column of a `.gff` file
    """
    ReferenceSequence = ""
    Source = ""
    Feature = ""
    Start = int()
    End = int()
    Score = ""
    Strand = ""
    Phase = ""
    Annotation = ""

    def __str__(self):
        return "ReferenceSequence\t" + self.ReferenceSequence \
            + "\nSource\t\t\t" + self.Source \
            + "\nFeature\t\t\t" + self.Feature \
            + "\nStart\t\t\t" + str(self.Start) \
            + "\nEnd\t\t\t" + str(self.End) \
            + "\nScore\t\t\t" + self.Score \
            + "\nStrand\t\t\t" + self.Strand \
            + "\nPhase\t\t\t" + self.Phase \
            + "\nAnnotation\t\t" + self.Annotation

def open_gff(path: str) -> list:
    """
    open the gff file, and return a list of GFF
    """
    ls_gff = []

    with open(path) as f:
        for line in f.readlines():
            gff = GFF()
            data = line.strip().split("\t")
            gff.ReferenceSequence = data[0]
            gff.Source = data[1]
            gff.Feature = data[2]
            gff.Start = int(data[3])
            gff.End = int(data[4])
            gff.Score = data[5]
            gff.Strand = data[6]
            gff.Phase = data[7]
            gff.Annotation = data[8]
            
            ls_gff.append(gff)

    return ls_gff

def get_name_from_gff(gff: GFF) -> str:
    """
    return the name from gff annotation
    """
    match = re.search(r"Name=(.*?)(;|$)", gff.Annotation)
    return match.group(1) if match else ""

Test my functions

In [14]:
TAIR10_GFF3 = open_gff("TAIR10_GFF3_genes.gff")

In [15]:
print(TAIR10_GFF3[1])
get_name_from_gff(TAIR10_GFF3[1])

ReferenceSequence	Chr1
Source			TAIR10
Feature			gene
Start			3631
End			5899
Score			.
Strand			+
Phase			.
Annotation		ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010


'AT1G01010'

In [16]:
print(TAIR10_GFF3[2])
get_name_from_gff(TAIR10_GFF3[2])

ReferenceSequence	Chr1
Source			TAIR10
Feature			mRNA
Start			3631
End			5899
Score			.
Strand			+
Phase			.
Annotation		ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1


'AT1G01010.1'

My `dffparser`

On the class Prof. Katari said that
- range start < gene end
- range end > gene start

```
--- neucleotides
=== genes
|   range selected

     |                |
----====-----===----====----====----
     ✓        ✓      ✓      X
```

In [17]:
def gffparser(path: str, chromosome: str, start: int, end: int):
    ggfs = open_gff(path)

    # only get the gene name
    return [get_name_from_gff(i) for i in ggfs \
        if i.ReferenceSequence == chromosome and
            i.Feature == "gene" and # make sure it is a gene
            i.End > start and # make sure it is in the range
            i.Start < end]

In [18]:
gffparser("TAIR10_GFF3_genes.gff", "Chr1", 1, 10000)
# AT1G01010
# AT1G01020

['AT1G01010', 'AT1G01020']