# Problem 0: Read data from a file (19 points)

In this problem, you'll read in the data from a file of human gene coordinates taken from the GENCODE database. You'll store the four columns of data from the file as four individual lists. 

To do this homework, make sure `gene_table.txt` and `test_table.txt` are in the same folder as this notebook. (They should already be included in the downloaded ps3 folder.)

**The goal of problem 0 is to end up with lists holding gene names, start coordinates, stop coordinates, and number of exons per gene - for the entire GENCODE gene set.** As demonstrated in class, you'll use a `for` loop to read in and process lines from the file, one at a time.

In the block of the `for` loop we need code to:

1. Strip the trailing newline from each line
2. Extract the individual values from the four data columns
3. Append those values (as the correct data type) to the corresponding lists


### Coding Strategy

To successfully write code that carries out multiple steps of data processing, it helps to have a strategy. For this problem, you'll use an inside-out coding strategy. The key idea is that we don't usually write code the way you might write, say, an esaay – you don't start at the beginning and work through to the end. Instead, we write code best by:

1. Breaking the overall task into small steps.
2. Write and test individual pieces of code for those small steps, using simple examples
3. Assemble the small steps into a larger whole that executes the main task.


Refer to the Lecture 3 notes and the lecture activity for code examples as you go along. Some of this problem will feel repetitive, since we worked through nearly identical code in class. This is by design! To solve statistical problems later in the course, it is critical to be comfortable with variables, lists, loops, conditionals, and reading files.

One more note: some of you may be aware that there are existing Python packages to read in file data. Later in the course we'll use one or two, associated with numpy and pandas. However, you won't always have files that work nicely with pre-written functions. It is critical to know how to use basic Python syntax to handle file input/output.

# Problem 0.1: Processing file lines

## Processing lines, step 1: Strip trailing newline characters

In the cell below, a variable called `line` holds a sample line from a larger file of genomic data. Use the `.strip()` command to remove the trailing newline, and assign the result to a new variable `stripped`.

In [None]:
line = 'chr1\t33385\t33445\tchr1.17\n' # line of data from a DNaseI bed file in K562 cells

# YOUR ANSWER HERE

stripped # to display the result in output

In [None]:
assert len(stripped) == 24
assert ('\n' in stripped) == False

## Processing lines, step 2: Split a line of data into a list of elements

The next task is to split the string `stripped` into individual data values. As we saw in class, we do this using the function `.split()`.

In genomic data files, columns are sometimes delimited (separated) by tabs, in what are *tab-delimited files*. You may have seen people export Microsoft Excel workbooks as `.csv` files. Csv stands for *comma separated values*. So if you were reading in a .csv file, you'd tell `.split()` to divide the line by commas instead of tabs.

The function `.split()` turns a *string* into a *list*, spliting the strings at the whitespace (tabs or spaces), or other delimiter (commas) separating the values. `.split()` will try to guess where to divide up the string if you don't give it any arguments (meaning, leaving the parentheses empty). In some cases where your data might include tabs and spaces or commas, `.split()` could guess wrong. So I recommend always including the delimiter argument when you call `.split()`.

NOTE: Delimiter arguments for `.split()` must be strings.

In the cell below, I've given you a line that has already been stripped of the newline character. Split this line **by its given delimiter** and save the resulting list as `split`.

In [None]:
stripped = 'ENST00000511072.1,PRDM16,chr1,16,protein_coding,2985731,364430,365175,653,92'

# YOUR ANSWER HERE

print(split) # to display the result

In [None]:
assert len(split) == 10
assert split[6] == '364430'
assert split[-6] == 'protein_coding'

## Processing lines, step 3: Convert numerical data to floats

The function `.split()` breaks up a string into a list of individual values, but those values are still strings - even the numbers. (Look at the output of the last problem. How can you tell that the list elements are strings?)

To perform calculations on the data, we need to convert numerical elements to either integers (using `int()`) or floating point numbers (using `float()`).

In the cell below, the list `split` holds data from one subject in a diabetes study. There are three elements in the list: Category (case or control), age, and fasting blood sugar in mmol/L.

Below, define three empty lists called `category`, `age`, and `blood_sugar`. Then take each element of `split`, convert that element to the appropriate data type as needed (integer or float), and append the result to the proper list.

In [None]:
split = ['case', '64', '5.2'] # category, age, fasting blood sugar

# YOUR ANSWER HERE

print("Category:", category, "\nAge:", age, "\nBlood Sugar:", blood_sugar)

In [None]:
assert len(category) == 1
assert category[0] == 'case'
assert age[0] == 64
assert type(blood_sugar[0]) == float

## A review of how to read lines from files.

Now that we've gone over how to process indivdual lines from a file, let's breifly recap the three ways to read in those lines into your notebook.

Let's say we opened a file and assigned the file object to the variable name `file`. Here are three ways to read in lines:
```python
file.readline() # reads just one line
file.read() # reads all lines at once

for line in file: # reads all file lines, one by one
    print(line)
```

Recall that each time a line from the file is read, Python remembers where it left off, no matter which method you use to read in a file line. Given that, why is the following code wrong?

```python
file = open('test_table.txt`)

for line in file:
    line = file.readline()
    print(line)
```

The answer is that the **`for`** loop will read in file lines without any help from functions like `.readline()`. In the above code, one line from the file is read with the command `for line in file:`. Then, *the next line* is read with the command `line = file.readline()`. The above code would therefore only print every other line of a file. Make sure you understand why this code is wrong. If you want to test it out, feel free to create a blank cell below and run the code.

## Putting the code together to read an entire file

Now we put all the code together into something that will read in the entire file `gene_table.txt`. To do this you can type the code *almost* exactly as you typed it above, even using the same variable names `line`, `stripped`, and `split`. The individual steps above should be places within the block of a `for` loop that reads in all the data lines.

`gene_table.txt` has four columns of data: gene names, start coordinates, stop coordinates, and the number of exons in the gene. (You can verify this by opening this text file in your jupyter dashboard.)

In the cell below, write code to do the following:

1. Create empty lists called `names`, `starts`, `stops`, and `exons`. (NOTE the names are plural, a handy convention indicate lists.)
2. Open the file. (Refer to lecture 3 notes for the syntax).
3. Read in the header line by itself with .readline() **Don't forget this step!** You don't want to mistakenly add the header to your lists of data.
4. For each line after the header:
    - Strip the trailing newline
    - Split the line into individual elements (our file is **tab-delimited**)
    - Convert numerical elements to **integers**
    - Append data values to their respective lists
5. Close the file using the command *file_variable*`.close()` (Fill in whatever variable name you choose for your file.)

**A suggestion**: Try your code on the smaller file `test_table.txt` first. When you think it's working, change your answer in the cell below to open the file `gene_table.txt`.


In [None]:
# Create the four blank lists below:

# YOUR ANSWER HERE

# Open the file. Choose an informative variable name.
# I suggest 'file'.

# YOUR ANSWER HERE

# Use .readline() to first read in just the file header as we did in class
# Then write the for loop to read the rest of the lines. In the block of the loop,
# include the line processing operations listed above.

# YOUR ANSWER HERE

# Close the file. If you named your file object 'file', 
# then you would write file.close()

# YOUR ANSWER HERE

# Feel free to create a new blank cell below to
# test out what's in your lists, to make sure you got the 
# answer you expect.

In [None]:
assert len(names) == 197782
assert type(exons[2873]) == int
assert len(stops) == len(names)
assert starts[2847] == 26432397

# Problem 0.2: How long are human genes?

We opened `gene_table.txt`, read in the data, and closed the file. Our data is now stored in four lists. We can now calculate the lengths of all human genes. 

The problem is basically like our red/green luciferase assay problem from class. We need to loop over two lists simultaneously, perform a calculation, and append the result to a new list. Using 4 lines of code (or less), we can calculate the lengths of all human genes using basic Python syntax.

In the cell below, do the following:

1. Loop over the lists `stops` and `starts`, using `zip()` as demonstrated in class.

2. In the block of the `for` loop, calculate gene lengths by subtracting each start coordinate from its corresponding stop coordinate. Append the result to a list called `lengths`.


Remember to define `lengths` as an empty list before you start your **`for`** loop.

In [None]:
# YOUR ANSWER HERE

In [None]:
assert len(lengths) == 197782
assert lengths[1827] == 16858
assert lengths[53] == 104
assert lengths[75] - lengths[20938] == 3273

# Problem 0.3: Calculate the mean length of human genes

## Easy summary functions for lists

This problem will demonstrate some simple functions that summarize properties of our lists. You've already seen one: `len()`. You could write own length function using the syntax you've learned already. But often there is a built-in Python function that save you the trouble.

**These next problems will require a little searching online, or some experimenting to find built-in Python functions that summarize your list in some way.** The functions are actually easy to guess, so I encourage you to just try what you think would be an obvious name for a function. If `len()` is a function to find the length of a list, what would a function for calculating the sum of list elements look like?

**Summing numbers in a list:** 
How many total exons are there in the human genome? Find a Python function to **sum** all of the values in the list `exons`. Save the answer to the variable `exons_sum`.

NOTE: This isn't quite the correct number of total exons, since there is some redundancy in our table. We downloaded a table of human transcripts, many of which overlap and share the same exons.

In [None]:
# YOUR ANSWER HERE

print(exons_sum)

In [None]:
assert exons_sum == 1230194

**Largest and smallest values in a list:** 
How long is the longest human gene? How long is the shortest? Find Python functions that will identify the maximum and miminum values in a list, and apply it to your list of human gene lengths. Assign the answers to the variables `longest` and `shortest`.

In [None]:
# YOUR ANSWER HERE
print('longest:', longest, 'shortest:', shortest)

In [None]:
assert longest == 2304640
assert shortest == 5

What is the maximum number of exons in a human gene? Use one of the functions from the last question to answer this. Assign the result to `most_exons`.

In [None]:
# YOUR ANSWER HERE

print(most_exons) # display the answer

In [None]:
assert most_exons == 363

## Calculate the mean length of human genes

To calculate the mean of a set of data, you take the **sum** of those data and divide by the **number** of data points. What two simple list summarizing functions could you use to calculate the mean transcript length of the list `lengths`?

In the cell below, use two simple list summarizing functions to calculate the mean length of human genes. (Take the sum of the list `lengths` and divide it by the number of list elements.) Assign your answer to the variable `mean_length`.

In [None]:
# YOUR ANSWER HERE

print(mean_length)

In [None]:
assert 34491.72 < mean_length < 34491.73