# 1: Lists of lists and building your own complex data structures

You have worked with lists whose elements are basic data types - strings, integers, floating point numbers. 

As discussed in the lecture, a key concept is that lists and dictionaries don't have to hold only these basic data types - they can hold more complex things, like other lists or dictionaries. 

In this problem, you'll practice working with a list of lists.

When would you want to make a list of lists? One case would be when you have a data table with many columns. Our example datasets in this class have had only a few columns of data, and so a separate list for each column has been sufficient. 

But imagine a table with 1000 patients (rows), and dozens of health measurements (columns) for each one. Storing each column as its own list quickly becomes unwieldy in this case. Or imagine a set of gene expression data that include a series of time points - you may have data for 100 genes (rows) and 20 time points (columns). For this data, it makes sense to keep the data for each gene together, rather than distributing it among different lists.

In this activity, you'll load a table of data that shows the expression of 99 yeast genes at 25 different time points across the cell cycle. You'll read in the file as you've done before: taking each line one at a time, removing the trailing newline, etc. But this time, instead of appending individual values to separate lists, you'll append an entire list to the list that holds all the data.

Once the data is loaded, you'll practice accessing data from lists of lists. 

**NOTE:** There are other data structures in Python that make rows with many columns easy to work with, which we'll see later in the course. However, there are always cases when your data requires you to build your own structure, and so it is good to practive the skills in this problem set.

## 1.0 A simple list of lists (Non-scored warmup)

To practice working with lists of lists, work through the toy example in the next few cells. The graded part of the homework will begin with the import of the cell cycle timecourse data.

In [None]:
# Gene expression at five different time points for 4 genes

g1 = [0,2,5,8,2] # gene 1 data
g2 = [0,1,7,12,6] # gene 2 data
g3 = [1,3,5,5,0] # etc
g4 = [9,4,0,1,8]

# Below, create an empty list named expression_data. 
# Then write 4 lines of code appending each of the above lists to 'expression_data'
expression_data = []


# Now look at the result - just type the variable name and run the cell
expression_data

Look at the pattern of square brackets and commas in the output above, and make sure that you understand it. How can you tell that this is a list of lists?

### Accessing rows
You can think of this list of lists as somewhat like a data table with rows and columns, in which data for genes (`g1-g4`) are the rows, and the different time points are the columns. (Important: lists of lists aren't necessarily equivalent to a table with rows and columns. The individual lists inside a list of lists don't have to be the same length.)

Since our 'rows' are just the individual lists that make up the list `expression_data`, we use regular list indexing to access an entire row. In the cell below, use indexing on `expression_data` to show the **third** 'row' of data (or the third list in the list of lists). Your result should match the values of `g3` above. 

In [None]:
# Access the third row of data from expression_data


To belabor the obvious, what kind of data is the output? 

### Accessing individual data points

Now, rather than grabbing an entire row, let's just get a single data point. Use indexing to get the **fourth data point** in the **third row** of `expression_data`, which is 5.

Remember, in the cell above, you used indexing to access an element of the list `expression_data`. That list element was itself a list, so how would you use indexing to access one of its elements?

### Accessing multiple rows

You can grab multiple rows at a time from a list of lists using a slice. In the cell below, use a slice to get the first two rows of `expression_data`. What is the data structure of the output?

### Accessing columns
One disadvantage of the list of lists structure is that it's a little more complicated to access data by columns. As we saw during the lecture, you can't do it with indexing alone. The easiest approach is to use the nifty Python trick of **list comprehension**, as we discussed in the previous lecture.

Remember, a list comprehension is basically a concise **`for`** loop that *builds a new list from an existing one*:

```python
# Create a new list by doubling the values of the first list
first_list = [1, 2, 3, 4]
second_list = [item * 2 for item in second_list]
```

In the list `expression_data`, the list items are themselves lists. So inside of a list comprehension, you can use indexing to access the same column from each list in `expression_data`. The syntax goes like this:

```python
first_column = [row[0] for row in list_of_lists]
```

In the cell below, use list comprehension to get the last column of data from `expression_data`. Before moving on, make sure you understand the syntax.



## 1.1 Read in cell cycle time course data as a list of lists

For this next exercise, you'll use a list of lists to hold time series data of gene expression over the yeast cell cycle, from the file `cell_cycle_timed.txt`. (The data is a selection of the microarray results of this paper:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1553209/.) The expression of many of genes oscillates with the cell cycle, with minimum expression in one phase and maximum expression in another. We'll practice working with complex data structures like lists of lists by performing some calculations on the cell cycle data.

There are two main purposes of this activity:

1) Practice reading in numerical data from a file as a list of lists.

2) Practice performing calculations over 'rows' and 'columns'.

To create a list of lists from file data, the steps are largely the same as what you've done previously in this course, with one difference, as you'll see. In the cell below, write code that does the following:

**1) Create a blank list called `cc_data` to hold the rows of the file.** This will be our list of lists.

**2) Open the `cell_cycle_timed.txt` and save the header row as a list called `header`.** Use the `readline()` function to just read in the single header line. The header has the time points (in minutes) at which the measurements were taken. IMPORTANT: Apply `.strip('\n').split('\t')` on the header line so that the variable `header` is a list of time points, not a single string. This will work (assuming your file handle is named `file`):

```python
header = file.readline().strip('\n').split('\t')
```

**3) Read in the remaining lines with a for loop.** To process each line, you can perform the strip and split operations all in one step:

```python
row = line.strip('\n').split('\t') # file is tab delimited
```

**4) Convert string values to floats:** Remember that `.split()` produces a list of strings, but we want floating point numbers. To do math with our data, we need to convert each element of the list `row` into floats with `float().` A quick way to convert all elements of a list is to use list comprehension:

```python
# Whenever you need to create a new list from an existing list, use list comprehension
row_float = [float(i) for i in row]
```

**5) Append the entire list of floats `row_float` to `cc_data`.**

**6) Close the file.**

Write the code to carry out these steps in the row below. Above, I've given you almost all of the syntax you need. Your main task is to put all of together.

In [None]:
cc_data = []

file = open('cell_cycle_timed.txt')

# YOUR ANSWER HERE
file.close()

In [None]:
# Run this cell to test your code

assert len(cc_data[0]) == 25 # 25 columns
assert len(cc_data) == 99 # 99 rows
assert cc_data[0][10] == -0.128817812

## 1.2 Calculating the mean expression over rows and columns.

Now that you've read in the data, you'll finish this activity by taking an average over a row of data (mean expression of one gene across all time points), and an average over a column of data (mean expression of all genes at one time point).

### Averaging across a row.

In the cell below, take the mean of a single row of data across all time points. More specifically, do the follwing:

1) Sum the values of the 51st row of `cc_data`.

2) Get the mean by dividing that sum by the number of items in the sum. You get the number of items with the `len()` of the 51st row.

3) Save the answer as `mean_51`. Your answer should be about `0.1184596...`

In [None]:
# YOUR ANSWER HERE
print(mean_51)

In [None]:
assert int(mean_51 * 10000000) == 1184596

### Averaging across a column

Columns are more challenging. In the next task, you'll average over all genes at **time point 77 min**. First you have to figure out which column represents that time point. Here is where `enumerate()` comes in handy. Run the cell below to see the matchup between column number and cell cycle time point.

In [None]:
# Below I've used a for loop with enumerate() to get the column indices for the different time points.
# Which index corresponds to time point 77 min?

for index, time in enumerate(header):
    print(index, time)

Looking at the output above, you can see that time point 77 corresponde to column 11.

In the cell below, take the average of time point 77.

1) Write a list comprehension to build a new list of all values at position 11 in each of the "rows" of `cc_data`. Call this list `t_77` (time point 77 min).

2) Calculate the mean of `t_77` using `sum()` and `len()`. (Note that you could apply these functions directly to the list comprehension without bothering to create `t_77`. This accomplishes the mean calculation in one line of code rather than two.)

3) Save the answer as the variable `mean_77`.


In [None]:
# YOUR ANSWER HERE

print(int(mean_77*10000000), len(t_77))

In [None]:
assert len(t_77) == 99
assert int(mean_77*10000000) == -1981588

Complex data structures like lists of lists are relatively common in Python code. Learning how to comfortably work with them is critical.