# Problem 1: `numpy` arrays

Arrays are part of the `numpy` package, which is used frequently in scientific computing. In fact `matplotlib` converts all lists into arrays before making plots. In this course we have emphasized building skills with Python's basic syntax and data structures. But some tasks are cumbersome with basic Python tools (such as working with lists of lists). `Numpy` arrays and associated functions make many computing tasks easier and faster, as long as you're working with floats or integers. Just as it's sometimes useful to buy a miniprep kit or polymerase master mix, rather than make them yourself, it's also often more efficient to work with pre-written software. Numpy provides a lot of powerful data analysis software that you don't want to bother writing from scratch, so it is worth becoming familiar with the basics.

## Key learning goal: Understand how arrays are different from lists

Arrays are in many ways like lists and lists of lists. But there are critical differences between them. Given what you already know about lists, extending your knowledge to arrays should be straightforward.

This problem is mostly a self-directed tutorial, with only a few graded problems. The goal is become familiar with features of arrays.

Here's the official `numpy` documentation, which will be useful to you as you continue to write code after finishing this course: https://docs.scipy.org/doc/numpy-1.13.0/index.html

To begin the problem, import `numpy` as `np` in the cell below. (`np` is the standard nickname used with a `numpy` import.)

In [None]:
# import numpy here

import numpy as np

## 1.0 Creating an array

Numpy functions almost always return an array, so calling a `numpy` function is one way to generate an array. But to create an array from scratch, you typically first create a list, then use `np.array()` to convert the list to an array:

```python
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) # create a 1D array by calling np.array on a list

# an alternative - create the list first:
list_a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

a = np.array(list_a)
```

### How not to create an array:

The `np.array()` function takes *one* data object and converts it to an array. So the code will not work:

```python
b = np.array(0,1,2,3)
```

Can you see why?


In the cell below, choose one of the correct methods to create an array called `a` with the numbers 0 through 9.

In [None]:
# YOUR ANSWER HERE

In [None]:
assert len(a) == 10

## 1.2 Higher dimensional arrays
Arrays don't just have to hold a single sequence of values. Arrays can be two dimensional, like lists of lists (e.g., something with rows and columns). Arrays can have higher dimensions or axes as well, but we usually don't need that for biological data analysis.

In the cell below, do the following:

1) Create a list of lists that holds the data below (containing hypothetical fluoresence intensities of several pixels in a microsopy image). Each list represent x, y, and intensity values for a pixel, and the list of lists holds all three pixels. (Make up your own variable name for the list of lists.)

pixel 1: (10, 0, 35)

pixel 2: (3, 8, 89)

pixel 3: (6, 5, 23)

2) Convert the list to an array called `pixels` (using `np.array()`), and print `pixels`. 


Use `print()` to look at the array `pixels`. Does the output make sense? You can use `type()` to verify that `pixels` is an array. 

There is one other way to tell whether something is an array or a list. Look carefully at the output of `print(pixels)` – what punctuation mark is missing between the data values that you would expect to see in a list?



In [None]:
# YOUR ANSWER HERE

print(pixels)

In [None]:
assert pixels.size == 9
assert pixels.shape == (3,3)

## Array attributes (not graded exercise)

Because arrays can have one or more dimensions, there are a few commands that help you determine the properties of an array. In the cell below, try the following commands on our arrays `a` and `c`. Replace 'array_name' with the actual array variable). 

```python
array_name.size
array_name.shape
array_name.ndim
```
What information does each command give you? Also try the function `len()` - what does that tell you about an array?

NOTE: There are no parentheses after these commands. Why not? It's because they aren't functions, but rather *attributes* of an array. These are intrinsic properties of the array object, kind of like metadata of files on your computer. The above commands report those properties.

In [None]:
# Try the array attribute commands here (not graded)


To see an example of a higher (3) dimensional array, type the following into the cell below:
```python
np.ones((3,3,3)) # not np.array() - what is this function doing?
```

## Some useful array functions
There is a `numpy` version of `range()` just for arrays. Type the following code into the cell below, then `print()` `d` and `e` to see the differences in output:
```python
d = list(range(10))
e = np.arange(10)
```
Look carefully at the output – `e` may look like a list, but there is an important difference in the printed output. Can you spot it?

Here's a function similar to `np.arange()` for creating an array of evenly spaced numbers. Run this code in the cell below.

```python
np.linspace(0, 1, 6) # start, stop, number of points
```

The next function is useful for initializing an empty array (filled with zeros) - run the code in the cell below:
```python
np.zeros((2, 10))
```

In [None]:
np.zeros((2, 10))

Why create an array filled with zeros? **As you'll see below, appending to arrays (unlike lists) is not very efficient.** Generally, you want to avoid appending new data to arrays. The best practice is to either: 

1) Create a list using `.append()`, then covert the list to an array once all items have been appended.


2) First create an empty array of the size of your final dataset (using `np.zeros()`), then fill in the array with your data.

Bottom line, `np.zeros()` is a handy function for pre-building arrays before you add your actual data.

## 1.3 Indexing and slicing arrays
Indexing a 1D array is similar to indexing a list. In the cell below, do the following, using the array `a` that you created at the beginning:

1) Use indexing to assign the 4th element of the array `a` to the variable `b`.

2) Use a slice to grab the last three elements of `a` and assign the result to the variable `c`



Slices also work. Use a slice to print the last three elements of `a`.

In [None]:
# Graded answer

# YOUR ANSWER HERE

In [None]:
assert type(c) == type(a)
assert isinstance(b, np.int64)

Indexing a 2D array is much easier than it was with a list of lists - this is a case where it works just as you expect:
```python
my_array[row][column] # same applied to higher dimensions
```

In the cell below, use indexing to get *the y coordinate of the third pixel in the array `pixels`* (third row, second column). Assign the answer to the variable `pixel3`.

In [None]:
# Graded answer

# YOUR ANSWER HERE

In [None]:
assert pixel3 ==5
assert isinstance(pixel3, np.integer)

## An even better way to index arrays with more than one dimension
There is a better (and, under the hood, faster) way to do this. Rather than use a separate set of square indexing brackets for each array dimension, you just put the indicies within a single set of square brackets:

```python
b[2,0] # better than writing b[2][0]
```
Try accessing the y coordinate of the third pixel again, but use this more efficient approach.

In [None]:
# Non-graded answer. Try faster index below



## 1.4 Accessing rows and columns with arrays

Another great feature of arrays is that you can slice over more than one dimension. Remember, we could not do that for lists of lists. For example, this doesn't work:

```python
list_of_lists[:3][:2] # will throw error
```
But with arrays, you can, as long as you put your slices in a single set of square brackets:
```python
b[:3, :2] # first three rows, first two columns
```
In the cell below, use this syntax to grab the pixel intensities (values of the third column) of all three pixels in `pixels`. (Created above in section 1.2.) Assign the answer to the variable `intensities`.

In [None]:
# Graded answer

# YOUR ANSWER HERE

In [None]:
assert intensities[1] == 89
assert isinstance(intensities, np.ndarray)

## Math is concise and fast with arrays - no need for `for` loops

If you wanted to square all numbers in a list, you would do it with a for loop, or more concisely, with a list comprehension. Squaring all numbers in an array is easier and faster. More generally, math operations applied to an array operate element by element, *without the need to write a `for` loop*. Run the cell below to see this.

In [None]:
my_array = np.array([1, 2, 3, 4])
my_array**2 # square each element - what would happen if you tried this with a list?

To see how much faster math is on arrays, let's use the `%timeit` magic function. Use `%timeit` before a statement to see how long it takes to execute:

```python
my_list = [1, 2, 3]
%timeit [i + 1 for i in my_list] # time a list comprehension to add one to each element
```
In the cell below, do the following:

1. Use `list(range(1000))` to create a list of the numbers 0 to 999, and assign it to the variable `numbers_list`.
2. Then use `%timeit` to see how long it takes square all numbers in `numbers_list` using list comprehension.

In [None]:
# Non-graded answer - just try the code in the next two answer cells


In the next cell, do the same with an array. Use `np.arange()` to create an array with the numbers 0 to 999, and call it `numbers_array`. Then use `%timeit` to see how long it takes to square all numbers in the array. Using `numpy` is more than 200 times faster than a regular `for` loop.

## 1.5 How arrays are *not* like lists

You've seen a few ways in which arrays are not like lists: 

1) Indexing and slicing across multiple dimensions is simpler with arrays. 

2) Element by element math works on arrays without a `for` loop.

Another important way arrays are different from lists is that *arrays only hold a single data type.* 

In the cell below, create a list called `list_h` with the following three values: the integer 1, the float 2.5, and the string DNA (remember you need quotes for strings).

Then use the `np.array()` function to convert `list_h` to an array called `array_h`. Finally, print out `list_h` and `array_h`. 

In [None]:
# Graded answer

# YOUR ANSWER HERE

print(list_h, array_h)

In [None]:
assert array_h[1] == '2.5'

As you can see, by converting `list_h` into an array, all the list elements were converted into a single data type. What data type is that? Write your answer in the cell below. (This answer is graded.)

YOUR ANSWER HERE

### Key takeaway: arrays generally hold only one data type

Because all elements of an array have to be of the same type, the `np.array()` function coerces all elements to be a single data type, in a way that depends on what can be coerced into what. For example, strings can't be forced into integers, but it's easy to make an integer into a string.

You can actually include multiple data types in an array by defining your own complex data type. It gets ugly quickly, but if you want to see how to do it, check out this page: http://www.python-course.eu/numpy_dtype.php

### Iterating over an array

You can iterate over arrays with a for loop using the same syntax used for looping over lists (`for item in list:`). It's easiest with a one dimentional array. In the cell below, write a for loop that iterates over the 1D array `a` and prints out each value. Then in the next cell, do the same thing for the 2D array `pixels`. Do the results make sese?

In [None]:
# Non-graded answer

for element in pixels:
    print(element)
    
for element in a:
    print(element)

## Appending to arrays
Arrays have their own special append function, but beware, it's different from list append:
```python
a = np.append(a,10) # note, np.append returns a *copy* of the array, so you need to assign to variable
```
Try that in the cell below. Then type `np.append?` to learn more. Appending to arrays can get complicated – you have to think about array dimensions and data types. Try experimenting with the append function and the 2D array `pixels`. (For example, append a list of three numbers to `pixels`. Then try appending a number to *one* of the lists of three numbers in `pixels`, to make a list of 4 numbers. What works and what gives you an error?)

**NOTE:** `np.append()` is generally not something you want to use much because it's inherently inefficient. We append to lists frequently, but arrays are not meant to used in this way. This has to do with how arrays versus lists are stored in memory. Appending something to the array changes the array size, which requires new memory allocation in a way not required of lists. Therefore it's best to first build your dataset as a list, then convert it to an array.

In [None]:
# Non-graded cell


## 1.6 Math functions:

Some convient math functoins for arrays will look familiar. Try these for array `a` in the cell below. (Use print, or just rerun the cell with each new function.)

```
array_name.sum()
array_name.min()
array_name.max()
array_name.mean()
array_name.var()
array_name.std()
```

In [None]:
# Non-graded cell


The same functions work on multi-dimensional arrays too. For example, for our array `b` you can calculate sums or means along rows or columns using the axis parameter like this:

```
pixels.sum(axis=0)
```
Try these the cell below, first for axis 0 of array `pixels`, then for axis 1. Which axis represents rows, and which columns? Are the results of each function what you expected? 

Then, in the next cell, use indexing and the built-in array mean function to take the mean of the pixel intensities (third column) in the array `pixels`. Assin the result to the variable `mean_intensity`.

In [None]:
# Non-graded cell to test functions


In [None]:
# Calculate mean of 3rd column of pixels
# Graded answer

# YOUR ANSWER HERE

In [None]:
assert int(mean_intensity) == 49

## 1.7 Read in file data with NumPy

In this course we emphasize writing code from basic building blocks. You've learned how to read in and process file data by writing a for loop. However, there is a shortcut to use when you want to load your data into a numpy array:

```python
data = np.loadtxt(fname='my_file.csv', delimiter=',')
```

Use this command to load the data from the file `cell_cycle.txt` and save it as an array called `data`. (The delimiter will be the tab character, `'\t'`.)

This is the yeast cell cycle gene expression data we briefly worked with in earlier lectures. There is a header row, indicating each time point. The remaining data are gene expression values, one gene per row, across the cell cycle.

In [None]:
# Load cell_cycle.txt
# Graded answer

# YOUR ANSWER HERE

In [None]:
assert data.ndim == 2

Once you've loaded the file data into an array called `data` experiment with indexing, slices, `len()`, `.shape`, `.ndim` to look at the array. For example, try printing out just the header row, or the last three rows. Is this a two dimensional array, as expected?

In [None]:
# Non-graded cell to test array features

Next, plot some of this data.

In the cell below, first run the standard commands to set up and import `matplotlib.pyplot`:

```python
%matplotlib inline
import matplotlib.pyplot as plt
```

Then, assign the header row of `data` to a new variable `time_points`. These will be our x-axis values. The y-axis will show gene expression levels.

What data type is `time_points`? (Use `type()` to check.)

In [None]:
# Graded answer - create variable time_points

# YOUR ANSWER HERE
print(time_points, type(time_points))

In [None]:
assert 25 == len(time_points)

Next, plot all the the *non-header* rows of `data` (y-axis values) against `time_points` (x-axis values), using the `plt.plot()` function. Remember, `plt.plot()` takes two arguments - a list (or array) of x values and a list (or array) of y values.

In the cell below, plot *the first* row of `data`, using `time_points` and a slice on `data`.

In [None]:
# Non-graded answer


Next, plot all of the non-header rows by using a `for` loop. The loop should iterate over *all but the first row of `data`.* (Use a slice of `data` in your `for` statement, like this: `for row in data[1:]:`.) Each time through the loop, call `plt.plot()` to plot the current row.

In [None]:
# Non-graded answer


## Averaging over columns and rows in a 2D array

When we worked with lists of lists, we had to use a for loop (or list comprehension) to take averages over rows or columns of data. This averaging over a data column is much easier with arrays. 

Your next task is to calculate the average expression for each gene (that is, each row in `data` **except** the header row), over the **first six** time points (i.e., the first six columns).

To do this without a for loop, you can simply apply the `np.mean()` function `data`. Here's an example:

```python
col_averages = np.mean(my_array, axis = 0) # axis = 0 averages over the *rows* for each column
row_averages = np.mean(my_array, axis=1) # axis = 1 averages over *columns* for each row
```

The `axis` argument is important to understand, since it's used in many functions that operate on arrays. Axis 0 of an array refers to the first array dimension, e.g., the elements you would get with your first indexing number, like `data[2]`. A one-dimensional array has only one axis, obviously, axis 0. A two-dimensional array has axis 0 and axis 1, a 3D array has axes 0-2, etc.

In a 2D array, axis 0 can be thought of as the rows, and axis 1 as the columns. So `np.mean(my_array, axis = 0)` gives you *column* averages because you are averaging *over the row axis*, that is, over all the rows.

In the cell below, use `np.mean()` to calculate the average expression of each gene (i.e., *row* averages) for the first six time points. The examples above show how to do this for all columns. Use a slice on `data` to get the column averages for the first six columns. Remember to exclude the header row!

Assign the result to `gene_avgs`.

In [None]:
# Graded answer

# YOUR ANSWER HERE


In [None]:
assert gene_avgs.shape == (99,)
assert 0.2629 < gene_avgs[10] < 0.2630
assert -0.0312 < np.mean(gene_avgs) < -0.0311

## 1.8 Practicing Element by Element Operations With Arrays

Below are the results of a two-color luciferase reporter gene assay, performed in a 96 well plate. For each well, there is a green measurement (the control reporter, the list `green`) and a red measurement (the test reporter, the list `red`). Run the cell below to load the data.

In [None]:
# Run this cell
red = [  75.06789969,  102.6761147 ,  105.80468261,   91.32053303,
         94.81085526,  123.13513035,  105.18699967,  102.52030289,
         88.56186704,   97.56377875,  118.16973774,   97.25039381,
         94.02892511,   70.88551784,   89.90737712,   92.46419686,
         75.01544288,  105.93009074,   89.42360156,   71.93031026,
        111.45803265,  109.2387486 ,   87.16146403,   74.54965171,
         93.14081199,   85.15774173,  119.83568919,  131.46453267,
        124.73590733,   80.20070033,  102.53917546,  106.38963365,
        107.69692525,  105.97778478,  102.39125882,   93.91543072,
         82.38649633,   95.98291161,   93.38359085,  107.33904336,
        108.59808569,  110.15603312,   86.71854631,   97.57525572,
         88.41521212,   99.21128443,   93.31129435,  105.98018753,
         80.36089216,  112.72033678,   99.54035572,  132.52076639,
        110.73406648,  104.70161138,  112.78952785,   96.07567285,
         91.97611987,  116.21238255,   76.59076397,   96.48804188,
         87.51085807,   87.36755846,  111.6409671 ,  104.72468052,
        121.33189269,  101.74353757,   70.43838411,  109.27861276,
         88.37094794,   71.65313041,   94.40397085,  104.22731807,
        100.51691116,   71.3525646 ,   94.88456141,   97.67093936,
         97.74755503,   93.38834725,   76.36626406,  124.94518032,
         98.39223498,   96.84341551,  117.88054516,  102.6533986 ,
         82.83899181,   71.54005462,  120.25959834,  100.6473819 ,
         98.83913644,  139.37448277,  105.36258277,  113.30474148,
         87.00635691,  101.88296106,  106.88937415,   78.80771799]

green = [  9.44794457,  11.02191701,   9.12734783,  12.06936672,
         9.75529923,  10.09647493,  11.30731366,   6.77353594,
        12.30478573,   8.3746339 ,   8.33274671,   9.98814856,
         6.95522297,   9.97492289,  13.74362915,   5.78911019,
        11.97888824,   6.06880946,  10.92527125,  11.5591475 ,
        12.17907135,  10.86482997,  10.05318254,   8.88161624,
         9.29767555,  11.96684094,   8.02057314,   8.30002577,
        10.24994511,   7.07860487,  11.4631994 ,   8.70756227,
         6.52934609,  10.95439525,   9.12974836,   7.54861306,
        10.14259259,  11.38768161,   9.37378845,   8.3745961 ,
        12.5271594 ,   9.24121746,   6.08488598,  11.16757983,
        12.54897223,  14.56168373,  15.32761802,   7.2858463 ,
         7.20713393,   9.07103501,  12.47403878,  10.50002976,
         5.67374945,  11.02314955,  11.66384027,   9.04565327,
        12.93272115,  10.49581989,  11.34283136,   8.72258147,
         9.66582856,   9.0737323 ,   9.43100488,   7.93671749,
        11.02899911,  12.03540981,   9.03346529,   8.31858648,
        13.00019911,  10.83834909,   5.7958448 ,   6.02205257,
        14.13883493,   9.74121018,  11.71495384,   8.93387281,
        12.51727825,  12.36010242,   8.55949812,   7.14193118,
         9.5120317 ,  11.58300315,  11.88483374,   8.16378893,
        13.52081441,  10.31226078,  12.95468397,   9.13677285,
         8.81048839,   8.94386741,  10.10855814,   7.61843472,
        12.5949694 ,   9.71627537,  12.80195694,  10.28982774]

Your task is to take the ratio of red versus green for each measurement. You've done this before for lists, using `zip()` and a for loop (or a list comprehension). Below is the code to do this using `zip()`. Run the cell below.

In [None]:
# Just run this cell
# The syntax should be familiar

ratios_list = []
for r, g in zip(red, green):
    ratios_list.append(r/g)

You can perform the same normalization more efficiently and concisely with arrays rather than lists. As you saw, arrays let you perform element by element operations without using a for loop. 

In the cell below, convert the lists `red` and `green` to arrays, using the `np.array()` function. Call them `red_array` and `green_array`. Then take the ratio of the two arrays and assign the result to `ratios_array`, using the natural element by element operation syntax for arrays. (If you need to check your answers, print out `ratios_list` and `ratios_array` to see if they match.) 

In [None]:
# Graded cell

# YOUR ANSWER HERE

In [None]:
assert ratios_array.ndim == 1
assert 7.566 < ratios_array[3] < 7.567 

To lean more about `numpy`, check out the `numpy` site, which has a few nice tutorials: https://numpy.org/devdocs/user/quickstart.html