# 1 Filtering sequence barcodes by standard error

## As with `pandas` in problem 0, you'll practice using `numpy` data loading function and boolean masks to work with `numpy` arrays.


In this problem you'll practice `numpy` techniques to work with a largish dataset. The techniques in this problem will come in handy in the last two homeworks of the class. One of the most important techniques in this homework involves the concept of a *boolean mask*. A boolean mask is an conditional expression that creates a set of true/false values, which are then used to index data structures like a `numpy` array.

**Background on sequence barcodes:** A common approach in highly multiplexed experiments is to use uniquely identifying sequence "barcodes" to measure the output of an experiment. In these experiments, large libraries of experimental perturbations are pooled and then sequenced. The results of individual perturbations are measured by the number of sequencing reads of a barcode that corresponds to a paricular perturbation. 

For example, a "deep mutational scanning" assay measures the effect of thousands of different mutations on the activity of protein. Each mutant version of the corrsponding gene is identified by a sequence barcode, which is recovered after a cell growth or ligand binding  (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4410700/). In "massively parallel reporter assays", activities of barcode-tagged plasmid reporter genes in a large library are measured by detecting sequence barcodes in the transcribed reporter genes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3511131/). And in pooled CRISPR assays, sequence barcodes link single-cell transcriptomes with their corresponding guide RNA (https://www.ncbi.nlm.nih.gov/pubmed/27984732).

A key part of the data processing step in these assays is to filter out sequence barcodes that were poorly measured. In a recent paper, Rubin and colleagues propose a statistical framework for doing this, which involves removing barcodes based on their standard error across replicates (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1272-5#Abs1).

In this problem, you'll implement a simplified version of standard error filtering on sequence barcodes. Your task is to use `numpy` tools to do the following:

1. Read in a file of barcode sequencing data
2. Calculate the standard error for each barcode across replicates 
3. Identify and remove barcodes with extreme standard error values
4. Write the filtered data to a new file.

The structure of this homework is more of a guided tour of a `numpy` workflow, rather than a set of problems to be solved. There is a lot of text in this notebook, but each answer requires only a few lines of code.

## 1.1 Read in data from file
Open up the data file `barcode_reads.txt` by clicking on it in your Jupyter dashboard. (It will open a new browser window.) This file contains sequencing read counts (normalized by the total reads in the sequencing run, so the units are reads per million total reads) for 12,999 distinct barcodes, measured as a pooled library in a cell assay.

The tab-delimited file contains eight columns: The barcode sequence, the number input DNA reads (representation of that barcode in the original pooled library), and six replicates of assay output reads.

We want to read in the data as a `numpy` array, but there is a problem - **one column of data consists of strings, while the others consist of float numbers.** As you'll recall from class, arrays are meant to hold only one data type. 

(Technically it's possible to mix data types, but it's generally not a good idea. If you're curious, here's a short tutorial: https://scipython.com/book/chapter-6-numpy/examples/using-numpys-loadtxt-method/.)

We want to **skip the first column** in the data file and import the rest. Fortunately, the useful `numpy` file input function `np.loadtxt()` has a `usecols` argument that allows you to select which columns in the file to read in. Like everything else in Python, column numbers start with 0.

```python
# An example of reading in just 3 columns from a file:
data = np.loadtxt(fname = 'my_file.txt', usecols = (2, 4, 6)) # load only columns 3, 5, 7
```
Here are example values for the `usecols` argument:
```python
usecols = 2 # integer to select 3rd column

usecols = (2, 5) # tuple to select 3rd and 6th columns

usecols = range(2, 7) # range function, cols. 3 - 7 - remember last number of range() is 1 beyond last desired value
```
Note carefully how parentheses are used in the above examples.

In the cell below, import numpy as np, then **load columns 2-8** from `barcode_reads.txt`, and assign the resulting array to the variable `data`. Note that `np.loadtxt()` produces a numpy array - thus `data` is an array.

In [None]:
# YOUR ANSWER HERE

# Check your result
print(data[10])

In [None]:
assert len(data[0]) == 7
assert len(data) == 12999
assert data.ndim == 2

## 1.2 Filter out barcodes with no input DNA reads using a boolean mask

**NOTE: If you completed problem 0 already, you're familiar with boolean masks.**

The library was designed with 12,999 barcodes, but not all of these barcodes were detected in the DNA input – some dropped out during cloning. Before we start filtering barcodes based on their standard errors, we first want to remove barcodes that were not detected in the DNA input (column 0 of the `numpy` array `data`).

To put it more precisely:

**We want keep only those rows of the array `data` for which the value of the first column is > 0.** 

Extracting particular rows and columns from a data table is a common problem in exploratory data analysis, so it's important to know how it's done on data structures like `numpy` arrays.

We can solve this problem **by placing a conditional statement inside of the indexing brackets.** For example, if I wanted to get all rows of `data` for which the sum of replicates 1-3 was > 200, I could write something like this:


```python
# pick rows in which columns 1-3 sum to >200
# Recall columns are axis=1 on 2D numpy arrays
data[data[:,1:4].sum(axis=1) > 200] 
```

This code example may look very obscure. The key principle here is that between the outer square brackets is an expression that defines a set of true/false conditions, called a boolean mask. Those true/false conditions can be used as indexing values on `data`.

This is a really handy and common trick. Let's take the code example one at a time. The statement of true/false conditions is:

```data[:,1:4].sum(axis=1) > 200```

The left part of this expression, `data[:,1:4].sum(axis=1)`, calculates the *sum across columns 1-3*. (Remember how a slice works - `my_array[:, 1:4]` gives you columns 1, 2, 3 of all rows.) 

Then we ask whether that sum is greater than 200, for each row:

```python
data[:,1:4].sum(axis=1) > 200
```

Retype this true/false statement into the cell below, then run the cell. What do you get? (No need to write the answer  - just make sure you understand the output.)

Running that code gives you a one-dimensional array made up 12,999 true or false values, identifying the truth value of our conditional statement *for each row of `data`*. 

You can now use that set of true/false values to pick out the matching rows and columns in `data`. In the cell below is a toy example - run it to see the result. 

(The technical name for what we're creating wth the condistional statements is a *boolean mask*. See this for more - scroll down to the boolean mask section: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html.)

In [None]:
# Run this cell and be sure to understand what's happening

my_2d_array = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) # make a 2d array 3 rows, 3 columns
boolean_mask = np.array([False, True, False], dtype=bool) # 1D array of booleans to selects rows
my_2d_array[boolean_mask] # will pick out middle row

### Use a boolean mask to pick out rows with DNA input reads > 0.02
In the cell below, write a statement that creates a boolean mask that **identifies all rows of data for which column 0 is greater than 0.02**. (We'll pick the slightly more stringent threshold of > 0.02  rather than > 0 for inclusion in our barcode analysis.)

Use the boolean mask inside indexing brackets to extract the appropriate rows from `data`, and save the result as `data_dna_filter`.

You answer will be only a slight modification of the code example four cells up.

**Note: You can define the boolean mask as a separate variable and then use it, or you can place the boolean statement directly within indexing brackets. In problem 0 we defined a variable first, but feel free to take either approach here.**
 

In [None]:
# YOUR ANSWER HERE
print(data_dna_filter.ndim, data_dna_filter.shape) # Show how many rows were removed

In [None]:
assert data_dna_filter.shape == (12735, 7)
assert 0.526 < data_dna_filter[123, 3] < 0.527

## 1.3: Add Pseudocounts

For some replicates of some barcodes, there were no matching output reads, thus the read count is 0. Zero values can cause problems for downstream calculations, such as taking the log, or calculating a ratio when the denominator is zero. One way to handle zeros is to simply remove them from the data. Another way, followed by Rubin et al in the paper linked above, is to add a 'pseudocount' – in this case, adding a single read to all measurements. This eliminates any zeros and makes calculations easier, but it doesn't significantly change the measurements. 

Our data has been normalized by total reads, and thus no longer is in units of absolute read counts. So rather than add 1 pseudocount, we'll add a "normalized" pseudovalue of 0.01 to each entry in the array.

In the cell below, **using the ability of arrays to handle element by element math operations, add 0.01 to all values** in the array `data_dna_filter`. (Recall the example in class in which an array of chip peak widths was divided by a mean value. This syntax here will be the same, except you'll use addition rather than division.)

Assign the result to a new variable called `data_pseudocounts`. (This way we don't overwrite the original in case we need to go back to it.)

In [None]:
# YOUR ANSWER HERE

# See if you get the result you expected:
print(data[2, 3], data_pseudocounts[2, 3])

In [None]:
assert len(data_pseudocounts[0]) == 7
assert len(data_pseudocounts) == 12735
assert data_pseudocounts.ndim == 2
assert 0.0099999 < data_pseudocounts[2, 3] - data_dna_filter[2, 3] < 0.01000001 # floating point precision limits!

## 1.4: Calculate standard errors for each barcode

Standard error of the mean is covered in more details in the statistics lectures. Here we'll just do the calculation without diving into the justification. Essentially, the standard error of the mean gives an estimate of how precisely you have estimated the population mean with your sample. The standard error is the sample standard deviaion divided by the square root of the sample size:

$\textit{s.e.m.} = \frac{\textit{sample_sd}}{\sqrt{\textit{n}}}$

Your next task is to calculate the standard error of the mean for each barcode (in other words, for each row of the data). **Remember, we only want to calculate standard error using the last 6 columns** – the first column contains input DNA reads, which we only needed in our first filter step. So you need to:

1. Take a slice on the array `data_pseudocounts` to exclude the first column.
2. Use element by element operation to calculate the standard deviation of each row, divided by the square root of the number of replicates. Use `numpy` standard deviation function (`my_array.std()`) and square root function (`np.sqrt()`).

**HINT 1:** `my_array.std()` will calculate a single standard deviation of *all* values in the array. We want the standard deviation for each row (axis 0), which is calculated by taking the measurments across all *columns* (axis 1). To do this problem right, you'll have to use the axis argument to specify the correct axis when you call `.std()`. You did something similar in the Lecture 10 in-class activity, when calculating means for each row.

**HINT 2**: To get `n` for the square root, you can just hard-code 6, since you know that there are six replicates. But to write more general code, you could get the length of each row of your array, using the `shape` attribute of the array. `shape` returns a tuple with the dimensions. You can then use indexing to access just one of the numbers of that tuple:

```python
my_array.shape[0] # length of axis 0
my_array.shape[1] # length of axix 1
```

Finally, `.shape` works on slices of arrays. Use the blank cell below to try it out. (This is not for points, just for fun.) Then, in the following cell, **write your code to calculate standard errors for each row. Assign the result to `barcode_errors`.** 

In [None]:
# Expiriment with shape here


In [None]:
# Write one line of code that calculates the standard error for the 6 replicates of each row.
# DO NOT include column 0 (input DNA) in the std error calculation!

# YOUR ANSWER HERE

print(barcode_errors.max(), barcode_errors[10], barcode_errors.shape)

## 1.5: Normalize standard errors by the mean

Right now, replicates with larger means will tend to have larger standard errors. If we removed barcodes with larger standard errors at this point, we'd would tend to simply remove barcodes with higher means. The next step is to normalize standard errors by dividing them by the mean, which puts the standard errors in terms of fraction of the mean, making it easier to compare barcodes with different means.

In the cell below, **calculate the mean values of columns 1-6 for each row of `data_pseudocounts`. Assign the result to a new array called `barcode_means`.** 

(HINT: Use numpy's mean function (e.g., `my_array.mean()`) in the same way you used the standard deviation function above – you have to specify the axis.)

Then create a new array of normalized errors by dividing `barcode_errors` by `barcode_means`. Save the result as `normalized_errors`.

In [None]:
# Write two lines of code to solve this problem

# YOUR ANSWER HERE

print(normalized_errors.shape, normalized_errors[2], max(barcode_means))

In [None]:
assert len(barcode_means) == 12735 
assert 8860.6 < max(barcode_means) < 8860.7
assert normalized_errors.shape == (12735,)
assert 0.0441 < normalized_errors[2] < 0.0442

## 1.6: Plot normalized standard errors

Import `seaborn` and make a quick plot a histogram of `normalized_errors`. What does the shape of the distribution of errors look like? Look at the x-axis - what standard error cutoff would you use for filtering out noisy barcodes? 

Here's the standard way to import `seaborn`

```python
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
```

To make your plot, use the `sns.distplot()` function. The only argument it needs is the array `normalized_errors`.

In [None]:
# YOUR ANSWER HERE


## 1.7: Remove barcodes with high normalized errors.

For purposes of this programming exercise, we'll somewhat choose a normalized error of 0.3 as our filter to select the final list of barcodes.

In the cell below, construct a boolean mask, using `normalized_errors` to select rows from `barcode_means` for only those barcodes with a normalized error < 0.3. 

(In **Step 2**, you used `data` itself to create a boolean mask for `data`. In this case, you'll use a boolean mask defined on `normalized_errors` to select elements from `barcode_means`.)

Save the result as `filtered_barcodes`.

In [None]:
# YOUR ANSWER HERE


print('number of removed barcodes:',len(barcode_means) - len(filtered_means))

In [None]:
assert len(barcode_means) - len(filtered_means) == 542
assert barcode_means[10] == filtered_means[10]

# 1.8 Write the filtered barcode means to a file

Now that we've processed our data, we want to save the result. It's simple to write numpy arrays to a file using `np.save.txt`:

```python
np.savetxt('my_file_name.txt', my_array, fmt= 'number formatting string')
```
The full documentation for this function is with looking at, because there are many options: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html

**In the cell below, save `filtered_means` to a file. To trim your numbers to five decimal points, use the following formatting string for the `fmt` argument: `%.5f`.** (The `%` sign indicates the beginning of a formatting string, `.5` indicates five decmial places, and `f` indicates that the number should be in floating point decimal format. A tutorial on formatting is here: https://www.python-course.eu/python3_formatted_output.php.)

In [None]:
# Non-graded answer, but try it. It's useful to know how
# to write the results of your analysis to a file.



To keep things relatively simple in this programming problem, we left out some things. For example, it would be helpful to read in the barcode sequences (in the first column of the actual text file), and then write them out with their respecive filtered means. To do this, we could have run `np.loadtxt` a second time, reading in just the barcode column as an array of strings. 

(If you try this and see some weird input like `b'TGCAATACG'`, be aware that there is a bug that adds the `b` when you read in strings: https://github.com/numpy/numpy/issues/2715. But you could easily trim this off.)

The workflow you just completed is something for which jupyter notebook is well suited. You could write a data processing workflow and just run new data through the notebook as you get it. The notebook is also a useful way to share your code with others.