# Homework03 (10 points): Parsing single-cell genomics data

Here we parse the output of a single-cell genomics experiment. Actually just the first couple cells from an experiment. The data is formatted in the style of 10X Genomics output.

There are some potentially useful code snippets at the end of the notebook.

Please post questions to the `lectures-homework` slack channel. Phil is also available via email (pbradley@fredhutch.org).

## Looking at the input files
In the folder where this homework notebook lives, there's a directory `data/` which contains another directory `filtered_feature_bc_matrix/`  with the results of a single-cell genomics experiment.

In [None]:
# We can list the contents of that directory to see that it contains three files.
# here the %ls is a jupyter notebook thing, not a python thing. % means we are calling
# a built-in jupyter notebook function
%ls data/filtered_feature_bc_matrix

Here's what the files look like, using the unix `head` command to print out the first 5 lines of each.

```
rhino01$ head -5 data/filtered_feature_bc_matrix/barcodes.tsv
AAAGCAACAAGCGTAG-1
AAAGTAGCATACGCTA-1
AAAGTAGGTCTCTCGT-1
AAAGTAGTCCACGTGG-1
AAATGCCTCACGACTA-1

rhino01$ head -5 data/filtered_feature_bc_matrix/features.tsv
ENSG00000243485	MIR1302-2HG	Gene Expression
ENSG00000237613	FAM138A	Gene Expression
ENSG00000186092	OR4F5	Gene Expression
ENSG00000238009	AL627309.1	Gene Expression
ENSG00000239945	AL627309.3	Gene Expression

rhino01$ head -5 data/filtered_feature_bc_matrix/matrix.mtx
%%MatrixMarket matrix coordinate integer general
%
36620 10 8350
42 1 1
49 1 1
```

The `barcodes.tsv` file contains DNA barcodes associated to the different cells that were profiled.

The `features.tsv` file describes the features that were analyzed (mRNA transcripts and in this case also surface proteins using DNA-barcoded antibodies).

The `matrix.mtx` file contains the results of the experiment. Each line after the first three header lines contains three integers, FEATURE CELL COUNT, which indicate the COUNT of transcripts mapping to feature number FEATURE in cell number CELL. FEATURE and CELL are numbered 1...N_FEATURES and 1...N_CELLS, respectively. 


In [None]:
## to get started, execute the code in this cell to define some filenames and a useful function

barcodes_file = 'data/filtered_feature_bc_matrix/barcodes.tsv'
features_file = 'data/filtered_feature_bc_matrix/features.tsv'
matrix_file = 'data/filtered_feature_bc_matrix/matrix.mtx'

def read_lines_from_file( filename ):
    '''Returns a list containing all the lines in the given file
    Note: the lines will end with the newline character ('\n'),
    with the possible exception of the last line
    
    Note: we could also do this in a single line:  return open(filename,'r').readlines()
       The downside to that is that the file object may stay open for a little while after
       we read from it. So it's safer to explicitly close the file by calling data.close()
       or by using something we haven't talked about yet called a 'with' statement
    '''
    # open a file object for reading ('r')
    data = open(filename, 'r')
    # read all the lines into a list of strings
    lines = data.readlines()
    # close the file
    data.close()
    # return the lines
    return lines



## Q1 (1 pt). How many cells are represented in this experiment? 
Figure this out by reading the lines from `barcodes_file` into a list using the function `read_lines_from_file`, and printing the length of that list.

In [None]:
# put your code here in this block


## Q2 (1 pt). How many features (mRNA transcripts or surface proteins) are represented in this experiment? 
Figure this out by reading the lines from `features_file` into a list, and printing the length of that list.

In [None]:
# put your code here in this block


## Q3 (1 pt). How many nonzero entries are there in the counts matrix, which stores all the observed features (mRNA transcripts or surface proteins) for all the cells? In other words, how many cell-feature combinations were observed in the experiment?
Figure this out by reading the lines from `matrix_file` into a list, and printing the length of that list *MINUS 3* because the matrix file starts with three header lines (two comments and a line showing the overall size of the matrix).

In [None]:
# your code here


## Q4 (1 pt). Print the third line in the matrix file (counting lines as we normally do 1,2,3,...). 
This line should relate to your previous answers.

In [None]:
# your code here

## Q5 (2 pts). Write a function that reads the features file and returns a list of the gene/protein names, as a list of strings. Use this function to get a list of the features and print the name of the 100th feature (here counting 1,2,3...). 
The features file has three, tab-separated columns and no header line. The names that we are looking for are in the second (ie, middle) column. These are the usual human-readable gene names like 'CD4' and 'CCL5'.

In [None]:
## you can fill out this function

def read_feature_names(filename):
    ''' Return a list of the feature names
    '''
    lines = read_lines_from_file(filename)
    # etc, etc
    
    
features = read_feature_names(features_file)
print(features[YOU_PUT_THE_RIGHT_NUMBER_IN_HERE])

## Q6 (2 pts). Write a function that takes as input a cell number and a matrix filename and returns the total transcript count for all the features mapped to that cell number. In other words, the sum of all the transcript counts for all the features that were mapped to that cell. So, a single number. 
Start from the template below. The matrix file starts with three header lines. Every other line in the file consists of three integers: the feature number, the cell number, and the count for that feature (the number of unique transcripts mapping to that feature).

In [None]:
def get_total_count_for_cell(cell_number, matrix_filename):
    ''' Read through the matrix file and sum up all the counts for features 
    that mapped to the given cell_number. cell_number is a 1-indexed integer,
    just like the feature and cell indexes in the matrix file. I.e., it starts at 1.
    '''
    assert cell_number>0 # cell_number should be 1-indexed. 0 is not a valid cell nunber.
    
    # here we read the lines from the file, and use slicing to remove the first three header lines:
    lines = read_lines_from_file(matrix_filename)[3:]
    total_count = 0
        
    # etc etc
            


## Q7 (2 pts). Use the function you created above to print out the total feature count for each cell in the dataset. Which cell number (1-10) has the highest total number of transcripts? Which has the lowest?
Use a `for` loop over a `range` expression to loop over the cell numbers when printing out the total feature counts. You can just eyeball the numbers to identify the cells with the largest and smallest counts. Or you could append them to a list and use the `max()` `min()` and `list.index()` functions.

In [1]:
# your code here


## Q8 (2 pts extra credit). Write a function that takes as input a cell number, a list of all the features, and a matrix filename and returns a tuple (name, count) consisting of the name of the feature with the highest transcript count in that cell and the count for that feature. Use this to figure out the highest count feature for each of the 10 cells.
One tricky thing is the disconnect between how the features are numbered in the matrix file and how they are numbered in python lists.  

In [None]:
def get_highest_feature(cell_number, features, matrix_filename):
    ''' Read through the matrix file and find the feature with the highest count
    for the given cell. Return the feature name and the count for that feature
    '''
    # fill this in
    

# Potentially useful code snippets

In [None]:
# parsing a single line...

line = '4 6 20\n'
line_split = line.split()
print(line_split) #strings

#this is something called a list comprehension: it's a nice, short-hand way of creating a list in python
# the general syntax is:
#  [ expression_involving_variable for variable in list ]
# or you can add an if-statement to filter out some elements:
#  [ expression_involving_variable for variable in list if boolean_involving_variable ]

counts = [int(x) for x in line_split] 
 
print(counts) # now integers

a, b, c = counts # in python we can assign to multiple names using a list (or other "iterable" like a tuple) 
print('a=', a)
print('b=', b)
print('c=', c)


In [None]:
# max and min and index

l = [3, 7, 4, -1, 10, 6]

# max and min are built-in python functions that can operate on integers, floats, lists, and many other objects  
print('max(3,4)=', max(3,4))
print('min(2,10,-1,59)=', min(2,10,-1,59))

print('max(l)=', max(l))
print('min(l)=', min(l))

# l.index is a function that returns the index where a given element occurs in a list

print('index for max(l) in l:', l.index(max(l)))

