# Script Documentation and Commenting

Just like keeping a lab notebook to track all of your experiments, documentation helps users know how to use your code, explains what stage it's at, and provides any rationals or results you got from running that code. That user could even be you!

Here are some tips and tricks that I've used while writing my scripts.

# Jupyter Notebooks

Jupyter notebooks are really useful for documenting everything that you've been doing because not only can you separate your code into cells, but you can add text cells and write in LaTeX using the ```Markdown``` cell format. 

You can write LaTeX formatted ** equations ** :

$$ \frac{-b\pm\sqrt{b^2-4ac}}{2a} $$

You can also add ** images ** to your notebooks:

<img src="spbob.jpeg">

** Resources **

- [Markdown Style Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

## Jupyter Lab

There's a new web-based interface that Jupyter just released called **JupyterLab**: 

<img src="jupyterlab.png">

JupyterLab has some cool new features such as:

- Mulit-tab in-browser environment
- Collapse Cells
- Drag and drop cells
- Side by side editing (like markdown viewer below)
- Console Editor
- Single Document Mode
- New File browser
- File Grids + CSV viewer on Big Data
- Image file viewing

More on JupyterLab: 
- [News Update](https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906)
- [Data Centric Review](https://medium.com/@brianray_7981/jupyterlab-first-impressions-e6d70d8a175d)
- [Advanced JupyterLab Documentation](https://media.readthedocs.org/pdf/jlab/debug-rtd/jlab.pdf)

This was just released, so there's still room to improve, but overall people in the data science community speak highly of it and it's potential!

# docstrings

The basic purpose of a docstring is to explain what your function or class does and what are it's parameters and returns. 

Jessime and I both use [NumPy's documentation style](http://www.numpy.org/devdocs/docs/howto_document.html) mostly because it's pretty straightforward and legible to anyone who is familiar with code.

** Example docstring **


In [None]:
'''
This is a one-sentence explaination of what your function does.

Parameters
----------
x : type
    Description of parameter 'x'.
y
    Description of parameter 'y' with type not specified. 
    
Returns
-------
z : type
    Description of return variable 'z'. 
'''

In [None]:
def is_nucleic_acid(seq):
    '''
    Indicates whether a string is a valid nucleic acid sequence.
    
    Parameters
    ----------
    seq: str
        Case-insensitive string
    
    Returns
    -------
    bool
    
    '''
    seq = seq.upper()
    na = 'ACTGUWSMKRYBDHVNZ' # Why are there so many nucleic acids?
    return (all(i in na for i in seq))

A ```comment``` is used when a line or a block of code isn't clear. 

In [None]:
def is_nucleic_acid(seq):
    '''
    Indicates whether a string is a valid nucleic acid sequence.
    
    Parameters
    ----------
    seq: str
        Case-insensitive string
    
    Returns
    -------
    bool
    
    '''
    seq = seq.upper()
    na = 'ACTGUWSMKRYBDHVNZ' # Includes all IUPAC standard notations
    return (all(i in na for i in seq))

# Saving command line history

# Level 4 Review using docstrings!

You have two files: theraptrix_protein_orders.txt, which contains a list of names of all the proteins that Theraptrix has purchased recently, and locus_data.tab, which contains tabular data of a series of experiments. 

*Parameters*
- paths to both locus_data.tab and theraptrix_protein_orders.txt 

*Returns*
- 0.200 : maximum average correlation value (rounded to 3 decimals)
- ('K562', 'apoptosis') : corresponding cell type and phenotype

*Steps*
- Filter locus_data.tab by removing rows where the expression of the protein is not significant
- Filter .tab by removing rows where the protein name is not in theraptrix_protein_orders.txt
- Then, for each cell type within the filtered data set, average the correlation values of all proteins for each individual phenotype. 
- Write the maximum average correlation value (rounded to 3 decimals) among all of the average values, along with the cell type and phenotype corresponding to this max value, to the file results/4.txt.

In [None]:
# Importing modules
import pandas as pd
from sys import argv

# Main funtion
def pheno_corr(locus_data, proteins):
    pass
    with open('results/4.txt', 'w') as outfile:
        outfile.write()
# Take command lines
if __name__ == '__main__':
    pheno_corr(argv[1], argv[2])
    # pheno_corr('locus_data.tab', 'theraptrix_protein_orders.txt')


In [None]:
def pheno_corr(locus_data, proteins):
    '''
    The strongest phenotype in a single cell type expressing select proteins.
    C:\Users\sksuzuki\Desktop\level4.py
    Parameters
    ----------
    locus_data: str
        path to experimental data in a tab-separated file
    
    proteins: str
        path to list of proteins in a txt file
    
    Returns
    -------
    Txt File
        - correlation value
        - cell type and phenotype
    
    '''

In [None]:
locus_data = pd.read_table(locus_data)
with open(proteins) as f:
    thera_proteins = [line.strip() for line in f.readlines()]
print(locus_data)
print(thera_proteins)

In [None]:
data = data[(data['Exp Sig'] == True)]
print(data)

In [None]:
data = data[(data['Exp Sig'] == True) & (data['Protein'].isin(thera_proteins))]
print(data)    

In [None]:
grouped_means = data.groupby('Cell Type').mean()
print(grouped_means)

In [None]:
pheno_data = grouped_means.iloc[:,-3:] # change this when submitting
print(pheno_data) 

In [None]:
max_value = pheno_data.max().max()
print(max_value)

In [None]:
cell_type = pheno_data.max(axis=1).idxmax()
print(cell_type)

In [None]:
pheno = pheno_data.max(axis=0).idxmax()
print(pheno)

In [None]:
outfile.write(f'{round(max_value, 3)}\n')
outfile.write(f'{(cell_type, pheno)}')