<a href="https://colab.research.google.com/github/AndreMacedo88/VEnCode-Notebooks/blob/main/Getting_VEnCodes_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **How to use this notebook**

### **What is this notebook for?**
This notebook allows users to **find, rank, and retrieve VEnCodes** without doing any modification and software set up to their machines.
For in-depth information about the VEnCode tool, please refer to [Macedo and Gontijo, The intersectional genetics landscape for humans, GigaScience, 2020](https://doi.org/10.1093/gigascience/giaa083).

In the context of the VEnCode technology, and using the CAGE-seq dataset from the FANTOM5 consortium, this notebook allows retrieving VEnCodes for the **primary cells in the FANTOM5 CAGE-seq dataset**. Other notebooks by our team allow for different analyses of different cell types, lines, or states.

Also, this notebook runs the base algorithm that works for any numeric data matrix of columns and rows, where it will find combinations of n rows that exhibit values above a set threshold in one desired column, and below another threshold in all other columns. So, this means that the notebook can be used even for applications outside transcriptomics.

### **Rationale**
While the full VEnCode tool (available for Python 3) allows for many more functions, analyses, and flexibility, there are times when a more user friendly method is important, for example:

*   Users working "on the go" and currently without access to their machines;
*   Users that don't want to set up a python environment;

Moreover, google colab sets up identical machines for their users, which makes this code reproducible between all who have access to this notebook.
### **Requirements**
There is only one requirement to use this notebook successfully:
**You must have .csv data in the form of cell type x regulatory element (columns x rows) stored in your google drive.**
### **Considerations on how to use the notebook**
1. Before running, **there are two things you need to address**:

*   Load your Google Drive and starting data
*   Define the starting Parameters to get VEnCodes

We will explain how to address these issues in the next two sections.

2. Ideally, for full reproducible results, **when you are ready to run the code you should click on "Runtime" and then "Restart and run all"**. However, the cells of this notebook where separated for you to run independently if you want to do some quick tweaks to the parameters.

3. Note that **your attention will be needed once after starting the run**: to access your Google Drive, Google Colab will ask you to follow a link, copy and paste it in the code block as instructed, and press enter.

4. After running the code engine, **your results should be in your Google Drive** folder that contains your initial data set. Otherwise, check any error message that might have been triggered in the code blocks.

## **Preliminary steps**

First we need to import your data. This is the data that will be used to find and retrieve VEnCodes. Put the data file in your google drive for easy and quick access and run the next code cell.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Next explain where the data that will be searched for VEnCodes is:

In [None]:
# Change the file name accordingly. Don't forget the extension's name.
file_name = "enhancers_154cells_non_binarized.csv"
# Also, you may want to take a look at your file and check which separator is used to split the values. Common separators are a comma "," or a semicolon ";".
separator_style = ";"

# Next variable can be left unchanged if your file is at the root of your Google Drive folder. 
# Otherwise, check the file structure on the left of this page in google colab and change the path to the folder where the file previously named is.
path_to_folder = '/content/drive/My Drive/'


Install the VEnCode tool in the google colab machine:

In [None]:
!pip install VEnCode

## **Define the Parameters to get VEnCodes**

Change the following parameters as desired (see [Macedo and Gontijo, The intersectional genetics landscape for humans, GigaScience, 2020](https://doi.org/10.1093/gigascience/giaa083) for more details):

In [None]:
# Input the target cell type to get VEnCodes. This should be the name of a column in your data. 
cell_type = "Hepatocyte"  # If using FANTOM5 data, run the code in the next cell group to see available cell types in FANTOM5.

# Choose the number of VEnCodes to find
number_vencodes = 3

# Define the number of regulatory elements that should comprise the VEnCodes
number_of_re = 4

# Choose the algorithm used to find VEnCodes. Can be "heuristic" or "sampling".
algorithm = "heuristic"

# Choose if you want to determine the E values for the VEnCodes that will be retrieved.
e_values = True

# Next we will define the threshold parameters.
# First, the target_celltype_activity is the TPM threshold after which we consider the target cell type as active. 
# For FANTOM5 promoter data we recommend a value of 0.5. For enhancer data we recommend 0.1.
target_celltype_activity = 0.1

# Then, non_target_celltypes_inactivity is the TPM threshold after which we consider a non target cell type as active. It's recommended to use the value 0.
non_target_celltypes_inactivity = 0


Run the next code block if you need to check the exact name of a cell type in the FANTOM5 data sets.

In [None]:
from VEnCode import common_variables as cv
from pprint import pprint as pp

# show all names of primary cell types:
pp(cv.primary_cell_list, compact=True)  # Turn compact to False to see an easier to read, albeit longer, list of cell types.

## **Code engine (no need to change)**

Next comes the actual code to retrieve VEnCodes, do not change unless you understand the code:

In [None]:
import VEnCode
import os

path_to_data = os.path.join(path_to_folder + file_name)

# Define some default initial Parameters
if algorithm == "heuristic":
    reg_element_sparseness = 0
else:
    reg_element_sparseness = 90

# Open the data
data_tpm = VEnCode.DataTpm(path_to_data, sep=separator_style)
data_tpm.load_data()
data_tpm.make_data_celltype_specific(target_celltype=cell_type, replicates=False)

# Filter the data
data_tpm.filter_by_target_celltype_activity(threshold=target_celltype_activity, binarize=False)
data_tpm.define_non_target_celltypes_inactivity(threshold=non_target_celltypes_inactivity)
data_tpm.filter_by_reg_element_sparseness(threshold=reg_element_sparseness)
if algorithm != "sampling":
    data_tpm.sort_sparseness()

# Find VEnCodes
vencodes = VEnCode.Vencodes(data_tpm, algorithm=algorithm, number_of_re=number_of_re)
vencodes.next(amount=number_vencodes)
if vencodes.vencodes:
    pass
else:
    raise Exception("No VEnCodes found for {}!".format(cell_type))
if e_values:
  vencodes.determine_e_values(repetitions=100)
  # Export the VEnCodes and E values to a file:
  vencodes.export("vencodes", "e-values", path=path_to_folder)
else:
  # Export the VEnCodes to a file:
  vencodes.export("vencodes", path=path_to_folder)

Finally, make sure that the result files are downloaded to your drive:

In [None]:
drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in your Google Drive.')