<a href="https://colab.research.google.com/github/STRIDES-Codes/NL4Cell-Integrating-NLP-with-single-cell-data-analysis/blob/main/NL4Cell_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started

Welcome to NL4Cell! This interactive tutorial will take you all the way from data preprocessing through training our first model.

To get started with this notebook, first we need to connect it to Google Drive; this way we have all of our data in one nice spot. Run the blocks of code below and follow the prompts.


In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now that we've connected this notebook to Google drive, we can install the required packages. 

In [3]:
%%capture
!pip install datasets

In [4]:
# %%capture
!pip install --force-reinstall git+https://github.com/justinphan3110/transformers.git

Collecting git+https://github.com/justinphan3110/transformers.git
  Cloning https://github.com/justinphan3110/transformers.git to /tmp/pip-req-build-g_d4etmu
  Running command git clone -q https://github.com/justinphan3110/transformers.git /tmp/pip-req-build-g_d4etmu
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 9.0 MB/s 
[?25hCollecting numpy>=1.17
  Downloading numpy-1.21.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 76 kB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting pyyaml
  Downloading PyYAML-5.4.1-cp37

Now we handle the imports and make sure that we're using a GPU if it is available.

In [5]:
from transformers import EncoderDecoderModel, BertTokenizer, BertConfig, BertModel
import torch
import numpy as np

In [6]:
if torch.cuda.is_available():
  device = torch.device('cuda')
  print(torch.cuda.get_device_name())
else:
  device = torch.device('cpu')
  

Tesla T4


## Handling preprocessed data.

To make our lives a little easier, we can start off by handling preprocessed data. 

The short version of what was done was that originally the data came in the form of a Pandas dataframe. Each row is an individual cell, and each column is a gene. Thus the dataframe created a gene expression matrix. Additional preprocessing was done, but for that see our more comprehensive [documentation](#null).



Let's start by coping the zipped dataframe into our space and expanding it.

Taking a peek at this data, we can see that the matrix has numeric values for the gene expression data. This is more information than we need, so let's start by discritizing the data into positive and negative.

## Preprocessed data



Each column (i.e. gene expression data for one gene across all cells in the data set) was selected and a threshold was determined using Otsu's technique. Cells with the specific gene expression value below this specific gene threshold were classified as negative (-), and genes above that threshold were classified as positive (+).

These files were then converted to a text representation, pickled, and zipped, and we can access them now.

In [11]:
!mkdir /content/drive/My\ Drive/NL4Cell_data
!gsutil -q -m cp -r gs://cytereader/*txt /content/drive/My\ Drive/NL4Cell_data
!gsutil -q -m cp -rv gs://cytereader/otsu /content/drive/My\ Drive/NL4Cell_data


mkdir: cannot create directory ‘/content/drive/My Drive/NL4Cell_data’: File exists


Now let's list out the directories to see if everything is there.

In [15]:
!ls /content/drive/My\ Drive/NL4Cell_data
!ls /content/drive/My\ Drive/NL4Cell_data/otsu

otsu  vocab.txt
vocab.txt		 Z2L2-otsu.3.txt	 Z3YR-otsu.12.txt
Z23S-otsu.0.txt		 Z2L2-otsu.4.txt	 Z3YR-otsu.13.txt
Z23S-otsu.10.txt	 Z2L2-otsu.5.txt	 Z3YR-otsu.14.txt
Z23S-otsu.11.txt	 Z2L2-otsu.6.txt	 Z3YR-otsu.15.txt
Z23S-otsu.12.txt	 Z2L2-otsu.7.txt	 Z3YR-otsu.16.txt
Z23S-otsu.13.txt	 Z2L2-otsu.8.txt	 Z3YR-otsu.17.txt
Z23S-otsu.14.txt	 Z367-otsu.0.txt	 Z3YR-otsu.18.txt
Z23S-otsu.15.txt	 Z367-otsu.10.txt	 Z3YR-otsu.19.txt
Z23S-otsu.16.txt_.gstmp  Z367-otsu.11.txt	 Z3YR-otsu.1.txt
Z23S-otsu.1.txt		 Z367-otsu.1.txt	 Z3YR-otsu.20.txt
Z23S-otsu.2.txt		 Z367-otsu.2.txt	 Z3YR-otsu.21.txt
Z23S-otsu.3.txt		 Z367-otsu.3.txt	 Z3YR-otsu.22.txt
Z23S-otsu.4.txt		 Z367-otsu.4.txt_.gstmp  Z3YR-otsu.23.txt
Z23S-otsu.5.txt		 Z367-otsu.5.txt_.gstmp  Z3YR-otsu.2.txt
Z23S-otsu.6.txt		 Z367-otsu.6.txt_.gstmp  Z3YR-otsu.3.txt
Z23S-otsu.7.txt		 Z367-otsu.7.txt	 Z3YR-otsu.4.txt
Z23S-otsu.8.txt		 Z367-otsu.8.txt	 Z3YR-otsu.5.txt
Z23S-otsu.9.txt		 Z367-otsu.9.txt	 Z3YR-otsu.6.txt
Z2L2-otsu.0.txt		 Z3YR-otsu.0.

Now let's take a peek at one of the files...

In [16]:
!head /content/drive/My\ Drive/NL4Cell_data/otsu/Z23S-otsu.0.txt

igd- cd19- cd45ra- cd141+ cd4+ cd8- cd16- cd127+ cd1c- cd123- cd66b- cd27+ cd14- cd56- cd24- cd3+ cd38+ cd161+ cd25-
igd- cd19- cd45ra+ cd141+ cd4- cd8+ cd16- cd127+ cd1c- cd123+ cd66b- cd27+ cd14- cd56- cd24+ cd3+ cd38- cd161+ cd25-
igd- cd19- cd45ra+ cd141- cd4+ cd8- cd16- cd127+ cd1c- cd123- cd66b- cd27- cd14- cd56+ cd24- cd3+ cd38- cd161- cd25-
igd- cd19- cd45ra+ cd141- cd4- cd8- cd16- cd127+ cd1c- cd123- cd66b- cd27+ cd14- cd56+ cd24+ cd3+ cd38- cd161- cd25-
igd- cd19- cd45ra- cd141- cd4+ cd8- cd16- cd127+ cd1c- cd123- cd66b- cd27+ cd14- cd56- cd24- cd3+ cd38+ cd161- cd25-
igd- cd19- cd45ra+ cd141- cd4+ cd8- cd16- cd127- cd1c- cd123- cd66b- cd27- cd14- cd56+ cd24- cd3+ cd38+ cd161- cd25-
igd- cd19- cd45ra- cd141+ cd4+ cd8- cd16+ cd127- cd1c+ cd123- cd66b- cd27- cd14+ cd56- cd24- cd3- cd38+ cd161- cd25-
igd- cd19- cd45ra- cd141+ cd4+ cd8- cd16- cd127+ cd1c- cd123- cd66b- cd27+ cd14- cd56- cd24+ cd3+ cd38- cd161- cd25-
igd- cd19- cd45ra+ cd141+ cd4+ cd8- cd16- cd127- cd1c- cd123- cd

Each row in this example is a "sentence" as we would normally see it in NLP, and each gene is a word. 

Additionally we can take a look at `vocab.txt`. This file contains every gene (we'll use a placeholder, `gene`) across all of the data with three variants:
* `gene`
* `gene-`
* `gene+`

The `gene+` and `gene-` are values as we'd see them in the sentence, whereas the neutral `gene` is used for training-- it is a placeholder so that we can predict whether a specific gene is positive or negative without the model throwing in any random gene.

Additionally it contains a number of spetial tokens, but this is more important down the road.

In [18]:
!head -n15 /content/drive/My\ Drive/NL4Cell_data/vocab.txt

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
ccr4
ccr4-
ccr4+
ccr5
ccr5-
ccr5+
ccr6
ccr6-
ccr6+
ccr7


And that's it for what the data looks like. There are a couple different ways to prepare the data (binary vs. stratified encoding, different thresholding methods), so for more information check out the relevant notebook.

## Train

Now that we've taken a look at the data we'll be using, we can start training. First we need to download a package that helps us tokenize the data.



In [19]:
!pip install tokenizers



Then we configure a modified version of Bert to be able to processes these gene "sentences" via regular tokenization methods. The model is modified so that the positional encoding doesn't matter (since `cd4+ cd8-` is the exact same as `cd8- cd4+`)

In [20]:
from transformers import BertConfig, BertTokenizer, BertLMHeadModel
from transformers.models.bert.tokenization_bert import CellBertTokenizer

configuration = BertConfig(vocab_size=1000)
model = BertLMHeadModel(configuration)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


In [21]:
!mkdir cellAttention

!gsutil cp gs://cytereader/vocab.txt cellAttention/

Copying gs://cytereader/vocab.txt...
/ [0 files][    0.0 B/  978.0 B]                                                / [1 files][  978.0 B/  978.0 B]                                                
Operation completed over 1 objects/978.0 B.                                      


The `tokenizer` just converts the whole vocabulary into something that is numerically encoded. See the next cell for an example

In [22]:
tokenizer = CellBertTokenizer.from_pretrained('./cellAttention',vocab_file="./cellAttention/vocab.txt")


In [25]:
tokenizer.encode('CD45+ CD196_CCR6+'.lower())

[2, 97, 1, 3]

Now it's time to begin training on a small subset of encoded data. First we need to set up the tokenizer 

In [31]:
%%time
from transformers import LineByLineTextDataset
from datasets import load_dataset
import os 

files_name = ['/content/drive/My Drive/NL4Cell_data/otsu/' + x for x in os.listdir('/content/drive/My Drive/NL4Cell_data/otsu')][0:2]
dataset = load_dataset("text", data_files=files_name, split='train')
max_length = 128
batch_size=64 

def tokenize_function(examples):
    tokenizer_ = tokenizer([x.lower() for x in examples['text']], max_length=max_length, truncation=True, padding='max_length')
    return tokenizer_



train_data_batch = dataset.map(
    tokenize_function, 
    batched=True, 
    batch_size=batch_size,
)


Using custom data configuration default-d2ed2ec12f0b89a9


Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-d2ed2ec12f0b89a9/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...


0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-d2ed2ec12f0b89a9/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.


  0%|          | 0/31250 [00:00<?, ?ba/s]

CPU times: user 2min 39s, sys: 5.2 s, total: 2min 44s
Wall time: 2min 48s


Then set up the data collator which helps form batches.

In [None]:
from transformers.data.data_collator import DataCollatorForLanguageModeling, DataCollatorForSOP, DataCollatorForNetutralCellModeling

data_collator = DataCollatorForNetutralCellModeling(
    tokenizer=tokenizer, ncm=True, mlm_probability=0.15
)


Now we define the arguments for training...

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./cellAttention",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    learning_rate=1e-4,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data_batch,
)

And train!

In [None]:
# %%time
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertLMHeadModel.forward` and have been ignored: text.
***** Running training *****
  Num examples = 2000000
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 62500


Step,Training Loss
500,0.7371
1000,0.4934
1500,0.4608
2000,0.4469
2500,0.4351
3000,0.4334
3500,0.4229
4000,0.4241
4500,0.414
5000,0.4079


Saving model checkpoint to ./cellAttention/checkpoint-10000
Configuration saved in ./cellAttention/checkpoint-10000/config.json
Model weights saved in ./cellAttention/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to ./cellAttention/checkpoint-20000
Configuration saved in ./cellAttention/checkpoint-20000/config.json
Model weights saved in ./cellAttention/checkpoint-20000/pytorch_model.bin


Below this is just lines of code which you can play around with. First we save the model, then establish a pipeline which lets us predict which values are in a mask. For example, if we were dealing with a typical NLP model we could make it fill in the mask like so:

`the chef [MASK] the meal` --> `the chef prepared the meal`. 

Likewise, we can pass a sequence of genes and have it try to infer genes that would fit well in its place.


In [None]:
trainer.save_model("./cellAttention")

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./cellAttention",
    tokenizer="./cellAttention"
)

Predict if a cell corresponding to this vector will express `CD182_CXCR2`.

In [None]:
fill_mask("CD45+ CD196_CCR6+ CD181_CXCR1- HLA_DR- CD15- CD31_PECAM1- CD8a- CD182_CXCR2[MASK] CD66ace- CD63- CD14- CD66b- CD62L_Lselectin- CD3+ CD27- CD86+ CD10- CD197_CCR7+ CD28- CD11c- CD33- CD161- CD45RO- CD24- CD38+ CD278_ICOS- CD32- CD152_CTLA4+ IgM+ CD184_CXCR4+ CD279_PD1- CD56+ CD16-")

And that's it for the tutorial! Now you have a model that understands genes like a normal NLP model would understand words.