# Introduction

This notebook is an attempt to perform a literature validation of the top k scoring pathways using Natural Language Processing (NLP) with BioBERT. This is done to proof that the DRW based scoring algorithms provide more biologically correct results than the EG scoring method.

It is still to be decided if this will be used in the BOO report

## Libs

In [19]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from Bio import Entrez
from tqdm import tqdm

In [14]:
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModel.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model.eval()

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## Functions

In [16]:
def get_embedding(text):
    """This function ..."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding.squeeze().numpy()

def compute_similarity(term, abstract):
    """This function ..."""
    term_emb = get_embedding(term)
    abs_emb = get_embedding(abstract)
    return cosine_similarity([term_emb], [abs_emb])[0][0]

def fetch_abstracts(query, retmax=50):
    """This function ..."""
    handle = Entrez.esearch(db="pubmed", term=query, retmax=retmax)
    record = Entrez.read(handle)
    ids = record["IdList"]
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")
    return handle.read()

# RPTEC

In [3]:
data_dir = "C:/Users/semde/Documents/BOO_Scripts/Data/RPTEC_TXG-MAPr"

In [13]:
df_eg_RPTEC = pd.read_csv(os.path.join(data_dir, "eg_joined_RPTEC.csv"))
df_drw_RPTEC = pd.read_csv(os.path.join(data_dir, "drw_joined_RPTEC.csv"))

## Eigengene scoring

### Preprocessing

In [10]:
df_eg_RPTEC = df_eg_RPTEC[["sample_id", "abs_eg_score", "module_number", "annotation"]]

In [12]:
df_eg_RPTEC = df_eg_RPTEC.sort_values(by="abs_eg_score", ascending=False)
print(df_eg_RPTEC.head())

                                          sample_id  abs_eg_score  \
2487   LU_HRPTECTERT1_SINGLE_ARISTOLOCHICACID_T3_C2      9.060121   
53575       LU_HRPTECTERT1_SINGLE_LEADACETATE_T3_C3      8.701595   
2547   LU_HRPTECTERT1_SINGLE_ARISTOLOCHICACID_T2_C3      8.459835   
515         LU_HRPTECTERT1_SINGLE_OCHRATOXINA_T2_C2      7.878635   
525         LU_HRPTECTERT1_SINGLE_OCHRATOXINA_T2_C3      7.812291   

       module_number                                         annotation  
2487              11  immune(immune, natural killer cell, lymphocyte...  
53575            264  metabolism(metabolism), rna processing(transcr...  
2547              11  immune(immune, natural killer cell, lymphocyte...  
515                3  metabolism(metabolism), mitochondria(mitochond...  
525                3  metabolism(metabolism), mitochondria(mitochond...  


## Weighted Directed Random Walk (wDRW)

## Weighted Significant Directed Random Walk (s-wDRW)

# PHH

## Eigengene scoring

## Weighted Directed Random Walk (wDRW)

## Weighted Significant Directed Random Walk (s-wDRW)