# Create gene vectors from mutations and CNA data

Use cooccurrence statistics to create gene vectors. 

[Stop Using word2vec](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/)

[PMI Word Vectors from Wikipedia](https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors)

[Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://aclanthology.org/Q15-1016/) (LGD15)



# Create Gene Embeddings from Tumor Sample "Sentences" 

We are going to create Gene vectors by treating them like words in sentences.
We can then combine these to make tumor sample embeddings.
The descriptions below are copied from LGD15. In their work, they refer "words and their contexts". 
We will map these idess to "genes and their contexts". 
To create our gene embeddings we will, 

* create gene-gene co-occurrence matrices. 
* calculate a pointwise mutual information matrices. 
* reduce the dimensionality using singular value decomposition. 


# Pointwise Mutual Information Matrices (Notation from LGD15)


## Notation


We assume a collection of words $w \in V_W$ and their
contexts $c \in V_C$, where $V_W$ and $V_C$
are the word and context vocabularies, and denote
the collection of observed word-context pairs as $D$.

We use $\#(w,c)$ to denote the number of times the pair
$(w,c)$ appears in $D$ and $\#(w)$ and $\#(c)$ to denote 
the number of times $w$ and $c$ occurred in $D$, respectively.

$$
\begin{align}
\#(w) = \sum_{c^{\prime}} \#(w, c^{\prime})
, \quad
\#(c) = \sum_{w^{\prime}} \#(w^{\prime}, c)
, \quad
\lvert D \rvert = \sum_{w,c} \#(w, c)
\end{align}
$$


$$
\begin{align}
\hat{P}(w) = \frac{\#(w)}{\lvert D \rvert}
, \quad
\hat{P}(c) = \frac{\#(c)}{\lvert D \rvert}
, \quad
\hat{P}(w,c) = \frac{\#(w,c)}{\lvert D \rvert}
\end{align}
$$


## Contexts

$D$ is commonly obtained by taking a
corpus $w_1$, $w_2$, . . . , $w_n$ and defining the contexts
of word $w_i$ as the words surrounding it in an 
$L$-sized window $w_{i−L}$, . . . , $w_{i−1}$, $w_{i+1}$, . . . , $w_{i+L}$.

In our case, the corpus will be genes and their contexts will be 
other genes that co-occurr in the same sample.  


## Definitions

$$
\begin{align}
PMI(w, c) = 
\log \frac
{\hat{P}(w,c)}
{\hat{P}(w)\hat{P}(c)} =
\log \frac
{\#(w,c) \, \cdot \lvert D \rvert}
{\#(w) \cdot \#(c)}
\end{align}
$$

$$
\begin{align}
PPMI(w, c) = {\rm max} \left[ PMI(w, c), 0 \right]
\end{align}
$$


## Context Distribution Smoothing

$$
\begin{align}
PMI_{\alpha}(w, c) = 
\log \frac
{\hat{P}(w,c)}
{\hat{P}(w)\hat{P}_{\alpha}(c)} = 
\log \frac
{\#(w,c) \cdot \sum_{c^{\prime}} \#(c^{\prime})^{\alpha}}
{\#(w) \cdot \#(c)^{\alpha}}
\end{align}
$$

$$
\begin{align}
\hat{P}_{\alpha}(c) = 
\frac
{\#(c)^{\alpha}}
{\sum_{c^{\prime}} \#(c^{\prime})^{\alpha}}
\end{align}
$$

In [1]:
import itertools
import json
import math
import os
import pandas as pd

In [2]:
from nextgenlp import synapse 
from nextgenlp import genie
from nextgenlp import embedders
from nextgenlp.config import config

2022-11-02 02:01:54.238 | INFO     | nextgenlp.synapse:<module>:36 - SYNC_PATH=/home/galtay/data/hack4nf-2022/synapse
2022-11-02 02:01:54.239 | INFO     | nextgenlp.synapse:<module>:40 - SECRETS_PATH=/home/galtay/data/hack4nf-2022/secrets.json


In [3]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [4]:
pd.set_option('display.max_columns', 200)

In [5]:
#GENIE_VERSION = "genie-12.0-public"
GENIE_VERSION = "genie-13.3-consortium"

In [6]:
SYNC_PATH = config['Paths']['SYNAPSE_PATH']
EMBEDDINGS_PATH = config['Paths']['EMBEDDINGS_PATH']
print(SYNC_PATH)
print(EMBEDDINGS_PATH)

/home/galtay/data/hack4nf-2022/synapse
/home/galtay/data/hack4nf-2022/embeddings


In [7]:
MIN_UNIGRAM_COUNT = 10
EMBEDDING_SIZES = [25, 50, 100, 200, 300]

In [8]:
syn_file_paths = synapse.get_file_name_to_path(genie_version=GENIE_VERSION)
syn_file_paths

2022-11-02 02:01:56.926 | INFO     | nextgenlp.synapse:get_file_name_to_path:86 - genie_path=/home/galtay/data/hack4nf-2022/synapse/syn36709873


{'assay_information': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/assay_information_13.3-consortium.txt'),
 'data_clinical_patient': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_clinical_patient_13.3-consortium.txt'),
 'data_clinical_sample': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_clinical_sample_13.3-consortium.txt'),
 'data_fusions': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_fusions_13.3-consortium.txt'),
 'data_gene_matrix': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_gene_matrix_13.3-consortium.txt'),
 'data_mutations_extended': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_mutations_extended_13.3-consortium.txt'),
 'data_CNA': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/data_CNA_13.3-consortium.txt'),
 'data_cna_hg19': PosixPath('/home/galtay/data/hack4nf-2022/synapse/syn36709873/genie_private_data_cna_hg19_13.3-consortium.seg')

# RAS Pathway data

In [9]:
df_ras = pd.read_excel(os.path.join(SYNC_PATH, '../nci-ras-initiative/ras-pathway-gene-names.xlsx'))

In [10]:
df_ras

Unnamed: 0,Gene name,Protein name from BioGPS,Alternative gene names from BioGPS
0,AKT1,RAC-alpha serine/threonine-protein kinase,"AKT, CWS6, PKB, PKB-ALPHA, PRKBA, RAC, RAC-ALPHA"
1,AKT2,RAC-beta serine/threonine-protein kinase,"HIHGHH, PKBB, PKBBETA, PRKBB, RAC-BETA"
2,AKT3,RAC-gamma serine/threonine-protein kinase,"MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-ga..."
3,ALK,anaplastic lymphoma receptor tyrosine kinase; ...,"CD246, NBLST3"
4,APAF1,apoptotic peptidase activating factor 1,"APAF-1, CED4"
...,...,...,...
222,TSC2,"Tuberous Sclerosis 2, Tuberin","LAM, PPP1R160, TSC4"
223,TYMS,thymidylate synthetase,"HST422, TMS, TS"
224,UNG,uracil-DNA glycosylase,"DGU, HIGM4, HIGM5, UDG, UNG1, UNG15, UNG2"
225,VAV1,vav 1 guanine nucleotide exchange factor,VAV


# GENIE Joined Mutation Data 

In [11]:
df_dme_all = genie.read_pat_sam_mut(
    syn_file_paths["data_clinical_patient"],
    syn_file_paths["data_clinical_sample"],
    syn_file_paths["data_mutations_extended"],
)

# GENIE - Clinical Sample

In [None]:
df_dcs_all = genie.read_clinical_sample(syn_file_paths["data_clinical_sample"])
df_dcs_all['CENTER'] = df_dcs_all['SAMPLE_ID'].apply(lambda x: x.split('-')[1])

In [None]:
df_cen_all = df_dcs_all['CENTER'].value_counts().to_frame('count')
df_cen_all['frac'] = df_cen_all['count'] / df_cen_all['count'].sum()
df_cen_all

In [None]:
df_dcs_all[df_dcs_all['CENTER']=='MSK']['SEQ_ASSAY_ID'].value_counts()

# GENIE - Data CNA (Discrete Copy Number Alteration Data)

https://docs.cbioportal.org/file-formats/#discrete-copy-number-data

For each gene-sample combination, a copy number level is specified:

* "-2" is a deep loss, possibly a homozygous deletion
* "-1" is a single-copy loss (heterozygous deletion)
* "0" is diploid
* "1" indicates a low-level gain
* "2" is a high-level amplification.

In [None]:
df_cna_all = genie.read_cna(syn_file_paths['data_CNA'])
df_cna_all = df_cna_all.fillna(0.0).abs()

# Subset for embedding

### MSK-IMPACT468

In [None]:
subset_name = "MSK-IMPACT468"
df_dcs = df_dcs_all[df_dcs_all['SEQ_ASSAY_ID']=='MSK-IMPACT468']

df_mut = df_mut_all[df_mut_all['SAMPLE_ID'].isin(df_dcs['SAMPLE_ID'])]
ser_mut_tokens = df_mut.groupby('SAMPLE_ID')['Hugo_Symbol'].apply(list)

df_cna = df_cna_all.loc[df_dcs['SAMPLE_ID']]
df_cna_melted = genie.get_melted_cna(df_cna, drop_nan=True, drop_zero=True)
ser_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(
    lambda x: list(zip(x['hugo'], x['dcna']))
)

In [None]:
df_mut.groupby('SAMPLE_ID').apply(lambda x: list(zip(x['Hugo_Symbol'], [1.0]*len(x['Hugo_Symbol']))))

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_mut = embedders.GeneMutationEmbeddings(
        ser_mut_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_mut.create_embeddings()
    embds_mut.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'dme_{subset_name}')

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_cna = embedders.GeneCnaEmbeddings(
        ser_cna_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_cna.create_embeddings()
    embds_cna.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'cna_{subset_name}')

### ALL

In [None]:
subset_name = "ALL"
df_dcs = df_dcs_all

df_mut = df_mut_all
ser_mut_tokens = df_mut.groupby('SAMPLE_ID')['Hugo_Symbol'].apply(list)

df_cna = df_cna_all
df_cna_melted = genie.get_melted_cna(df_cna, drop_nan=True, drop_zero=True)
ser_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(
    lambda x: list(zip(x['hugo'], x['dcna']))
)

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_mut = embedders.GeneMutationEmbeddings(
        ser_mut_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_mut.create_embeddings()
    embds_mut.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'dme_{subset_name}')

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_cna = embedders.GeneCnaEmbeddings(
        ser_cna_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_cna.create_embeddings()
    embds_cna.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'cna_{subset_name}')

In [None]:
df_v = pd.read_csv(
    os.path.join(EMBEDDINGS_PATH, 'dme_MSK-IMPACT468_gene_svd_200_vecs.tsv'), 
    sep='\t', 
    header=None,
)

In [None]:
df_v

In [None]:
df_m = pd.read_csv(os.path.join(EMBEDDINGS_PATH, 'dme_MSK-IMPACT468_sample_meta.tsv'), sep='\t')

In [None]:
df_m