# Create gene vectors from mutations and CNA data

Use cooccurrence statistics to create gene vectors. 

https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/

https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors

https://aclanthology.org/Q15-1016/ (LGD15)



# Create CNA "Sentences" 

We are going to create Gene vectors by treating them like words in sentences. 
First we are going to create a gene-gene co-occurrence matrix. 
Then we are going to calculate a pointwise mutual information matrix. 
Then we are going to reduce the dimensionality. 


# Pointwise Mutual Information Matrices (Notation from LGD15)


## Notation


We assume a collection of words $w \in V_W$ and their
contexts $c \in V_C$, where $V_W$ and $V_C$
are the word and context vocabularies, and denote
the collection of observed word-context pairs as $D$.

We use $\#(w,c)$ to denote the number of times the pair
$(w,c)$ appears in $D$ and $\#(w)$ and $\#(c)$ to denote 
the number of times $w$ and $c$ occurred in $D$, respectively.

$$
\begin{align}
\#(w) = \sum_{c^{\prime}} \#(w, c^{\prime})
, \quad
\#(c) = \sum_{w^{\prime}} \#(w^{\prime}, c)
, \quad
\lvert D \rvert = \sum_{w,c} \#(w, c)
\end{align}
$$


$$
\begin{align}
\hat{P}(w) = \frac{\#(w)}{\lvert D \rvert}
, \quad
\hat{P}(c) = \frac{\#(c)}{\lvert D \rvert}
, \quad
\hat{P}(w,c) = \frac{\#(w,c)}{\lvert D \rvert}
\end{align}
$$


## Contexts

$D$ is commonly obtained by taking a
corpus $w_1$, $w_2$, . . . , $w_n$ and defining the contexts
of word $w_i$ as the words surrounding it in an 
$L$-sized window $w_{i−L}$, . . . , $w_{i−1}$, $w_{i+1}$, . . . , $w_{i+L}$.

In our case, the corpus will be genes and their contexts will be 
other genes that they co-occurr with. 


## Definitions

$$
\begin{align}
PMI(w, c) = 
\log \frac
{\hat{P}(w,c)}
{\hat{P}(w)\hat{P}(c)} =
\log \frac
{\#(w,c) \, \cdot \lvert D \rvert}
{\#(w) \cdot \#(c)}
\end{align}
$$

$$
\begin{align}
PPMI(w, c) = {\rm max} \left[ PMI(w, c), 0 \right]
\end{align}
$$


## Context Distribution Smoothing

$$
\begin{align}
PMI_{\alpha}(w, c) = 
\log \frac
{\hat{P}(w,c)}
{\hat{P}(w)\hat{P}_{\alpha}(c)} = 
\log \frac
{\#(w,c) \cdot \sum_{c^{\prime}} \#(c^{\prime})^{\alpha}}
{\#(w) \cdot \#(c)^{\alpha}}
\end{align}
$$

$$
\begin{align}
\hat{P}_{\alpha}(c) = 
\frac
{\#(c)^{\alpha}}
{\sum_{c^{\prime}} \#(c^{\prime})^{\alpha}}
\end{align}
$$

In [None]:
import itertools
import json
import math
import os
import pandas as pd

In [None]:
from hack4nf import synapse 
from hack4nf import genie
from hack4nf import embedders

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
pd.set_option('display.max_columns', 200)

In [None]:
#genie_dataset_version = "genie-12.0-public"
genie_dataset_version = "genie-13.3-consortium"

In [None]:
SYNC_PATH = synapse.SYNC_PATH
print(SYNC_PATH)

In [None]:
EMBEDDINGS_PATH = os.path.join(SYNC_PATH, "../embeddings")
print(EMBEDDINGS_PATH)
MIN_UNIGRAM_COUNT = 10
EMBEDDING_SIZES = [50, 100, 200, 300, 400]

In [None]:
FILE_NAME_TO_PATH = synapse.get_file_name_to_path(sync_path=SYNC_PATH)
syn_file_paths = {
    'data_clinical_patient': FILE_NAME_TO_PATH[genie_dataset_version]['data_clinical_patient'],
    'data_clinical_sample': FILE_NAME_TO_PATH[genie_dataset_version]['data_clinical_sample'],
    'data_mutations_extended': FILE_NAME_TO_PATH[genie_dataset_version]['data_mutations_extended'],
    'data_CNA': FILE_NAME_TO_PATH[genie_dataset_version]['data_CNA'],
    'data_cna_hg19_seg': FILE_NAME_TO_PATH[genie_dataset_version]['data_cna_hg19'],
}
syn_file_paths

# RAS Pathway data

In [None]:
df_ras = pd.read_excel(os.path.join(SYNC_PATH, '../nci-ras-initiative/ras-pathway-gene-names.xlsx'))

In [None]:
df_ras

# GENIE Joined Mutation Data 

In [None]:
df_mut_all = genie.read_pat_sam_mut(
    syn_file_paths["data_clinical_patient"],
    syn_file_paths["data_clinical_sample"],
    syn_file_paths["data_mutations_extended"],
)

# GENIE - Clinical Sample

In [None]:
df_dcs_all = genie.read_clinical_sample(syn_file_paths["data_clinical_sample"])
df_dcs_all['CENTER'] = df_dcs_all['SAMPLE_ID'].apply(lambda x: x.split('-')[1])

In [None]:
df_cen_all = df_dcs_all['CENTER'].value_counts().to_frame('count')
df_cen_all['frac'] = df_cen_all['count'] / df_cen_all['count'].sum()
df_cen_all

In [None]:
df_dcs_all[df_dcs_all['CENTER']=='MSK']['SEQ_ASSAY_ID'].value_counts()

# GENIE - Data CNA (Discrete Copy Number Alteration Data)

https://docs.cbioportal.org/file-formats/#discrete-copy-number-data

For each gene-sample combination, a copy number level is specified:

* "-2" is a deep loss, possibly a homozygous deletion
* "-1" is a single-copy loss (heterozygous deletion)
* "0" is diploid
* "1" indicates a low-level gain
* "2" is a high-level amplification.

In [None]:
df_cna_all = genie.read_cna(syn_file_paths['data_CNA'])
df_cna_all = df_cna_all.fillna(0.0).abs()

# Subset for embedding

### MSK-IMPACT468

In [None]:
subset_name = "MSK-IMPACT468"
df_dcs = df_dcs_all[df_dcs_all['SEQ_ASSAY_ID']=='MSK-IMPACT468']

df_mut = df_mut_all[df_mut_all['SAMPLE_ID'].isin(df_dcs['SAMPLE_ID'])]
ser_mut_tokens = df_mut.groupby('SAMPLE_ID')['Hugo_Symbol'].apply(list)

df_cna = df_cna_all.loc[df_dcs['SAMPLE_ID']]
df_cna_melted = genie.get_melted_cna(df_cna, drop_nan=True, drop_zero=True)
ser_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(
    lambda x: list(zip(x['hugo'], x['dcna']))
)

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_mut = embedders.GeneMutationEmbeddings(
        ser_mut_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_mut.create_embeddings()
    embds_mut.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'dme_{subset_name}')

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_cna = embedders.GeneCnaEmbeddings(
        ser_cna_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_cna.create_embeddings()
    embds_cna.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'cna_{subset_name}')

### ALL

In [None]:
subset_name = "ALL"
df_dcs = df_dcs_all

df_mut = df_mut_all
ser_mut_tokens = df_mut.groupby('SAMPLE_ID')['Hugo_Symbol'].apply(list)

df_cna = df_cna_all
df_cna_melted = genie.get_melted_cna(df_cna, drop_nan=True, drop_zero=True)
ser_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(
    lambda x: list(zip(x['hugo'], x['dcna']))
)

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_mut = embedders.GeneMutationEmbeddings(
        ser_mut_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_mut.create_embeddings()
    embds_mut.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'dme_{subset_name}')

In [None]:
for embedding_size in EMBEDDING_SIZES:
    embds_cna = embedders.GeneCnaEmbeddings(
        ser_cna_tokens, 
        subset_name, 
        min_unigram_count=MIN_UNIGRAM_COUNT,
        embedding_size=embedding_size,
    )
    embds_cna.create_embeddings()
    embds_cna.write_projector_files(df_dcs, df_ras, EMBEDDINGS_PATH, f'cna_{subset_name}')

In [None]:
df_v = pd.read_csv(
    os.path.join(EMBEDDINGS_PATH, 'dme_MSK-IMPACT468_gene_svd_200_vecs.tsv'), 
    sep='\t', 
    header=None,
)

In [None]:
df_v

In [None]:
df_m = pd.read_csv(os.path.join(EMBEDDINGS_PATH, 'dme_MSK-IMPACT468_sample_meta.tsv'), sep='\t')

In [None]:
df_m