# Skip-Gram

This notebook provides function to compute Skip-Gram algorithm in order to learn patient representation.

### References 

Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov et al (2013)


Learning Low-Dimensional Representations of Medical Concepts, Youngduck Choi (2016)


Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction, Edward Choi (2016)


Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework, Steiger (2023)


Doctor AI: Predicting Clinical Events via Recurrent Neural Networks, Edward Choi (2016)


#### Python environment 
- Python 3.10
- Required Packages: pytorch, pandas and scikit-learn


#### Functionalities

- A function to prepare the data: Word2VecDataset
- A class for the model: Word2Vec
- A Gridsearch of the hyperparameters function: Gridsearch_Train_Word2Vec
- A training function: Train_Word2Vec

For all these functionalities, explanations are detailed in the following.

Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

## Data Preparation Step

This function creates word pairs.

For each time sequence (i.e., each patient):

- It iterates over each code to find its neighbors. If window_size = 5, then it takes 5 neighbors, ie. the 5 codes following the target one.
- It generates 'negative samples': num_negative_samples indicates the number of negative samples.


This function returns a list of tuples (target_word, context_word, label).

The label indicates if it is a positive sample (a true neighbor) or a negative one (a false one).

Example:
If window_size = 5 and num_negative_samples = 3, then one obtains:

[[451, 525, 1],

 [451, 817, 1],

 [451, 818, 1],

 [451, 579, 1],

 [451, 579, 1],

 [451, 530, 0],

 [451, 530, 0],
 
 [451, 659, 0],...]

with 525, 817, 818, 579 the true neighbors (label=1) of the target code 451 and 530, 659 the false ones (label=0)

In [2]:
def Word2VecDataset(corpus, vocab, window_size, num_negative_samples=0) :
    '''
    Parameters
    ----------
    corpus : list
        Codes sequence per patient : [[Code1, Code2, Code3], [Patient 2]]
    vocab : dict
        {'mot' : id}
    window_size : int
        Size of the window (ie. number of neighboors)
    num_negative_samples : int
        Number of negative samples to generate. Default = 0
    Returns
    -------
    word_pairs : list
        Tuples (target_word, context_word, label). Label indicates if it is a positive sample (a true neighboor) or a neagtive one (a false one).)
    '''
    word_pairs = []
    for sentence in corpus:
        for i, target_word in enumerate(sentence):
            context_words = []
            for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
                if i != j:
                    context_word = sentence[j]
                    context_words.append(context_word)
                    word_pairs.append([vocab[target_word], vocab[context_word], 1])

            negative_samples = np.random.choice([w for w in corpus[window_size-1:-window_size] if w != target_word and w not in context_words][0], num_negative_samples)
            for negative_samp in negative_samples:
                word_pairs.append([vocab[target_word], vocab[negative_samp], 0])

    return word_pairs

## Word2Vec Class


This class constructs the Word2vec model. 

It takes two parameters:
- vocab_size (int): the number of unique words (i.e., codes)
- embedding_dim (int): the size of the embedding space


The forward function takes two tensors as input:
- target: the word on which Skip-Gram is applied
- context: the context word


It returns the score:

$$v(c_{target})^Tv(c_{context})$$

This is the matrix multiplication between the embedding (i.e., the representation) of the target word and that of its neighbor.

In [3]:
class Word2Vec(nn.Module):
    '''
    Parameters
    ----------
    vocab_size : int
        Number of unique words (ie. codes)
    embedding_dim : int
        Size of the embedding space.
    '''
    
    def __init__(self, vocab_size, embedding_dim):

        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        super(Word2Vec, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim).to(device)
    
    def forward(self, target, context):
        '''
        Inputs
        ------
        target : tensor
            The word on which one applies Word2Vec.
        context : tensor
            The context word.

        Returns
        -------
        dot_product : tensor
            Score.
        '''
        target_embeds = self.embeddings(target)
        context_embeds = self.embeddings(context)
        dot_product = torch.sum(target_embeds * context_embeds, dim=1, keepdim=True) 
        return dot_product


## Training of the model

This function is used to train the model.

It takes 5 parameters:

- embedding_dim (int): the dimension of the embedding
- vocab (dict): unique words in the corpus
- nb_epoch (int): the number of iterations
- learning_rate (float): the gradient step size
- batch_size (int): the size of batches


Some notes regarding training:
- Possibility of batching
- Loss: Binary Cross-Entropy (BCE)
- Optimizer: Adam

In [4]:
def Train_Word2Vec(data_loader, vocab, embedding_dim, nb_epoch, learning_rate, verbose=True) :
    '''
    Parameters
    ----------
    data_loader : Pytorch.Dataloader
        (target, context, label)
    vocab : dict
        {'mot' : id}
    embedding_size : int
        Size of the embedding.
    nb_epoch : int
        The number of iterations.
    learning_rate : float
        The learning rate for the Stochatic Gradient Descent.
    verbose : boolean
        If true, the loss per epoch is printed. Default = True
    '''
    
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
    model = Word2Vec(len(vocab), embedding_dim)

    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(nb_epoch):
        total_loss = 0
        loss_batch = []
        for target, context, label in data_loader:
            target, context, label = target.to(device), context.to(device), label.to(device)
            optimizer.zero_grad()
            output = model(target, context)
            label_tensor = label.clone().detach().view(label.size()[0], 1).to(dtype=torch.float)
            loss = criterion(output, label_tensor)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            loss_batch.append(loss)
        
        avg_loss = torch.stack(loss_batch).mean().item()

        if verbose==True:
            print('Epoch {}: Total Loss = {:.4f} - Average Loss : {:.4f}'.format(epoch, total_loss, avg_loss))
    
    return model

## Gridsearch of hyperparameters

This function optimizes the model hyperparameters.

The hyperparameters tested are:

- window_size: the size of the window (i.e., the number of considered neighbors)
- emb_size: the size of the embedding
- num_negative_samples: the number of generated negative samples
- batch_size: the number of sequences per batch


Here are the different steps of the function:

1. Create the dataset based on
    - window_size
    - num_negative_samples

2. Generate a train (80%) and test (20%) sample:
    - Train: for model learning
    - Test: for evaluating the loss on this sample

3. Create dataloaders based on batch_size

4. Train the model

5. Compute the loss on the validation sample

6. Stores, for each set of hyperparameters, the test loss in a dataframe

7. Return a dataframe with the following columns:
    - window_size
    - num_negative_samples
    - batch_size
    - embedding_dim
    - Total_Loss: the total loss on the validation sample
    - Avg_Loss: the average loss obtained on the validation sample

In [5]:
def Gridsearch_Train_Word2Vec(corpus, vocab, nb_epoch, learning_rate, window_size_list, num_negative_samples_list, batch_size_list, embedding_dim_list):
    '''
    Parameters
    ----------
    corpus : list
        Séquences de codes par patient : [[Code1, Code2, Code3], [Patient 2]]
    vocab : dict
        {'mot' : id}
    nb_epoch : int
        The number of iterations.
    learning_rate : float
        The learning rate for the Stochatic Gradient Descent.
    window_size_list : list
        List of the sizes of the window (ie. number of neighboors) to test.
    num_negative_samples_list : list
        List of the numbers of negative samples to generate to test. 
    batch_size_list : list
        List of the number of samples per batch to test.
    batch_size_list : list
        List of the embeddings dimensions to test.
    
    Returns
    -------
    df_hyperparameters : DataFrame 
        DataFrame with the total loss and average loss per batch on the validation set for all combinations of hyperparameters tested.
    '''

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
    df_hyperparameters = pd.DataFrame(columns=['window_size', 'num_negative_samples', 'batch_size', 'embedding_dim', 'Total_Loss', 'Avg_Loss'])

    i=0
    
    for window_size in window_size_list :

        for num_negative_samples in num_negative_samples_list :

            # Data Preparation
            word_pairs = Word2VecDataset(corpus, vocab, window_size, num_negative_samples=num_negative_samples)

            # Train / Test Split
            word_pairs_train, word_pairs_test= train_test_split(word_pairs, test_size=0.2, random_state=42)

            # Dataloader constructions
            for batch_size in batch_size_list :
                dataloader_train, dataloader_test = DataLoader(word_pairs_train, batch_size=batch_size, shuffle=True), DataLoader(word_pairs_test, batch_size=batch_size, shuffle=True)
                        
                # Training of the model on the training set
                for embedding_dim in embedding_dim_list :
                    print('Window_size : {} - Num_neg_samples : {} - Batch_size : {} - Emb_size : {}'.format(window_size, num_negative_samples, batch_size, embedding_dim))
                    model = Train_Word2Vec(dataloader_train, vocab, embedding_dim, nb_epoch, learning_rate, verbose=False)

                    # Validation of the model
                    total_loss = 0
                    loss_batch_test = []
                    for target, context, label in dataloader_test:
                        target, context, label = target.to(device), context.to(device), label.to(device)
                        output = model(target, context)
                        label_tensor = label.clone().detach().view(label.size()[0], 1).to(dtype=torch.float)
                        loss = nn.BCEWithLogitsLoss()(output, label_tensor)
                        total_loss += loss.item()
                        loss_batch_test.append(loss)

                    avg_loss_test = torch.stack(loss_batch_test).mean().item()

                    # Save the loss in the final dataframe
                    df_hyperparameters.loc[i, 'window_size'], df_hyperparameters.loc[i, 'num_negative_samples'], df_hyperparameters.loc[i, 'batch_size'], df_hyperparameters.loc[i, 'embedding_dim'], df_hyperparameters.loc[i, 'Total_Loss'], df_hyperparameters.loc[i, 'Avg_Loss'] = window_size, num_negative_samples, batch_size, embedding_dim, total_loss, avg_loss_test
                    
                    i+=1

    return df_hyperparameters