# Embedding Analysis

In this notebook we explore how glove embeddings can be used to create a noisy set for training. In particular given a set of `BIO` encoded sequences, extract the positive labels, and use them to do some analysis in the embedding space/augment the set of positive labels.

## Algorithm

```
glove_embeddings <- load_glove()

gather positive labeled words (both word and phrase level)

augment set of positive labeled words
    - kNN with FAISS
    - logistic regression / SVM kernels for hyperplane
    - linear transforms/Affine Transforms

Analyze augmented set
```

### Resources

- FAISS (Facebook AI Similarity Search) for kNN style search with
    - L2 distance
    - Cosine Simlarity
- Scikit Learn: Logistic Regression
- Scikit Learn: SVM Kernels
- PyTorch: Linear Transforms

In [27]:
from typing import (
    Dict,
    List,
    Tuple,
    Callable,
    Optional,
)

import os
import sys
from tqdm import tqdm_notebook as tqdm

import torch
import altair as alt
import pandas as pd
import numpy as np
import nltk
import spacy

import allennlp

# local imports
import dpd
from dpd.dataset.bio_dataset import BIODataset
from dpd.utils import (
    remove_bio
)

from dpd.constants import (
    CONLL2003_TRAIN,
    CONLL2003_VALID,
    CADEC_TRAIN,
    CADEC_VALID,
)

In [3]:
# Some generic constants to help throughout
from nltk.corpus import stopwords

STOP_WORDS = set(stopwords.words('english'))

GLOVE_EMBEDDING_DIR = 'data/glove.6B'
GLOVE_DIMS = [50, 100, 200, 300]

## Loading Embeddings
In this section we look to load the GLOVE word embeddings.

Input: `file_path: str`

Output: `Dict[str, np.ndarray]` word to embedding

In [7]:
# Useful TypeDefs
EmbeddingType = np.ndarray
EmbeddingSpaceType = Dict[str, EmbeddingType]

In [8]:
def get_glove_dim_file(dims: int, include_base=True) -> str:
    '''
    Given a number of dimensions, return the associated GLOVE embedding file
    
    Input: ``dims`` int, the number of dims
            ``include_base``, should include the full file path or just the file name
    Output: ``file_name`` str, the name of the associated file
    
    raises: Exception if number of dims is not available
    '''
    if dims not in GLOVE_DIMS:
        raise Exception(f'Unknown dims: {dims} only have {GLOVE_DIMS}')
    
    glove_file = f'glove.6B.{dims}d.txt'
    if include_base:
        glove_file = os.path.join(GLOVE_EMBEDDING_DIR, glove_file)
    return glove_file

def load_glove(dims: int) -> EmbeddingSpaceType:
    '''
    Given a number of dimensions load the embedding space for the associated GLOVE embedding
    
    Input: ``dims``: int the number of dimensions to use
    
    Output: ``EmbeddingSpace`` EmbeddingSpaceType, the entire embedding space embedded in the file
    '''
    glove_file = get_glove_dim_file(dims, include_base=True)
    with open(glove_file, 'r') as f:
        embedding_space = {}
        for line in tqdm(f):
            splitLine = line.split()
            word = splitLine[0]
            embedding = np.array([float(val) for val in splitLine[1:]])
            embedding_space[word] = embedding
    return embedding_space

In [9]:
glove_embeddings = load_glove(300)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




## Loading Data
We will use our dataset readers that we implemented and load the CADEC dataset
In particular we will look at the ADR tag, however this should be generalized, so
we implement a set of functions to load our data given

- DataSet Type (CADEC, CONLL)
- Dataset Class (e.g. `ADR`, `PER`)

In [23]:
def get_dataset_files(dataset_type: str) -> Tuple[str, str]:
    if dataset_type == 'CONLL':
        return CONLL2003_TRAIN, CONLL2003_VALID
    elif dataset_type == 'CADEC':
        return CADEC_TRAIN, CADEC_VALID
    else:
        raise Exception(f'Unknown dataset: {dataset_type}')

def load_data(
    dataset_type: str,
    binary_class: Optional[str] = None,
) -> Tuple[BIODataset, BIODataset]:
    '''
    Load BIODataset for a given dataset type with the binary
    class if specified
    '''
    train_file, valid_file = get_dataset_files(dataset_type)

    train_dataset = BIODataset(
        dataset_id=0,
        file_name=train_file,
        binary_class=binary_class,
    )
    
    train_dataset.parse_file()
    
    valid_dataset = BIODataset(
        dataset_id=1,
        file_name=valid_file,
        binary_class=binary_class, 
    )
    
    valid_dataset.parse_file()
    
    return train_dataset, valid_dataset

train_data, valid_data = load_data('CADEC', 'ADR')

96867it [00:00, 264653.12it/s]
24143it [00:00, 337541.35it/s]


## Process BIO Data

Now that we have loaded all the proper data, we need to process the dataset, and in particular create two dictionaries

1. Word level
2. Phrase level

In [28]:
def get_words(sentence: List[str], tags: List[str]) -> Dict[str, List[str]]:
    output = {}
    for word, tag in zip(sentence, tags):
        if word in STOP_WORDS:
            continue
        r_tag = remove_bio(tag)
        output[r_tag] = word
    return output

def get_phrase(sentence: List[str], tags: List[str]) -> List[str]:
    pass