# Embedding Analysis

In this notebook we explore how glove embeddings can be used to create a noisy set for training. In particular given a set of `BIO` encoded sequences, extract the positive labels, and use them to do some analysis in the embedding space/augment the set of positive labels.

## Algorithm

```
glove_embeddings <- load_glove()

gather positive labeled words (both word and phrase level)

augment set of positive labeled words
    - kNN with FAISS
    - logistic regression / SVM kernels for hyperplane
    - linear transforms/Affine Transforms

Analyze augmented set
```

### Resources

- FAISS (Facebook AI Similarity Search) for kNN style search with
    - L2 distance
    - Cosine Simlarity
- Scikit Learn: Logistic Regression
- Scikit Learn: SVM Kernels
- PyTorch: Linear Transforms

In [54]:
from typing import (
    Dict,
    List,
    Tuple,
    Callable,
    Optional,
)
from enum import Enum


import os
import sys
import random
from tqdm import tqdm
from collections import Counter

import torch
import faiss
import altair as alt
import pandas as pd
import numpy as np
import nltk
import spacy

import allennlp

# local imports
import dpd
from dpd.dataset.bio_dataset import BIODataset
from dpd.utils import (
    remove_bio,
    explain_labels,
    get_words,
)

from dpd.constants import (
    CONLL2003_TRAIN,
    CONLL2003_VALID,
    CADEC_TRAIN,
    CADEC_VALID,
)

In [2]:
# Some generic constants to help throughout
from nltk.corpus import stopwords

STOP_WORDS = set(stopwords.words('english'))

GLOVE_EMBEDDING_DIR = 'data/glove.6B'
GLOVE_DIMS = [50, 100, 200, 300]

## Loading Embeddings
In this section we look to load the GLOVE word embeddings.

Input: `file_path: str`

Output: `Dict[str, np.ndarray]` word to embedding

In [3]:
# Useful TypeDefs
EmbeddingType = np.ndarray
EmbeddingSpaceType = Dict[str, EmbeddingType]

In [4]:
def get_glove_dim_file(dims: int, include_base=True) -> str:
    '''
    Given a number of dimensions, return the associated GLOVE embedding file
    
    Input: ``dims`` int, the number of dims
            ``include_base``, should include the full file path or just the file name
    Output: ``file_name`` str, the name of the associated file
    
    raises: Exception if number of dims is not available
    '''
    if dims not in GLOVE_DIMS:
        raise Exception(f'Unknown dims: {dims} only have {GLOVE_DIMS}')
    
    glove_file = f'glove.6B.{dims}d.txt'
    if include_base:
        glove_file = os.path.join(GLOVE_EMBEDDING_DIR, glove_file)
    return glove_file

def load_glove(dims: int) -> EmbeddingSpaceType:
    '''
    Given a number of dimensions load the embedding space for the associated GLOVE embedding
    
    Input: ``dims``: int the number of dimensions to use
    
    Output: ``EmbeddingSpace`` EmbeddingSpaceType, the entire embedding space embedded in the file
    '''
    glove_file = get_glove_dim_file(dims, include_base=True)
    with open(glove_file, 'r') as f:
        embedding_space = {}
        for line in tqdm(f):
            splitLine = line.split()
            word = splitLine[0]
            embedding = np.array([float(val) for val in splitLine[1:]])
            embedding_space[word] = embedding
    return embedding_space

In [57]:
glove_embedding_dim = 300
glove_embeddings = load_glove(glove_embedding_dim)

400000it [00:44, 8928.75it/s]


## Loading Data
We will use our dataset readers that we implemented and load the CADEC dataset
In particular we will look at the ADR tag, however this should be generalized, so
we implement a set of functions to load our data given

- DataSet Type (CADEC, CONLL)
- Dataset Class (e.g. `ADR`, `PER`)

In [19]:
def get_dataset_files(dataset_type: str) -> Tuple[str, str]:
    if dataset_type == 'CONLL':
        return CONLL2003_TRAIN, CONLL2003_VALID
    elif dataset_type == 'CADEC':
        return CADEC_TRAIN, CADEC_VALID
    else:
        raise Exception(f'Unknown dataset: {dataset_type}')

def load_data(
    dataset_type: str,
    binary_class: Optional[str] = None,
) -> Tuple[BIODataset, BIODataset]:
    '''
    Load BIODataset for a given dataset type with the binary
    class if specified
    '''
    train_file, valid_file = get_dataset_files(dataset_type)

    train_dataset = BIODataset(
        dataset_id=0,
        file_name=train_file,
        binary_class=binary_class,
    )
    
    train_dataset.parse_file()
    
    valid_dataset = BIODataset(
        dataset_id=1,
        file_name=valid_file,
        binary_class=binary_class, 
    )
    
    valid_dataset.parse_file()
    
    return train_dataset, valid_dataset

train_data, valid_data = load_data('CADEC', 'ADR')

96867it [00:00, 343559.76it/s]
24143it [00:00, 360596.54it/s]


## Process BIO Data

Now that we have loaded all the proper data, we need to process the dataset, and in particular create two dictionaries

1. Word level
2. Phrase level

In [45]:
def bio_random_sample(data: BIODataset, sample_size: int) -> BIODataset:
    '''
    Genereate a random sample of the data
    '''
    bio_data = list(data.data)
    bio_dataset = BIODataset(
        dataset_id=2,
        file_name='temp.txt',
        binary_class=data.binary_class,
    )

    def _random_sample_array(array: List[object], size: int):
        for i in range(len(array)):
            ind = random.randint(0, len(array) - 1)
            temp = array[i]
            array[i] = array[ind]
            array[ind] = temp

        return array[:size]
    
    bio_dataset.data = bio_data[:sample_size] #_random_sample_array(bio_data, sample_size)
    
    return bio_dataset
train_data_sample = bio_random_sample(train_data, 50)

In [48]:
word_counter: Dict[str, Counter] = {'pos': Counter(), 'neg': Counter()}
phrase_counter: Counter = Counter()
for entry in train_data_sample:
    sentence, tags = entry['input'], entry['output']
    pos_words = get_words(sentence, tags, train_data_sample.binary_class)
    
    # get the negative words
    neg_words = get_words(sentence, tags, 'O')

    pos_ranges, pos_phrases = explain_labels(sentence, tags)
    for w in pos_words:
        word_counter['pos'][w] += 1

    for w in neg_words:
        word_counter['neg'][w] += 1

    for phrase in pos_phrases:
        phrase_counter[tuple(phrase)] += 1

# Analysis

Now we have everything setup

- `glove_embeddings`: this contains a map of word to numpy array (`np.ndarray`) of the glove embedding
- `train_data_sample`: a random sample of the training data
- `word_counter`: pos word to count and neg word to count
- `phrase_counter`: count of tuples of the most popular entities

Now we can use this data to determine how to best augment our dictionaries and define functions seperating them

### k-Nearest Neighbors

First lets look at how we can augment this dictionary through the embedding space with kNN.

- `L2 Distance` find the K nearest neighbors through the L2 distance
- `Cosine Similarity` find the K nearest neighbors through the cosine similarity

Algorithm:

1. Build Index
    - build a `FAISS` index
    - map word to index in glove
    - map index to word in glove
2. Search Index
    - build a `np.ndarray` query
3. Experiment with Hyper Parameters
    - different algorithms listed above
    - experiment with various values of k

#### Hypothesis

While visualizing the embedding space an help, our thought is that there may be different hyper planes that are closer than we anticipate, something like logisitc regression, SVM, or affine transforms might be better.

In [185]:
class SimilarityAlgorithm(Enum):
    L2Distance = 1
    CosineSimilarity = 2

class WordEmbeddingIndex(object):
    '''
    Build a FAISS index for an EmbeddingSpaceType
    object
    
    Its a nice wrapper around the faiss index, to allow easily searching
    and converting vectors to words and vice versa
    '''
    def __init__(
        self,
        embedding_space: EmbeddingSpaceType,
        embedding_space_dims: int,
        similarity_algorithm: SimilarityAlgorithm,
    ):
        self.embedding_space = embedding_space
        self.embedding_space_dims = embedding_space_dims
        self.similarity_algorithm = SimilarityAlgorithm
        self.index_np, self.word_to_index, self.index_to_word = (
            WordEmbeddingIndex.build_index(
                embedding_space,
                embedding_space_dims,
            )
        )
        
        # for FAISS we need float32 instead of float64
        self.index_np = self.index_np.astype('float32')
        
        self.faiss_index = faiss.IndexFlatIP(embedding_space_dims)
        if similarity_algorithm == SimilarityAlgorithm.CosineSimilarity:
            # normalize with L2 as a proxy for cosine search
            faiss.normalize_L2(self.index_np)
        self.faiss_index.add(self.index_np)
    
    def find_similar(
        self,
        query_np: np.ndarray,
        k: int,
        remove_first_row: bool = True,
    ) -> Counter:
        '''
        given a query retreive similar words
        
        input:
            - ``query_np`` np.ndarray
                The query to search the embedding space for
        output:
            - ``Counter``
                The count of kNN results
        '''
        query_np = query_np.astype('float32')
        distances, indexes = self.faiss_index.search(query_np, k)
        
        if remove_first_row:
            first_row = indexes[:, 0]
#             assert (self.index_np[embedding_indicies[0]] - self.index_np[first_row[0]]).sum() == 0
#             assert (query_np[0] - self.index_np[first_row[0]]).sum() == 0
            similar_words_i = indexes[:, 0:]
        else:
            similar_words_i = indexes
        
        similar_words_i = similar_words_i.flatten()
        
        similar_words = Counter()
        for word_index in similar_words_i:
            similar_words[self.index_to_word[word_index]] += 1
        return similar_words
    
    def _phrase_embedding(
        self,
        phrase: List[str],
    ) -> np.ndarray:
        '''
        compute an embedding representation given a list of words
        input:
            - ``phrase`` List[str]
                
        output:
            - ``np.ndarray``
                the embedding for the phrase
        '''
        phrase_embedding = np.zeros((self.embedding_space_dims,))
        for w in phrase:
            embedding_index = self.get_embedding_index(w)
            if w not in STOP_WORDS and embedding_index > 0:
                continue
            embedding_vec = self.index_np[embedding_index]
            phrase_embedding += embedding_vec
        phrase_embedding /= len(phrase)
        return phrase_embedding
    
    def find_similar_phrases(
        self,
        query: List[List[str]],
        k: int = 5,
    ) -> Counter:
        query_vecs = [self._phrase_embedding(q) for q in query]
        query_np = np.array(query_vecs)
        similar_words = self.find_similar(query_np, k, remove_first_row=False)
        for phrase in query:
            for word in phrase:
                del similar_words[word]
        return similar_words

    def get_embedding(
        self,
        word: str,
    ) -> np.ndarray:
        '''
        Retrieving the embedding for a specific word
        '''
        embedding_i = self.get_embedding_index(word)
        return self.index_np[embedding_i]

    def find_similar_words(
        self,
        query: List[str],
        k: int = 5,
    ) -> Counter:
        '''
        Using the specified search algorithm and the query passed in, the method
        returns similar words
        
        The algorithm builds a query of embedding vectors by retrieving the cached embedding vectors
        Then uses the `similarity search` specified in the constructor to find similary queries
        Finally the indexes are converted to words and a list of query words is retrieved and ranked
        by ocurrence.
        
        input:
            - ``query``: List[str]
                a list of all the query words, this should be the dictionary (or a subset) that
                we are augmenting
            - ``k``: int,
                the number of instances to search over (the k in kNN)
            - ``result_size``: Optional[int]
                if specified will limit the results to be of the result size
        output:
            - ``similar words`` Counter[str]
                a counting occurence of all the words retrieved from the query
        '''
        embedding_indicies_list = list(set([self.get_embedding_index(w) for w in query]))
        embedding_indicies = list(filter(lambda x: x > 0, embedding_indicies_list))
        embedding_indicies = np.array(embedding_indicies)
        
        query_np = self.index_np[embedding_indicies]
        
        similar_words = self.find_similar(query_np, k)
        for word in query:
            del similar_words[word]
        return similar_words
    
    def get_embedding_index(self, word: str) -> np.ndarray:
        if word not in self.word_to_index:
            word = 'UNK'
        return self.word_to_index[word]   
    
    @classmethod
    def build_index(
        cls,
        embedding_space: EmbeddingSpaceType,
        embedding_space_dims: int,
    ) -> Tuple[np.ndarray, Dict[str, int], Dict[int, str]]:
        '''
        Builds 3 objects specified in the output, meant for searching in the
        embedding space
        
        input:
            - ``embedding_space``: EmbeddingSpaceType
                this is the embedding space mapping keys to embeddings, we use this
                to create a nice wrapper around FAISS to enable fast searching
            - ``embedding_space_dims``: int
                the number of dimensions in each embedding
        output Tuple of 3 object:
            - ``index_np`` np.ndarray
                shape: (len(embedding_space), embedding space dimensions)
                this contains the entire index of the embedding space in a continous numpy
                ndarray for searching
            - ``word_to_index`` Dict[str, int]
                maps each word to the associated index in the index_np
            - ``index_to_word`` Dict[int, str]
                maps each index to the associated word
        '''
        word_to_index = {'UNK': 0}
        index_to_word = {0: 'UNK'}
        for word in embedding_space:
            word_to_index[word] = len(word_to_index)
            index_to_word[
                word_to_index[word]
            ] = word
        
        index_np = np.ndarray((len(word_to_index), embedding_space_dims))

        # first dimension is UNK
        index_np[0] = np.zeros((embedding_space_dims,))
        for word, embedding in embedding_space.items():
            word_i = word_to_index[word]
            index_np[word_i] = embedding
        
        return index_np, word_to_index, index_to_word

word_embedding_index = WordEmbeddingIndex(
    glove_embeddings,
    glove_embedding_dim,
    SimilarityAlgorithm.CosineSimilarity,
)

In [186]:
# embedding space search: word_embedding_index
# query: word_counter['pos'] (len: 293)
# results come from: word_embedding_index.find_similar_words(word_counter['pos'], k=5, result_size=10)
similar_words = word_embedding_index.find_similar_words(
    list(word_counter['pos'].keys()),
    k=5,
)

similar_phrases = word_embedding_index.find_similar_phrases(
    list(phrase_counter.keys()),
    k=5,
)

In [187]:
similar_words.most_common(20)

[('ankle', 7),
 ('vomiting', 6),
 ('shortness', 6),
 ('groin', 4),
 ('anxiety', 4),
 ('hamstring', 3),
 ('elbow', 3),
 ('maybe', 3),
 ('worried', 3),
 ('lips', 3),
 ('losing', 3),
 ('wrist', 3),
 ('hand', 3),
 ('migraines', 3),
 ('itching', 3),
 ('you', 3),
 ('but', 2),
 ('dark', 2),
 ('blue', 2),
 ('increase', 2)]

## kNN Results

The results above show that the embedding space search is bring in relevant items, however it seems like some of these items are not relevant, which could be the issue related to the hyperplanes mentioned earlier.

To counter act this next we take a look at logistic regression.

## Logistic Regression

Here we employ the algorithm to learn a logistic regression classifier.

1. Create a train set with all positive words and an equal amount of negative words (exclude stop words)
2. Fit a logisitc regression model
3. Use this to classifiy every point in the embedding space as a part of the dictionary or not

The hope here is that the logistic regression model can form a hyperplane that seperate concepts better than the kNN model could, since nearest neighbors could be in some random dimensions.

In [228]:
train_set_embeddings = []
train_set_labels = []
import string

num_pos = 0
for w in word_counter['pos'].keys():
    if w in STOP_WORDS or w in string.punctuation:
        continue
    embedding_vec = word_embedding_index.get_embedding(w)
    train_set_labels.append(1)
    train_set_embeddings.append(embedding_vec)
    num_pos += 1

num_neg = 0
for w in word_counter['neg'].keys():
    if w in STOP_WORDS or w in string.punctuation:
        continue
    embedding_vec = word_embedding_index.get_embedding(w)
    train_set_labels.append(0)
    train_set_embeddings.append(embedding_vec)
    num_neg += 1

In [229]:
from sklearn.linear_model import LogisticRegression

x_train = np.array(train_set_embeddings)
y_train = np.array(train_set_labels)
logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [240]:
labels = logisticRegr.predict(word_embedding_index.index_np)
probs = logisticRegr.predict_proba(word_embedding_index.index_np)

In [247]:
# probs.shape (index_size, 2)
similar_words_lr = {'pos': Counter(), 'neg': Counter()}
for i, (label, prob) in enumerate(zip(labels, probs)):
    key = 'pos' if label == 1 else 'neg'
    counter = similar_words_lr[key]
    word = word_embedding_index.index_to_word[i]
    if word in STOP_WORDS or word in string.punctuation or w in word_counter['pos']:
        continue
    counter[word] += prob[label]

In [261]:
similar_words_lr['pos'].most_common(10)

[('numbness', 0.9174100396266176),
 ('rashes', 0.9012706273738842),
 ('dizziness', 0.8934524597467283),
 ('protruding', 0.890337491119129),
 ('bruised', 0.8853579871679472),
 ('irritability', 0.8801254161047158),
 ('itchy', 0.879740547126885),
 ('swollen', 0.8773934862144145),
 ('lethargy', 0.8740313251993611),
 ('blisters', 0.8736541852951585)]

## SVMs

Lets try the same thing with SVMs

In [296]:
from sklearn.svm import SVC  
svclassifier = SVC(kernel='linear', probability=True) 
svclassifier.fit(x_train, y_train)
labels = svclassifier.predict(word_embedding_index.index_np)
probs = svclassifier.predict_proba(word_embedding_index.index_np)
similar_words_svm = {'pos': Counter(), 'neg': Counter()}
for i, (label, prob) in enumerate(zip(labels, probs)):
    key = 'pos' if label == 1 else 'neg'
    counter = similar_words_svm[key]
    word = word_embedding_index.index_to_word[i]
    if word in STOP_WORDS or word in string.punctuation or w in word_counter['pos']:
        continue
    counter[word] += prob[label]

In [297]:
similar_words_svm['pos'].most_common(10)

[('numbness', 0.9374348638457773),
 ('dizziness', 0.9318870428755427),
 ('rashes', 0.9279577857644434),
 ('blisters', 0.9260811004453761),
 ('faint', 0.919302595417729),
 ('itching', 0.9160872487326459),
 ('bruised', 0.9144254605848144),
 ('tingling', 0.9092695025666933),
 ('slurred', 0.9092653569745792),
 ('coughing', 0.9091544259898877)]

### SVM RBF?
Lets try the same thing with an `rbf` kernel instead of a `linear` kernel.


In [298]:
svclassifier = SVC(kernel='rbf', probability=True) 
svclassifier.fit(x_train, y_train)
labels = svclassifier.predict(word_embedding_index.index_np)
probs = svclassifier.predict_proba(word_embedding_index.index_np)
similar_words_svm_rbf = {'pos': Counter(), 'neg': Counter()}
for i, (label, prob) in enumerate(zip(labels, probs)):
    if prob[1] > prob[0]:
        label = 1
    else:
        label = 0
    key = 'pos' if label == 1 else 'neg'
    counter = similar_words_svm_rbf[key]
    word = word_embedding_index.index_to_word[i]
    if word in STOP_WORDS or word in string.punctuation or w in word_counter['pos']:
        continue
    counter[word] += prob[label]



In [299]:
similar_words_svm_rbf['pos'].most_common(10)

[('numbness', 0.9754717961932726),
 ('dizziness', 0.9703606725879994),
 ('rashes', 0.9639200074280232),
 ('blisters', 0.9593204541186061),
 ('nausea', 0.9588400021115162),
 ('cramps', 0.9584717111222454),
 ('cramping', 0.9572098637354005),
 ('twitching', 0.9568297853283206),
 ('vomiting', 0.9529760285921908),
 ('aches', 0.9519459599576058)]

#### SVM Polynomial Degree 2?

You know the drill

In [300]:
svclassifier = SVC(kernel='poly', degree=2, probability=True)
svclassifier.fit(x_train, y_train)
labels = svclassifier.predict(word_embedding_index.index_np)
probs = svclassifier.predict_proba(word_embedding_index.index_np)
similar_words_svm_ply = {'pos': Counter(), 'neg': Counter()}
for i, (label, prob) in enumerate(zip(labels, probs)):
    if prob[1] > prob[0]:
        label = 1
    else:
        label = 0
    key = 'pos' if label == 1 else 'neg'
    counter = similar_words_svm_ply[key]
    word = word_embedding_index.index_to_word[i]
    if word in STOP_WORDS or word in string.punctuation or w in word_counter['pos']:
        continue
    counter[word] += prob[label]



In [301]:
similar_words_svm_ply['pos'].most_common(10)

[('dizziness', 0.9729899342296005),
 ('nausea', 0.9691237668776509),
 ('numbness', 0.9625200586567948),
 ('headaches', 0.9544747070153324),
 ('vomiting', 0.9526546886960093),
 ('cramps', 0.9509906356922635),
 ('aches', 0.9445665677362626),
 ('cramping', 0.9305996447244286),
 ('sore', 0.9267912627285241),
 ('aching', 0.9263559710717603)]

# Error Analysis

In this section lets take a look at an error analysis of the validation set.

Sine we are looking at dictionary models, lets use a dictionary based approach.

In [303]:
valid_data_sample = bio_random_sample(valid_data, 100)
valid_word_counter: Dict[str, Counter] = {'pos': Counter(), 'neg': Counter()}
valid_phrase_counter: Counter = Counter()
for entry in valid_data_sample:
    sentence, tags = entry['input'], entry['output']
    pos_words = get_words(sentence, tags, train_data_sample.binary_class)
    
    # get the negative words
    neg_words = get_words(sentence, tags, 'O')

    pos_ranges, pos_phrases = explain_labels(sentence, tags)
    for w in pos_words:
        valid_word_counter['pos'][w] += 1

    for w in neg_words:
        valid_word_counter['neg'][w] += 1

    for phrase in pos_phrases:
        valid_phrase_counter[tuple(phrase)] += 1

In [379]:
empty_counter = {'pos': Counter(), 'neg': Counter()}
def analyze_dict(known_dict: Counter, aug_dict: Counter, valid_dict: Dict[str, Counter]) -> Tuple[float, float, float]:
    correct = 0
    incorrect = 0
    not_found = 0
    
    pred_dict = {
        'pos': known_dict['pos'] + aug_dict['pos'],
        'neg': known_dict['neg'] + aug_dict['neg'],
    }

    # pos_dict = [w for (w, _) in pred_dict['pos'].most_common(N)]
    
    for valid_w in valid_dict['pos']:
        if valid_w in STOP_WORDS:
            continue
        if valid_w in pred_dict['pos']:
            correct += 1
        elif valid_w in pred_dict['neg']:
            incorrect += 1
        else:
            not_found += 1
    
    return (correct, incorrect, not_found, correct / (correct + incorrect + not_found))

def compute_f1(dict_model: Counter, valid_data) -> Tuple[float, float, float]:
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    def _pred(dict_model: Counter, word: str, pred_class: str) -> str:
        if word in dict_model:
            return pred_class
        else:
            return 'O'
    for entry in valid_data_sample:
        sentence, tags = entry['input'], entry['output']
        for i, (s_i, t_i) in enumerate(zip(sentence, tags)):
            t_i = remove_bio(t_i)
            p_i = _pred(dict_model, s_i, 'ADR')
            
            if p_i == 'ADR':
                if t_i == p_i:
                    tp += 1
                else:
                    fp += 1
            else:
                if t_i == p_i:
                    tn += 1
                else:
                    fn += 1
    precision = float(tp) / float(tp + fp + 1e-13) # avoid / 0
    recall = float(tp) / float(tp + fn + 1e-13) # avoid / 0
    f1_measure = 2. * ((precision * recall) / (precision + recall + 1e-13))
    return (precision, recall, f1_measure)

def counter_f1(known_dict: Counter, aug_dict: Counter, valid_data):
    def _merge_counter(known_dict, aug_dict, top_k = None):
        if top_k is None:
            return list((known_dict['pos'] + aug_dict['pos']).keys())
        result = list(known_dict.keys())
        aug_dict_top_k = [w for w, _ in aug_dict['pos'].most_common(top_k)]
        return result + aug_dict_top_k

    dict_model = _merge_counter(known_dict, aug_dict, top_k=None)
    return compute_f1(dict_model, valid_data)
            

In [362]:
ply = analyze_dict(word_counter, similar_words_svm_ply, valid_word_counter)
norm_dict = analyze_dict(word_counter, empty_counter, valid_word_counter)
knn_dict = analyze_dict(word_counter, {'pos': similar_words, 'neg': Counter()}, valid_word_counter)

In [363]:
print(knn_dict)
print(ply)
print(norm_dict)

(202, 38, 244, 0.41735537190082644)
(197, 177, 110, 0.40702479338842973)
(137, 56, 291, 0.2830578512396694)


In [380]:
# Logistic Regression
print('lr', counter_f1(word_counter, similar_words_lr, valid_data))

# SVM Linear
print('svm-linear-kernel', counter_f1(word_counter, similar_words_svm, valid_data))

# SVM RBF
print('svm-rbf-kernel', counter_f1(word_counter, similar_words_svm_rbf, valid_data))

# SVM PLY
print('svm-ply-2-kernel', counter_f1(word_counter, similar_words_svm_ply, valid_data))

# kNN
print('kNN', counter_f1(word_counter, {'pos': similar_words, 'neg': Counter()}, valid_data))

# No augmentation
print('no-aug', counter_f1(word_counter, empty_counter, valid_data))

lr (0.3632124352331606, 0.5603517186250999, 0.4407419050612537)
svm-linear-kernel (0.36512820512820515, 0.5691446842525979, 0.4448609809434076)
svm-rbf-kernel (0.359979633401222, 0.5651478816946442, 0.4398133748055512)
svm-ply-2-kernel (0.3619246861924686, 0.5531574740207834, 0.4375592791653016)
kNN (0.2041994750656168, 0.6219024780175859, 0.3074491207270918)
no-aug (0.33520809898762655, 0.47641886490807356, 0.39352921756350384)


In [381]:
len(similar_words_lr['pos'])

24085