### Learning cross-lingual embedding with Skip-gram Negative Sampling Word Embeddings

Author: Jeanne Elizabeth Daniel

November 2019

Skip-gram negative sampling is one of the two variants of the popular Word2Vec model proposed by Mikolov et al., (2013). The skip-gram and continous bag-of-words (CBOW) is conceptually similar to the neural network language model (NNLM) proposed by Bengio et al., (2003), except it omits the hidden layer of the NNLM. 

The key difference between the skip-gram model and the CBOW model is that CBOW aims to predict the center word given the surrounding words, while skip-gram aims to predict the words surrounding the center word. The vector representations of words can be extracted from the projection layer and used for various down-stream tasks, such as sentiment classification.

To encourage the model to learn a cross-lingual embedding space, we train it on a corpus of multilingual questions found in the MomConnect training set. Previous examinations of these questions estimated the prevalence of code-switching to be about 10%. We construct a massive multilingual vocabulary from all the words found in the training set. 

After training the embedding models on the questions found in the training set, we extract the cross-lingual word embeddings. We construct a sentence embedding by taking the average of all the word embeddings in the sentence (Wieting et al., 2015). Then we train $k$-nearest neighbour classifiers to predict the most appropriate answer, with $k = 1, 5, 25, 50$. The best validation scores were achieved by using cosine as the distance metric and using weighted majority voting, where the contribution of each nearest neighbour is inversely proportion to its distance from the query vector.

In [1]:
import pandas as pd
import gensim
import numpy as np
from gensim.models import Word2Vec, FastText

In [2]:
import preprocess_data

In [3]:
data = pd.read_csv('dataset_7B', delimiter = ';', engine = 'python')
data = data[['helpdesk_question', 'helpdesk_reply', 'set', 'low_resource']] 

In [4]:
responses = pd.DataFrame(data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts()).reset_index()
responses['reply'] = responses['index']
responses['index'] = responses.index
responses = dict(responses.set_index('reply')['index'])

In [5]:
def create_word2vec(data, skip_gram = 1, size = 100):
    
    """ Create word2vec embedding model. Word2Vec has two variants - CBOW and SGNS.  
    
    Args:
        data: dataframe that contains the questions 
        skip_gram: binary indicator to use either skip-gram negative sampling or 
            continuous bag-of-words (Mikolov et al., 2013)
        size: number of dimensions in embedding
    
    Returns:
        Trained embedding model
    
    """
    
    documents = data['helpdesk_question']
    documents['index'] = documents.index
    processed_docs = documents.apply(preprocess_data.preprocess, args = [0, False])
    print(len(processed_docs))
    model = Word2Vec(processed_docs, min_count = 1, sg = skip_gram, seed= 1, size = size,
                     negative = 5, ns_exponent =  0.75, workers = 5)  
    
    return model
    

In [6]:
def create_sentence_embeddings(embedding_model, sentence):
    
    """ We create sentence embeddings by averaging the embeddings of the words found in the sentence. 
    If no words match, we return a vector of random values.
    
    Args:
        embedding_model:
        sentence: list of words found in sentence
        
    Returns:
        A sentence embedding
    
    """
        
    sentence_vector = np.zeros(100)
    length = 0
    if len(sentence) == 0:
        return (np.random.random(100) - 0.5)/100
    
    if embedding_model.wv.vocab.get(sentence[0]) != None:
        sentence_vector = embedding_model.wv[sentence[0]]
        length += 1
    
    for word in sentence[1:]:
        if embedding_model.wv.vocab.get(word) != None:
            sentence_vector = sentence_vector + 1*np.array(embedding_model.wv[word])
            length += 1
            
    if length == 0:
        return (np.random.random(100) - 0.5)/100
   
    return sentence_vector/length

In [7]:
def create_batch(df, embedding_model, D): 
        
    """ Create batch of feature vectors in matrix form
    
    Args:
        df: dataset of questions
        embedding_model: pretrained embedding model
        D: size of embedding
        
    Returns:
        matrix where rows are embeddings of questions
    
    """    
    
    matrix = np.zeros((df.shape[0], D, ))
    all_text = list(df['helpdesk_question'].apply(preprocess_data.preprocess)) 

    for i in range(len(all_text) -1):
        sentence_vector = create_sentence_embeddings(embedding_model, all_text[i])
        matrix[i] += np.array(sentence_vector)
            
    return matrix 

def label_preprocess(entry):
        
    """ Returns integer ID corresponding to response for easy comparison and classification
    
    Args:
        entry: query item 
        responses: dict containing all the template responses with their corresponding IDs
        
    Return: 
        integer corresponding to each response     
        
    """
    
    if responses.get(entry) != None:
        return responses[entry]
    else:
        return len(responses) #default unknown class

In [8]:
train_df   = data.loc[data['set'] == 'Train']
valid_df   = data.loc[data['set'] == 'Valid']
test_df    = data.loc[data['set'] == 'Test']
test_LR_df = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True')]

y_train   = data.loc[data['set'] == 'Train']['helpdesk_reply'].apply(label_preprocess)
y_valid   = data.loc[data['set'] == 'Valid']['helpdesk_reply'].apply(label_preprocess)
y_test    = data.loc[data['set'] == 'Test']['helpdesk_reply'].apply(label_preprocess)
y_test_LR = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True')]['helpdesk_reply'].apply(label_preprocess)

In [9]:
w2v = create_word2vec(train_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


96413


In [10]:
from sklearn.neighbors import KNeighborsClassifier

def train_knn_model(x_train, y_train, metric, k, weights):
    
    """ Fit k-nearest neighbour model to the sentence embeddings
    
    Args:
        x_train: matrix of sentence embeddings
        y_train: class labels associated with each sentence embedding 
        metric: distance metric to use
        k: number of neighbours to consider
        weights: to either use uniform voting (equal weighting) or weighted voting (the weight of 
        each vote is proportional to its distance to query)
        
    Returns:
        A trained KNN classifier
    
    """
    
    
    clf = KNeighborsClassifier(n_neighbors=k, weights= weights, metric = metric)
    clf.fit(x_train, y_train)
    return clf

### Results for Word2Vec Embeddings

In [11]:
x_train = create_batch(train_df, w2v, 100)
x_valid = create_batch(valid_df, w2v, 100)
x_test  = create_batch(test_df, w2v, 100)
x_test_LR = create_batch(test_LR_df, w2v, 100)

In [12]:
clf_1NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 1, weights = 'distance')
score = clf_1NN.score(x_train, y_train)
print("Train accuracy", score)
score = clf_1NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Train accuracy 0.9681471186159399
Validation accuracy 0.48824910029729307


In [13]:
clf_5NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 5, weights = 'distance')
score = clf_5NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.5322797684243468


In [14]:
clf_25NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 25, weights = 'distance')
score = clf_25NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.5714598654357691


In [15]:
clf_50NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 50, weights = 'distance')
score = clf_50NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.5739007979971835


In [16]:
score = clf_1NN.score(x_test, y_test)
print("Test accuracy on 1-NN", score)
score = clf_5NN.score(x_test, y_test)
print("Test accuracy on 5-NN", score)
score = clf_25NN.score(x_test, y_test)
print("Test accuracy on 25-NN", score)
score = clf_50NN.score(x_test, y_test)
print("Test accuracy on 50-NN", score)

Test accuracy on 1-NN 0.4868612912232805
Test accuracy on 5-NN 0.5351037756336674
Test accuracy on 25-NN 0.5731393292588342
Test accuracy on 50-NN 0.575373064871405


In [17]:
score = clf_1NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 1-NN", score)
score = clf_5NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 5-NN", score)
score = clf_25NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 25-NN", score)
score = clf_50NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 50-NN", score)

LR Test accuracy on 1-NN 0.373046875
LR Test accuracy on 5-NN 0.4230769230769231
LR Test accuracy on 25-NN 0.48001802884615385
LR Test accuracy on 50-NN 0.4846754807692308


### Assessing the quality of cross-lingual embeddings


We design a small experiment to assess the quality of the cross-lingual embeddings for English and Zulu. The English sentences were synthesized based on frequently occurring questions found in the dataset. The Zulu translations were obtained using google translate and verified by a Zulu speaker. We compute the sentence embedding for each English-Zulu translation pair and calculate the cosine distance between the two embeddings. 

In [30]:
eng_A  = "can you drink coca cola when you are pregnant"
zulu_A = "ungayiphuza yini i-coca cola uma ukhulelwe"

eng_B  = "when can i stop breastfeeding"
zulu_B = "ngingakuyeka nini ukuncelisa ibele"

eng_C  = "when can I start feeding my baby solid food"
zulu_C = "ngingaqala nini ukondla ingane yami ukudla okuqinile"

eng_D  = "what are the signs of labour"
zulu_D = "yiziphi izimpawu zokubeletha"

eng_E  = "when can I learn the gender of my baby"
zulu_E = "ngingabazi ubulili bengane yami"

In [31]:
create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_A))

array([-0.43539676,  0.14723022, -0.09262004,  0.25502288,  0.11587235,
        0.06259349,  0.38609362,  0.21637774, -0.21829106,  0.64426017,
       -0.20095043, -0.18074936,  0.07898352, -0.01266432,  0.63484   ,
        0.17975637, -0.02500954, -0.4267427 ,  0.04626761,  0.31362945,
        0.7310216 , -0.04467263, -0.45416588,  0.31655535,  0.4234427 ,
       -0.15194374, -0.17476863,  0.31114313, -0.24059613, -0.11271609,
        0.23048925,  0.19298944,  0.19004875,  0.13038677,  0.07424899,
        0.16243826, -0.4898702 , -0.33368504, -0.4785539 ,  0.12046936,
       -0.5566835 ,  0.37200242,  0.03245384, -0.64333767,  0.37079924,
        0.2541796 ,  0.02759663,  0.16897526, -0.97489905, -0.02927784,
        0.00379283,  0.6215036 , -0.2587112 , -0.02903298,  0.63286567,
       -0.0719777 ,  0.80922073,  0.00908484,  0.38366628, -0.3711188 ,
       -0.3862779 ,  0.33875817, -0.5076158 , -0.5550014 , -0.6456312 ,
        0.02900601,  0.96982616,  0.6402915 ,  0.0903128 , -0.25

In [32]:
embed_eng_A = create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_A))
embed_eng_B = create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_B))
embed_eng_C = create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_C))
embed_eng_D = create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_D))
embed_eng_E = create_sentence_embeddings(w2v, preprocess_data.preprocess(eng_E))

In [33]:
embed_zulu_A = create_sentence_embeddings(w2v, preprocess_data.preprocess(zulu_A))
embed_zulu_B = create_sentence_embeddings(w2v, preprocess_data.preprocess(zulu_B))
embed_zulu_C = create_sentence_embeddings(w2v, preprocess_data.preprocess(zulu_C))
embed_zulu_D = create_sentence_embeddings(w2v, preprocess_data.preprocess(zulu_D))
embed_zulu_E = create_sentence_embeddings(w2v, preprocess_data.preprocess(zulu_E))

In [34]:
from scipy.spatial.distance import cosine

In [35]:
print("Sentence A:", cosine(embed_eng_A, embed_zulu_A))
print("Sentence B:", cosine(embed_eng_B, embed_zulu_B))
print("Sentence C:", cosine(embed_eng_C, embed_zulu_C))
print("Sentence D:", cosine(embed_eng_D, embed_zulu_D))
print("Sentence E:", cosine(embed_eng_E, embed_zulu_E))

Sentence A: 0.39793431758880615
Sentence B: 0.5983215216141248
Sentence C: 0.5964771807193756
Sentence D: 0.5928864777088165
Sentence E: 0.6021465587010766
