### Learning cross-lingual embedding with FastText Embeddings

Author: Jeanne Elizabeth Daniel

November 2019

FastText (Bojanowskiet al., 2017) is conceptually similar to SGNS, except it also takes into account subword information. Each word is broken up into a set of character $n$-grams, with special boundary symbols at the beginning and end of each word. The original word is also retained in the set. Thus, fir $n=3$, the word "there" is represented by the following $n$-grams "<th", "the", "her", "ere", "re>" and the special token "\< there>".

This simple approach enables sharing representations across the vocabulary,can handle rare words better, and can even handle unseen words (a property theat SGNS models lack). We learn a cross-lingual embedding space from the multilingual questions by relying on the estimated 10% prevalence of code-mixing. 

After training the embedding models on the questions found in the training set, we extract the cross-lingual word embeddings. We construct a sentence embedding by taking the average of all the word embeddings in the sentence (Wieting et al., 2015). Then we train $k$-nearest neighbour classifiers to predict the most appropriate answer, with $k = 1, 5, 25, 50$. The best validation scores were achieved by using cosine as the distance metric and using weighted majority voting, where the contribution of each nearest neighbour is inversely proportion to its distance from the query vector.

In [1]:
import pandas as pd
import gensim
import numpy as np
from gensim.models import Word2Vec, FastText

In [2]:
import preprocess_data

In [3]:
data = pd.read_csv('dataset_7B', delimiter = ';', engine = 'python')
data = data[['helpdesk_question', 'helpdesk_reply', 'set', 'low_resource']] 

In [4]:
responses = pd.DataFrame(data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts()).reset_index()
responses['reply'] = responses['index']
responses['index'] = responses.index
responses = dict(responses.set_index('reply')['index'])

In [5]:
def create_fasttext(data, skip_gram= 1, min_count = 1, size = 100):
    
    """ Train a fasttext embedding model. 
    The FastText model implicitly creates a multilingual vocabulary from the multilingual dataset. 
    The estimate 10% code-switching is used as a weak cross-lingual signal to construct 
    cross-lingual embeddings
    
    Args:
        data: dataframe that contains the questions 
        skip_gram: binary indicator to use either skip-gram negative sampling or 
            continuous bag-of-words (Mikolov et al., 2013)
        size: number of dimensions in embedding
        
    Returns:
        Trained embedding model
    
    """
    
    documents = data['helpdesk_question']
    documents['index'] = documents.index
    processed_docs = documents.apply(preprocess_data.preprocess, args = [0, False])
    
    model = FastText(sentences=processed_docs, sg=skip_gram, size=size, window=5, min_count=min_count, 
                     word_ngrams=1, sample=0.001, seed=1, workers=5, negative=5, ns_exponent=0.75,
                     iter=5, min_n=3, max_n=6, trim_rule=None)
    
    return model
    

In [6]:
def create_sentence_embeddings(embedding_model, sentence):
    
    """ We create sentence embeddings by averaging the embeddings of the words found in the sentence. 
    If no words match, we return a vector of random values.
    
    Args:
        embedding_model: pretrained word embedding model
        sentence: list of words found in sentence
        
    Returns:
        A sentence embedding for the input sentence
    
    """
    
    sentence_vector = np.zeros(100)
    length = 0
    
    if len(sentence) == 0:
        return (np.random.random(100) - 0.5)/100
    
    if embedding_model.wv.vocab.get(sentence[0]) != None:
        sentence_vector = embedding_model.wv[sentence[0]]
        length += 1
    
    for word in sentence[1:]:
        if embedding_model.wv.vocab.get(word) != None:
            sentence_vector = sentence_vector + 1*np.array(embedding_model.wv[word])
            length += 1
            
    if length == 0:
        return (np.random.random(100) - 0.5)/100
   
    return sentence_vector/length

In [7]:
def create_batch(df, embedding_model, D): 
        
    """ Create batch of feature vectors in matrix form
    
    Args:
        df: dataset of questions
        embedding_model: pretrained embedding model
        D: size of embedding
        
    Returns:
        matrix where rows are embeddings of questions
    
    """    
    
    matrix = np.zeros((df.shape[0], D, ))
    all_text = list(df['helpdesk_question'].apply(preprocess_data.preprocess)) 
    
    for i in range(len(all_text) -1):
        sentence_vector = create_sentence_embeddings(embedding_model, all_text[i])
        matrix[i] += np.array(sentence_vector)
            
    return matrix 

def label_preprocess(entry):
    
    """ Returns integer ID corresponding to response for easy comparison and classification
    
    Args:
        entry: query item 
        responses: dict containing all the template responses with their corresponding IDs
        
    Return: 
        integer corresponding to each response     
        
    """
    
    if responses.get(entry) != None:
        return responses[entry]
    else:
        return len(responses) #default unknown class

In [8]:
train_df   = data.loc[data['set'] == 'Train']
valid_df   = data.loc[data['set'] == 'Valid']
test_df    = data.loc[data['set'] == 'Test']
test_LR_df = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True')]

y_train   = data.loc[data['set'] == 'Train']['helpdesk_reply'].apply(label_preprocess)
y_valid   = data.loc[data['set'] == 'Valid']['helpdesk_reply'].apply(label_preprocess)
y_test    = data.loc[data['set'] == 'Test']['helpdesk_reply'].apply(label_preprocess)
y_test_LR = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True')]['helpdesk_reply'].apply(label_preprocess)

In [9]:
fast = create_fasttext(train_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [10]:
from sklearn.neighbors import KNeighborsClassifier

def train_knn_model(x_train, y_train, metric, k, weights):
    
    """ Fit k-nearest neighbour model to the sentence embeddings
    
    Args:
        x_train: matrix of sentence embeddings
        y_train: class labels associated with each sentence embedding 
        metric: distance metric to use
        k: number of neighbours to consider
        weights: to either use uniform voting (equal weighting) or weighted voting (the weight of 
        each vote is proportional to its distance to query)
        
    Returns:
        A trained KNN classifier
    
    """
    
    clf = KNeighborsClassifier(n_neighbors=k, weights= weights, metric = metric)
    clf.fit(x_train, y_train)
    return clf

### Results for FastText

In [11]:
x_train = create_batch(train_df, fast, 100)
x_valid = create_batch(valid_df, fast, 100)
x_test  = create_batch(test_df, fast, 100)
x_test_LR = create_batch(test_LR_df, fast, 100)

In [12]:
clf_1NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 1, weights = 'distance')
score = clf_1NN.score(x_train, y_train)
print("Train accuracy", score)
score = clf_1NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Train accuracy 0.9681471186159399
Validation accuracy 0.4981380065717415


In [13]:
clf_5NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 5, weights = 'distance')
score = clf_5NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.546894069785636


In [14]:
clf_25NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 25, weights = 'distance')
score = clf_25NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.5837271162572367


In [15]:
clf_50NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 50, weights = 'distance')
score = clf_50NN.score(x_valid, y_valid)
print("Validation accuracy", score)

Validation accuracy 0.584040056329213


In [16]:
score = clf_1NN.score(x_test, y_test)
print("Test accuracy on 1-NN", score)
score = clf_5NN.score(x_test, y_test)
print("Test accuracy on 5-NN", score)
score = clf_25NN.score(x_test, y_test)
print("Test accuracy on 25-NN", score)
score = clf_50NN.score(x_test, y_test)
print("Test accuracy on 50-NN", score)

Test accuracy on 1-NN 0.5005118977445475
Test accuracy on 5-NN 0.5444110073527131
Test accuracy on 25-NN 0.5848043930133714
Test accuracy on 50-NN 0.5839357180529271


In [17]:
score = clf_1NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 1-NN", score)
score = clf_5NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 5-NN", score)
score = clf_25NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 25-NN", score)
score = clf_50NN.score(x_test_LR, y_test_LR)
print("LR Test accuracy on 50-NN", score)

LR Test accuracy on 1-NN 0.3942307692307692
LR Test accuracy on 5-NN 0.4391526442307692
LR Test accuracy on 25-NN 0.494140625
LR Test accuracy on 50-NN 0.49759615384615385


### Assessing the quality of cross-lingual embeddings

We design a small experiment to assess the quality of the cross-lingual embeddings for English and Zulu. The translations were obtained using google translate and verified by a Zulu speaker. We compute the sentence embedding for each English-Zulu translation pair and calculate the cosine distance between the two embeddings. 

In [27]:
eng_A  = "can you drink coca cola when you are pregnant"
zulu_A = "ungayiphuza yini i-coca cola uma ukhulelwe"

eng_B  = "when can i stop breastfeeding"
zulu_B = "ngingakuyeka nini ukuncelisa ibele"

eng_C  = "when can I start feeding my baby solid food"
zulu_C = "ngingaqala nini ukondla ingane yami ukudla okuqinile"

eng_D  = "what are the signs of labour"
zulu_D = "yiziphi izimpawu zokubeletha"

eng_E  = "when can I learn the gender of my baby"
zulu_E = "ngingabazi ubulili bengane yami"

In [28]:
embed_eng_A = create_sentence_embeddings(fast, preprocess_data.preprocess(eng_A))
embed_eng_B = create_sentence_embeddings(fast, preprocess_data.preprocess(eng_B))
embed_eng_C = create_sentence_embeddings(fast, preprocess_data.preprocess(eng_C))
embed_eng_D = create_sentence_embeddings(fast, preprocess_data.preprocess(eng_D))
embed_eng_E = create_sentence_embeddings(fast, preprocess_data.preprocess(eng_E))

In [29]:
embed_zulu_A = create_sentence_embeddings(fast, preprocess_data.preprocess(zulu_A))
embed_zulu_B = create_sentence_embeddings(fast, preprocess_data.preprocess(zulu_B))
embed_zulu_C = create_sentence_embeddings(fast, preprocess_data.preprocess(zulu_C))
embed_zulu_D = create_sentence_embeddings(fast, preprocess_data.preprocess(zulu_D))
embed_zulu_E = create_sentence_embeddings(fast, preprocess_data.preprocess(zulu_E))

In [30]:
from scipy.spatial.distance import cosine

In [31]:
print("Sentence A:", cosine(embed_eng_A, embed_zulu_A))
print("Sentence B:", cosine(embed_eng_B, embed_zulu_B))
print("Sentence C:", cosine(embed_eng_C, embed_zulu_C))
print("Sentence D:", cosine(embed_eng_D, embed_zulu_D))
print("Sentence E:", cosine(embed_eng_E, embed_zulu_E))

Sentence A: 0.3769022226333618
Sentence B: 0.5876132222684076
Sentence C: 0.5461249947547913
Sentence D: 0.647421807050705
Sentence E: 0.6305571736233155
