# Preliminaries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [2]:
# install sent2vec
!pip install git+https://github.com/epfml/sent2vec
# install annoy
!pip install annoy

Collecting git+https://github.com/epfml/sent2vec
  Cloning https://github.com/epfml/sent2vec to /tmp/pip-req-build-_75a1j6t
  Running command git clone -q https://github.com/epfml/sent2vec /tmp/pip-req-build-_75a1j6t
Building wheels for collected packages: sent2vec
  Building wheel for sent2vec (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / done
[?25h  Created wheel for sent2vec: filename=sent2vec-0.0.0-cp36-cp36m-linux_x86_64.whl size=1137474 sha256=12703f152ae23c528b3cee90794587b2ca576b30b34b54dd0638a5b3dcb3864f
  Stored in directory: /tmp/pip-ephem-wheel-cache-gbc30ioy/wheels/7b/db/9d/db816db406f182ce88bf1b90e6d41313bf16c7ed72a9626ac7
Successfully built sent2vec
Installing collected packages: sent2vec
Successfully installed sent2vec-0.0.0


Write requirements to file, anytime you run it, in case you have to go back and recover dependencies.

Latest known such requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [3]:
!pip freeze > kaggle_image_requirements.txt

Check what is in the input folder

In [4]:
# Input data files are available in the "../input/" directory.
# Any results you write to the current directory are saved as output.

!ls ../input

jw300entw  sent2vec


# Load The Raw Data

In [5]:
import re
import time

start = time.time()
english_sentences = []
with open("../input/jw300entw/jw300.en-tw.en") as f:
    for line in f:
        english_sentences.append(re.sub(r'[\W\d]', " ",line.lower())) # clean and normalize
end = time.time()
print("Loading the english sentences took %d seconds"%(end-start))

Loading the english sentences took 5 seconds


In [6]:
print("A sample of the english sentences is:")
print(english_sentences[:10])
print("The length of the list is:")
print(len(english_sentences))

A sample of the english sentences is:
['  oh   jehovah   keep my young girl faithful     ', 'i was born in      in alsace   france   into an artistic family   ', 'during the evenings   father   sitting in his lounge chair   would be reading some books about geography or astronomy   ', 'my doggy would be sleeping by his feet   and daddy would be sharing with mum some highlights from his reading while she was knitting for her family   ', 'how much i enjoyed those evenings   ', 'religion played a big part in our lives   ', 'we were staunch catholics   and people who saw us going to church on sunday morning would say     it s nine o clock   ', 'the arnolds are going to church     ', 'every day before going to school   i went to church   ', 'but because of the priest s misbehavior   mum forbade me to go to church alone   ']
The length of the list is:
606197


In [7]:
twi_sentences = []
with open("../input/jw300entw/jw300.en-tw.tw") as f:
    for line in f:
        twi_sentences.append(re.sub(r'[\W\d]', " ", line.lower())) # clean and normalize

In [8]:
print("A sample of the twi sentences is:")
print(twi_sentences[:10])
print("The length of the list is:")
print(len(twi_sentences))

A sample of the twi sentences is:
['  oo   yehowa   boa me babea kumaa yi ma onni nokware     ', 'wɔwoo me too abusua a wonim adwinne di mu wɔ alsace   france   wɔ      mu   ', 'ná papa taa pa twere n agua mu kenkan asase ho nsɛm anaa ewim nneɛma ho nhoma bi anwummere anwummere   ', 'ná me kraman da papa nan so   na na ɔka nsɛntitiriw a epue wɔ n akenkan no mu kyerɛ maame bere a ɔnwene abusua no nneɛma no   ', 'ná m ani gye anwummere a ɛtete saa no ho kɛse   ', 'ná yɛpɛ nyamesom kɛse   ', 'ná yɛyɛ katolekfo amapa   na na nkurɔfo a wohu yɛn sɛ yɛrekɔ asɔre kwasida anɔpa no ka sɛ     abɔ nɔnkron   ', 'arnold abusua no rekɔ asɔre     ', 'ná mekɔ asɔre anɔpa biara ansa na makɔ sukuu   ', 'nanso esiane ɔsɔfo no suban bɔne nti   maame amma me nkutoo ankɔ asɔre bio   ']
The length of the list is:
606197


# Vectorize Subset of English Sentences with sent2vec

In [9]:
MAXNSENT = 10000 # how many sentences to take from top of data for now (small experiment)

In [10]:
import time
import sent2vec

model = sent2vec.Sent2vecModel()
start=time.time()
model.load_model('../input/sent2vec/wiki_unigrams.bin')
end = time.time()
print("Loading the sent2vec embedding took %d seconds"%(end-start))

Loading the sent2vec embedding took 10 seconds


In [11]:
def assemble_embedding_vectors(data):
    out = None
    for item in data:
        vec = model.embed_sentence(item)
        if vec is not None:
            if out is not None:
                out = np.concatenate((out,vec),axis=0)
            else:
                out = vec                                            
        else:
            pass
        
    return out

In [12]:
start=time.time()
EmbeddingVectors = assemble_embedding_vectors(english_sentences[:MAXNSENT])
end = time.time()
print("Computing all embeddings took %d seconds"%(end-start))
print(EmbeddingVectors)

Computing all embeddings took 31 seconds
[[ 1.18584879e-01 -2.13146985e-01 -3.34718257e-01 ...  3.40525210e-01
   2.87434489e-01 -1.28532574e-01]
 [ 2.40930080e-01 -4.59281594e-01 -3.28957111e-01 ...  3.89363542e-02
   3.97189409e-01 -3.97409573e-02]
 [-2.81732213e-02 -6.40245825e-02  3.14420313e-01 ... -4.41329218e-02
  -2.96688620e-02  3.62253864e-04]
 ...
 [-5.24406433e-02  4.13849652e-01 -1.14870206e-01 ...  7.23682165e-01
  -2.16702804e-01 -2.00715717e-02]
 [ 2.01901551e-02 -3.09839964e-01 -1.10622898e-01 ... -4.58649695e-02
   8.25465545e-02  1.08406030e-01]
 [ 7.83808351e-01  1.92601413e-01  5.50085455e-02 ...  1.12677053e-01
   3.22284132e-01  3.28299969e-01]]


In [13]:
print("The shape of embedding matrix:")
print(EmbeddingVectors.shape)

The shape of embedding matrix:
(10000, 600)


In [14]:
# Save embeddings for later use
np.save("english_sent2vec_vectors_jw.npy",EmbeddingVectors)

# Build and Test Index w/ Annoy for fast Neareast-Neighbor Retrieval

First build the annoy index for the available English sent2vec vectors

In [15]:
from annoy import AnnoyIndex

start = time.time()
dimension = EmbeddingVectors.shape[1] # Length of item vector that will be indexed
english_NN_index = AnnoyIndex(dimension, 'angular')  
for i in range(EmbeddingVectors.shape[0]): # go through every embedding vector
    english_NN_index.add_item(i, EmbeddingVectors[i]) # add to index

english_NN_index.build(10) # 10 trees
english_NN_index.save('en_sent2vec_NN_index.ann') # save index
end = time.time()
print("Building the NN index took %d seconds"%(end-start))

Building the NN index took 1 seconds


Test the built index

In [16]:
test_english_NN_index = AnnoyIndex(dimension, 'angular')
test_english_NN_index.load('en_sent2vec_NN_index.ann') # super fast, will just mmap the file

True

In [17]:
translation_idx = 15 # choose index of sentence to focus on in english_sentences/twi_sentences

annoy_out = test_english_NN_index.get_nns_by_item(translation_idx, 5) # will 5 nearest neighbors to the very first sentence

In [18]:
print(annoy_out)

[15, 8659, 9452, 1636, 4009]


In [19]:
print("- The sentence we are finding nearest neighbors of:\n")
print(english_sentences[annoy_out[0]])
print("\n\n- The 4 nearest neighbors found:\n")
for i in range(1,5):
    print(str(i) + ". "+ english_sentences[annoy_out[i]])

- The sentence we are finding nearest neighbors of:

but mother was so enthusiastic about the truth that she decided to do some bible reading with me   


- The 4 nearest neighbors found:

1. after extended discussions mother too became convinced that what she had learned from the bible was the truth   
2. he also spoke to me about maria stossier   our neighbor hans   younger sister   who had taken a stand for bible truth   
3.   however   we feel that the real encouragement was for the rest of us who were able to meet them and witness their love and zeal for bible truth     
4. my grandfather listened to the same lecture   and he too was convinced that what he heard was the truth   


In [20]:
print("- In other words, if we were translating the english sentence:\n")
print(english_sentences[annoy_out[0]])
print("  where the known correct translation is:")
print(twi_sentences[annoy_out[0]])
print("\n\n- The 4 top translation suggested by our sparse retrieval system above are:\n")
for i in range(1,5):
    print(str(i) + ". "+ twi_sentences[annoy_out[i]])

- In other words, if we were translating the english sentence:

but mother was so enthusiastic about the truth that she decided to do some bible reading with me   
  where the known correct translation is:
nanso na maame ani gye nokware no ho araa ma ɔne me boom kenkan bible no   


- The 4 top translation suggested by our sparse retrieval system above are:

1. bere a wɔne maame bɔɔ nkɔmmɔ pii akyi no   ɔno nso begye dii sɛ nea na wasua afi bible mu no ne nokware no   
2. ɔkaa maria stossier   yɛn fipamfo hans nuabea kumaa a na wagyina bible mu nokware akyi   ho asɛm kyerɛɛ me   
3. nanso   yɛte nka sɛ yɛn a yɛaka a yenyaa hokwan hyiaa wɔn na yehuu wɔn dɔ ne wɔn mmɔdenbɔ wɔ bible mu nokware ho no na yɛanya nkuranhyɛ kɛse     
4. me nana barima tiee saa ɔkasa no bi   na ɔno nso begye dii sɛ nokware no na wate no   


This seems to work! Now we need to just scale it out to the whole dataset and test with random input!