# Assignment 4 - Naive Machine Translation and LSH

You will now implement your first machine translation system and then you
will see how locality sensitive hashing works. Let's get started by importing
the required functions!

If you are running this notebook in your local computer, don't forget to
download the twitter samples and stopwords from nltk.

```
nltk.download('stopwords')
nltk.download('twitter_samples')
```

**NOTE**: The `Exercise xx` numbers in this assignment **_are inconsistent_** with the `UNQ_Cx` numbers.

### This assignment covers the folowing topics:

- [1. The word embeddings data for English and French words](#1)
  - [1.1 Generate embedding and transform matrices](#1-1)
      - [Exercise 1](#ex-01)
- [2. Translations](#2)
  - [2.1 Translation as linear transformation of embeddings](#2-1)
      - [Exercise 2](#ex-02)  
      - [Exercise 3](#ex-03)  
      - [Exercise 4](#ex-04)        
  - [2.2 Testing the translation](#2-2)
      - [Exercise 5](#ex-05)
      - [Exercise 6](#ex-06)      
- [3. LSH and document search](#3)
  - [3.1 Getting the document embeddings](#3-1)
      - [Exercise 7](#ex-07)
      - [Exercise 8](#ex-08)      
  - [3.2 Looking up the tweets](#3-2)
  - [3.3 Finding the most similar tweets with LSH](#3-3)
  - [3.4 Getting the hash number for a vector](#3-4)
      - [Exercise 9](#ex-09)  
  - [3.5 Creating a hash table](#3-5)
      - [Exercise 10](#ex-10)  
  - [3.6 Creating all hash tables](#3-6)
      - [Exercise 11](#ex-11)  

In [1]:
import nltk
import pdb
import pickle
import string
import time
import gensim
import matplotlib.pyplot as plt
import numpy as np
import scipy
import sklearn
from gensim.models import KeyedVectors
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.tokenize import TweetTokenizer
from utils_nb import cosine_similarity
from utils_nb import process_tweet
from utils_nb import get_dict
from os import getcwd

In [2]:
# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

<a name="1"></a>

# 1. The word embeddings data for English and French words

Write a program that translates English to French.

## The data

The full dataset for English embeddings is about 3.64 gigabytes, and the French
embeddings are about 629 megabytes. To prevent the Coursera workspace from
crashing, we've extracted a subset of the embeddings for the words that you'll
use in this assignment.

If you want to run this on your local computer and use the full dataset,
you can download the
* English embeddings from Google code archive word2vec
[look for GoogleNews-vectors-negative300.bin.gz](https://code.google.com/archive/p/word2vec/)
    * You'll need to unzip the file first.
* and the French embeddings from
[cross_lingual_text_classification](https://github.com/vjstark/crosslingual_text_classification).
    * in the terminal, type (in one line)
    `curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec`

Download full dataset on our local machine.

In [16]:
# Use this code to download and process the full dataset on your local computer

from gensim.models import KeyedVectors

en_embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
fr_embeddings = KeyedVectors.load_word2vec_format('./wiki.multi.fr.vec')


# loading the english to french dictionaries
en_fr_train = get_dict('en-fr.train.txt')
print('The length of the english to french training dictionary is', len(en_fr_train))
en_fr_test = get_dict('en-fr.test.txt')
print('The length of the english to french test dictionary is', len(en_fr_train))

english_set = set(en_embeddings.vocab)
french_set = set(fr_embeddings.vocab)
en_embeddings_subset = {}
fr_embeddings_subset = {}
french_words = set(en_fr_train.values())

for en_word in en_fr_train.keys():
    fr_word = en_fr_train[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


for en_word in en_fr_test.keys():
    fr_word = en_fr_test[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


pickle.dump( en_embeddings_subset, open( "en_embeddings.p", "wb" ) )
pickle.dump( fr_embeddings_subset, open( "fr_embeddings.p", "wb" ) )

FileNotFoundError: [Errno 2] No such file or directory: './GoogleNews-vectors-negative300.bin'