# Fastext_demo : Alignment of Multilingual Word Embeddings 

The following demo is a script based on Alex Meritts Usage at https://github.com/JairParra/NLPChallengeAI4Good/blob/master/TextClassification/Text%20classification.ipynb for the AI4Good NLPChallenge. 

The idea is the following: we are given a huge English and French dataset with text in both English and French, and the idea is to clasisfy the text according to a given set of labels; however, the text itself is unlabeled. 

We will then make usage of multilingual word vector alignments so that we can identify the text in any language (say English **or** French) , and then assign an appropriate set of labels that are the most correlated. These labels could be in any language! 

For this, we need the appropriate pre-trained word vectors. We will be borrowing the fastText_multilingual fastText module/wrapper (1) as well as the given language word vectors from Facebook fasttext. We will also need the `langdetect` module, as well as `spacy` , which will help us better pre-process the text. 

This script requires the following dependencies: 
1. https://github.com/Babylonpartners/fastText_multilingual (wrapper + fastText vector alignment) 
2. https://fasttext.cc/docs/en/crawl-vectors.html (to download the vectors) 
3. https://spacy.io/ (spaCy)  
4. https://pypi.org/project/langdetect/ (obvious...) 

## 1. Imports 


In [5]:
import os
import time
import spacy # for language pre-processing 
import gensim 
import numpy as np 
import pandas as pd

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA 

from gensim.test.utils import datapath, get_tmpfile 
from gensim.models import KeyedVectors 
from gensim.scripts.glove2word2vec import glove2word2vec 
    
from nltk.corpus import stopwords
from fasttext import FastVector 
from langdetect import detect 

## 2. Loading the tools 

We will obtain do the following:  
- Obtain stopwords for both languages
- Initialize both language models with spaCy 
- load the vectors into appropriate objects 

In [24]:
# Obtain stopwords for both English and French  
english_stopwords = stopwords.words('french') 
french_stopwords = stopwords.words('english') 

# initialize spaCy's medium news language model and english model 
nlp_fr = spacy.load("fr_core_news_md") 
nlp_en = spacy.load("en_core_web_md") 

Now we can load the vectors Note that here I am loading the vectors directed from a hardcoded PATH. In practice **you shouldn't do this**, but instead use a relative PATH. The reason why I'm doing this here is because the vectors themselves are quite heavy (~4G big or more), and since I'm using them for various projects in different locations, it doesn't make sense to have multiple copies everywhere. 

### 2.1 Loading the fasttext Facebook research vectors 

First we obtain the path the vectos, then we load them using the FastVector class from the `fasttext` module. Notice that these embedding files are HUGE, so loading each will take roughly between 2 and 3 minutes. 

In [2]:
# direct path to source folder , then join the paths to each vectors file
direct_PATH = "C:\\Users\\jairp\\Desktop\\BackUP\\CODE-20180719T021021Z-001\\CODE\\Python\\Datasets\\vectors"
fr_vecs_PATH = os.path.join(direct_PATH, "cc.fr.300.vec") 
en_vecs_PATH = os.path.join(direct_PATH, "cc.en.300.vec") 

print(fr_vecs_PATH)

C:\Users\jairp\Desktop\BackUP\CODE-20180719T021021Z-001\CODE\Python\Datasets\vectors\cc.fr.300.vec


In [3]:
# load the French vectors
t0 = time.time()
fr_dictionary = FastVector(vector_file=fr_vecs_PATH, encoding='utf-8')
t1 = time.time() 
print("Done in {} seconds.".format(t1-t0)) 

reading word vectors from C:\Users\jairp\Desktop\BackUP\CODE-20180719T021021Z-001\CODE\Python\Datasets\vectors\cc.fr.300.vec


loading vector...: 2000000it [03:39, 9125.35it/s] 


In [6]:
# laod the alignment matrix 
t0 = time.time()
fr_dictionary.apply_transform('alignment_matrices/fr.txt')
t1 = time.time() 
print("Done in {} seconds.".format(t1-t0)) 

Time elapsed:  10.193409442901611


In [7]:
# Load the English vectors 
t0 = time.time()
en_dictionary = FastVector(vector_file=en_vecs_PATH, encoding='utf-8')
t1 = time.time() 
print("Done in {} seconds.".format(t1-t0)) 

reading word vectors from C:\Users\jairp\Desktop\BackUP\CODE-20180719T021021Z-001\CODE\Python\Datasets\vectors\cc.en.300.vec


loading vector...: 2000000it [03:48, 8751.66it/s] 


Done in 228.54795336723328 seconds.


In [8]:
# laod the alignment matrix 
t0 = time.time()
fr_dictionary.apply_transform('alignment_matrices/fr.txt')
t1 = time.time() 
print("Done in {} seconds.".format(t1-t0)) 

Done in 46.83586049079895 seconds.


## 3. Loading the data 

First, we will only load the labels from the excel file in Sheet1. We read in binary model so it is faster. 

In [20]:
df = pd.read_excel(open('Baseresultats.xlsx', 'rb'),
              sheet_name='Sheet1', header=None)
indicators = df[0].tolist()
indicators = [x.strip() for x in indicators]

print(indicators)

['Autonomie', "Bris de l'isolement", 'Communication', 'Compétences', 'Confiance en soi', 'Connaissance de soi', 'Connaissances', 'Conscientisation / Esprit critique', "Développement (de l'enfant)", 'Développement de pratiques démocratiques', 'Empowerment collectif', 'Empowerment individuel', 'Estime de soi', 'Habiletés cognitives', 'Habiletés dans la vie quotidienne', 'Habiletés sociales', 'Habitudes de vie', 'Identification des besoins', 'Intégration sociale', 'Lien de confiance', 'Liens familiaux', 'Mixité sociale et culturelle', 'Participation citoyenne', 'Plaisir', 'Prise de parole', 'Réciprocité', 'Répit', "Réseau d'entraide", 'Résultats scolaires', 'Sécurité', "Sentiment d'appartenance", 'Sentiment de valorisation', 'Socialisation', 'Soutien']


## 4. Defining useful functions 

We will define a couple of functions. 

In [9]:
# cosine similarity  
def similarity(v1, v2):
    n1 = np.linalg.norm(v1)
    n2 = np.linalg.norm(v2)
    return np.dot(v1, v2) / n1 / n2

In [11]:
def vector(word, lang):
    """ 
    Returns the vector representation of the input word 
    according to the language specified. If an Exception 
    is raised, returns a vector of zeros. 
    """

    try:
        if(lang == 'en'):
            return en_dictionary[word]
        elif(lang == 'fr'):
            return fr_dictionary[word]
        else:
            print("Found a non-English, non-French doc. Language detected : " + lang)
    except:
        return np.zeros(300)

In [27]:
def docIndicatorSimilarity(indicator, row, doc_lang): 
    """
    args: 
        @ indicator: 
        @ row: 
        @ doc: input document  
        @ doc_lang: input document language 
    """
    
    # declare tags to filter in the text 
    tags = ["DET","ADP","PUNCT","CONJ","CCONJ","NUM","SYM","SPACE"]
    
    # tokenize by spaces
    indicator_words = indicator.split(' ')
    
    # obtain word vectors 
    indicator_word_vecs = [vectorByLanguage(word, 'fr') for word in indicator_words]
    
    # vector of means of indicators by column 
    indicator_avg_vec = np.mean(indicator_word_vecs, axis=0)
    
    doc = str(row['RESULTATS_2018']) 
    
    # obtain sentences 
    sent_toks = [sent for sent in nlp_fr(doc).sents]
    
    # clean the setnecne tokens by filtering certain tags and stopwords, numbers, etc. 
    flt_sent_toks = [[token for token in toks if token.pos_ not in tags and token.text.isalpha()
                     and token.text not in french_stopwords] for toks in sent_toks]
    
    
    flat_flt_sent_toks = [item.text for sublist in flt_sent_toks for item in sublist]
    doc_words = doc.split(' ')

    doc_word_vecs = [vectorByLanguage(word, doc_lang) for word in flat_flt_sent_toks]
    doc_avg_vec = np.mean(doc_word_vecs, axis=0)
    return similarity(indicator_avg_vec, doc_avg_vec)
    

Now we can load the word vectors with gensim instead. 

In [32]:
# load the french vectors. This might take a good damn while 
model_fr = KeyedVectors.load_word2vec_format(fr_vecs_PATH)

KeyboardInterrupt: 

In [None]:
# Load the English vectors
model_en = KeyedVectors.load_word2vec_format(en_vecs_PATH)

In [None]:
vector = model_fr['aimer'] 

# We can see the shape of the vector 
print(vector.shape)

In [34]:
with open(fr_vecs_PATH,'r') as infile: 
    count = 0
    for line in infile: 
        if count <5:
            print(line) 
        count += 1

2000000 300

, 0.0058 0.0478 0.1094 -0.0839 -0.2092 0.0072 -0.0780 0.0683 0.0120 -0.0314 -0.0695 -0.0938 -0.0006 0.0257 0.0215 0.1130 0.0517 0.0191 -0.0224 -0.0168 0.0723 0.0711 -0.0505 -0.0987 -0.0960 -0.0695 0.0191 -0.0003 -0.1440 -0.0528 0.0305 0.0586 -0.0246 0.0195 -0.0040 0.0421 -0.0361 0.0546 0.1568 0.0482 -0.0072 -0.0352 -0.0004 0.1192 0.1274 0.1168 -0.0188 -0.0482 0.0467 0.0487 -0.0213 -0.0177 -0.0399 0.0466 0.0376 -0.0011 0.0841 0.0149 -0.2848 0.0367 0.0917 0.0908 0.0493 -0.1145 0.0352 -0.0179 -0.0245 0.0516 0.0297 0.0141 -0.0582 -0.0562 -0.1111 -0.0624 -0.1561 -0.0105 0.0271 -0.0011 0.0857 0.0516 -0.0387 -0.0856 0.0198 -1.1291 -0.0349 -0.0315 0.0705 -0.0057 -0.0195 -0.0522 0.0336 -0.0265 0.0823 0.0362 0.0892 -0.0831 -0.0747 0.1039 -0.0266 0.4814 0.0162 -0.0484 -0.0033 0.0761 -0.0312 0.0213 -0.0188 0.0121 -0.0537 0.0473 0.0583 0.0292 0.0655 0.0111 0.1129 0.0659 -0.0759 0.0795 -0.0425 -0.0335 -0.1145 0.0100 0.0197 -0.1981 -0.0385 0.0319 0.0612 0.0273 -0.0384 0.0350 -0.0085 -0.2




UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5973: character maps to <undefined>