<a href="https://colab.research.google.com/github/AROM98/spln3/blob/main/fasttext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 3.3 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.9.2-py2.py3-none-any.whl (213 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3141037 sha256=e084e484b6ae5aa31c2abe739fc31439c29b189db9f62f706db5052b5e4746a4
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.9.2


In [6]:
import re
import fasttext

## Na primeira vez, tenho que usar os vetores que fiz download
## link de download: https://fasttext.cc/docs/en/pretrained-vectors.html

#pt_dictionary = FastVector(vector_file='wiki.pt.vec')
#en_dictionary = FastVector(vector_file='wiki.en.vec')

## e Ajusta-los com estas tranformações, e sarlvar num ficheiro
# pt_dictionary.apply_transform('alignment_matrices/pt.txt')
# en_dictionary.apply_transform('alignment_matrices/en.txt')
# print("Vou escrever os ficheiros:")
# pt_dictionary.export("pt_model.txt")
# print("Acabei 1")
# en_dictionary.export("en_model.txt")
# print("Acabei 2")

In [7]:
#
# Copyright (c) 2017-present, babylon health
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
#

import numpy as np


class FastVector:
    """
    Minimal wrapper for fastvector embeddings.
    ```
    Usage:
        $ model = FastVector(vector_file='/path/to/wiki.en.vec')
        $ 'apple' in model
        > TRUE
        $ model['apple'].shape
        > (300,)
    ```
    """

    def __init__(self, vector_file='', transform=None):
        """Read in word vectors in fasttext format"""
        self.word2id = {}

        # Captures word order, for export() and translate methods
        self.id2word = []

        print('reading word vectors from %s' % vector_file)
        with open(vector_file, 'r') as f:
            (self.n_words, self.n_dim) = \
                (int(x) for x in f.readline().rstrip('\n').split(' '))
            self.embed = np.zeros((self.n_words, self.n_dim))
            for i, line in enumerate(f):
                elems = line.rstrip('\n').split(' ')
                self.word2id[elems[0]] = i
                self.embed[i] = elems[1:self.n_dim+1]
                self.id2word.append(elems[0])
        
        # Used in translate_inverted_softmax()
        self.softmax_denominators = None
        
        if transform is not None:
            print('Applying transformation to embedding')
            self.apply_transform(transform)

    def apply_transform(self, transform):
        """
        Apply the given transformation to the vector space

        Right-multiplies given transform with embeddings E:
            E = E * transform

        Transform can either be a string with a filename to a
        text file containing a ndarray (compat. with np.loadtxt)
        or a numpy ndarray.
        """
        transmat = np.loadtxt(transform) if isinstance(transform, str) else transform
        self.embed = np.matmul(self.embed, transmat)

    def export(self, outpath):
        """
        Transforming a large matrix of WordVectors is expensive. 
        This method lets you write the transformed matrix back to a file for future use
        :param The path to the output file to be written 
        """
        fout = open(outpath, "w")

        # Header takes the guesswork out of loading by recording how many lines, vector dims
        fout.write(str(self.n_words) + " " + str(self.n_dim) + "\n")
        for token in self.id2word:
            vector_components = ["%.6f" % number for number in self[token]]
            vector_as_string = " ".join(vector_components)

            out_line = token + " " + vector_as_string + "\n"
            fout.write(out_line)

        fout.close()

    def translate_nearest_neighbour(self, source_vector):
        """Obtain translation of source_vector using nearest neighbour retrieval"""
        similarity_vector = np.matmul(FastVector.normalised(self.embed), source_vector)
        target_id = np.argmax(similarity_vector)
        return self.id2word[target_id]

    def translate_inverted_softmax(self, source_vector, source_space, nsamples,
                                   beta=10., batch_size=100, recalculate=True):
        """
        Obtain translation of source_vector using sampled inverted softmax retrieval
        with inverse temperature beta.

        nsamples vectors are drawn from source_space in batches of batch_size
        to calculate the inverted softmax denominators.
        Denominators from previous call are reused if recalculate=False. This saves
        time if multiple words are translated from the same source language.
        """
        embed_normalised = FastVector.normalised(self.embed)
        # calculate contributions to softmax denominators in batches
        # to save memory
        if self.softmax_denominators is None or recalculate is True:
            self.softmax_denominators = np.zeros(self.embed.shape[0])
            while nsamples > 0:
                # get batch of randomly sampled vectors from source space
                sample_vectors = source_space.get_samples(min(nsamples, batch_size))
                # calculate cosine similarities between sampled vectors and
                # all vectors in the target space
                sample_similarities = \
                    np.matmul(embed_normalised,
                              FastVector.normalised(sample_vectors).transpose())
                # accumulate contribution to denominators
                self.softmax_denominators \
                    += np.sum(np.exp(beta * sample_similarities), axis=1)
                nsamples -= batch_size
        # cosine similarities between source_vector and all target vectors
        similarity_vector = np.matmul(embed_normalised,
                                      source_vector/np.linalg.norm(source_vector))
        # exponentiate and normalise with denominators to obtain inverted softmax
        softmax_scores = np.exp(beta * similarity_vector) / \
                         self.softmax_denominators
        # pick highest score as translation
        target_id = np.argmax(softmax_scores)
        return self.id2word[target_id]

    def get_samples(self, nsamples):
        """Return a matrix of nsamples randomly sampled vectors from embed"""
        sample_ids = np.random.choice(self.embed.shape[0], nsamples, replace=False)
        return self.embed[sample_ids]

    @classmethod
    def normalised(cls, mat, axis=-1, order=2):
        """Utility function to normalise the rows of a numpy array."""
        norm = np.linalg.norm(
            mat, axis=axis, ord=order, keepdims=True)
        norm[norm == 0] = 1
        return mat / norm
    
    @classmethod
    def cosine_similarity(cls, vec_a, vec_b):
        """Compute cosine similarity between vec_a and vec_b"""
        return np.dot(vec_a, vec_b) / \
            (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

    def __contains__(self, key):
        return key in self.word2id

    def __getitem__(self, key):
        return self.embed[self.word2id[key]]


In [8]:
pt_dictionary = FastVector(vector_file='/content/drive/MyDrive/SPLN/pt_model.txt')
en_dictionary = FastVector(vector_file='/content/drive/MyDrive/SPLN/en_model.txt')

reading word vectors from /content/drive/MyDrive/SPLN/pt_model.txt
reading word vectors from /content/drive/MyDrive/SPLN/en_model.txt


In [9]:
## codigo para frases, palavra a palavra (EN -> PT)
aux = "always sat with his back to the window in his office on the ninth floor"
aux = re.split(r" ", aux)
res = ""
for word in aux:
    en_vector = en_dictionary[word]
    res += " " + pt_dictionary.translate_nearest_neighbour(en_vector)

print("Tradução: ", res)

Tradução:   sempre cnvp com seu voltar para a janela em seu escritório em a oitavo térreo


In [10]:
## traduz palavra dada como input (EN -> PT)
while True:
    print("introduza palavra a traduzir para português: ")
    valor = input()
    print("---> ", valor)
    en_vector = en_dictionary[valor]
    print("Tradução: ",pt_dictionary.translate_nearest_neighbour(en_vector))

introduza palavra a traduzir para português: 
hello
--->  hello
Tradução:  olá
introduza palavra a traduzir para português: 
coisas
--->  coisas
Tradução:  #¿por
introduza palavra a traduzir para português: 
vaca
--->  vaca
Tradução:  grijalva
introduza palavra a traduzir para português: 
cavalo
--->  cavalo
Tradução:  guaraldo
introduza palavra a traduzir para português: 
mãe
--->  mãe
Tradução:  aracy
introduza palavra a traduzir para português: 
tua
--->  tua
Tradução:  fili
introduza palavra a traduzir para português: 
cona
--->  cona
Tradução:  conchiglia
introduza palavra a traduzir para português: 
vagina
--->  vagina
Tradução:  vagina
introduza palavra a traduzir para português: 
penis
--->  penis
Tradução:  pênis
introduza palavra a traduzir para português: 
mamilo
--->  mamilo


KeyError: ignored