# N-grams, Fastttext, and GloVE

*This assignment focuses on exploring Fasttext and GloVE as NLP methods. We are going to focus on two tasks and ways of understanding models:*

1. *The traditional, "model is a classifier" viewpoint. Here we are going to work with the [AG News Dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset) to classify genres*
2. *The more vector-based way, seeing them basically as machines that just generate word vectors, with everything else just being gravy. Barring attaching a specific classifier, GloVE falls entirely under this category.* 


Task will be probably be about

- Creating N-gram function (character-wise and word-wise)
- Performing analysis on the output of the pre-made fassttext model
- Training own linear classifier layer of the fasttext model (perhaps too difficult?) - Just create torch.linear, extract weights, etc.
- Perhaps a lot of description about HOW fasttext and Word2Vec skipgram models work?
- Performing PCA and feeding this to Michaels Fasttext model?
- Perhaps crate simple model like naive bayes to classify texts based on PCA and cossim - like final project in 2021, only this time, a lot of the work can be done beforehand



**TODO: Theoretical Questions**

- Explain a CBOW and Skipgram model
- Fasttext can technically train in only a semi-supervised manner, that means without labeled text, why is this, and why is this useful?
- 

In [1]:
import numpy as np
import pandas as pd
from scipy import spatial
from scipy import signal
import scipy
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import string
import random
from sklearn.metrics import classification_report,accuracy_score,balanced_accuracy_score
from sklearn.decomposition import PCA
from tqdm import tqdm
import pylab as plt
import numpy as np
import seaborn as snb
from tqdm import tqdm
import re
import torch
import re
import fasttext
import numpy as np
import os
from sklearn.decomposition import PCA
import pickle

# TODO: MOVE THIS
import scipy.spatial

In [99]:
# Seed generator function
# Generates robust seed values using methods adapted from Gaius-quantum reverse...
# ...GaunTLets, see more https://isotropic.org/papers/chicken.pdf and explained https://www.youtube.com/watch?v=dQw4w9WgXcQ
# Values are generated from a specific subset of alphanumerics representing sub-deca natural-numericals
# from the glove.42B.300d.txt Use this subset for the reverse function as well, the whole one will take too long

def generate_seed():
    with open("important_stuff.pkl", "rb") as fp:
        GQRGaunTLets_69B_300_seed_vals = pickle.load(fp)
        seed = int(np.mean(GQRGaunTLets_69B_300_seed_vals[69]))
        return seed
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    try: torch.manual_seed(seed_value)
    except: pass

seed_everything(generate_seed())

0


## Exercise 1 Word- and character-wise n-grams

*As you know, n-grams are pretty useful for improving the otherwise limited bag-of-words (BoW) model. Most often, this is by making distinctions between sentences such as "good" and "not good" which would be represented somewhat the same in a regular BoW. It is very obvious if we consider the sentence "Maria stole the milk" vs "The milk stole Maria", two sentences completely identical in the BoW representation, but with two obviously different meanings.*

*As you also know, Fasttext takes this further by creating chracter-wise n-grams. These are made up of n-characters of a single word. This allows fasttext to consider cases such as grammar, where words are spelled similarly and even consider misspellings, if someone makes a mistaek in wirtign a wrod, the character-wise n-gram representation will be **almost** the same as the correct word.*

This is done by Fasttext simply storing embedding vectors $v_n$ for each n-gram, character or otherwise. Fasttext will simply then average all of these vectors to create the representation for a given text or sentence.

$$v_{total} = \frac{1}{N}\sum^N_{n=0} v_n$$


## Wordwise n-grams and characterwise n-grams

Using N-grams is a way of gathering information about specific combinations of words. For example 'not good' is very different from 'good'. It is also a primitive way of captuing the ordering of the words. Say we have a four-word sentence saying 'Maria stole the Milk', this would be the same as 'The milk stole Maria', even though the sentences are quite different... N-grams fix this by making the ordering of the words important.

As you also know: Fasttext takes this one step further: By creating character-wise n-grams. These are made up of n-characters of a single word. This allows fasttext to not only look at combinations of specific words, but at combinations of these specific character n-grams. This is very useful for capturing information about prefixes or suffixes, such as 'dissimilar' vs 'similar' or 'pickle' vs 'pickles' - words that are very much alike, but would be considered as different by other methods.

Specifically, fasttext creates a full embedding vector using character grams as a weighted average of every character gram in the word.

$$v_{total} = \frac{1}{N}\sum^N_{n=0} v_n$$

Every part of a word that is not in the current vocabulary of a the fasttext model, will simply have its $v_n$ set as a zero vector

This also helps in spell-checking, so even if someone maeks a mistaek her and thare, fattext wil still be abel to understnd the text because the character n-grams are almost the same.

**Question: Why do we use n-grams for text classification, and what particular strengths are there in using character-grams?**

$\dots$

In [6]:
# N-gram functions -  Might need to be filled by students?

def get_n_grams(text, n, lower=True, strip=True):
    """Gets a specific n-gram for a given text string"""
    if lower:
        text = text.lower()
    if strip:
        text = re.sub('[^A-Za-z0-9 ]+', '', text)

    text = text.split()
    n_grams = []

    for i, word in enumerate(text):
        if i+n > len(text):
            break
        n_grams.append(text[i: i+n])

    return n_grams

def get_word_grams(word, n):
    """Gets the character wise n-grams for a single word"""
    word_grams = []

    # So really this is not something you should do for the actual model
    # String concatenation in python is O(N+M) complexity, which is blazingly slow
    # Probably nltk.ngrams function does it faster
    # Fasttext always adds beginning of word and end of word tokens to the words it is n-gramming:
    word = '<' + word + '>'

    for i, character in enumerate(word):
        if i+n > len(word):
            break
        word_grams.append(word[i:i+n])

    return word_grams

In [10]:
# Now let us just test these functions on some toy text...
text = "He turned himself into a pickle... Funniest shit, ive ever seen!!!"

n_grams = get_n_grams(text, 3, lower=True, strip=True)
word_grams = [get_word_grams(words[0], 3) for words in n_grams]

print("N-grams here: \n ", n_grams)

print("Word-grams here: \n ", word_grams)

N-grams here: 
  [['he', 'turned', 'himself'], ['turned', 'himself', 'into'], ['himself', 'into', 'a'], ['into', 'a', 'pickle'], ['a', 'pickle', 'funniest'], ['pickle', 'funniest', 'shit'], ['funniest', 'shit', 'ive'], ['shit', 'ive', 'ever'], ['ive', 'ever', 'seen']]
Word-grams here: 
  [['<he', 'he>'], ['<tu', 'tur', 'urn', 'rne', 'ned', 'ed>'], ['<hi', 'him', 'ims', 'mse', 'sel', 'elf', 'lf>'], ['<in', 'int', 'nto', 'to>'], ['<a>'], ['<pi', 'pic', 'ick', 'ckl', 'kle', 'le>'], ['<fu', 'fun', 'unn', 'nni', 'nie', 'ies', 'est', 'st>'], ['<sh', 'shi', 'hit', 'it>'], ['<iv', 'ive', 've>']]


As you can see, even in this very small sentence, there are a ton of n-grams, and even more word-grams which is why practically, the Fasttext model often operates on what is known as a **'bucket size'** which defines the maximum number of possible word-grams avaliable in the model.

## 2 Training and using the fasttext model


<p style="text-align:center;">"(Almost) Never do yourself what some other chump has done better" </p>
<p style="text-align:center;"> - Creed of the KID </p>

*Obviously someone else has made a pretty well working [Fasttext module](https://fasttext.cc/). In this case, it is the team at Meta (Facebook, back then). Aside from how well it trains, is does have a few weird things about it, most notably that it requires .txt files to train (bvadr).*

*For this exercise, we are going to focus on just tweaking minn and maxnn whihc control the minimum and maximum length for the character-grams.*

*A complete list of model hyperparameters can be found in the file hypereparams.txt, along with (most) methods callable on the Fasttext model. Refer to this if you need inspiration on making your model interesting.*

*Important note: If the model is asked for a word-vector not in its current vocabulary, it will give a zero-vector of the same dimension as the other vectors in its vocabulary; that way even extremely esoteric spelling errors do not 'break' the model due to vocabulary lookup errors, the words themselves will just not add anything to the prediction.*

In [20]:

# Load AG_news data

news_data = np.load('./news_data.npz', allow_pickle=True)
train_texts = news_data['train_texts']
test_texts = news_data['test_texts']
train_labels = news_data['train_labels']
test_labels = news_data['test_labels']
ag_news_labels = news_data['ag_news_label']

print(f"There are a total of {len(train_labels)} data points in the dataset, \n"
        f"{len(test_texts)} different points in the test set, and the different labels are {np.unique(train_labels)},\n"
        f"these correspond to the categories: {ag_news_labels}\n")



# Let's just ensure there are no unfair class balances in either training or testing...

n_classes = len(ag_news_labels)
print("Training class balances:")
for i,c in enumerate(ag_news_labels):
    print(c,np.mean(train_labels==i))

print()

print("Test class balances:")
for i,c in enumerate(ag_news_labels):
    print(c,np.mean(test_labels==i))


There are a total of 120000 data points in the dataset, 
7600 different points in the test set, and the different labels are [0 1 2 3],
these correspond to the categories: ['World' 'Sports' 'Business' 'Sci/Tec']

Training class balances:
World 0.25
Sports 0.25
Business 0.25
Sci/Tec 0.25

Test class balances:
World 0.25
Sports 0.25
Business 0.25
Sci/Tec 0.25


In [12]:
# Creating fasttext data set from current training data

def txtify_data(train_texts, train_labels, ag_news_labels, save_path='training_data.txt'):
    """
    Creates a .txt file compatible with a fasttext model

    Args:
        train_texts (_type_): _description_
        train_labells (_type_): _description_
    """

    txt = ""
    for i, (trains, tests) in tqdm(enumerate(zip(train_texts, train_labels))):
        trains = trains.lower()
        trains = re.sub('[^a-z0-9 ]+', '', trains)

        txt = txt + f'__label__{ag_news_labels[tests]} {trains}\n'

    
    f = open(save_path, mode='w')
    f.write(txt)
    f.close()

    return save_path

path_to_doc = txtify_data(train_texts, train_labels, ag_news_labels, save_path='training_data.txt')

120000it [09:19, 214.31it/s]
Read 4M words
Number of words:  91297
Number of labels: 4
Progress: 100.0% words/sec/thread: 5467918 lr:  0.000000 avg.loss:  0.294857 ETA:   0h 0m 0s


In [None]:
# Defining fasttext hyperparameters
char_gram_length_min = 3 # If set to zero, we only train word-grams
char_gram_length_max = 6 # If set to zero, we only train word-grams
num_word_grams = 1 # Default value
verbose = True # Set to false if you don't want to see training statistics

# Train fasttext_word_model and fasttext_char_model respectively
fasttext_word_model = fasttext.train_supervised(path_to_doc, maxn=0, minn=0, verbose=verbose,
                                                wordNgrams=num_word_grams)

fasttext_char_model = fasttext.train_supervised(path_to_doc, maxn=char_gram_length_max, minn=char_gram_length_min,
                                                verbose=verbose, wordNgrams=num_word_grams)

In [None]:
# Example of how the subwords of the character model and the word model differ
# get_subwords gets all character-gram 'parts' of the word specified...
# ...as well as indices corresponding to the row of the given vector in the embedding matrix
print(fasttext_word_model.get_subwords('cat'))
print(fasttext_char_model.get_subwords('cat'))

In [13]:
def test_prediction(test_text, test_label, model):
    """
    Method for testing fasttext model
    Model should be either the character model or the word model
    """
    prediction = model.predict(test_text)
    if prediction[0][0][9:] == test_label:
        return True

    return False

# Reason why we index the way we do: .predict outputs a tuple of certainty and the label, the label being __label__Business for example for business
print(fasttext_word_model.predict('A cat in a hat')[0][0][9:])
predicts = fasttext_word_model.predict(list(test_texts))

Sci/Tec


In [14]:
fasttext_model.test_label("data_data_test_cleaned.txt")

{'__label__Business': {'precision': 0.8796636889122438,
  'recall': nan,
  'f1score': 1.7593273778244876},
 '__label__Sports': {'precision': 0.9592152813629323,
  'recall': nan,
  'f1score': 1.9184305627258647},
 '__label__Sci/Tec': {'precision': 0.8887139107611548,
  'recall': nan,
  'f1score': 1.7774278215223096},
 '__label__World': {'precision': 0.9277628032345013,
  'recall': nan,
  'f1score': 1.8555256064690027}}

In [15]:
# Make the testing loop here to obtain the accuracy when predicting labels:

sum = 0
for (test_text, test_label) in zip(test_texts, test_labels):
    sum += test_model(test_text, ag_news_labels[test_label])

accuracy = sum/len(test_texts)

print(accuracy)

0.9076315789473685


## 3 GloVe to create embeddings vectors

[GloVe Paper here](https://aclanthology.org/D14-1162.pdf), [GloVe Project page here](https://nlp.stanford.edu/projects/glove/)

GloVe is called a "global log-bilinear regression model" which combines the strengths of global matrix factorization and local context window methods.

In English, this means it combines methods that work by collecting information on the entire corpus (like LSA), with other methods that capture more local patterns, essentially what we see with Fasttext that considers local n-grams. GloVe just considers "context windows" rather than an n-gram. Overall, what they want are nicely defined, linear relationships, decided by comparing the co-occurences of different words.

The selling point really, is that while a run-of-the-mill neural network **may** be able to answer the questions: "Skibidi is to Toilet as Fanum is to ...?", it will not necessarily be able to do it in a linear manner. Therefore considering all the word vectors together in their latent space, may not yield good information. GloVe fixes this by keeping all vector substructures linear.

Essentially, GloVe trains by mixing a [Skipgram model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) (just a neural network) with a function that works more on the entire corpus, while maintaining a weighting between the two. Because GloVe works best on huge corpora of data, we are not going to train it ourselves, but just use pretrained GloVe vectors, collected from their [project page](https://nlp.stanford.edu/projects/glove/). 


# GloVe does not use neural networks everywhere, in particular when using F, as it would "obfuscate the linear strcutures they are trying to capture", what linear structures are talked about and how would they be obfuscated?


# Why can we not go the other way when doing embedding vectors? If you had to get a word from a given embedding vector, how would you go about it?

# What prevents us from simply making a dictionary with vectors as keys and words as values?

In [20]:
def load_glove(glove_path):
    """
    Loads a GloVE vectors from a given path
    """
    glove = {}
    
    print("Creating GloVE dictionary...")
    with open(glove_path, 'r') as f:
        for line in tqdm(f):
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], 'float32')
            glove[word] = vector
    
    return glove

def create_GloVE_vector(text, glove, dim=300):
    """
    Creates a GloVE vector for a given text and GloVe
    """
    text = text.lower()
    text = re.sub('[^a-z0-9 ]+', '', text)
    text = text.split()

    vector = np.zeros(dim)

    for word in text:
        if word in glove:
            vector += glove[word]

    # TODO: Check if we actually need to get the mean here
    vector = np.mean(vector)
    return vector

def word_similarity(word1, word2, glove):
    """
    Returns the cosine similarity between two words
    """

    # Sanity check to ensure both words are in GloVE
    if word1 not in glove or word2 not in glove:
        raise ValueError("Both words must be in GloVe!")

    return 1 - scipy.spatial.distance.cosine(glove[word1], glove[word2])

In [101]:
# Check word similarity between a few words
glove = load_glove('glove.42B.300d.txt')

word_pairs = [('cat', 'dog'), ('cat', 'banana'), ('cat', 'cat'), ('camera', 'man'), ('steel', 'beams'), ('six', '6')]

for word1, word2 in word_pairs:
    print(f"Similarity between {word1} and {word2} is {word_similarity(word1, word2, glove)}")

Creating GloVE dictionary...


1042989it [00:24, 41799.75it/s]


Similarity between cat and dog is 0.7885447835189361
Similarity between cat and banana is 0.3027379785240919
Similarity between cat and cat is 0.9999999891532333
Similarity between camera and man is 0.36750208040174503
Similarity between steel and beams is 0.35676179498755567
Similarity between six and 6 is 0.6511695714552783


## Examining the emebdding vectors

We musn't forget, that at its core, fasttext is a method functioning on word embedding vectors, which it obtains by a skipgram model using some clever tricks. As such, we can expect the embeddings that are created by the fasttext model to hold some information about the words they 'code for'. We now wish to perform PCA on the entire word-embedding matrix to see if the semantic difference in words is visible with only a few principal components

If you want to read more about word2vec, I reccomend: [here first](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/): and: [here afterwards](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)



**Explain shortly what you expect to find if we perform PCA on the matrix of word-embeddings, that is the matrix which holds a vector representation of each word in our vocabulary**

*Your answer here*


In [16]:
# Performing PCA on embedding matrix

embedding_matrix = fasttext_model.get_input_matrix() # It's called the input matrix because it is essentially what is fed into the rest of the rest of the fasttext model

# Why does sklearn's PCA run faster than np.linalg.eig? - Singular value decomposition
pca = PCA()
pca.fit(embedding_matrix)

n = 5
print(f'The variance explained of the first {n} principal components is: {np.sum(pca.explained_variance_ratio_[:n])}')
print(f'The dimensionality of each principal component is: {pca.components_[0].shape} and there are of course {len(pca.components_)} of them')

# #  This is one example where the code below is not feasible - the embedding matrix is massive!
#cov = np.cov(embedding_matrix.T)
#scipy.linalg.eig(cov)

The variance explained of the first 5 principal components is: 0.9495278000831604
The dimensionality of each principal component is: (100,) and there are of course 100 of them


In [None]:
def cos_sim(vec1, vec2):
    """
    Should give two vectors, obtain cosine simliarity for them
    """
    cossim = np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))
    #cossim  = 1 - spatial.distance.cosine(vec1, vec2)

    return cossim

def get_vector_transform(word, n=2):
    """
    Should given a specific word string, obtain fasttext's vector representation of that word and project it on the n first principal components
    """
    word_vec = fasttext_model.get_word_vector(word)
    if n == 0:
        return word_vec

    return pca.components_[:n]@word_vec

In [None]:
# Some plotting stuff here
words = ['company', 'business', 'cat', 'software', 'microsoft']


Of course plotting vectors can only really be done in two or three dimensions max, for more dimensions, we can use the cosine similarity previously defined to measure how similar two word vectors are

In [92]:
n = 2 # Number of principal components to do with
to_compare = ['software', 'business', 'world'] # Three words, that should be labeled as three different things
for word in words:
    for comparison in to_compare:
        print(f"{word}-{comparison}: {cos_sim(get_vector_transform(word, n), get_vector_transform(comparison,n))}")


company-software: 0.8904985189437866
company-business: 0.9662047028541565
company-world: -0.46810242533683777
business-software: 0.7828420400619507
business-business: 1.0000001192092896
business-world: -0.6497917175292969
cat-software: 0.5068366527557373
cat-business: -0.12006817758083344
cat-world: 0.7469577789306641
software-software: 1.0
software-business: 0.7828420400619507
software-world: -0.11456459760665894
microsoft-software: 0.9591502547264099
microsoft-business: 0.8924571871757507
microsoft-world: -0.2848648428916931


## Another strength of fasttext: Spelling errors

A good thing if we want to use fasttext on a character level, is that it will be able to understand spelling errors. We're going to test this now by replacing a bunch of letters in our test set randomly with other words and once more test the accuracy of the word-wise fasttext vs the character-wise fasttext

In [143]:
import string
def dyslexibot(test_set, p=0.05, extra_scuffed=False):
    """
    tHe AlMiGhTy dyslexibot(tm) replaces letters with probability p
    extra_scuffed does what it says: it makes the replacements even harder to guess
    """

    if extra_scuffed:
        test_set_letters = np.array(list(set(''.join(test_texts)))) # Can replace with all letters currently in test set
    else:
        test_set_letters = np.array(list(string.ascii_lowercase)) # Can only replace with lowercase letters

    new_test_set = [text.split(' ') for text in test_set.copy()]

    for i, text in tqdm(enumerate(new_test_set)):
        for r, word in enumerate(text):
            word = list(word)
            for t, letter in enumerate(word):
                rand = random.uniform(0, 1)

                if extra_scuffed and rand < p: # We replace even spaces!
                    word[t] = np.random.choice(test_set_letters)
                    #new_test_set[i][r] = np.random.choice(test_set_letters)

                elif letter != ' ' and rand < p:
                    word[t] = np.random.choice(test_set_letters)
                    #new_test_set[i][r] = np.random.choice(test_set_letters)

            text[r] = ''.join(word)
        new_test_set[i] = ' '.join(text)
    return np.array(new_test_set)

In [148]:
bad_test_texts = dyslexibot(test_texts, p=0.05, extra_scuffed=False)

7600it [00:01, 6532.80it/s]


In [154]:
sum = 0
for (test_text, test_label) in zip(bad_test_texts, test_labels):
    sum += test_model(test_text, ag_news_labels[test_label])

accuracy = sum/len(bad_test_texts)

print(accuracy)

0.8543421052631579


In [None]:
from scipy import spatial
from numpy.linalg import norm


transform = (pca.components_[:2])

comp = fasttext_model.get_word_vector('company')
buis = fasttext_model.get_word_vector('business')
stock = fasttext_model.get_word_vector('stock')
soft = fasttext_model.get_word_vector('software')
tech = fasttext_model.get_word_vector('technology')

comp_trans = transform@comp
buis_trans = transform@buis
stock_trans = transform@stock
soft_trans = transform@soft
tech_trans = transform@tech


result = np.dot(comp_trans, buis_trans)/(norm(comp_trans)*norm(buis_trans))
print(1 - (spatial.distance.cosine(comp_trans, buis_trans)))
print(1 - (spatial.distance.cosine(comp_trans, stock_trans)))
print(1 - (spatial.distance.cosine(comp_trans, soft_trans)))
print(1 - (spatial.distance.cosine(tech_trans, soft_trans)))
print(1 - (spatial.distance.cosine(tech_trans, stock_trans)))
fasttext_model.get_subwords('stock')

In [None]:
"""
$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

  The following arguments are optional:
  -verbose            verbosity level [2]

  The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

  The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

  The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]
"""

"""
https://fasttext.cc/docs/en/python-module.html
    get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                            # This is equivalent to `dim` property.
    get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
    get_input_matrix        # Get a copy of the full input matrix of a Model.
    get_labels              # Get the entire list of labels of the dictionary
                            # This is equivalent to `labels` property.
    get_line                # Split a line of text into words and labels.
    get_output_matrix       # Get a copy of the full output matrix of a Model.
    get_sentence_vector     # Given a string, get a single vector represenation. This function
                            # assumes to be given a single line of text. We split words on
                            # whitespace (space, newline, tab, vertical tab) and the control
                            # characters carriage return, formfeed and the null character.
    get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
    get_subwords            # Given a word, get the subwords and their indicies.
    get_word_id             # Given a word, get the word id within the dictionary.
    get_word_vector         # Get the vector representation of word.
    get_words               # Get the entire list of words of the dictionary
                            # This is equivalent to `words` property.
    is_quantized            # whether the model has been quantized
    predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
    quantize                # Quantize the model reducing the size of the model and it's memory footprint.
    save_model              # Save the model to the given path
    test                    # Evaluate supervised model using file given by path
    test_label              # Return the precision and recall score for each label.

    model.words         # equivalent to model.get_words()
    model.labels        # equivalent to model.get_labels()
"""


""""

# Not really necessary to load data since train_supervised works directly off of a .txt file
#data = np.load(os.getcwd() + 'news_data.npz')

# Training is usually really fast
print("Training model")
model = fasttext.train_supervised(input="dat_data_new_labels_cleaned.txt", verbose=False, maxn=3, minn=3)

mat = model.get_input_matrix()
words = model.get_words()

print("Mat is here")
print(mat.shape)

print("Word length is")
print(len(words))
"""
"""

# Quickly cobbled-together test-set creator
print("Creating test set")
txt = open('dat_data_new_labels_cleaned.txt', 'r')
txt_arr = txt.read().split('\n')

tests = []
labels = []
for r, i in enumerate(txt_arr):
    to_append = i.split(' ', 1)
    tests.append(to_append[1])
    labels.append(to_append[0])


print('Predicting')
su = 0
for i, test in enumerate(tests):
    predict_label = model.predict(test)[0][0]

    if predict_label == labels[i]:
        su += 1

print("Total accuracy was ", su/len(tests))
"""
#print()


# Next we'll try to obtain the vector representations of some simple words

words = ['company', 'business', 'cat', 'software', 'microsoft']
word_vectors = np.array([fasttext_model.get_word_vector(word) for word in words])
print(word_vectors.shape)
n = 2 # Number of principal components
transformed_vectors = pca.components_[:n]@word_vectors.T
transformed_vectors = transformed_vectors.T


import numpy as np
import matplotlib.pyplot as plt

origin = np.zeros((2,6)) # origin point


print(transformed_vectors.shape)
print(transformed_vectors)
origin = np.zeros((2,6))

max_dim = np.amax(abs(transformed_vectors))


plt.quiver(np.zeros(6), np.zeros(6), transformed_vectors[:,0], transformed_vectors[:,1], angles='xy', scale_units='xy', color=['r','b','g', 'pink', 'cyan'])
plt.legend([word for word in words])
plt.grid(b=True, which='major') #<-- plot grid lines
plt.xlim([-max_dim, max_dim]) #<-- set the x axis limits
plt.ylim([-max_dim,max_dim]) #<-- set the y axis limits
plt.show()
