<a href="https://colab.research.google.com/github/StevenPeutz/Masterthesis-Disinformation-NLP/blob/master/CODE/2_DimReduction_Word2Vec_FastText(GH).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dimension reduction for pretrained embeddings
- FastText (100 -> 50)
- Word2Vec (100 -> 50)
- (for GloVe the 50dim version is used), so no reduction required)


<br>
This is done for the following reasons;

*   Comparison of architectures is considered more fair if all pretrained embeddings contains equal dimensions.
*   RAM reductions (local 16GB, google colab 35).
*   Reduction in storage space of embedding files (github limits).
*   This allows all embedding and model combinations to be tested within a single 'overview' environment. Of course this comes at the cost of classification performance. Therefor all embedding and model combination are also run in seperate environments where RAM will not be a limiting factor.  
<br>
<br>
The technique used for dimension reduction is PCA in the case of Word2Vec, and SVD in the case of FastText. (For FastText I have also used a PCA reduced version to make sure the method difference was not a big factor)



# Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA

In [None]:
import gzip
import io
import shutil

In [None]:
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2
  Using cached pybind11-2.10.3-py3-none-any.whl (222 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp38-cp38-linux_x86_64.whl size=4402288 sha256=74b7076be46ff5c3523c931fd873cc638e261333db9a36184cadf80832c9190a
  Stored in directory: /root/.cache/pip/wheels/93/61/2a/c54711a91c418ba06ba195b1d78ff24fcaad8592f2a694ac94
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.10.3


In [None]:
import fasttext
import fasttext.util

# Reducing FastText (300d-1M.bin)

For FastText I will use the built-in utility for dimension reduction instead of PCA. (The FastText built in method for dimension reduction is similar in principle but uses SVD method.)

In [None]:
fasttext.util.download_model('en', if_exists='ignore')  # English

#from drive storage:
ft = fasttext.load_model('/content/drive/MyDrive/MYDATA/Embeddings_PreTrained/FastText/cc.en.300.bin')

#the above file can be downloaded from facebook AI research ('FAIR') with this URL:
#'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin')

print(ft.get_dimension())



300


In [None]:
fasttext.util.reduce_model(ft, 50)
#print(ft.get_dimension())

50


In [None]:
ft.save_model("FastText_SVD_Reduced50dim.bin")

In [None]:
ft_ft = fasttext.load_model("FastText_SVD_Reduced50dim.bin")



In [None]:
ft_ft.save_model('FastText_SVD_Reduced50dim.txt')

In [None]:
with open('FastText_SVD_Reduced50dim.txt', 'rb') as f_in, gzip.open('FastText_SVD_Reduced50dim.txt.gz', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)

In [None]:
!cp FastText_SVD_Reduced50dim.txt.gz /content/drive/MyDrive/MYDATA/Embeddings_PreTrained/FastText/

# Reducing Word2Vec (w2v.bin)

In [None]:
#from drive storage
model_path = '/content/drive/MyDrive/MYDATA/Embeddings_PreTrained/word2vec/w2v.bin'
#the file could be downloaded from 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin' but this is longer supported.
#it is still available for download through kaggele (https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300)
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

In [None]:
# Extract the word vectors from the model
word_vectors = model.vectors

# Reduce the dimensionality of the vectors to 50 using PCA
pca = PCA(n_components=50)
word_vectors_50d = pca.fit_transform(word_vectors)

In [None]:
# Save the reduced vectors to a file in text format #SLOW!
with gzip.open("/content/drive/MyDrive/MYDATA/Embeddings_PreTrained/word2vec/w2v_PCA_reduced-vectors.txt.gz", "wt") as f:
    for i, word in enumerate(model.index2word):
        vector_str = " ".join([str(x) for x in word_vectors_50d[i]])
        f.write(f"{word} {vector_str}\n")

KeyboardInterrupt: ignored



*   open() <-gzip.open()
*   wt = write text
*   rt = read text






