# Word Embedding with Word2Vec

This notebook demonstrates the process of loading pre-trained Word2Vec embeddings and applying them to a training and test dataset. The primary objective is to convert words from training texts into their corresponding vector representations using a pre-trained Word2Vec model.

## Overview
1. **Loading Libraries and Data**: Import necessary libraries and datasets.
2. **Loading Pre-trained Word2Vec Model**: Load the Spanish Billion Words Corpus (SBWC) Word2Vec model.
3. **Embedding Transformation**: Transform the words in the training and test datasets into their respective Word2Vec vector representations.

## Goal
The main goal of this notebook is to prepare word embeddings for use in various natural language processing tasks. By the end of the notebook, you will have word vectors ready for integration into downstream models and analyses.

In [1]:
import json
import torch
import gensim
import gensim.downloader as api
from nltk import sent_tokenize, word_tokenize
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [2]:
model_path='http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz'
word2vec_model=gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)

In [3]:
f=open("full_training_set_CRF_tagged.json")
training_set=json.load(f)

f=open("full_test_set_CRF_tagged.json")
test_set=json.load(f)

In [14]:
def word_extraction(document):   
    text=document['text']
    tagged_sentences=[]
    tag_index=0
    
    for sentence in sent_tokenize(text):
        if(any(char.isalpha() for char in sentence)):
            l = []
            for word in word_tokenize(sentence):
                 l.append(word)
                 tag_index += 1
            tagged_sentences.append(l)

    return tagged_sentences

In [15]:
test_set_words = []
for i in range(len(test_set)):
    test_set_words += word_extraction(test_set[i])

In [16]:
print(len(test_set_words))

3211


In [17]:
word_embeddings=[]
for sentence in test_set_words:
    l=[]
    for word in sentence:
        if word in word2vec_model:
            l.append(word2vec_model.get_vector(word))
        else:
            l.append(np.zeros(shape=(300,)))
    word_embeddings.append(l)

In [18]:
for sentence in word_embeddings:
    for i in range(len(sentence)):
        sentence[i]=sentence[i].tolist()

In [20]:
len(word_embeddings)

300

In [21]:
with open("test_word_embeddings.json","w") as f:
  json.dump(word_embeddings, f)