# Character Embedding Representation Notebook

This notebook focuses on creating vector representations of words using a pre-trained FastText model, specifically the 'fasttext-sbwc.3.6.e20.vec' model. The primary objective is to convert words from training texts into their corresponding vector forms, which have a fixed length of 300 dimensions. This representation is crucial for the task described in the paper "Extending a Deep Learning Approach for Negation Cues Detection in Spanish."

## Overview
The notebook includes the following key steps:
1. **Loading Data**: Import training and test datasets from JSON files (`full_training_set_CRF_tagged.json` and `full_test_set_CRF_tagged.json`).
2. **Loading Pre-trained Model**: Load the pre-trained FastText word vectors (`fasttext-sbwc.3.6.e20.vec`).
3. **Word Vector Conversion**: Define functions to convert words into their corresponding vector representations using the pre-trained model.

## Files and Resources
- **Pre-trained FastText Model**: `fasttext-sbwc.3.6.e20.vec`, downloaded from [FastText](https://fasttext.cc/).
- **Training and Test Data**: `full_training_set_CRF_tagged.json` and `full_test_set_CRF_tagged.json`, containing the data for training and evaluation.

## Goal
The goal is to leverage pre-trained embeddings to represent words in a high-dimensional space, facilitating the analysis and processing of textual data for tasks such as negation cues detection in the Spanish language.

By the end of this notebook, you will have a set of word vectors ready for use in further natural language processing tasks, particularly those involving character-level analysis as mentioned in the referenced paper.


In [None]:
import json 
from nltk import sent_tokenize
from nltk import word_tokenize

f1 = open("full_training_set_CRF_tagged.json")
full_training_set_CRF = json.load(f1)

f1 = open("full_test_set_CRF_tagged.json")
full_test_set_CRF = json.load(f1)

In [None]:
from gensim.models.keyedvectors import KeyedVectors
wordvectors_file_vec = 'fasttext-sbwc.3.6.e20.vec'
cantidad = 100000
wordfast_model = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)

def get_word_vector(word):
    """
    Esta función toma una palabra como entrada y devuelve su vector correspondiente
    utilizando el modelo de FastText previamente cargado.
    
    :param word: Palabra para la que se desea obtener el vector.
    :return: Vector de la palabra si está en el vocabulario del modelo, de lo contrario None.
    """
    if word in wordfast_model:
        return wordfast_model[word]
    else:
        print(f"La palabra '{word}' no está en el vocabulario.")
        return None


In [None]:
def word_extraction_wordfast(document,n):
    "Receive a document and the number of the document. Then it returns the labels divided in sentences"    
    text = document['text']
    tagged_sentences = []
    tag_index = 0
    
    for sentence in sent_tokenize(text):
        if(any(char.isalpha() for char in sentence)):
            l = []
            for word in word_tokenize(sentence):
                 l.append(get_word_vector(word))
                 tag_index += 1
            tagged_sentences.append(l)

    return tagged_sentences

In [None]:
word_vectors_fast = []
for i in range(len(full_training_set_CRF)):
    word_vectors_fast += word_extraction_wordfast(full_training_set_CRF[i],i)

In [None]:
with open('w_fast_vectors', 'w', encoding='utf-8') as f:
    json.dump(word_vectors_fast, f, ensure_ascii=False)