## Tutoriel pytorch word2vect from scratch
### Torchtext overview

Torchtext is a Python library that provides a suite of tools for natural language processing (NLP) tasks such as text preprocessing, tokenization, and vocabulary management. It is specifically designed for working with text data and provides several functions for cleaning and preparing text data, including removing punctuation, extra whitespaces, and other special characters, splitting text into words or tokens, and converting text to lowercase.

torchtext is often used in conjunction with PyTorch, a popular deep learning framework, to build end-to-end NLP pipelines. It can be particularly useful for preparing text data for tasks such as text classification, sentiment analysis, and machine translation.

The for acces to torchtext you need to dowload the library using the next command below.

In [57]:
! pip install pytorch
! pip install torchtext
! pip install transformers
! pip install SentencePiece

Collecting pytorch
  Using cached pytorch-1.0.2.tar.gz (689 bytes)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: pytorch
  Building wheel for pytorch (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[6 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "<string>", line 2, in <module>
  [31m   [0m   File "<pip-setuptools-caller>", line 34, in <module>
  [31m   [0m   File "/tmp/pip-install-aui7uxqw/pytorch_d5d4d51e27e446ea84e8071b969cea08/setup.py", line 15, in <module>
  [31m   [0m     raise Exception(message)
  [31m   [0m Exception: You tried to install "pytorch". The package named for PyTorch is "torch"
  [31m   [0m [31m[end of output][0m
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not

### Tokenizer
Tokenizer is an approach used in natural language processing (nlp) to split a text in to token. 
Tokenization is the process of breaking down a text into individual words, phrases, symbols, or other meaningful elements, which are referred to as "tokens". These tokens can then be used for a variety of NLP tasks, such as language modeling, text classification, and sentiment analysis.
For load it you need [torchtext.data.utils]

#### get_tokenizer 
get_tokenizer from torchtext.data.utils is a method for tokenize text 
tokenizer = get_tokenizer(tokenizer method ,  language )

In [27]:
from torchtext.data.utils import get_tokenizer
def get_english_tokenizer():
    """
    Documentation:
    https://pytorch.org/text/stable/_modules/torchtext/data/utils.html#get_tokenizer
    """
    tokenizer = get_tokenizer("basic_english", language="en")

    return tokenizer

## Vocabulary
vocab is a set of word with their idx : 
words ["je" , "vais" , "sortir"]
vocab(words) ----> {"je" : 0 ; "vais" :  1, "sortir" : 2}

Then, you need to tokenize first the text befor using vocab. 

For use vocab in pytorch, you use the following code snippet.

It take two parameters such as: data wich can be a list of words (["je suis parti" , "Il est rentré"]) and the tokenizer function.
add special token for the unknow word with parameter (specials = ["<unk>"] )


In [28]:
from torchtext.vocab import build_vocab_from_iterator
def build_vocab(data_iter, tokenizer):
    """Builds vocabulary from iterator"""
    
    vocab = build_vocab_from_iterator(
        map(tokenizer, data_iter),
        specials=["<unk>"],
        min_freq=1,
    )
    vocab.set_default_index(vocab["<unk>"])
    return vocab


In [30]:
#Exemple
data_iter = ["je suis sorti voir Nicolas" , "Il est parti regarder un film " ,  "soit sérieux"]
tokenizer = get_english_tokenizer()
build_vocab(data_iter, tokenizer)

Vocab()


In [49]:
data_iter =  ["je suis sorti voir Nicolas" , "Il est parti regarder un film " ,  "soit sérieux"]
tokenizer = get_english_tokenizer()
vocab = build_vocab(data_iter, tokenizer)

In [50]:
text_pipeline = lambda x: vocab(tokenizer(x))

vect = [text_pipeline(item) for item in data_iter]

In [51]:
vect

[[4, 10, 9, 13, 5], [3, 1, 6, 7, 12, 2], [8, 11]]

In [52]:
#### Contre exemple 
text_pipeline("pitier")

[0]

In [53]:
text_pipeline("ruiser")

[0]

### CONTEXT IN NATURAL LANGUAGE PROICESSING
Suppose you are reading a book and you come across a line of text that you don't understand. What can you do to understand it? Firstly, you should reread the entire sentence and look at the words surrounding the unfamiliar word. Once you have a clearer understanding of the context, you can then work out the meaning of the unknown word. This method of understanding an unknown word by using the context of the text is known as contextualization.

Contextualization in NLP is so important for the model to understand well the whole sentence and make the relation between all words inside the sentence.
The parameter than we use such a way to control the number of the word that we are chose before and after the unknow word (current word) is called N_CONTEXT_WORDS

For by the know the entire lenght of the word tou should do this kind of operation N_CONTEXT_WORDS * 2 + 1: This max sequence length
Batch means a bag of sentences. 

In [34]:
sentence = "je suis sorti voir Nicolas" 
N_CONTEXT_WORDS = 2
tokens = sentence.split()

In [35]:
for item in tokens:
    print(item)

je
suis
sorti
voir
Nicolas


In [19]:
CBOW_N_WORDS = 2
MAX_SEQUENCE_LENGTH = CBOW_N_WORDS * 2 +1

In [55]:

def collate_cbow(batch , tokenizer , vocab):
    """
    Collate_fn for CBOW model to be used with Dataloader.
    `batch` is expected to be list of text paragrahs.
    
    Context is represented as N=CBOW_N_WORDS past words 
    and N=CBOW_N_WORDS future words.
    CE_LENGTH tokens.
    
    Each element in `batch_input` is N=CBOW_N_WORDS*2 context words.
    Each element in `batch_output` is a middle word.
    """
        
    text_pipeline = lambda x: vocab(tokenizer(x))

    batch_input, batch_output = [], []
    text_pipeline = lambda x: vocab(tokenizer(x))
    for text in batch:
        text_tokens_ids =text_pipeline(text)

        if len(text_tokens_ids) < CBOW_N_WORDS * 2 + 1:
            continue

        if MAX_SEQUENCE_LENGTH:
            text_tokens_ids = text_tokens_ids[:MAX_SEQUENCE_LENGTH]
           

        for idx in range(len(text_tokens_ids) - CBOW_N_WORDS * 2):
            token_id_sequence = text_tokens_ids[idx : (idx + CBOW_N_WORDS * 2 + 1)]
            output = token_id_sequence.pop(CBOW_N_WORDS)
            input_ = token_id_sequence
            batch_input.append(input_)
            batch_output.append(output)

    batch_input = torch.tensor(batch_input, dtype=torch.long)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    return batch_input, batch_output



In [56]:
batch = ["je suis sorti voir Nicolas" , "Il est parti regarder un film " ,  "soit sérieux"]
data_iter = batch
tokenizer = get_english_tokenizer()
vocab = build_vocab(data_iter, tokenizer)

collate_cbow(batch , tokenizer , vocab)

(tensor([[ 4, 10, 13,  5],
         [ 3,  1,  7, 12]]),
 tensor([9, 6]))

### DataLoader

  dataloader = DataLoader(
        data_iter,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=partial(collate_fn, text_pipeline=text_pipeline),
    )
    
DataLoader is a function putting in place by pytorch for build dataloader for training models.
It take parameters such as data iter wich is the list of sentences, batch size wich is the max lenght of the text (usualy 2*N_CONTEXT_WORD + 1)
collate_fn wich is the funtion that make relation with vocab tokenizer ant text_pipeline. it has aims to return a tenso depending on the job (CBOW, skip-gram).

In [None]:
def get_dataloader_and_vocab(
    model_name, ds_name, ds_type, data_dir, batch_size, shuffle, vocab=None
):

    data_iter = get_data_iterator(ds_name, ds_type, data_dir)
    tokenizer = get_english_tokenizer()

    if not vocab:
        vocab = build_vocab(data_iter, tokenizer)
        
    text_pipeline = lambda x: vocab(tokenizer(x))

    if model_name == "cbow":
        collate_fn = collate_cbow
    elif model_name == "skipgram":
        collate_fn = collate_skipgram
    else:
        raise ValueError("Choose model from: cbow, skipgram")

    dataloader = DataLoader(
        data_iter,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=partial(collate_fn, text_pipeline=text_pipeline),
    )
    return dataloader, vocab

#### ANOTHER COLLATE FUNCTION IS SKIP-GRAM 
Skip-gram is the opposite of the cbow model.

In [None]:
def collate_skipgram(batch, text_pipeline):
    """
    Collate_fn for Skip-Gram model to be used with Dataloader.
    `batch` is expected to be list of text paragrahs.
    
    Context is represented as N=SKIPGRAM_N_WORDS past words 
    and N=SKIPGRAM_N_WORDS future words.
    
    Long paragraphs will be truncated to contain
    no more that MAX_SEQUENCE_LENGTH tokens.
    
    Each element in `batch_input` is a middle word.
    Each element in `batch_output` is a context word.
    """
    batch_input, batch_output = [], []
    for text in batch:
        text_tokens_ids = text_pipeline(text)

        if len(text_tokens_ids) < SKIPGRAM_N_WORDS * 2 + 1:
            continue

        if MAX_SEQUENCE_LENGTH:
            text_tokens_ids = text_tokens_ids[:MAX_SEQUENCE_LENGTH]

        for idx in range(len(text_tokens_ids) - SKIPGRAM_N_WORDS * 2):
            token_id_sequence = text_tokens_ids[idx : (idx + SKIPGRAM_N_WORDS * 2 + 1)]
            input_ = token_id_sequence.pop(SKIPGRAM_N_WORDS)
            outputs = token_id_sequence

            for output in outputs:
                batch_input.append(input_)
                batch_output.append(output)

    batch_input = torch.tensor(batch_input, dtype=torch.long)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    return batch_input, batch_output