# Let's Build an Encoder (Part 1): Task

### Recap of the Full process
- Given a Corpus of Text (List of Sentences)
- **Normalize** each sentence
  - Lower case
  - Lemmatization & Stemming
  - Remove symbols
- Apply **Tokenization** on the normalized sentence
- Build **Vocabulary** of such Tokens
- **Encode** the tokens given their IDs
- Apply **Padding** on each Sequence of IDs

### Illustrative Figure For Encoder Pipeline

![Encoder-Pipeline](../imgs/Encoder-Pipeline.png)

### Import Packages

In [33]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import numpy as np
from itertools import chain
from typing import List, Dict

In [34]:
sm = PorterStemmer()
wn = WordNetLemmatizer()

### Corpus

In [None]:
data = [
    "Machine Can only understand digits.",
    "Tokenizing Sentence into Sequence of Tokens",
    "Encoding Tokens into Sequence of IDs",
    "Apply Padding on each sequence",
    "Pass sequences into our model",
]

### Helper Methods

In [46]:
def clean_text(sentence: str):
    """Apply Lower, Stemming

    Args:
        x (str): sentence
        stemming (bool, optional): whether to apply stemming or not. Defaults to False.

    Returns:
        list: list of normalized sentences
    """
    return ...

In [66]:
def create_vocab(sentences: List[str]) -> Dict[str, int]:
    """Generate a vocabulary given corpus

    Args:
        sentences (List[str]): List of sentences (all data)

    Returns:
        dict: dictionary with key as the string (token) and value as the index
        example
        {
            "token-x": 1, # let the 0 be for the padding token
            "token-y": 2,
            ...
        }
    """
    vocab = {...}

    return vocab

In [78]:
def encode_tokens(data_normalized: List[str], vocab: dict) -> List[int]:
    """Apply Encoding on the given list of sentences via the vocabulary

    Args:
        data_normalized (List[str]): List of normalized sentences to be encoded
        vocab (dict): dictionary containing each token mapped to index

    Returns:
        List[int]: List of indexes (encoded tokens)
    """
    return ...

In [162]:
def pad_seq(inp_seq: List[int], max_length: int) -> List[int]:
    """Apply padding given sequence of encoded tokens

    Args:
        inp_seq (List[int]): Sequence of tokens after encoding
        max_length (int): length where all sequences will have

    Returns:
        list: list of encoded sequence with padding to be all the same size
    """
    
    return ...


In [163]:
# Write in this cell avoid removing the output on the next cell

In [None]:
# Clean and tokenize Corpus
data_tokens = ...
data_tokens

[['machine', 'can', 'only', 'understand', 'digits.'],
 ['tokenizing', 'sentence', 'into', 'sequence', 'of', 'token'],
 ['encoding', 'token', 'into', 'sequence', 'of', 'id'],
 ['apply', 'padding', 'on', 'each', 'sequence'],
 ['pas', 'sequence', 'into', 'our', 'model']]

In [None]:
# Write in this cell avoid removing the output on the next cell

In [None]:
# Create Vocabulary
vocab = ...
vocab

{'sequence': 1,
 'machine': 2,
 'padding': 3,
 'on': 4,
 'model': 5,
 'can': 6,
 'into': 7,
 'id': 8,
 'of': 9,
 'our': 10,
 'apply': 11,
 'pas': 12,
 'understand': 13,
 'token': 14,
 'each': 15,
 'sentence': 16,
 'encoding': 17,
 'tokenizing': 18,
 'digits.': 19,
 'only': 20}

In [None]:
# Write in this cell avoid removing the output on the next cell

In [None]:
enc = ...
enc

[[2, 6, 20, 13, 19],
 [18, 16, 7, 1, 9, 14],
 [17, 14, 7, 1, 9, 8],
 [11, 3, 4, 15, 1],
 [12, 1, 7, 10, 5]]

In [None]:
# Write in this cell avoid removing the output on the next cell

In [None]:
max_len = ...
data_tokens_padded = ...
data_tokens_padded

[[2, 6, 20, 13, 19, 0],
 [18, 16, 7, 1, 9, 14],
 [17, 14, 7, 1, 9, 8],
 [11, 3, 4, 15, 1, 0],
 [12, 1, 7, 10, 5, 0]]

In [None]:
# Check All sequences having the same lenght
assert all(list(map(len, data_tokens_padded)))

### Great Work! 🎉  
Now That you had learned the main concept of the Encoder, Let's work on a real world scenario and build a real tokenizer given an actual data of [Amazon Food reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews), which will have training and testing datasets.