- Before we dive into the three most common subword tokenization algorithms used with Transformer models, take a look at the preprocessing the steps in the tokenization pipeline: 
- Normalization -> Pre-tokenization -> model -> postprocessor
- The tokenizer performs two steps: normalization and pre-tokenization.
- 3 most common subword tokenization algorithms used with Transformer models (Byte-Pair Encoding [BPE], WordPiece, and Unigram)

In [10]:
import transformers

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))

<class 'tokenizers.Tokenizer'>


- Normalization: normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents

In [3]:
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


- Pre-tokenization: split a raw text into words on whitespace and punctuation. Those words will be the boundaries of the subtokens the tokenizer can learn during its training. we can use the pre_tokenize_str() method of the pre_tokenizer attribute of the tokenizer object:

In [4]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('how', (7, 10)),
 ('are', (11, 14)),
 ('you', (16, 19)),
 ('?', (19, 20))]

In [5]:
#unlike the BERT tokenizer, this tokenizer does not ignore the double space.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġhow', (6, 10)),
 ('Ġare', (10, 14)),
 ('Ġ', (14, 15)),
 ('Ġyou', (15, 19)),
 ('?', (19, 20))]

In [6]:
#Like the GPT-2 tokenizer, this one keeps spaces and replaces them with a specific token (_), 
#but the T5 tokenizer only splits on whitespace, not punctuation.
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('▁Hello,', (0, 6)),
 ('▁how', (7, 10)),
 ('▁are', (11, 14)),
 ('▁you?', (16, 20))]

In [8]:
#three main subword tokenization algorithms: 
#BPE (used by GPT-2 and others), WordPiece (used for example by BERT), and Unigram (used by T5 and others)
#https://huggingface.co/learn/nlp-course/en/chapter6/4?fw=pt

#BPE
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

from transformers import AutoTokenizer
from collections import defaultdict

tokenizer = AutoTokenizer.from_pretrained("gpt2")

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})


In [9]:
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

[',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


In [11]:
splits = {word: [c for c in word] for word in word_freqs.keys()}
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair_freqs[pair] += freq
    return pair_freqs

In [12]:
pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3


In [13]:
best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

('Ġ', 't') 7


In [14]:
vocab = ["<|endoftext|>"] + alphabet.copy()
merges = {("Ġ", "t"): "Ġt"}
vocab.append("Ġt")

In [15]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

In [16]:
splits = merge_pair("Ġ", "t", splits)
print(splits["Ġtrained"])

['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']


In [17]:
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

In [18]:
print(merges)

{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok', ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe', ('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}


In [19]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

In [20]:
tokenize("This is not a token.")

['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']

#### **summary**
- Consider a simple corpus: "hug", "pug", "pun", "bun", "hugs". The base vocabulary starts as ["b", "g", "h", "n", "p", "s", "u"]. The pair "u", "g" appears most frequently across words, so it is merged into a new token "ug". The corpus then updates words containing "ug", and the vocabulary now includes "ug". This process repeats, gradually creating larger tokens, such as merging "u", "n" into "un", and so on, building a vocabulary that represents the text efficiently.
- Key Benefits for NLP balance between character-level and full-word tokenizations adapt to new words 

##### **WordPiece**
- WordPiece is a tokenization algorithm developed by Google for pretraining BERT and has been utilized in various Transformer models like DistilBERT, MobileBERT. Unlike BPE which chooses the most frequent pair for merging, WordPiece calculates a score for each pair.

In [21]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [22]:
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

In [23]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

In [24]:
pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.1111111111111111
('##h', '##i'): 0.02564102564102564
('##i', '##s'): 0.029585798816568046
('Ġ', '##i'): 0.005698005698005698
('Ġ', '##t'): 0.018518518518518517
('##t', '##h'): 0.023809523809523808


#### **Unigram tokenization**
- Unlike BPE and WordPiece, which start with a small vocabulary and expand it by learning rules, Unigram begins with a large vocabulary and prunes tokens until reaching the desired size.
- It evaluates all possible segmentations of a given word and selects the segmentation with the highest probability, based on token frequencies. This model treats all tokens as independent, simplifying the calculation of a word's likelihood.

    1. Establishes an initial large vocabulary and calculates probabilities based on token frequencies.
    2. For a given word, considers all possible segmentations, calculates the probability of each, and selects the one with the highest probability.
    3. Calculates the total loss using the current vocabulary for the entire corpus and identifies less necessary tokens for removal based on the increase in loss they would cause.

- In the Unigram tokenization process, the increase in total loss when a specific token is removed from the vocabulary is calculated Tokens that result in a smaller increase are considered "less important" for the model and become candidates for removal from the vocabulary. This method progressively reduces the size of the vocabulary, ultimately deriving the most efficient and meaningful set of tokens.

### **Generative Adversarial Network (GAN)**
https://www.geeksforgeeks.org/generative-adversarial-network-gan/
- Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for an unsupervised learning. 
- The goal of generative modeling is to autonomously identify patterns in input data, enabling the model to produce new examples that feasibly resemble the original dataset.
- GANs are made up of two neural networks, a discriminator and a generator. They use adversarial training to produce artificial data that is identical to actual data.

##### **Architecture of GANs**

<div>
<img src = "nlp_images/GAN.png" width = "700">
</div>

- A Generative Adversarial Network (GAN) is composed of two primary parts, which are the Generator and the Discriminator.
- Generator
    - The generator model creates realistic data from random noise, continually refining its output to mirror actual data through an adaptive training process. It aims to minimize its loss by producing samples that effectively deceive the discriminator into classifying them as authentic.

- Discriminator
    - The discriminator serves as a binary classifier, distinguishing between real data and synthetic samples produced by the generator by evaluating their authenticity. It learns to improve its accuracy over time, aiming to maximize its ability to identify generated data as fake and real data as authentic, thereby refining the GAN's output to produce highly realistic synthetic data.
    
1. **Initialization**: Two neural networks, the Generator (G) and the Discriminator (D), are set up, where G aims to generate data resembling real data, and D evaluates the authenticity of data from both G and actual datasets.

2. **Generator’s First Move**: G starts by converting a random noise vector into a data sample (e.g., an image), simulating the process of creating new, realistic data.

3. **Discriminator’s Turn**: D assesses both real data from its training set and the synthetic data from G, assigning a probability score to gauge if the data is real (closer to 1) or fake (closer to 0).

4. **The Learning Process**: The adversarial dynamic kicks in, rewarding both networks when D accurately classifies real and fake data but also encouraging G to improve its output to better deceive D.

5. **Generator’s Improvement**: G is updated positively when it successfully fools D, motivating it to enhance its data generation to be more convincing.

6. **Discriminator’s Adaptation**: D is reinforced for correctly identifying fake data, sharpening its ability to discriminate between real and generated samples, which in turn drives the iterative improvement of both models.


### **RNN**
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

##### **Architecture of a traditional RNN**
- Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:

<div>
<img src = "nlp_images/RNN.png" width = "900">
</div>

##### **Applications of RNN**

<div>
<img src = "nlp_images/Applications_RNNs.png" width = "600">
</div>

##### **Loss function**
- In the case of a recurrent neural network, the loss function *L* of all time steps is defined based on the loss at every time step as follows:
    <div>
    <img src = "nlp_images/LossFunction.png" width = "200">
    </div>

##### **Backpropagation through time**
- Backpropagation is done at each point in time. At timestep *T*, the derivative of the loss *L* with respect to weight matrix *W* is expressed as floows:
    <div>
    <img src = "nlp_images/Backpropagation.png" width = "180">
    </div>

##### **Commonly used activation functions**
- The most common activation functions used in RNN modules are described below:
    <div>
    <img src = "nlp_images/ActivationFunctions.png" width = "800">
    </div>

- RNNs are useful for tasks involving sequential data, like sentence generation and machine translation
- However, RNNs struggle with long-term dependencies due to the following two reasons
    1. **Vanishing/exploding gradient**: Vanishing(if wt<1) or Exploding(if wt>1) Gradients During training beacuase the errors are multiplied by weights at each time step
    2. **Short-Term Memory Bias**: The basic RNN structure tends to be heavily influenced by recent inputs

- **Gradient clipping**: It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.
    <div>
    <img src = "nlp_images/gradient_clipping.png" width = "300">
    </div>

- In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted **Γ** and are equal to:
    <div>
    <img src = "nlp_images/gates.png" width = "550">
    </div>

### **LSTM (Long Short-Term Memory units)**
<div>
<img src = "nlp_images/GRU_LSTM.png" width="800">
</div>

##### **A single LSTM Cell**
- LSTMs (long short-term memory) are a type of RNN that can learn long-term dependencies
- https://medium.com/analytics-vidhya/lstms-explained-a-complete-technically-accurate-conceptual-guide-with-keras-2a650327e8f2
<div>
<img src = "nlp_images/LSTM.png" width="600">
</div>

- The *hidden state* serves as an encoding of the most recent input, capturing immediate past information relevant for tasks like sentiment analysis or next-word prediction. 
- The *cell state*, on the other hand, acts as the network's long-term memory, integrating data from all previous time steps through selective filtering by forget and input gates, allowing LSTMs to dynamically decide the importance of historical information in time-series data analysis. Together, these components enable LSTMs to effectively process and remember information over both short and long durations.
- The cell state represents the memory of the network, storing information over time
    - LSTMs have a complex architecture with forget gates, input gates, and output gates 
    - The forget gate decides how much of the past information the LSTM should remember
    - The input gate decides how much new information to add to the cell state.
    - The output gate decides what part of the current cell state makes it to the output.
    - These gates control the flow of information in the network
    - LSTMs are powerful tools, but they can be difficult to train

### **Bidirectional LSTM**
- https://www.geeksforgeeks.org/bidirectional-lstm-in-nlp/

##### **Bidirectional LSTM layer Architecture**
<div>
<img src = "nlp_images/BidLSTM.png" width="800">
</div>

- Bidirectional LSTMs are a type of recurrent neural network model that is able to process information in both the forward and backward direction 
- Bidirectional LSTM is a sequence model which contains two LSTM layers, one for processing input in the forward direction and the other for processing in the backward direction. This allows the model to better understand the relationships between words in a sequence
- Bidirectional LSTMs are useful for tasks like sentiment analysis, text classification, and machine translation

### **Transformers**
- https://serokell.io/blog/transformers-in-ml

##### **Architecture of Transformers**
<div>
<img src = "nlp_images/transformer.png" width="400">
</div>

##### This architecture is composed of two main parts: the encoder on the left and the decoder on the right. Each part consists of multiple identical layers (stacked N times as indicated by "Nx" in the diagram). Let me explain the key components:

- Add & Norm: The outputs of the attention mechanism are added to the original embeddings (residual connection), and then normalized. This helps in stabilizing the learning process.
- Feed Forward: Each position passes through a feed-forward neural network that processes the information from the attention mechanism.

- There are 3 key elements that make transformers so powerful:
    1. Self-attention 
    2. Positional embeddings 
    3. Multihead attention
    ##### **1. Self-attention**
    - Self-attention allows a model to weigh the importance of different words in a sentence relative to each other. It calculates a score indicating how much focus to place on other parts of the input sentence as it processes a word. This mechanism enables the model to capture relationships and dependencies between words, regardless of their positional distance in the sentence. For example, in the sentence "The cat that sat on the mat was fluffy," self-attention helps the model understand that "cat" is related to "fluffy."
    - The self-attention mechanism enables the model to detect the connection between different elements even if they are far from each other and assess the importance of those connections, therefore, improving the understanding of the context.
    ##### **2. Positional Embeddings**
    - Positional embeddings are used to give the model information about the order of words in a sentence since models like the Transformer, which rely heavily on attention mechanisms, do not inherently process sequential data in order. 
    - Positional embeddings are added to the input embeddings and provide a unique signal for each position in the sequence, allowing the model to recognize word order and use this information to better understand the context and meaning of a sentence.
    ##### **3. Multihead attention**
    - In NLP, sequences (like sentences or documents) of text are typically represented as vectors of word embeddings. However, a word embedding doesn’t capture the position of the word in the sequence.
    - Multihead attention is an extension of the attention mechanism that allows the model to focus on different parts of the input sentence for different "reasons" or "aspects." Instead of having a single set of attention weights, the model has multiple sets (heads), each capable of focusing on different parts of the input. 
    - This architecture enables the model to capture a richer array of relationships in the data, as each head can learn to attend to different features of the input. 
    - Multihead attention combines these diverse perspectives to produce a more comprehensive understanding of the input data. For instance, in one head, the model might focus on the syntactic relationship between words, while another might focus on semantic aspects.

Due to positional embeddings and multihead attention, transformers allow for simultaneous sequence processing (parallelization)


##### How do **transformers** work?
1. Input Embedding
    - The transformer starts by converting input tokens (words, characters, etc.) into vectors using embedding layers. These embeddings capture the semantic information of each token.

2. Positional Encoding
    - Since transformers do not inherently process data in order, positional encodings are added to the input embeddings to provide information about the position of each token in the sequence. This allows the transformer to maintain the sequence's order.

3. Self-attention Mechanism
    - The self-attention mechanism allows each token to interact with every other token in the input. It computes attention scores to determine how much focus each token should have on every other token, enabling the model to capture relationships between all parts of the input, regardless of their distance.

4. Multi-head Attention
    - Instead of one set of attention scores, the transformer uses multiple sets or "heads" to allow the model to pay attention to different parts of the input for different reasons simultaneously. The outputs of these attention heads are then combined.

5. Feed-Forward Neural Networks
    - After attention computation, each token's representation is passed through a feed-forward neural network (FFNN). Each token is processed independently, with the same FFNN applied to each position.

6. Residual Connections and Layer Normalization
    - To help mitigate the vanishing gradient problem and facilitate training of deep networks, residual connections followed by layer normalization are applied after the multi-head attention and FFNN layers.

7. Output
    - For each token in the input sequence, the transformer outputs a vector. This output can be used for various tasks, such as classification, translation, or generation, depending on the specific configuration of the output layer.

8. Decoder (For Encoder-Decoder Architecture)
    - In tasks like translation, a decoder similar to the encoder but with additional attention layers to focus on relevant parts of the input sequence is used. The decoder takes the encoder output and generates the target sequence step by step.

### How Transformers Surpassed RNNs in NLP Performance

**Parallel Processing:**

- RNN: Difficult to parallelize due to sequential processing.
- Transformer: Can process all inputs simultaneously using self-attention mechanisms, making it much faster.


**Long-range Dependency Handling:**

- RNN: Struggles to connect information from distant parts in long sequences (vanishing gradient problem).
- Transformer: Self-attention mechanism allows direct connections between all positions regardless of sequence length.


**Fixed Context Size:**

- RNN: Must compress information into a fixed-size hidden state.
- Transformer: Can dynamically allocate attention to all input tokens.


**Positional Information Processing:**

- RNN: Naturally includes positional information through sequential processing.
- Transformer: Uses explicit positional encoding for more flexible handling of position information.


**Computational Efficiency:**

- RNN: Computational cost increases with longer sequences.
- Transformer: Can process long sequences more efficiently due to parallel processing.


**Multi-scale Feature Extraction:**

- RNN: Extracts features at a single time scale.
- Transformer: Can extract features at various scales simultaneously through multi-head attention.


**Model Depth:**

- RNN: Difficult to create deep structures.
- Transformer: Can effectively train very deep structures through residual connections and layer normalization.

In [2]:
import pandas as pd
df = pd.read_csv('TalkFile_ner_2.csv')
df.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [3]:
#convert these strings into their actual Python data types
df['Tag'] = df['Tag'].apply(lambda x: eval(x))
df.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, I-geo..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, O, O,..."


In [4]:
list_tag = df['Tag'].to_list()
list_tag[0]

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-gpe',
 'O',
 'O',
 'O',
 'O',
 'O']

In [7]:
from itertools import chain
list_labels = ['O'] + [i for i in list(set(chain.from_iterable(list_tag))) if i != 'O']
list_labels

['O',
 'B-tim',
 'B-art',
 'B-gpe',
 'B-per',
 'B-org',
 'I-per',
 'I-gpe',
 'B-eve',
 'I-org',
 'I-eve',
 'I-nat',
 'I-tim',
 'B-geo',
 'B-nat',
 'I-art',
 'I-geo']

In [8]:
list_labels = ['O'] + [i for tag in list_tag for i in tag if i!='O']
list_labels = list(set(list_labels))
list_labels

['B-tim',
 'B-art',
 'B-gpe',
 'B-per',
 'B-org',
 'I-per',
 'I-gpe',
 'B-eve',
 'I-org',
 'I-eve',
 'I-nat',
 'I-tim',
 'B-geo',
 'B-nat',
 'O',
 'I-art',
 'I-geo']

In [9]:
dic = {}
for i, v in enumerate(list_labels):
    dic[v] = i
dic

{'B-tim': 0,
 'B-art': 1,
 'B-gpe': 2,
 'B-per': 3,
 'B-org': 4,
 'I-per': 5,
 'I-gpe': 6,
 'B-eve': 7,
 'I-org': 8,
 'I-eve': 9,
 'I-nat': 10,
 'I-tim': 11,
 'B-geo': 12,
 'B-nat': 13,
 'O': 14,
 'I-art': 15,
 'I-geo': 16}

In [30]:
#Use the chain.from_iterable class method to flatten a list of lists (or any iterable of iterables)
from itertools import chain
list_labels = ['O'] + [i for i in list(set(chain.from_iterable(list_tag))) if i != 'O']
print(list_labels)
#encoding: maps each unique label to a unique integer so that model can understand
label_index = {}
#decoding: map each index back to its corresponding label, making the predictions understandable
index_label = {}
for i, l in enumerate(list_labels):
    label_index[l] = i
    index_label[i] = l
    

['O', 'B-gpe', 'I-art', 'I-org', 'I-per', 'I-eve', 'I-geo', 'B-geo', 'B-tim', 'B-art', 'I-gpe', 'B-eve', 'B-nat', 'I-nat', 'B-org', 'I-tim', 'B-per']


In [31]:
label_index

{'O': 0,
 'B-gpe': 1,
 'I-art': 2,
 'I-org': 3,
 'I-per': 4,
 'I-eve': 5,
 'I-geo': 6,
 'B-geo': 7,
 'B-tim': 8,
 'B-art': 9,
 'I-gpe': 10,
 'B-eve': 11,
 'B-nat': 12,
 'I-nat': 13,
 'B-org': 14,
 'I-tim': 15,
 'B-per': 16}

In [32]:
index_label

{0: 'O',
 1: 'B-gpe',
 2: 'I-art',
 3: 'I-org',
 4: 'I-per',
 5: 'I-eve',
 6: 'I-geo',
 7: 'B-geo',
 8: 'B-tim',
 9: 'B-art',
 10: 'I-gpe',
 11: 'B-eve',
 12: 'B-nat',
 13: 'I-nat',
 14: 'B-org',
 15: 'I-tim',
 16: 'B-per'}

In [33]:
labels_ind_list = df['Tag'].apply(lambda x: [label_index[i] for i in x]).to_list()
text_list = df['Sentence'].apply(lambda x:x.split(' ')).to_list()
data_dict = {'id':list(range(len(text_list))),'tokens':text_list,'ner_tags':labels_ind_list}
new_df = pd.DataFrame(data_dict)
new_df.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[Thousands, of, demonstrators, have, marched, ...","[0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, ..."
1,1,"[Families, of, soldiers, killed, in, the, conf...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,2,"[They, marched, from, the, Houses, of, Parliam...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 0]"
3,3,"[Police, put, the, number, of, marchers, at, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,4,"[The, protest, comes, on, the, eve, of, the, a...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 14,..."


In [34]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=17, id2label=index_label, label2id=label_index
)
model2 = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=17, id2label=index_label, label2id=label_index
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.

                label_ids.append(label[word_idx])

            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

In [37]:
from sklearn.model_selection import train_test_split
train_df,test_df = train_test_split(new_df,test_size=0.2,random_state=42)

In [38]:
import datasets
dataset_dict = datasets.DatasetDict()
dataset_dict['train'] = datasets.Dataset.from_pandas(train_df)
dataset_dict['test'] = datasets.Dataset.from_pandas(test_df)

tokenized_dataset = dataset_dict.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/38367 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 38367/38367 [00:02<00:00, 16383.35 examples/s]
Map: 100%|██████████| 9592/9592 [00:00<00:00, 17661.62 examples/s]


In [39]:
example = tokenized_dataset['train'][0]
example

{'id': 7707,
 'tokens': ['The',
  '58-year-old',
  'former',
  'analyst',
  'says',
  'he',
  'provided',
  'information',
  'to',
  'an',
  'official',
  'at',
  'the',
  'Israeli',
  'embassy',
  'and',
  'to',
  'two',
  'members',
  'of',
  'a',
  'lobbying',
  'group',
  'called',
  'the',
  'American',
  'Israel',
  'Public',
  'Affairs',
  'Committee',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  14,
  3,
  3,
  0],
 '__index_level_0__': 7707,
 'input_ids': [101,
  1996,
  5388,
  1011,
  2095,
  1011,
  2214,
  2280,
  12941,
  2758,
  2002,
  3024,
  2592,
  2000,
  2019,
  2880,
  2012,
  1996,
  5611,
  8408,
  1998,
  2000,
  2048,
  2372,
  1997,
  1037,
  19670,
  2177,
  2170,
  1996,
  2137,
  3956,
  2270,
  3821,
  2837,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,


In [40]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
import evaluate

seqeval = evaluate.load("seqeval")
import numpy as np

labels = [index_label[i] for i in example[f"ner_tags"]]


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [index_label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [index_label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [41]:
training_args = TrainingArguments(
    output_dir=".",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  0%|          | 0/4796 [00:00<?, ?it/s]

 10%|█         | 500/4796 [05:14<44:31,  1.61it/s] 

{'loss': 0.2909, 'grad_norm': 1.853703498840332, 'learning_rate': 1.791492910758966e-05, 'epoch': 0.21}


 21%|██        | 1000/4796 [10:30<43:35,  1.45it/s]

{'loss': 0.14, 'grad_norm': 0.963873565196991, 'learning_rate': 1.5829858215179316e-05, 'epoch': 0.42}


 31%|███▏      | 1500/4796 [15:51<33:07,  1.66it/s]

{'loss': 0.1275, 'grad_norm': 0.8642410635948181, 'learning_rate': 1.3744787322768976e-05, 'epoch': 0.63}


 42%|████▏     | 2000/4796 [21:18<27:30,  1.69it/s]

{'loss': 0.1179, 'grad_norm': 0.8836368322372437, 'learning_rate': 1.1659716430358635e-05, 'epoch': 0.83}


  _warn_prf(average, modifier, msg_start, len(result))
                                                   
 50%|█████     | 2398/4796 [27:28<26:57,  1.48it/s]Checkpoint destination directory .\checkpoint-2398 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.11251270025968552, 'eval_precision': 0.8030616284696257, 'eval_recall': 0.815509693558474, 'eval_f1': 0.809237793390811, 'eval_accuracy': 0.9658309880726743, 'eval_runtime': 103.603, 'eval_samples_per_second': 92.584, 'eval_steps_per_second': 5.791, 'epoch': 1.0}


 52%|█████▏    | 2500/4796 [28:36<25:03,  1.53it/s]   

{'loss': 0.112, 'grad_norm': 0.8961143493652344, 'learning_rate': 9.57464553794829e-06, 'epoch': 1.04}


 63%|██████▎   | 3000/4796 [34:06<19:15,  1.55it/s]

{'loss': 0.0989, 'grad_norm': 0.9021720886230469, 'learning_rate': 7.4895746455379494e-06, 'epoch': 1.25}


 73%|███████▎  | 3500/4796 [39:32<14:20,  1.51it/s]

{'loss': 0.0983, 'grad_norm': 0.9885796904563904, 'learning_rate': 5.404503753127606e-06, 'epoch': 1.46}


 83%|████████▎ | 4000/4796 [44:59<07:41,  1.72it/s]

{'loss': 0.0979, 'grad_norm': 0.9366612434387207, 'learning_rate': 3.319432860717265e-06, 'epoch': 1.67}


 94%|█████████▍| 4500/4796 [50:20<03:02,  1.62it/s]

{'loss': 0.0918, 'grad_norm': 0.7471380233764648, 'learning_rate': 1.2343619683069227e-06, 'epoch': 1.88}


  _warn_prf(average, modifier, msg_start, len(result))
                                                   
100%|██████████| 4796/4796 [55:14<00:00,  1.39it/s]Checkpoint destination directory .\checkpoint-4796 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'eval_loss': 0.10554414987564087, 'eval_precision': 0.8175175977902521, 'eval_recall': 0.8197087465380148, 'eval_f1': 0.8186117059243398, 'eval_accuracy': 0.967841211494649, 'eval_runtime': 102.1942, 'eval_samples_per_second': 93.861, 'eval_steps_per_second': 5.871, 'epoch': 2.0}


100%|██████████| 4796/4796 [55:15<00:00,  1.45it/s]

{'train_runtime': 3315.332, 'train_samples_per_second': 23.145, 'train_steps_per_second': 1.447, 'train_loss': 0.12816107044426772, 'epoch': 2.0}





TrainOutput(global_step=4796, training_loss=0.12816107044426772, metrics={'train_runtime': 3315.332, 'train_samples_per_second': 23.145, 'train_steps_per_second': 1.447, 'train_loss': 0.12816107044426772, 'epoch': 2.0})

In [None]:
from transformers import pipeline
text = ' '.join(tokenized_dataset["test"]['tokens'][0])
print(text)
classifier = pipeline("ner", model=model,tokenizer=tokenizer)
classifier(text)