In [4]:
import pandas as pd
import transformers
import tensorflow as tf
from sentence_transformers import SentenceTransformer



In [3]:
!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


<b>So far, you have learned how to design a Natural Language Processing (NLP) architecture toachieve successful task performance with transformers. In this chapter, you will learn how to make efficient models out of trained models using distillation, pruning, and quantization.</b>

- An experimental setup using a typical GPU with 16 GB can handle the sentences of 512 tokens for training and inference.
    However, longer entries can cause problems.

Yes, you are correct. Model size reduction can indeed be achieved through three main approaches: knowledge distillation, pruning, and quantization.

1. **Knowledge Distillation:**
   - Knowledge distillation involves training a smaller model (student) to mimic the behavior of a larger pre-trained model (teacher). The idea is to transfer the knowledge learned by the larger model to the smaller one. This process helps in reducing the size of the model while preserving its performance to a certain extent.

2. **Pruning:**
   - Pruning involves removing certain connections or parameters from the neural network based on their importance. This can be done during or after the training process. Pruning techniques identify and eliminate redundant or less important weights, leading to a sparser model with reduced size. Pruning can be magnitude-based, sensitivity-based, or use other criteria to determine the importance of weights.

3. **Quantization:**
   - Quantization involves reducing the precision of the weights and activations in a neural network. Typically, deep learning models use 32-bit floating-point numbers to represent weights and activations, but quantization reduces these to lower bit precision (e.g., 8-bit integers). This results in a smaller memory footprint and faster inference, albeit with a slight reduction in model accuracy.

By employing these techniques individually or in combination, practitioners can significantly reduce the size of neural network models, making them more suitable for deployment on resource-constrained devices or environments where computational resources are limited. Each approach has its strengths and trade-offs, and the choice of which method to use may depend on the specific requirements of the application.

- DistilBert is 1.7x compressed and 1.6x faster with 97% relative performance (compared to original BERT).
- Mini-BERT is 6x compressed, 3x faster, and has 98% relative performance.
- TinyBERT is 7.5x compressed, has 9.4x speed, and 97% relative performance.

- In the context of neural network training, the gradients are calculated during the backward pass of the training process using techniques like backpropagation. The gradient indicates the direction and magnitude in which the parameters should be adjusted to reduce the loss.

## Pruning transformers
Pruning includes the process of setting weights at each layer to zero based on a pre-specified criterion.
For example, a simple pruning algorithm could take the weights of each layer and set those that are
below a threshold. This method eliminates weights that are very low in value and do not affect the
results too much.
Likewise, we prune some redundant parts of the transformer network. The pruned networks are more
likely to generalize better than the original one. We have seen a successful pruning operation because
the pruning process probably keeps the true underlying explanatory factors and discards the redundant
subnetwork. But we need to still train a large network. The reasonable strategy is that we train a neural
network as large as possible. Then, the less salient weights or units whose removals have a small effect
on the model performance are discarded.
There are two approaches:
* <b>Unstructured pruning</b>: 
where individual weights with a small saliency (or the least weight magnitude) are removed no matter
which part of the neural network they are located in.
* <b>Structured pruning</b>:
this approach prunes heads or layers.

# L1 pruning

In [13]:
distilroberta = SentenceTransformer('stsb-distilroberta-base-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [14]:
from datasets import load_metric,load_dataset
stsb_metric = load_metric('glue', 'stsb')
stsb = load_dataset('glue', 'stsb')
mrpc_metric = load_metric('glue', 'mrpc')
mrpc = load_dataset('glue','mrpc')


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/stsb (download: 784.05 KiB, generated: 1.09 MiB, post-processed: Unknown size, total: 1.86 MiB) to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/803k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [15]:
import math


In [22]:
def roberta_sts_benchmark(batch):
    sts_encode1=tf.nn.l2_normalize(distilroberta.encode(batch['sentence1']),axis=1)
    sts_encode2=tf.nn.l2_normalize(distilroberta.encode(batch['sentences2']),axis=1)
    cosine_similarities = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis=1)
    clip_cosine_similarities = tf.clip_by_value(cosine_similarities,-1.0,1.0)
    scores = 1.0 -tf.acos(clip_cosine_similarities) / math.pi
    return scores

In [None]:
approx_min_k

## Cross-Lingual and Multilingual Language Modeling

- - > These are the objectives used for monolingual models. So, what can be done for cross-lingual models?
    The answer is TLM, which is very similar to MLM, with a few changes. Instead of giving a sentence
    from a single language, a sentence pair is given to a model in different languages, separated by a special
    token. The model is required to predict the masked tokens, which are randomly masked in any of these
    languages.

In [1]:
from transformers import pipeline
unmasker=pipeline('fill-mask',model="bert-base-multilingual-uncased")

2024-02-19 05:05:32.450667: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 05:05:32.450764: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 05:05:32.594316: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

In [3]:
sentences=[
"Transformers changed the [MASK] language processing",
"Transformerlar [MASK] dil işlemeyi değiştirdiler",
"ترنسفرمرها پردازش زبان [MASK] را تغییر دادند"
]

In [7]:
for sentence in sentences:
    print(sentence)
    print(unmasker(sentence)[0]["sequence"])
    print("="*50)

Transformers changed the [MASK] language processing
transformers changed the english language processing
Transformerlar [MASK] dil işlemeyi değiştirdiler
transformerlar bu dil islemeyi degistirdiler
ترنسفرمرها پردازش زبان [MASK] را تغییر دادند
ترنسفرمرها پردازش زبانی را تغییر دادند


In [9]:
from transformers import pipeline
unmasker=pipeline('fill-mask',model="xlm-roberta-base")

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
sentences=[
"Transformers changed the [MASK] language processing",
"Transformerlar [MASK] dil işlemeyi değiştirdiler",
"ترنسفرمرها پردازش زبان [mask] را تغییر دادند"
]

In [11]:
for sentence in sentences:
    print(sentence)
    print(unmasker(sentence)[0]["sequence"])
    print("="*50)

Transformers changed the [MASK] language processing


PipelineException: No mask_token (<mask>) found on the input