- The Albert model, introduced by the Google team, addresses challenges in training large language models by reimagining the BERT architecture. It achieves improved scalability compared to BERT, with 18 times fewer parameters and 1.7 times faster training for Albert. The key modifications include factorized embedding parameterization, cross-layer parameter sharing, and the introduction of Sentence-Order Prediction (SOP) to replace the Next Sentence Prediction (NSP) task. Factorized embedding reduces the vocabulary-embedding matrix, enhancing parameter efficiency. Cross-layer parameter sharing prevents parameter explosion in deeper networks. SOP loss focuses on inter-sentence coherence instead of NSP's topic prediction, providing finer-grained distinctions at the discourse level. These innovations contribute to Albert's enhanced performance and efficiency in language representation learning.

In [2]:
from transformers import BertConfig,BertModel

In [3]:
bert_config=BertConfig()
model=BertModel(bert_config)


In [4]:
print(f"{model.num_parameters() /(10**6)}\
million parameters")

109.48224million parameters


- The provided code snippet demonstrates the instantiation of a BERT-BASE model with specific configuration parameters using the Transformers library. The model has 12 layers, a hidden size of 768, and 12 attention heads, totaling 110 million parameters. The code then calculates and prints the number of parameters in million units, resulting in approximately 109.48 million parameters for the BERT-BASE model.

- In the context of BERT (Bidirectional Encoder Representations from Transformers), "12 layers" refers to the number of transformer layers in the model architecture. Each layer is a block of computations that processes the input data. The BERT model architecture consists of multiple identical layers stacked on top of each other. 

Here's a brief overview of what a single transformer layer typically consists of:

1. **Self-Attention Mechanism:** This mechanism allows each word/token in the input sequence to focus on different parts of the sequence, capturing dependencies between words.

2. **Feedforward Neural Network:** After the self-attention mechanism, the output passes through a feedforward neural network that processes each position independently.

3. **Normalization:** Both the self-attention output and the output from the feedforward network undergo layer normalization.

BERT-BASE specifically has 12 of these identical layers stacked on top of each other. This stacking allows the model to capture complex patterns and relationships within the input data. The parameters within each layer are learned during the training process to enable the model to generate representations that are useful for various natural language processing tasks.

In the context of neural networks, the "hidden size" refers to the number of neurons or units in the hidden layers of the network. 

## The 110 million parameters in the context of the BERT-BASE model configuration typically include various trainable weights and biases throughout the entire model architecture. These parameters are learned during the training process to enable the model to capture and represent patterns in the input data. Here's a breakdown of what these parameters generally include:

1. **Embedding Parameters:** Parameters associated with the input embeddings for each token in the vocabulary. This involves mapping each token to a continuous vector representation.

2. **Attention Parameters:** Parameters associated with the self-attention mechanism, including weights and biases for each attention head in each layer. These parameters allow the model to learn how to attend to different parts of the input sequence.

3. **Feedforward Neural Network Parameters:** Parameters for the feedforward neural network within each transformer layer, including weights and biases. This network processes the output of the attention mechanism.

4. **Layer Normalization Parameters:** Parameters for layer normalization applied after each sub-layer (e.g., after the attention mechanism and after the feedforward neural network).

5. **Positional Encoding Parameters:** If used, parameters for the positional encoding that helps the model understand the order of tokens in the input sequence.

6. **Output Projection Parameters:** Parameters associated with projecting the final representations to the desired output dimensionality.

These parameters collectively make up the model's capacity to represent complex relationships and patterns in natural language data. The specific breakdown may vary based on the exact implementation and configuration details of the BERT model.

# ALBERT MODEL

In [16]:
from transformers import AlbertConfig,AlbertModel,AlbertTokenizer


In [10]:
albert_config=AlbertConfig(hidden_size=768,
num_attention_heads=12,
intermediate_size=3072,)
model=AlbertModel(albert_config)

In [12]:
print(f"{model.num_parameters() /(10**6)}million parameters")

11.683584million parameters


In [17]:
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
model = AlbertModel.from_pretrained("albert-base-v2")
text = "The cat is so sad ."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)


spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

In [21]:
import pandas as pd
from transformers import pipeline
fillmask= pipeline('fill-mask', model='albert-base-v2')
pd.DataFrame(fillmask("The cat is so [MASK] ."))

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForMaskedLM: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,score,token,token_str,sequence
0,0.281032,10901,cute,the cat is so cute.
1,0.094895,26354,adorable,the cat is so adorable.
2,0.042963,1700,happy,the cat is so happy.
3,0.040976,5066,funny,the cat is so funny.
4,0.024234,28803,affectionate,the cat is so affectionate.


# RoBerta

- Robustly Optimized BERT pre-training Approach (RoBERTa) is another popular BERT
    reimplementation. It has provided many more improvements in training strategy than architectural
    design. It outperformed BERT in almost all individual tasks on GLUE. Dynamic masking is one of its
    original design choices. Although static masking is better for some tasks, the RoBERTa team showed
    that dynamic masking can perform well for overall performances. Let's compare the changes from
    BERT and summarize all the features as follows:
    The changes in architecture are as follows:
    Removing the next sentence prediction training objective
    Dynamically changing the masking patterns instead of static masking, which is done by generating masking patterns whenever
    they feed a sequence to the model
    BPE sub-word tokenizer
    The changes in training are as follows:
    Controlling the training data: More data is used, such as 160 GB instead of the 16 GB originally used in BERT. Not only the size
    of the data but the quality and diversity were taken into consideration in the study.
    Longer iterations of up to 500K pretraining steps.
    A longer batch size.
    Longer sequences, which leads to less padding.
    A large 50K BPE vocabulary instead of a 30K BPE vocabulary.
    Thanks to the Transformers uniform API, as in the Albert model

+ It uses BPE tokenizer

# See electra 

SyntaxError: unterminated string literal (detected at line 5) (2541047105.py, line 5)