# Transformers

Transformers are a type of deep learning model introduced by Hugging Face, a company known for its work in natural language processing (NLP). They are designed to handle sequential data, such as text, and have been highly successful in various NLP tasks like text classification, question answering, and language translation.

https://github.com/huggingface/notebooks/tree/main/course/en

In [2]:
pip show transformers

Name: transformers
Version: 4.43.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: C:\Users\Catello\anaconda3\Lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.


### BERT
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model developed by Google. It is designed to understand the context of words in a sentence by processing text bidirectionally, meaning it considers both the left and right context of a word. BERT

#### Sentiment analysis
This is a pipeline provided by the Hugging Face Transformers library. It allows you to easily perform sentiment analysis on text. The pipeline abstracts away the complexities of loading models and tokenizers, making it straightforward to use pre-trained models for specific tasks like sentiment analysis.

#### DistilBERT
is a smaller, faster, cheaper, and lighter version of the BERT model. It is designed to retain most of the language understanding capabilities of BERT while being more efficient.

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I am having a wonderful day today!")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.999886155128479}]

### Classification
refers to the process of categorizing or labeling data into predefined groups or classes based on certain features or characteristics. In the context of natural language processing (NLP), classification can involve tasks such as sentiment analysis, where text is classified into categories like positive, negative, or neutral.

In [6]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
     

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### Zero-shot classification
Zero-shot classification is a type of classification task where the model is asked to classify text into categories that it has not been explicitly trained on. This is particularly useful when you have a pre-trained model and you want to use it for new, unseen categories without having to retrain the model.

In [8]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "Exploring the wonders of space travel is an exciting adventure",
    candidate_labels=["science", "travel", "entertainment"],
)


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.

KeyboardInterrupt



### Text generation
Text generation is a natural language processing (NLP) task where a model generates coherent and contextually relevant text based on an initial input or prompt. This can be used for a variety of applications, such as writing assistance, story generation, chatbots, and more.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("Today, we are going to ")


### GPT-2 model
GPT-2 (Generative Pre-trained Transformer 2) is a large-scale language model developed by OpenAI. It is designed to generate human-like text based on an input prompt.

#### distilgpt2
This parameter specifies the pre-trained model to use for text generation. In this case, distilgpt2 is a distilled version of the GPT-2 model. Distillation is a technique used to create smaller, faster models that retain most of the performance of the original model.

### Parameter Engineering
Parameter engineering involves fine-tuning the parameters of the text generation model to control the output
- **max_lengt** This parameter sets the maximum length of the generated text in terms of tokens. Tokens are the basic units of text that the model processes, which can be words or subwords
- **num_return_sequence** This parameter specifies the number of different sequences (or continuations) that the model should generate from the input prompt.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "Let's discover the secrets of",
    max_length=30,
    num_return_sequences=2,
)


### Words predictions

The model generates predictions for the masked token based on the context of the surrounding words.

#### fill-mask
task is a common NLP task where the model predicts the most likely word(s) to fill in a masked token in a sentence. This is particularly useful for tasks like language modeling, text completion, and understanding context.


In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("The secret to success is <mask> and hard work.", top_k=2)

### Named Entity Recognition (NER)
NER is a common NLP task where the model identifies and classifies named entities in a text, such as names of people, organizations, locations, dates, and more.

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("Carl lives in Paris and work at Amazon.")


### Summarization
Summarization is a natural language processing (NLP) task that involves generating a concise and coherent summary of a longer text. 

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    The field of artificial intelligence has seen remarkable advancements in recent years. Machine learning, a subset of AI, has become increasingly important in various industries, from healthcare to finance. Deep learning, a more specialized form of machine learning, has enabled significant breakthroughs in areas such as image recognition, natural language processing, and autonomous systems.

    Companies and researchers around the world are investing heavily in AI research and development. This has led to the creation of sophisticated algorithms and models that can perform complex tasks with high accuracy. However, there are also ethical considerations and challenges that need to be addressed, such as data privacy, bias, and the potential impact on employment.

    As AI continues to evolve, it is crucial to ensure that its benefits are distributed equitably and that its potential risks are managed responsibly. This requires collaboration between policymakers, industry leaders, and the public to develop robust frameworks for the ethical use of AI.
"""
)


### SentencePiece
SentencePiece is a natural language processing (NLP) library that provides tools for sentence segmentation and phrase compression. It is often used for text preprocessing in natural language processing models, such as transformer models

#### Helsinki-NLP/opus-mt-fr-en
Ce modèle est un modèle de machine learning entraîné pour effectuer la traduction automatique du français vers l'anglais en utilisant la bibliothèque de transformers de Hugging Face.

In [None]:
pip show sentencepiece

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

translator("Elle adore lire des livres.")


### Bias and limitations (BERT)
pre-trained language models like BERT, and to take steps to mitigate them as much as possible. This can include using techniques like debiasing, fine-tuning the model on specific tasks or domains, and carefully evaluating the outputs of the model.

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")

result = unmasker("He is a [MASK] of many talents.")
print([r["token_str"] for r in result])

result = unmasker("She is a [MASK] of many talents.")
print([r["token_str"] for r in result])

# PyTorch

PyTorch is an open-source framework for machine learning and artificial intelligence, developed by Facebook's AI Research lab. It is designed to be flexible and easy to use, making it popular among researchers and developers working on advanced AI projects.

### Behind the pipeline (PyTorch)

**A tensor** is a multi-dimensional data object used in the context of the PyTorch library, which is often used for natural language modeling and machine learning. Tensors are used to store and manipulate numerical data, such as matrices and arrays.

In the context of the transformer library, tensors are used to store the inputs and outputs of transformer models.r.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I'm so excited to start this course!",
        "I can't stand this anymore.",
    ]
)

This code takes two sentences as input and converts them into token identifiers and attention masks, which can be used to feed a transformer model for sentiment classification.

- **input_ids** This is a tensor that contains the token identifiers for each input sentence.
- **attention_mask** This is a binary tensor that indicates to transformer models which tokens should be considered when calculating attention.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
     

raw_inputs = [
        "I'm so excited to start this course!",
        "I can't stand this anymore.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

### The DistilBERT Model
This code loads a fine-tuned DistilBERT model for English sentiment classification from a given checkpoint, and then uses this model to process the previously generated tokenized inputs.

#### AutoModel

The model output is a tensor of dimensions 2 x 16 x 768, where each value is a contextual representation for a specific token in a specific input example. These contextual representations can be used as inputs for further processing layers, or to make decisions based on the content of the input examples.

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
     

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

#### AutoModelForSequenceClassification

Ce code charge un modèle de classification de séquence finement ajusté à partir d'un point de contrôle DistilBERT pour la tâche SST-2 (Sentiment Analysis on Stanford Sentiment Treebank), puis utilise ce modèle pour traiter les entrées tokenisées précédemment générées.

The model output is a tensor  containing logits for 2 input examples, each with 2 possible output classes (positive or negative).

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)


#### Classification scores
Classification scores can be used to make decisions based on the content of the input examples. For example, the model can be used to predict the most probable output class for each input example by selecting the class with the highest score.

In [None]:
print(outputs.logits)

#### Normalized classification scores
Normalized classification scores can be used to make decisions based on the probability of the output classes for each input example. For example, the model can be used to predict the most probable output class for each input example by selecting the class with the highest probability.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label

## Models (PyTorch)
This code uses a language model called BERT to turn phrases into numbers (called "embeddings") that can be used to train another machine learning model.

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [None]:
print(config)

### bert-base-cased
So, "bert-base-cased" is a pre-trained BERT model with 12 layers, a hidden size of 768, 12 self-attention heads, and case sensitivity. It has been pre-trained on the English Wikipedia and BookCorpus datasets.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

In [None]:
model.save_pretrained("C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models")


### Embeddings
The correspondence between words and numbers is simply a way for the model to represent words numerically.

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]


In [None]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

In [None]:
output = model(model_inputs)

## Tokenizers (PyTorch)

A tokenizer is a tool that allows converting text into a sequence of tokens (or words) that can be used as inputs for a language processing model.

In [9]:
tokenized_text = "Jack Sparrow was pirate".split()
print(tokenized_text)

['Jack', 'Sparrow', 'was', 'pirate']


In [11]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [26]:
tokenizer("My mom is a good cooker")

{'input_ids': [101, 1422, 4113, 1110, 170, 1363, 9834, 1200, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [28]:
tokenizer.save_pretrained("C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models")

('C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models\\tokenizer_config.json',
 'C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models\\special_tokens_map.json',
 'C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models\\vocab.txt',
 'C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models\\added_tokens.json',
 'C:/Users/Catello/Desktop/Protfolio/prompt_engineering/models\\tokenizer.json')

In [32]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "My mom is a good cooker"
tokens = tokenizer.tokenize(sequence)

print(tokens)
     

['My', 'mom', 'is', 'a', 'good', 'cook', '##er']


In [34]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1422, 4113, 1110, 170, 1363, 9834, 1200]


In [36]:
decoded_string = tokenizer.decode([1422, 4113, 1110, 170, 1363, 9834, 1200])
print(decoded_string)

My mom is a good cooker
