<a href="https://colab.research.google.com/github/MiguelPartosa/HuggingFaceNLP_Course/blob/main/HuggingFaceTransformers_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What can Transformers do

## Pipleine - the most basic Object
- These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

In [None]:
import datasets
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

In [None]:
classifier = pipeline("sentiment-analysis")
print(classifier("Tummy is the best cat in the world."))
print(classifier("tummy is the best cat in the world."))

#They have identical scores since we defaulted to an uncased model because we didn't specify a model.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9997889399528503}]
[{'label': 'POSITIVE', 'score': 0.9997889399528503}]


#### Example
Iterating overa  dataset with pipeleine

In [None]:
# !pip install datasets

# pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
# dataset = datasets.load_dataset("superb", name="asr", split="test")


Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

In [None]:
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
output_limit = 0
for out in tqdm(pipe(KeyDataset(dataset, "file")),total=10):
    print(out)
    if output_limit >= 10:  # Break after processing 10 items (0 to 9)
        break
    else: output_limit +=1

  0%|          | 0/10 [00:00<?, ?it/s]

{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'}
{'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'}
{'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'}
{'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}
{'text': 'NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND'}
{'text': "THE MUSIC CAME NEARER AND HE RECALLED THE WORDS THE WORDS OF SHELLY'S FRAGMENT UPON THE MOON WANDERING COMPANIONLESS PALE FOR WEARINESS"}
{'text': 'THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE WHEREON ANOTHER EQUATION BEGAN TO UNFOLD ITSELF SLOWLY AND TO SPREAD ABROAD ITS WIDENING TAIL'}
{'text': 'A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL'}
{'text': 'THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITSELF WAS A COLD INDIFFERENT KNOWLEDGE OF HIMSELF'}
{'text': 'AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLESSING HE FLED FROM 

## Zero-shot Classification Tasks
- Comparing Daberta and mBert

In [None]:
def is_medical_text(text_classifier, text, threshold=0.5):  # Default threshold of 0.5
    result = text_classifier(text, candidate_labels=["medical"])
    score = result['scores'][0]  # Get the score for the "medical" label
    return score >= threshold, score  # Return True if the score is above the threshold

Same base text to compare them to

In [None]:
text_to_classify = "This article discusses the side effects of a new drug."

### Multilingual Bert
Performance using Multilingual Bert for text-classification

In [None]:
mbert_tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")  # Or specify a different mBERT variant
mbert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-multilingual-cased")

mbert_text_classifier = pipeline("zero-shot-classification", model=mbert_model, tokenizer=mbert_tokenizer)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [None]:
mbert_scores = is_medical_text(mbert_text_classifier,text_to_classify)
print(f"Mbert Results:\nIs medical: {mbert_scores[0]}, Score: {mbert_scores[1]}")

Mbert Results:
Is medical: True, Score: 0.6101920008659363


### Daberta
Using Daberta to compare, we will be using the same monolingual text in english to compare the two

In [None]:
daberta_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
daberta_model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-large")

# Create the pipeline
daberta_text_classifier = pipeline("zero-shot-classification", model=daberta_model, tokenizer=daberta_tokenizer)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [None]:
daberta_scores = is_medical_text(daberta_text_classifier,text_to_classify)
print(f"Daberta Results:\nIs medical: {daberta_scores[0]}, Score: {daberta_scores[1]}")

Daberta Results:
Is medical: True, Score: 0.6157760620117188


Comparing Both


## Sideline: Text-Generation

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-neo-1.3B"

textgen_tokenizer = AutoTokenizer.from_pretrained(model_name)
textgen_model = AutoModelForCausalLM.from_pretrained(model_name)

generator = pipeline("text-generation", model = textgen_model, tokenizer =textgen_tokenizer)

prompt ="in this course, we will"
generator(prompt, max_length=20, num_return_sequences=3)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'in this course, we will review:\n• Introduction to the world of finance\n• How the'},
 {'generated_text': 'in this course, we will be discussing how to find the best fit parameters for your curve, in'},
 {'generated_text': "in this course, we will learn how the body makes and uses adenosine. We'll have"}]