<a href="https://colab.research.google.com/github/MorojMunshi/Lab/blob/main/UsingHuggingFacePipelinesforMultipleTasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
In this lab, students will explore how to use Hugging Face pipelines to perform multiple tasks such as text classification, question answering, text generation, and summarization. By the end of the exercise, students will understand how to work with pipelines and integrate them into practical applications.

# Pipelines

Pipelines are a great and easy way to utilize pre-trained models for inference. These pipelines abstract most of the complex code from the library, offering a simple API tailored for various tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction, and Question Answering.

**How Pipelines Work:**
1. You provide an input (e.g., a text or a question).
2. The pipeline processes it using a pre-trained model.
3. It returns the result in an easy-to-understand format.

Simply put, pipelines simplify what we have done in our last lab!

As illustrated in the figure below (adapted from the Hugging Face course [1]), we can pass a sentence (e.g., "This course is amazing") to the pipeline and specify the task, such as text classification. The pipeline will then process the input by first passing it through the pre-trained tokenizer, followed by the model. Finally, in the post-processing step, it will classify the input and produce a prediction.



en_chapter2_full_nlp_pipeline.svg

##Getting Started

Installing Hugging Face Transformers and Importing Libraries


##Task 1: Sentiment Analysis

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline

In [None]:
#with default model by huggingface, add model="" as a parameter inside pipeline to change model)
sa_pipeline = pipeline("sentiment-analysis")

result = sa_pipeline("I hate you!")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = sa_pipeline("I love you!")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


label: NEGATIVE, with score: 0.9987
label: POSITIVE, with score: 0.9999


When you use the Hugging Face pipeline function without specifying a model, it automatically selects the default pre-trained model for the sepecific task.

**Question #1:** what is the defualt pre-trained model used by the pipeline?  Can we check that from the object _sa_pipeline_?


# **النموذج الافتراضي المستخدم لمهمة sentiment-analysis هو:**
distilbert-base-uncased-finetuned-sst-2-english

وهو نسخة من نموذج **DistilBERT** مُدرَّب على مجموعة بيانات SST-2 (Stanford Sentiment Treebank) لتحليل **المشاعر**

In [None]:
print(sa_pipeline.model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


**Question #2:**
Now try to test different sentence using different model, for example you can use "bert-base-uncased"

In [None]:
from transformers import pipeline

# Use a fine-tuned model for sentiment analysis
sa_pipeline = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Test with different sentences
sentence1 = "I hate you!"
sentence2 = "I love programming!"

# Run predictions
result1 = sa_pipeline(sentence1)[0]
result2 = sa_pipeline(sentence2)[0]

# Print results
print(f"Sentence: '{sentence1}', Label: {result1['label']}, Score: {result1['score']:.4f}")
print(f"Sentence: '{sentence2}', Label: {result2['label']}, Score: {result2['score']:.4f}")

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


Sentence: 'I hate you!', Label: 1 star, Score: 0.6606
Sentence: 'I love programming!', Label: 5 stars, Score: 0.8565


In [None]:
from transformers import pipeline

# Try to use bert-base-uncased
sa_pipeline = pipeline("sentiment-analysis", model="bert-base-uncased")

# Test a sentence
result = sa_pipeline("I love programming!")
print(result)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'LABEL_0', 'score': 0.5064240097999573}]


##Task2: Named Entity Recognition

It is NLP task that involves identifying and classifying key information (entities) in text into predefined categories such as names of people, organizations, locations, dates, monetary values, and more.

**Question #3:**

A. Import the Necessary Classes:
Use the _AutoTokenizer_ and _AutoModelForTokenClassification_ from Hugging Face's transformers library.

B. Load the Tokenizer:
Use the AutoTokenizer class to load the tokenizer for the pre-trained model "dslim/bert-base-NER"

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

# Load the model
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is ...... and I live in Makkah" #complate the sentence with your name

ner_results = ner_pipeline(example)
print(ner_results)

Device set to use cpu


[{'entity': 'B-LOC', 'score': 0.9982666, 'index': 14, 'word': 'Ma', 'start': 32, 'end': 34}, {'entity': 'I-LOC', 'score': 0.9925713, 'index': 15, 'word': '##kka', 'start': 34, 'end': 37}, {'entity': 'I-LOC', 'score': 0.96728235, 'index': 16, 'word': '##h', 'start': 37, 'end': 38}]


##Task3: Question Answering

Question #4:
A. Import the Necessary Classes:
Use the _BertTokenizer_ and _BertForQuestionAnswering_ from Hugging Face's transformers library.

B. Load the Tokenizer:
Use the BertTokenizer class to load the tokenizer for the pre-trained model "salti/bert-base-multilingual-cased-finetuned-squad"

Note: Since this is a question-answering system, we need to provide the model with the question and context separately so that it can generate the answer.

In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering

In [None]:
# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained("salti/bert-base-multilingual-cased-finetuned-squad")

# Load the model
model = BertForQuestionAnswering.from_pretrained("salti/bert-base-multilingual-cased-finetuned-squad")

tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/822 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/709M [00:00<?, ?B/s]

In [None]:
question = "How many parameters does BERT-large have?"
context = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
qa_result = qa_pipeline(question=question,context=context)
print(qa_result)


Device set to use cpu


{'score': 0.6954228281974792, 'start': 92, 'end': 96, 'answer': '340M'}


النموذج حدد الإجابة بشكل صحيح وهي

**“340M”**

##Task4: Fill the mask

In [None]:
ftm_pipeline = pipeline("fill-mask", model= "ixa-ehu/ixambert-base-cased") #multilingual model en/es/eu

config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/713M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/936k [00:00<?, ?B/s]

Device set to use cpu


**Question #5:**
Create a sentence with a mask and set top_k=5. Print the result and explain what this parameter does?

In [None]:
from transformers import pipeline

# إنشاء pipeline لنموذج fill-mask
ftm_pipeline = pipeline("fill-mask", model="ixa-ehu/ixambert-base-cased")  # نموذج متعدد اللغات

# إنشاء الجملة مع الكلمة المفقودة [MASK]
sentence = "The capital of France is [MASK]."

# الحصول على النتائج مع top_k=5
results = ftm_pipeline(sentence, top_k=5)

# طباعة النتائج
print(results)

Device set to use cpu


[{'score': 0.427861750125885, 'token': 3471, 'token_str': 'Paris', 'sequence': 'The capital of France is Paris.'}, {'score': 0.04406878352165222, 'token': 33117, 'token_str': 'Nice', 'sequence': 'The capital of France is Nice.'}, {'score': 0.039858605712652206, 'token': 59886, 'token_str': 'Amiens', 'sequence': 'The capital of France is Amiens.'}, {'score': 0.03319675847887993, 'token': 17670, 'token_str': 'Lyon', 'sequence': 'The capital of France is Lyon.'}, {'score': 0.03223274275660515, 'token': 56810, 'token_str': 'Grenoble', 'sequence': 'The capital of France is Grenoble.'}]


معنى top_k:
هو عامل يحدد عدد الكلمات الأكثر احتمالاً التي يقدمها النموذج لاستبدال [**MASK**].

	•	عند تعيين top_k=5، يقوم النموذج بإرجاع 5 كلمات محتملة للكلمة المفقودة مع ترتيبها حسب الاحتمالية.
  
	•	كل اقتراح يتضمن:
	1.	الكلمة المقترحة.
	2.	درجة الاحتمالية (الثقة).
	3.	الجملة الكاملة مع الاقتراح

#Reference

1. Hugging Face. (2022). https://huggingface.co

- Check code and tasks here: https://huggingface.co/docs/transformers/v4.15.0/en/task_summary

- Check pretrained models here: https://huggingface.co/models

- Check pipelines here: https://huggingface.co/docs/transformers/main_classes/pipelines