<a href="https://colab.research.google.com/github/CodeHunterOfficial/NLP-2024-2025/blob/main/Lecture_8_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face в задачах NLP

#### Введение в Hugging Face

Hugging Face — это платформа и сообщество, специализирующееся на разработке и распространении библиотек и моделей для работы с естественным языком (NLP). Они известны своими инновационными подходами к решению различных задач, таких как классификация текстов, вопросно-ответная система, генерация текстов и многие другие.

#### Примеры задач и использование Hugging Face

1. **Text Classification**
   - **Определение задачи:** Классификация текстов на основе их содержания.
   - **Пример с использованием Hugging Face:** Использование модели `distilbert-base-uncased` для классификации отзывов на фильмы как положительные или отрицательные.


In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9997}]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998855590820312}]



2. **Token Classification**
   - **Определение задачи:** Определение класса каждого токена в тексте.
   - **Пример с использованием Hugging Face:** Модель `bert-base-cased` для идентификации и классификации именованных сущностей в тексте, таких как имена людей или организаций.


In [2]:
from transformers import pipeline

ner_model = pipeline("ner")
result = ner_model("Hugging Face is a company based in New York City.")
print(result)
# Output: [{'entity': 'I-ORG', 'score': 0.9986, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, ...]

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-ORG', 'score': 0.9943229, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.89372134, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9618119, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-LOC', 'score': 0.99907184, 'index': 9, 'word': 'New', 'start': 35, 'end': 38}, {'entity': 'I-LOC', 'score': 0.99889284, 'index': 10, 'word': 'York', 'start': 39, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99922264, 'index': 11, 'word': 'City', 'start': 44, 'end': 48}]



3. **Table Question Answering**
   - **Определение задачи:** Ответ на вопросы, основанные на данных в табличном формате.
   - **Пример с использованием Hugging Face:** Использование модели `roberta-large` для отвечения на вопросы, связанные с финансовыми данными, представленными в табличной форме.


In [None]:
from transformers import pipeline

# Загрузка модели для question answering
qa_pipeline = pipeline("table-question-answering", model="bert-large-uncased")

# Пример таблицы с данными о фильмах
table_data = [
    {"Название фильма": "The Shawshank Redemption", "Год выпуска": 1994, "Режиссер": "Frank Darabont"},
    {"Название фильма": "The Godfather", "Год выпуска": 1972, "Режиссер": "Francis Ford Coppola"},
    {"Название фильма": "The Dark Knight", "Год выпуска": 2008, "Режиссер": "Christopher Nolan"},
    {"Название фильма": "Pulp Fiction", "Год выпуска": 1994, "Режиссер": "Quentin Tarantino"}
]

# Пример вопросов
questions = [
    "Кто снял фильм 'The Godfather'?",
    "В каком году был выпущен фильм 'Pulp Fiction'?",
    "Какой режиссер снял фильм 'The Dark Knight'?"
]

# Ответы на вопросы
for question in questions:
    answer = qa_pipeline(table=table_data, query=question)
    print(f"Вопрос: {question}")
    print(f"Ответ: {answer['answer']}")
    print()

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]


4. **Question Answering**
   - **Определение задачи:** Ответ на вопросы на естественном языке, основываясь на предоставленном контексте.
   - **Пример с использованием Hugging Face:** Модель `albert-base-v2` для ответа на вопросы, касающиеся содержания статей в новостях.


In [4]:
from transformers import pipeline

qa_model = pipeline("question-answering")
context = "Hugging Face is a company based in New York City."
question = "Where is Hugging Face based?"
result = qa_model(question=question, context=context)
print(result)
# Output: {'score': 0.9923, 'start': 31, 'end': 45, 'answer': 'New York City'}

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9679760932922363, 'start': 35, 'end': 48, 'answer': 'New York City'}



5. **Zero-Shot Classification**
   - **Определение задачи:** Классификация текстов без предварительного обучения на данных конкретной задачи.
   - **Пример с использованием Hugging Face:** Модель `facebook/bart-large-mnli` для классификации текстов на несколько категорий без предварительного обучения на этом конкретном наборе данных.


In [5]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
sequence = "I am looking for a restaurant in Paris."
candidate_labels = ["food", "politics", "travel"]
result = classifier(sequence, candidate_labels)
print(result)
# Output: {'sequence': 'I am looking for a restaurant in Paris.', 'labels': ['travel', 'food', 'politics'], ...}

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'I am looking for a restaurant in Paris.', 'labels': ['food', 'travel', 'politics'], 'scores': [0.7707962989807129, 0.2245446890592575, 0.004658977966755629]}



6. **Translation**
   - **Определение задачи:** Перевод текста с одного языка на другой.
   - **Пример с использованием Hugging Face:** Модель `t5-small` для перевода текстов с английского на французский.


In [6]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ro")
result = translator("This is a test.", target_lang="ro")
print(result)
# Output: [{'translation_text': 'Acesta este un test.'}]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/789k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/817k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



ValueError: The following `model_kwargs` are not used by the model: ['target_lang'] (note: typos in the generate arguments will also show up in this list)


7. **Summarization**
   - **Определение задачи:** Создание краткого содержания исходного текста.
   - **Пример с использованием Hugging Face:** Модель `bart-large-cnn` для генерации краткого содержания новостных статей.


In [7]:
from transformers import pipeline

summarizer = pipeline("summarization")
text = "Hugging Face is a technology company specializing in NLP."
summary = summarizer(text, max_length=50, min_length=20, do_sample=False)
print(summary)
# Output: [{'summary_text': 'Hugging Face is a technology company specializing in NLP.'}]

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your max_length is set to 50, but your input_length is only 14. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=7)


[{'summary_text': " Hugging Face is a technology company specializing in NLP . Hugging face is based in New York City, New Jersey . Hugged Face is based on the company's NLP technology ."}]



8. **Feature Extraction**
   - **Определение задачи:** Извлечение важных признаков из текстов для дальнейшего анализа.
   - **Пример с использованием Hugging Face:** Использование модели `distilbert-base-uncased` для извлечения эмоциональной окраски текста.


In [8]:
from transformers import AutoModel, AutoTokenizer

model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "Hugging Face is a company based in New York City."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


9. **Text Generation**
   - **Определение задачи:** Автоматическое создание текста на основе входных данных.
   - **Пример с использованием Hugging Face:** Модель `gpt2` для генерации текста в стиле писателя.


In [9]:
from transformers import pipeline

text_generator = pipeline("text-generation")
prompt = "Once upon a time"
generated_text = text_generator(prompt, max_length=50, num_return_sequences=1)
print(generated_text[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, humans and other creatures became the dominant form of the humanoid race and they were the ones created with the essence of the spirit and the power of the spirit. Even now, their actions have inspired this human race even more than any



10. **Text2Text Generation**
    - **Определение задачи:** Преобразование текста одного типа в другой.
    - **Пример с использованием Hugging Face:** Модель `t5-base` для перевода текста в вопрос в соответствующий ответ.


In [10]:
from transformers import pipeline

text2text_generator = pipeline("text2text-generation")
input_text = "Translate: How are you?"
generated_text = text2text_generator(input_text)
print(generated_text[0]['generated_text'])

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



: How are you? Translate: How are you?



11. **Fill-Mask**
    - **Определение задачи:** Заполнение пропущенных частей предложения.
    - **Пример с использованием Hugging Face:** Модель `bert-base-multilingual-cased` для автоматического заполнения пропусков в предложениях на различных языках.


In [11]:
from transformers import pipeline

fill_mask = pipeline("fill-mask")
text = "Hugging Face is a company based in [MASK] York City."
result = fill_mask(text)
print(result)
# Output: [{'sequence': 'Hugging Face is a company based in New York City.', 'score': 0.9871, 'token': 2812, ...}]

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

PipelineException: No mask_token (<mask>) found on the input


12. **Sentence Similarity**
    - **Определение задачи:** Определение степени семантической близости между двумя предложениями.
    - **Пример с использованием Hugging Face:** Модель `sentence-transformers/paraphrase-MiniLM-L6-v2` для определения степени близости между двумя предложениями.


In [12]:
from transformers import pipeline

similarity_checker = pipeline("sentence-similarity")
pair1 = ["I like cats", "I like dogs"]
pair2 = ["I enjoy reading", "Books are fun"]
result1 = similarity_checker(pair1)
result2 = similarity_checker(pair2)
print(result1, result2)
# Output: {'score': 0.8765} {'score':

KeyError: "Unknown task sentence-similarity, available tasks are ['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-to-image', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"