<a href="https://colab.research.google.com/github/AhmedSSoliman/HuggingFace-Tutorial/blob/main/HuggingFace_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Hugging Face](https://miro.medium.com/max/2000/1*Z4mGaMsu34LfyE76QAi9qA.png)


# Using HuggingFace

In this tutorial, we will learn how to use the various functionalities offered by [HuggingFace](https://huggingface.co/).

## About Hugging Face

- HuggingFace is 'On a mission to solve NLP, One commit at a time', as per their tagline. 

- The HuggingFace Transformer library is closing on 26K Stars on GitHub now and provides state-of-the-art Transformer Based Models, their pretrained weights and a lots more (as we will see today)

- They recently released their Tokenisers library


They have originally used Rust, so that's an added advantage.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 29.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 40.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 46.5MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b2

# Pipeline

https://github.com/huggingface/transformers#quick-tour-of-pipelines

In [2]:
from transformers import pipeline

## Sentiment Analysis

In [3]:
nlp = pipeline('sentiment-analysis')
nlp('We are very happy to include pipeline into the transformers repository.')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

## Question Answering

In [4]:
nlp = pipeline('question-answering')
nlp({
    'question': 'What is my name ?',
    'context': 'My name is Ahmed, I am working with HuggingFace'
})

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




{'answer': 'Ahmed', 'end': 16, 'score': 0.9860308170318604, 'start': 11}

## Predicting Masks

In [5]:
nlp = pipeline('fill-mask')
nlp('I hope you <mask> this video')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




[{'score': 0.7073913812637329,
  'sequence': 'I hope you enjoyed this video',
  'token': 3776,
  'token_str': ' enjoyed'},
 {'score': 0.1367352455854416,
  'sequence': 'I hope you enjoy this video',
  'token': 2254,
  'token_str': ' enjoy'},
 {'score': 0.1335318684577942,
  'sequence': 'I hope you liked this video',
  'token': 6640,
  'token_str': ' liked'},
 {'score': 0.005779118277132511,
  'sequence': 'I hope you like this video',
  'token': 101,
  'token_str': ' like'},
 {'score': 0.005615219473838806,
  'sequence': 'I hope you appreciated this video',
  'token': 10874,
  'token_str': ' appreciated'}]

## NER

In [8]:
nlp = pipeline('ner')
nlp('It is me, Ahmed, I am working with HuggingFace')

[{'end': 15,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9993723630905151,
  'start': 10,
  'word': 'Ahmed'},
 {'end': 37,
  'entity': 'I-ORG',
  'index': 11,
  'score': 0.9980229735374451,
  'start': 35,
  'word': 'Hu'},
 {'end': 42,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.9865249991416931,
  'start': 37,
  'word': '##gging'},
 {'end': 43,
  'entity': 'I-ORG',
  'index': 13,
  'score': 0.996594250202179,
  'start': 42,
  'word': '##F'},
 {'end': 46,
  'entity': 'I-ORG',
  'index': 14,
  'score': 0.993009626865387,
  'start': 43,
  'word': '##ace'}]

In [9]:
nlp.model.save_pretrained('.')

# Text Generation

In [10]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [11]:
tokeniser = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




## Single Word Prediction

In [12]:
text = "let us see how this turns"
indexed_tokens = tokeniser.encode(text)
tokens_tensor = torch.tensor([indexed_tokens])
model.eval()
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


In [13]:
with torch.no_grad():
  outputs = model(tokens_tensor)
  predictions = outputs[0]

print(outputs[0].shape)

predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokeniser.decode(indexed_tokens + [predicted_index])

print(predicted_text)

torch.Size([1, 6, 50257])
let us see how this turns out


## Looping for multi-word

In [14]:
chars = 0
text = "i am very exited to present to you this"
while chars<50:
  chars += 1
  indexed_tokens = tokeniser.encode(text)
  tokens_tensors = torch.tensor([indexed_tokens])
  tokens_tensors = tokens_tensors.to('cuda')
#  model = model.to('cuda')
  with torch.no_grad():
    outputs = model(tokens_tensors)
    predictions = outputs[0]
  predicted_index = torch.argmax(predictions[0,-1,:]).item()
  text = tokeniser.decode(indexed_tokens + [predicted_index])

print(text)

i am very exited to present to you this new book. I am very excited to share with you the first chapter of the book, "The Secret of the Soul."


The Secret of the Soul is a book that I have been reading for over a year now. I have been


## OR

In [17]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 73432, done.[K
remote: Counting objects: 100% (193/193), done.[K
remote: Compressing objects: 100% (141/141), done.[K
remote: Total 73432 (delta 100), reused 95 (delta 41), pack-reused 73239[K
Receiving objects: 100% (73432/73432), 56.42 MiB | 29.25 MiB/s, done.
Resolving deltas: 100% (52193/52193), done.


In [23]:
%cd /content/

/content


In [24]:
!python pytorch-transformers/examples/pytorch/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \

2021-05-25 15:59:37.547932: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
05/25/2021 15:59:45 - INFO - __main__ -   Namespace(device=device(type='cuda'), fp16=False, k=0, length=100, model_name_or_path='gpt2', model_type='gpt2', n_gpu=1, no_cuda=False, num_return_sequences=1, p=0.9, padding_text='', prefix='', prompt='', repetition_penalty=1.0, seed=42, stop_token=None, temperature=1.0, xlm_language='')
Model prompt >>> How are you?
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
=== GENERATED SEQUENCE 1 ===
How are you? He was really crazy," said his mother, Carolyn Hickman, who suffered from Mettler's disease. "We all knew it, but this year? It wasn't so bad. Everything's great."

Not so great, Hickman said, for about 50 years. In 1993, she moved to Canada and moved with her family and enjoyed living at her aunt's cottage, in Canada's west end. At the time, she was traveling because of his

# Summarising text using HuggingFace

In [25]:
text = 'Shakespeare occupies a position unique in world literature. Other poets, such as Homer and Dante, and novelists, such as Leo Tolstoy and Charles Dickens, have transcended national barriers, but no writer’s living reputation can compare to that of Shakespeare, whose plays, written in the late 16th and early 17th centuries for a small repertory theatre, are now performed and read more often and in more countries than ever before. The prophecy of his great contemporary, the poet and dramatist Ben Jonson, that Shakespeare “was not of an age, but for all time,” has been fulfilled. It may be audacious even to attempt a definition of his greatness, but it is not so difficult to describe the gifts that enabled him to create imaginative visions of pathos and mirth that, whether read or witnessed in the theatre, fill the mind and linger there. He is a writer of great intellectual rapidity, perceptiveness, and poetic power. Other writers have had these qualities, but with Shakespeare the keenness of mind was applied not to abstruse or remote subjects but to human beings and their complete range of emotions and conflicts. Other writers have applied their keenness of mind in this way, but Shakespeare is astonishingly clever with words and images, so that his mental energy, when applied to intelligible human situations, finds full and memorable expression, convincing and imaginatively stimulating. As if this were not enough, the art form into which his creative energies went was not remote and bookish but involved the vivid stage impersonation of human beings, commanding sympathy and inviting vicarious participation. Thus, Shakespeare’s merits can survive translation into other languages and into cultures remote from that of Elizabethan England.'

In [28]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1399.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




In [29]:
inputs = tokenizer.batch_encode_plus([text], max_length=1024, return_tensors='pt')

summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=False)

for ids in summary_ids:
    short = tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

    print(len(text), len(short))
    print(short)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


1761 341
Shakespeare occupies a position unique in world literature. He is a writer of great intellectual rapidity, perceptiveness, and poetic power. Other writers have had these qualities, but with Shakespeare the keenness of mind was applied not to abstruse or remote subjects but to human beings and their complete range of emotions and conflicts.


HuggingFace Tranformers: https://github.com/huggingface/transformers

BART: https://arxiv.org/abs/1910.13461

Curious Case of Neural Text Degeneration: https://arxiv.org/abs/1904.09751