# Hugging Face
Huggign Face is an american company which develops tools for building application using machine learning. It is a website where people can share their ML models. It is most notably known for it’s transformers library which is used to perform different NLP tasks. 

This is a python notebook based on this playlist - https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o 

## The Pipeline Function
The pipeline function is the most high level API that Hugging Face library offers. 

In [5]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I hade a healthy breakfast this morning")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9974707365036011}]

In [11]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about trandformers library",
    candidate_labels=["education", "politics", "business"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about trandformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.7521424889564514, 0.1648690402507782, 0.08298840373754501]}

In [12]:
# Use pipeline with custom library

In [13]:
from transformers import pipeline

generator = pipeline('text-generation', model='distilgpt2')
generator(
    "India is a very",
    max_length=30,
    num_return_sequences=2,
)

Downloading (…)lve/main/config.json: 100%|██████████| 762/762 [00:00<00:00, 60.1kB/s]
Downloading model.safetensors: 100%|██████████| 353M/353M [00:30<00:00, 11.7MB/s] 
  with safe_open(checkpoint_file, framework="pt") as f:
  return self.fget.__get__(instance, owner)()
  storage = cls(wrap_storage=untyped_storage)
  with safe_open(filename, framework="pt", device=device) as f:
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 9.59kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 4.05MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.44MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 4.90MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'India is a very strong player in the NBA Draft, and their team has tremendous potential at this point in time.\n\n\n\n\n\n\n'},
 {'generated_text': 'India is a very good choice and is a very attractive value for investors and companies to invest in. However it could have been different, especially in India'}]

Use the save_pretrained() method to save the configs, model weights and vocabulary:

classifier.save_pretrained('/some/directory')  

### Different pipelines
Text classification
<br>
Zero-shot classification
<br>
Text Generation
<br>
Text completion
<br>
Token classification
<br>
Question answering
<br>
Summatization
<br>
Translation

## Transfer learning 

Transfer learning is basically finetuning a existing model. When we train a model from scratch we randomly initialize the weights of the model. In fine tuning/ transfer learning we use the weights of some pretrained model. 

Transfer learning has been succesfully been used in Image datasets but it is fairly new in NLP tasks. It works great on NLP tasks as well but it has a problem of being biased to the previous model. If a model is trained more on US data then the fine tuned has more bais towards the US english linguistic characteristics.

## Transformer Architecture

The Transformer architecture consist of 2 parts, encoder and decoder.
Both the encoder and decoder can run as independent components or can be combined together.

**The Encoder** is bi-directional model in the sense that when generating a vector for a word it takes context from the previous as well as the next word. It uses the self attention mechanism. It outputs the one vector for one input word 
<be>
Encoder examples - BERT, RoBERTa, ALBERT
<br>
Encoder are best for extracting meaning information, NLU - Natural Language Understanding, Sequence classification (sentiment analysis), question answering, masked language modeling 

**The Decoder** is a uni directional model in the sense that when predicting the next word it takes context only from the previously generated output.The output generated from the previous imput is added to the new input using auto-regressive method. It uses masked self-attention mechanism. It can generate many words from a given input sequence. 
<br>
Decoder Examples - GPT-2, GPT Neo
<br>
Decoders are best for Natural Language generation

Combining both are best for many-to-many tasks.<br>Weights are not necessarily shared between encoders and decoders.<br>Input distribution is different form output distribution.<br>Best for Translation tasks where we have to understand the meaning of the sentence to generate output or summarization.

## What happens inside pipeline function

The pipeline consists of 3 stages 

**Tokenizer** -> **Model** -> **Postprocessing**

Raw text (adding special tokens for start and end) -> **Tokenizer** -> Input ID's [100, 4054, ...]

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much",
]
inputs = tokenizer(raw_inputs, padding=True, return_tensors="pt") #pt = pytorch
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

Input ID's [100, 4054, ...] -> **Model** -> Logits [-4.3343, 4.4343]

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # returns [batch size, sequence length, hidden size]

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 15, 768])


In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

tensor([[-1.4683,  1.5105],
        [ 4.2141, -3.4158]], grad_fn=<AddmmBackward0>)


Logits [-4.3343, 4.4343] -> **Postprocessing** -> Predictions [Positive : 99%, Negative : 0.11%]

In [7]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.8393e-02, 9.5161e-01],
        [9.9951e-01, 4.8549e-04]], grad_fn=<SoftmaxBackward0>)


In [11]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}