# This is a general basic guide to pipeline from HuggingFace
                                                                  - By Anirudh Gupta
For more referrencing and clarity, refer the link : https://www.youtube.com/watch?v=QEaBAZQCtwE.
To see more pipelines check out the link: https://huggingface.co/docs/transformers/main_classes/pipelines.
To see different models, pipelines, etc. visit HuggingFace website. This has been taken from website as well as YouTube Videos.

In [6]:
from transformers import pipeline #must include this

classifier = pipeline("sentiment-analysis")   #Here sentiment-analysis is the model name, pipeline(...) does all your work
result =classifier(["I am very happy", "I am very sad"])

print(result)
#Notes: 
#result[0]['score'] is the confidence of the model, more the confident the more likely it is
#result[0]['label'] is the label, For various models it can change
#in case we do not specify model, it uses defualt one, here it is distilbert-base-uncased-fine-tuned-sst-2-english


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998795986175537}, {'label': 'NEGATIVE', 'score': 0.9994852542877197}]


In [10]:
#Now we will use a model and add more parameters, for what we want
generator =pipeline ("text-generation", model="distilgpt2")   ## generator, classifier are just fancy names
result = generator(
    "Today i ate", max_length=20, num_return_sequences=2
)
print(result)
#Notes:
#result[0]['generated_text'] is the generated text
#num_return_sequences is the number of generated texts
#max_length is the max length of the generated text
# your computer downloads the model just a little bit and shoots you the ans, it will tell you the folder it stores in too.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today i ate a snack on a plate and I never said how excited I was to get the dish'}, {'generated_text': 'Today i ate one morning from two old students. i can understand when i first heard of this and'}]


# Understanding Underlying idea of pipeline
Here we Use models and Tokenizer. If we want to finetune our model, we must understand this


In [12]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name= "distilbert-base-uncased-finetuned-sst-2-english"
model= AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer= AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
res= classifier(["I am very happy", "I am very sad"])
print(res)
#Notes:
#tokenizer puts the text in Mathematical Representation which model evaluates over
#AutoTokenizer.from_pretrained(model_name) just tokenizes the input for the model which is pre-trained
#AutoModelForSequenceClassification.from_pretrained(model_name) loads the model which is pre-trained
# just use .from_pretrained(model_name)

# Notice we have the same result as before 

[{'label': 'POSITIVE', 'score': 0.9998795986175537}, {'label': 'NEGATIVE', 'score': 0.9994852542877197}]


# Understanding Tokenizer more
Tokenizer converts inputs to mathematical representation which model evaluates over

In [5]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name= "distilbert-base-uncased-finetuned-sst-2-english"
model= AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer= AutoTokenizer.from_pretrained(model_name)

sequence="Understanding Tokenizer in easy way"

res=tokenizer(sequence)
print(res)
# converts each token into its numerical representation and 101 and 102 means start and ending of lines
# in attention mask, 1 means it is active and 0 means it is not active(not used in the model)
tokens= tokenizer.tokenize(sequence)
print(tokens)
ids= tokenizer.convert_tokens_to_ids(tokens)
print(ids)
decoded_string=tokenizer.decode(ids)
print(decoded_string)
#detailed explains how res is made, how tokens and ids and decoded string are used 
# Notice the final print value doesnt have capitals.

{'input_ids': [101, 4824, 19204, 17629, 1999, 3733, 2126, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
['understanding', 'token', '##izer', 'in', 'easy', 'way']
[4824, 19204, 17629, 1999, 3733, 2126]
understanding tokenizer in easy way


# How to Save a tokenizer and model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

save_directory="saved"   #anyname save_directory or path, etc
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

tok=AutoTokenizer.from_pretrained(save_directory)
mod=AutoModelForSequenceClassification.from_pretrained(save_directory)
# put this at the end of your code, optional

# Using PyTorch/Tensorflow

In [9]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name="distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer= AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
x_train = ["I am very happy", "I am very sad"]
result=classifier(x_train)
print(result)
# This is what we do usually
#Lets see Pytorch is used

batch= tokenizer(x_train, padding=True, truncation= True, max_length=512, return_tensors="pt")
print(batch)

with torch.no_grad():
    outputs = model(**batch)
    print(outputs)
    predictions= F.softmax(outputs.logits, dim=1)
    print(predictions)
    labels=torch.argmax(predictions, dim=1)
    print(labels)


[{'label': 'POSITIVE', 'score': 0.9998795986175537}, {'label': 'NEGATIVE', 'score': 0.9994852542877197}]
{'input_ids': tensor([[ 101, 1045, 2572, 2200, 3407,  102],
        [ 101, 1045, 2572, 2200, 6517,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[-4.3359,  4.6890],
        [ 4.1598, -3.4115]]), hidden_states=None, attentions=None)
tensor([[1.2036e-04, 9.9988e-01],
        [9.9949e-01, 5.1472e-04]])
tensor([1, 0])


# Example of how to use this to use LLM's from HuggingFace
This is an AI, well trained AI, so it will be faster for us to use. Similar fashion we will be able to use other LLM's from Hugging face. 
Whenever we are using models, there is a term 3b,7b,etc. Make sure for out laptops, it is <=7billion(7b), else computer may run very very slow( exceptions are always there)