# Transformers

models that are available through hugging face & the transformers packages
Neural Network models
take inputs and transforms into  numerical representations

Core components
- Encoders
- Decoders
- self-attention mechanism

Usecases for text, image, vision, and audio
Classification available across all above

Data can be used to train multiple models. So if I wanna train a similiar new task, I don't need to input a large amount of new data. (Also why very very specific tasks cant be completed with Hugging face)

In [None]:
# to use Huggingface hub
from huggingface_hub import HfApi
api = HfApi()
#view available models

# list(api.list_models())

#search for specific models
models = api.list_models(
    filter=ModelFilter(
        task='text-classification',
        sort='downloads',
        order='desc',
        limit=5
    )
)
models

In [None]:
#saving a model locally
from transformers import AutoModel
modelId = 'bert-base-uncased'
#download
model = AutoModel.from_pretrained(modelId)
#save to local directory
model.save_pretrained(save_directory=f'models/{modelId}')

Datasets in HF uses Apache Arrow structure. Column based

In [2]:
from datasets import load_dataset_builder, load_dataset

data_builder = load_dataset_builder('imdb')
print(data_builder.info.description)  # info
print(data_builder.info.features)  # list of columns

data = load_dataset('imdb')  # to load dataset

# To mutate the dataset
imdb=load_dataset('imdb',split='train')
#Filter
filtered = imdb.filter(lambda row: row['label']==0)  # filter to row with label == 0
# Sclising
sliced = filtered.select(range(2))  # select rows based on indices; selected rows 0 and 1

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development





## Pipelines with Hugging Face
When you click "use in transformers" when looking at a model on web...


Auto Class - general classes for using models and tokenizers, config, processors, feature extraction... flexible and direct.
- use to directly download model
Use Cases:
- Fine-tuning a pre-trained model on a specific task (e.g., classification, translation).
- Building custom workflows by combining a model, tokenizer, and other components.
- Advanced tasks requiring direct control over the model architecture and inputs/outputs.

Pipeline - more hands on; great for getting started
A pipeline simplifies the process of loading a pre-trained model and performing inference. It bundles the model, tokenizer, and preprocessing/postprocessing into a single interface.
- task-specific pipelines for each task
- leverage auto classes behind the scene
Use Cases:
- Quickly prototyping or running inference without needing detailed knowledge of the model’s internals.
- Non-developers or beginners who need easy access to NLP models for common tasks.
- Tasks where you don’t require much customization of the underlying model.

Having chatgpt explaining the differences: https://chatgpt.com/share/677cc85e-2004-8013-8101-ed7cc3b3f58e

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
# creating pipeline
my_pipeline = pipeline(task='text-classification', model='...')  # if not specifying model, it will use the default model
input = "text classification sample text"
my_pipeline(input)


# With auto class
# Download the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Create the pipeline
sentimentAnalysis = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)

# Predict the sentiment
output = sentimentAnalysis(input)

print(f"Sentiment using AutoClasses: {output[0]['label']}")



#### Tokenization
- convert components of natural language into tokens (numerical)

#### Normalization
- cleaning text & removing white spaces
- accents removed 
- lowercasing

#### pre-tokenization
- split text into small tokens

Different tokenization models have different tokenizing methods
GOAL: to create a library of vocabularies and understand the patterns 

In [None]:
# use noarmlizer
# Import the AutoTokenizer
from transformers import AutoTokenizer
# Download the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Normalize the input string
output = tokenizer.backend_tokenizer.normalizer.normalize_str(input_string)


# Use tokenizer
# CONFUSED: why not calling AutoTokenizer?
# AutoTokenizer is a flexible and generic class that automatically selects the correct tokenizer for a model based on its identifier or configuration.
# When you specify "gpt2" as the model name, AutoTokenizer internally resolves it to GPT2Tokenizer.

# Download the gpt tokenizer
gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Tokenize the input
gpt_tokens = gpt_tokenizer.tokenize(input)
# Repeat for distilbert
distil_tokenizer = DistilBertTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
distil_tokens = distil_tokenizer.tokenize(text=input)
# Compare the output
print(f"GPT tokenizer: {gpt_tokens}")
print(f"DistilBERT tokenizer: {distil_tokens}")


#### Text classification
- sentiment analysis
- grammatical correctness
- judging if an answer is logically correct for a question (entailment. Question Natural Language Inference, or QNLI)

zeroshot training model: best when with limited resources, no need for specific training
Zero-shot classification is the ability for a transformer to predict a label from a new set of classes which it wasn't originally trained to identify.
only requires inputs (text) and labels


#### Summarization
- representative information is extracted using sentence scoring

In [None]:
model = 'sshlefier/distilbar-cnn-12-6'
summarizer = pipeline(task='summarization', model=model)
text="xsxsssgqwegwq"
summary_text=summarizer(text)
print(summary_text[0]['summary_text'])

# Each model has specific set of parameters used for adjusting the model 
# min_length & Max_length
# max_lenghth defaulted to 142

#### Text Generation 

In [None]:
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM # frequently used for text generation

# Set model name
model_name = "gpt2"
# Get the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Wear sunglasses when its sunny because"

# Tokenize the input
input_ids = tokenizer.encode(prompt, return_tensors="pt")  # PyTorch tensors are similar to arrays. It's more compatible with gpt2
output = model.generate(input_ids, num_return_sequences=1)  # return only 1 word
# Decode the output
generated_text = tokenizer.decode(output[0])
print(generated_text)

In [None]:
# text generation from images
## Generating caption for an image

# Get the processor and model
processor = AutoProcessor.from_pretrained("microsoft/git-base-coco")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco")
# Process the image
pixels = processor(images=image, return_tensors="pt").pixel_values
# Generate the ids
output = model.generate(pixel_values=pixels)
# Decode the output
caption = processor.batch_decode(output)
print(caption[0]) 

# tbd
