# Preparing for fine-tuning

**Pipelines and auto classes**

In [None]:
"""

So far, we have used the pipeline() interface. It streamlines language tasks by automatically selecting a model and tokenizer but offers limited control.
Auto classes allow more customization, enabling manual adjustments and model fine-tuning

"""

**Tokenization**

In [None]:
"""

After loading and instantiating the data, model, and tokenizer, we tokenize the data subset in one go by selecting the text column,
enabling padding and sequence truncation when exceeding the specified maximum length. This helps with efficiency.
We set return_tensors to pt to return PyTorch tensors since our model expects this format.

"""


from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset


train_data = load_dataset("imdb", split="train")
train_data = data.shard(num_shards=4, index=0)
test_data = load_dataset("imdb", split="test")
test_data = data.shard(num_shards=4, index=0)


model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the data
tokenized_training_data = tokenizer(train_data["text"], return_tensors="pt", padding=True, truncation=True,
                                    max_length=64)
tokenized_test_data = tokenizer(test_data["text"], return_tensors="pt", padding=True, truncation=True,
                                max_length=64)


**Tokenizing row by row**

In [None]:
"""

If more control is needed, we can tokenize a dataset in batches or row by row with a custom function and the .map() method,
setting batches to True or False, respectively. The result will be a new dataset object with new columns for the tokenized data,
which is required for the training loop. Note that the .map() method does not accept list formats, only dataset objects

"""

def tokenize_function(text_data):
    return tokenizer(text_data["text"], return_tensors="pt", padding=True, truncation=True, max_length=64)

# Tokenize in batches
tokenized_in_batches = train_data.map(tokenize_function, batched=True)
# Tokenize row by row
tokenized_by_row = train_data.map(tokenize_function, batched=False)

**Subword tokenization**

In [None]:
"""

The tokenization we've performed is known as subword tokenization, common in most modern tokenizers. Here, words are split into smaller,
meaningful sub-parts of a word, including prefixes and suffixes. For example, with subword tokenization, a word like "unbelievably"
would be split into tokens "un", "believ", and "ably".

"""

In [None]:
"""

You want to leverage a pre-trained model from Hugging Face and fine-tune it with data from your company support team to help classify interactions depending on the risk for churn. \
This will help the team prioritize what to address first, and how to address it, making them more proactive.

Prepare the training and test data for fine-tuning by tokenizing the text

"""


"""
Load the pre-trained model and tokenizer in preparation for fine-tuning.
Tokenize both the train_data["text"] and test_data["text"], enabling padding and sequence truncation.

"""

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset


train_data = load_dataset("imdb", split="train")
train_data = train_data.shard(num_shards=4, index=0) #### Divides the training data into 4 equal parts , index=0 means working with the first slice only, to make the data smaller and faster to process.
test_data = load_dataset("imdb", split="test")
test_data = test_data.shard(num_shards=4, index=0)

# Load the model and tokenizer
model =  AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")  ### bert-base-uncased: Refers to the specific BERT model. "Uncased" means it doesn't differentiate between uppercase and lowercase letters
tokenizer =  AutoTokenizer.from_pretrained("distilbert-base-uncased")  #### Automatically loads a tokenizer that matches the model

# Tokenize the data
tokenized_training_data = tokenizer(train_data["text"], return_tensors="pt", padding = True, truncation = True, max_length=20)

tokenized_test_data = tokenizer(test_data["text"], return_tensors="pt", padding = True, truncation = True, max_length=20)

"""
tokenizer(train_data["text"]):
Takes the text column of the training data and converts it into tokens.

return_tensors="pt":
Converts the tokens into PyTorch tensors (needed for training the model).

padding=True:
Ensures all sequences have the same length by adding padding if they are too short.

truncation=True:
Cuts off sequences that are too long (prevents memory issues).

max_length=64:
Limits the maximum length of tokens for each sequence to 64 tokens.
"""

print(tokenized_training_data)

In [None]:
"""


1. For Sequence Classification (like Sentiment Analysis)
----------------------------------------------------------

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

--->  This is used when you want to classify sequences into categories (e.g., positive/negative sentiment).
--->  It loads a version of BERT with an additional classification head (a fully connected layer for predictions).






2. For Token Classification (like Named Entity Recognition)
-------------------------------------------------------------
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased")

---> We Use this for tasks where need predictions at the token level, like labeling each word as a person, location, or organization.






3. For Question Answering
----------------------------
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")


--->We use this for tasks where input is a question and a context, and the model predicts the answer span from the context.






4. For Language Modeling (Masked or Causal LM)
Masked Language Modeling (MLM) (e.g., BERT):
--------------------------------------------------------

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

--->  Use this to predict masked words in a sentence, which is how BERT was originally trained.




5. For Embedding Extraction (Without Task-Specific Heads)
---------------------------------------------------------------
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")


--->  This loads BERT without any task-specific layers (just the transformer blocks).
--->  Useful when you want to extract embeddings for sentences or tokens





6. For Custom Models
If need to fine-tune BERT for a unique task, you can load the base BERT model and add your custom head:
---------------------------------------------------------------------------------------------------------------
from transformers import AutoModel
import torch.nn as nn

class CustomBERTModel(nn.Module):
    def __init__(self):
        super(CustomBERTModel, self).__init__()
        self.bert = AutoModel.from_pretrained("bert-base-uncased")
        self.classifier = nn.Linear(768, 3)  # Example: 3 classes

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token embedding
        logits = self.classifier(cls_output)
        return logits

model = CustomBERTModel()


--->    This gives full flexibility to design model.

"""


In [None]:
### Mapping tokenization

"""
You now want to test out having more control over the tokenization and want to try tokenizing the data in rows or batches.
This will also give you a result that is a DataSet object, which you'll need for training.

The tokenizer has been loaded for you along with the data as train_data and test_data.

"""




"""

Complete tokenize_function returning tokenized tensors with sequence truncation and tokenize the train_data in batches.

"""

# Complete the function
def tokenize_function(data):
    return tokenizer(data["interaction"],
                     return_tensors = "pt",
                     padding=True,
                     truncation = True,
                     max_length=64)

tokenized_in_batches = train_data.map(tokenize_function , batched = True)


"""
Apply tokenize_function to train_data and tokenize row by row.
"""

# Complete the function
def tokenize_function(data):
    return tokenizer(data["text"],
                     return_tensors="pt",
                     padding=True,
                     truncation=True,
                     max_length=64)

# Tokenize row by row
tokenized_by_row =  train_data.map(tokenize_function, batched=False)

print(tokenized_by_row)

# Fine-tuning through Training

In [None]:
### Setting up training arguments

"""

Set up an instance of TrainingArguments().
Set the evaluation strategy as "epoch".
Specify three training epochs.
Set the batch sizes for both training and evaluation as three.

"""

from transformers import Trainer, TrainingArguments


# Set up an instance of TrainingArguments
training_args = TrainingArguments(
  output_dir="./finetuned",

  # Set the evaluation strategy
  evaluation_strategy = "epoch",

  # Specify the number of epochs
  num_train_epochs=3,
  learning_rate=2e-5,

  # Set the batch sizes
  per_device_train_batch_size=3,
  per_device_eval_batch_size=3,
  weight_decay=0.01
)

In [None]:
### Setting up the trainer

"""

Set up the Trainer() object.
Assign the previously defined training arguments and tokenizer.
Train the model.

"""

# Set up the trainer object
trainer = Trainer(
    model=model,

    # Assign the training arguments and tokenizer
    args = training_args,
    train_dataset=tokenized_training_data,
    eval_dataset=tokenized_test_data,
    tokenizer = tokenizer
)

# Train the model
trainer.train()

In [None]:
### Using the fine-tuned model

"""

Tokenize the new data.
Pass the tokenized inputs into the fine-tuned model, disabling gradients.
Extract the new predictions.

"""



input_text = ["I'd just like to say, I love the product! Thank you!"]

# Tokenize the new data
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Pass the tokenized inputs through the model
with torch.no_grad():
    outputs = model(**inputs)

# Extract the new predictions
predicted_labels = torch.argmax(outputs.logits, dim=1).tolist()

label_map = {0: "Low risk", 1: "High risk"}
for i, predicted_label in enumerate(predicted_labels):
    churn_label = label_map[predicted_label]
    print(f"\n Input Text {i + 1}: {input_text[i]}")
    print(f"Predicted Label: {predicted_label}")

# Fine-tuning approaches

In [None]:
"""

Complete the one-shot learning example by showing the sample review is Positive.

"""

# # Include an example in the input ext
# input_text = """
# Text: "The dinner we had was great and the service too."
# Classify the sentiment of this sentence as either positive or negative.
# Example:
# Text: "The food was delicious"
# ____
# Text: "The dinner we had was great and the service too."
# Sentiment:
# """

# # Apply the example to the model
# result = model(input_text, max_length=100)

# print(result[0]["label"])


# Include an example in the input ext
input_text = """
Text: "The dinner we had was great and the service too."
Classify the sentiment of this sentence as either positive or negative.
Example:
Text: "The food was delicious"
Sentiment: Positive
Text: "The dinner we had was great and the service too."
Sentiment:
"""

# Apply the example to the model
result = model(input_text, max_length=100)

print(result[0]["label"])

In [None]:
"""

N-shot learning means training a model to recognize something new based on how many examples it has seen before.


Zero-shot learning: The model has never seen the new task before, but it still tries to perform well.

Example: A model trained to recognize animals (dogs, cats, etc.) is asked to identify a zebra, even though it has never seen a zebra before.
Instead, it uses its knowledge of other animals to make a guess.


One-shot learning: The model learns from just one example.

Example: If you show a child one picture of a koala, and then they see another koala in a different picture, they can recognize it immediately. One-shot learning works the same way.


Few-shot learning: The model learns from a few examples.

Example: If a child sees three different pictures of koalas, they can now recognize them better than if they had seen just one.




One-shot Learning in Practice
Imagine you have a text generation model (like ChatGPT) that usually writes essays. But now, you want it to analyze sentiment (happy/sad) of a sentence.

Without One-shot Learning:
You ask: "Is this sentence positive or negative?"

The model may not understand that you want sentiment analysis.
With One-shot Learning:
You give an example before asking:
"Example: 'I love this movie!' → Positive
Now analyze: 'I hate this weather!'"

The model now understands what you want and correctly says: "Negative"

"""