<h1>Chapter 4 - Text Classification</h1>
<i>Classifying text with both representative and generative models</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter04/Chapter%204%20-%20Text%20Classification.ipynb)

---

This notebook is for Chapter 4 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

üí° **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install transformers sentence-transformers openai
# !pip install -U datasets

# **Data**

In [None]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [None]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [None]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

  return self.fget.__get__(instance, owner)()
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1066/1066 [00:37<00:00, 28.25it/s]


In [None]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [None]:
train_embeddings.shape

(8530, 768)

In [None]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [None]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [None]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

### Encoder-decoder Models

In [None]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

In [None]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [None]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1066/1066 [00:40<00:00, 26.07it/s]


In [None]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [None]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

In [None]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

In [None]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

'1'

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [None]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1066/1066 [13:34<00:00,  1.31it/s]


In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066



In [2]:
import numpy as np
import pandas as pd

from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, accuracy_score

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)

from peft import LoraConfig, get_peft_model, TaskType

In [3]:
data=pd.read_excel(r"/content/train-data.xlsx")

In [4]:
data.head()

Unnamed: 0,ID,SalesDiagnosticBotId,IsChecked,InventLocationRef,Level4_ID,Description,SalesDiagnosticBotDate,Level4,Level3,Name,District,label
0,337,5035112314041001,True,OKS07671,1_1_2_3,ÿÆÿ±€åÿØ ÿ™ŸÖÿß€åŸÑ ÿÆÿ±€åÿØ ŸÖÿ¥ÿ™ÿ±€å,14041001,ÿ¢ÿ®ŸÖ€åŸàŸá ⁄©Ÿà⁄Ü⁄©,ÿ¢ÿ®ŸÖ€åŸàŸá,ŸÅÿ±Ÿàÿ¥⁄ØÿßŸá ÿ™Ÿáÿ±ÿßŸÜ ÿ¨ŸÜÿ™ ÿ¢ÿ®ÿßÿØ ⁄ØŸÑÿ≥ÿ™ÿßŸÜ ÿ∫ÿ±ÿ®€å,ÿ™Ÿáÿ±ÿßŸÜ - ÿ¢ÿ±€åÿßÿ¥Ÿáÿ±,ÿ™ŸÇÿßÿ∂ÿß Ÿà ÿ±ŸÅÿ™ÿßÿ± ŸÖÿ¥ÿ™ÿ±€å
1,276,2511761314041001,True,OKS05895,7_6_1_3,ÿØŸàÿ∫ ⁄Øÿßÿ≤ÿØÿßÿ± ÿ®ÿ∑ÿ±€å ÿµÿ®ÿßÿ≠ ŸÅÿπÿßŸÑ ÿ¥ÿØ,14041001,ÿØŸàÿ∫ ⁄Øÿßÿ≤ÿØÿßÿ± ÿ®ÿ∑ÿ±€å,ÿØŸàÿ∫,ŸÅÿ±ÿßŸÜ⁄Üÿß€åÿ≤ ÿ™Ÿáÿ±ÿßŸÜ ÿ¨ŸÜÿ™ ÿ¢ÿ®ÿßÿØ ÿ∫ÿ±ÿ®€å,ÿ™Ÿáÿ±ÿßŸÜ - ÿ¢ÿ±€åÿßÿ¥Ÿáÿ±,ÿ≥ÿß€åÿ±
2,347,6247759214041001,True,OKS08821,7_5_9_2,ÿßŸÅÿ≤ÿß€åÿ¥ ŸÇ€åŸÖÿ™,14041001,ŸÖÿßÿ≥ÿ™ ÿ≥ÿ™,ŸÖÿßÿ≥ÿ™,ŸÅÿ±Ÿàÿ¥⁄ØÿßŸá ÿ™Ÿáÿ±ÿßŸÜ ŸàŸÅÿßÿ¢ÿ∞ÿ± €åÿßÿ≥,ÿ™Ÿáÿ±ÿßŸÜ - ÿ¢ÿ±€åÿßÿ¥Ÿáÿ±,ŸÇ€åŸÖÿ™‚Äå⁄Øÿ∞ÿßÿ±€åÿå ÿ™ÿÆŸÅ€åŸÅ Ÿà Ÿæÿ±ŸàŸÖŸàÿ¥ŸÜ
3,348,6247759414041001,True,OKS08821,7_5_9_4,ÿßŸÅÿ≤ÿß€åÿ¥ ŸÇ€åŸÖÿ™,14041001,ŸÖÿßÿ≥ÿ™ ŸáŸÖÿ≤ÿØŸá,ŸÖÿßÿ≥ÿ™,ŸÅÿ±Ÿàÿ¥⁄ØÿßŸá ÿ™Ÿáÿ±ÿßŸÜ ŸàŸÅÿßÿ¢ÿ∞ÿ± €åÿßÿ≥,ÿ™Ÿáÿ±ÿßŸÜ - ÿ¢ÿ±€åÿßÿ¥Ÿáÿ±,ŸÇ€åŸÖÿ™‚Äå⁄Øÿ∞ÿßÿ±€åÿå ÿ™ÿÆŸÅ€åŸÅ Ÿà Ÿæÿ±ŸàŸÖŸàÿ¥ŸÜ
4,344,6247755114041001,True,OKS08821,7_5_5_1,ÿßŸÅÿ≤ÿß€åÿ¥ ŸÇ€åŸÖÿ™,14041001,ÿ¥€åÿ± ÿ®ÿ∑ÿ±€å,ÿ¥€åÿ±,ŸÅÿ±Ÿàÿ¥⁄ØÿßŸá ÿ™Ÿáÿ±ÿßŸÜ ŸàŸÅÿßÿ¢ÿ∞ÿ± €åÿßÿ≥,ÿ™Ÿáÿ±ÿßŸÜ - ÿ¢ÿ±€åÿßÿ¥Ÿáÿ±,ŸÇ€åŸÖÿ™‚Äå⁄Øÿ∞ÿßÿ±€åÿå ÿ™ÿÆŸÅ€åŸÅ Ÿà Ÿæÿ±ŸàŸÖŸàÿ¥ŸÜ


In [5]:
df=data[['Description', 'label']]

In [6]:
df.shape

(15785, 2)

In [8]:
df["Description"] = df["Description"].astype(str).str.strip()
df["label"] = df["label"].astype(str).str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Description"] = df["Description"].astype(str).str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label"] = df["label"].astype(str).str.strip()


In [9]:
le = LabelEncoder()
df["label_id"] = le.fit_transform(df["label"])
num_labels = len(le.classes_)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label_id"] = le.fit_transform(df["label"])


In [10]:
train_df, val_df = train_test_split(
    df[["Description", "label_id"]],
    test_size=0.2,
    random_state=42,
    stratify=df["label_id"],
)

In [11]:
train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds = Dataset.from_pandas(val_df.reset_index(drop=True))


In [12]:
base_model_name = "HooshvareLab/bert-base-parsbert-uncased"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

def tokenize_fn(batch):
    return tokenizer(
        batch["Description"],
        truncation=True,
        max_length=128,
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

In [13]:
train_tok = train_ds.map(tokenize_fn, batched=True)
val_tok = val_ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/12628 [00:00<?, ? examples/s]

Map:   0%|          | 0/3157 [00:00<?, ? examples/s]

In [14]:
train_tok = train_tok.rename_column("label_id", "labels")
val_tok = val_tok.rename_column("label_id", "labels")


In [15]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=num_labels
)

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/654M [00:00<?, ?B/s]

BertForSequenceClassification LOAD REPORT from: HooshvareLab/bert-base-parsbert-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.decoder.weight             | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.decoder.bias               | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params 

In [16]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["query", "value"],
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 305,678 || all params: 163,157,788 || trainable%: 0.1874


In [17]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1_macro": f1_score(labels, preds, average="macro"),
        "f1_weighted": f1_score(labels, preds, average="weighted"),
    }

In [19]:
args = TrainingArguments(
    output_dir="./parsbert-lora-cls",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,

    learning_rate=2e-4,            # LoRA can usually handle a bit higher LR than full fine-tune
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,

    weight_decay=0.01,
    logging_steps=50,

    fp16=False,  # set True if you have CUDA + fp16
    report_to="none"
)

In [21]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

# Save LoRA adapter + tokenizer + label encoder classes
trainer.model.save_pretrained("./parsbert-lora-adapter")
tokenizer.save_pretrained("./parsbert-lora-adapter")
np.save("./parsbert-lora-adapter/label_classes.npy", le.classes_)

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted
1,0.356763,0.210281,0.9414,0.803524,0.938261
2,0.158507,0.162474,0.955654,0.894242,0.955047
3,0.169471,0.150374,0.958188,0.906689,0.957985
4,0.072605,0.145613,0.960722,0.91242,0.960383
5,0.136866,0.142503,0.961672,0.911446,0.96149


In [23]:
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

base_model_name = "HooshvareLab/bert-base-parsbert-uncased"
adapter_dir = "./parsbert-lora-adapter"

tokenizer = AutoTokenizer.from_pretrained(adapter_dir)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=len(np.load(f"{adapter_dir}/label_classes.npy", allow_pickle=True)))
model = PeftModel.from_pretrained(base_model, adapter_dir)
model.eval()

label_classes = np.load(f"{adapter_dir}/label_classes.npy", allow_pickle=True)

def predict(texts):
    if isinstance(texts, str):
        texts = [texts]
    inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=128)

    with torch.no_grad():
        out = model(**inputs)
        probs = torch.softmax(out.logits, dim=-1)
        pred_ids = probs.argmax(dim=-1).cpu().numpy()
        conf = probs.max(dim=-1).values.cpu().numpy()

    pred_labels = label_classes[pred_ids]
    return pred_labels.tolist(), conf.tolist()

print(predict(["ÿßŸÅÿ≤ÿß€åÿ¥ ŸÇ€åŸÖÿ™", "ÿπÿØŸÖ ŸÖŸàÿ¨ŸàÿØ€å", "ÿÆÿ±ÿßÿ®€å €åÿÆ⁄ÜÿßŸÑ"]))

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: HooshvareLab/bert-base-parsbert-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.decoder.weight             | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.decoder.bias               | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params 

(['ŸÇ€åŸÖÿ™\u200c⁄Øÿ∞ÿßÿ±€åÿå ÿ™ÿÆŸÅ€åŸÅ Ÿà Ÿæÿ±ŸàŸÖŸàÿ¥ŸÜ', 'ŸÖŸàÿ¨ŸàÿØ€åÿå ÿ™ÿ£ŸÖ€åŸÜ Ÿà ÿ™Ÿàÿ≤€åÿπ ⁄©ÿßŸÑÿß', 'ÿ™ÿ¨Ÿá€åÿ≤ÿßÿ™ ŸÅÿ±Ÿàÿ¥⁄ØÿßŸá€å'], [0.9997661709785461, 0.9993889331817627, 0.9458350539207458])


In [None]:
predict()"⁄©ÿßŸÑÿß ÿØÿ± ÿ¨ÿß€å ÿÆŸàÿ®€å ŸÇÿ±ÿßÿ± ŸÜ⁄Øÿ±ŸÅÿ™ŸÜÿØ"
predict("")

In [25]:
def predict_dataframe(df, batch_size=32):

    texts = df["Description"].astype(str).tolist()

    all_preds = []
    all_conf = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]

        inputs = tokenizer(
            batch_texts,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=128
        )

        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)

        pred_ids = probs.argmax(dim=-1).cpu().numpy()
        conf = probs.max(dim=-1).values.cpu().numpy()

        all_preds.extend(label_classes[pred_ids])
        all_conf.extend(conf)

    df["predicted_label"] = all_preds
    df["confidence"] = all_conf

    return df

In [35]:
df_new=pd.read_excel(r"/content/total_mahta.xlsx")

In [36]:
df_new.shape

(16277, 11)

In [37]:
df_predicted = predict_dataframe(df_new, batch_size=32)
print(df_predicted.head())

    ID  SalesDiagnosticBotId  IsChecked InventLocationRef Level4_ID  \
0  337      5035112314041000          1          OKS07671   1_1_2_3   
1  276      2511761314041000          1          OKS05895   7_6_1_3   
2  286      6235757114041000          1          OKS08809   7_5_7_1   
3  274      2511761214041000          1          OKS05895   7_6_1_2   
4  275      2511717714041000          1          OKS05895   7_1_7_7   

                    Description  SalesDiagnosticBotDate                Level4  \
0         ÿÆÿ±€åÿØ ÿ™ŸÖÿß€åŸÑ ÿÆÿ±€åÿØ ŸÖÿ¥ÿ™ÿ±€å                14041001           ÿ¢ÿ®ŸÖ€åŸàŸá ⁄©Ÿà⁄Ü⁄©   
1  ÿØŸàÿ∫ ⁄Øÿßÿ≤ÿØÿßÿ± ÿ®ÿ∑ÿ±€å ÿµÿ®ÿßÿ≠ ŸÅÿπÿßŸÑ ÿ¥ÿØ                14041001       ÿØŸàÿ∫ ⁄Øÿßÿ≤ÿØÿßÿ± ÿ®ÿ∑ÿ±€å   
2      ÿßŸÅÿ≤ÿß€åÿ¥ ŸÇ€åŸÖÿ™ Ÿà ⁄©ÿßŸáÿ¥ ÿ™ÿÆŸÅ€åŸÅ                14041001            ⁄©ÿ±Ÿá ÿ≠€åŸàÿßŸÜ€å   
3       ÿØŸàÿ∫ ŸÜÿß€åŸÑŸàŸÜ€å Ÿæÿß⁄© ŸÅÿπÿßŸÑ ÿ¥ÿØ                14041001  ÿØŸàÿ∫ ÿ®ÿØŸàŸÜ ⁄Øÿßÿ≤ ŸÜÿß€åŸÑŸàŸÜ€å   
4                    ÿπÿØŸÖ ÿßÿ±ÿ≥ÿßŸÑ 

In [40]:
df_predicted.to_excel('/content/last_train-data.xlsx')

In [34]:
df_predicted.shape

(16277, 3)