<a href="https://colab.research.google.com/github/AbhimanyuAryan/ap-text-classification/blob/main/Jose/MistralSequenceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pip Installs

In [None]:
!pip install -q -U torch=='2.0.0'

In [None]:
!pip install -q -U accelerate=='0.25.0' peft=='0.7.1' bitsandbytes=='0.41.3.post2' transformers=='4.36.1' trl=='0.7.4'

In [None]:
!pip install transformers datasets



Imports

In [None]:
from huggingface_hub import notebook_login
import os
import warnings
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)
from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split
from datasets import load_dataset



Login Hugging-face.

Make sure you have permissions for acessing: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [None]:
# notebook_login()

Set GPU and ignore errors

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

Get Data

In [None]:
def generate_prompt(data_point):
    text = f"""
            [INST]Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, or negative, and return the answer as
            the corresponding sentiment label "positive" or "negative"[/INST]

            [{data_point["text"]}] = {data_point["label"]}
            """.strip()
    return {
        'label': data_point["label"],
        'text' :text
    }

def generate_test_prompt(data_point):
    text = f"""
            [INST]Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, or negative, and return the answer as
            the corresponding sentiment label "positive" or "negative"[/INST]

            [{data_point["text"]}] = """.strip()
    return {
        'label': data_point["label"],
        'text' :text
    }

In [None]:
imdb = load_dataset('imdb')
imdb['train'] = imdb['train'].select(range(100))
imdb['test'] = imdb['test'].select(range(100))

train_data_imdb = imdb['train']
num = int(len(imdb['test'])/2)
lim = num + 200
eval_data_imdb = imdb['test']
test_data_imdb = imdb['test'][num:lim]
X_test_imdb = pd.DataFrame(test_data_imdb)
y_true_imdb = list(X_test_imdb['label'])

In [None]:
print(train_data_imdb[0]['text'])
print(train_data_imdb[0]['label'])
print(type(train_data_imdb[0]['label']))

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [None]:
print(train_data_imdb)
print(eval_data_imdb)
print(X_test_imdb.info())

Dataset({
    features: ['text', 'label'],
    num_rows: 100
})
Dataset({
    features: ['text', 'label'],
    num_rows: 100
})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    50 non-null     object
 1   label   50 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 928.0+ bytes
None


Auxiliar Functions

In [None]:
def evaluate(y_true, y_pred):
    labels = [1, 0, -1]
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [None]:
def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(task="text-generation",
                        model=model,
                        tokenizer=tokenizer,
                        max_new_tokens = 1,
                        temperature = 0.0,
                       )
        result = pipe(prompt, pad_token_id=pipe.tokenizer.eos_token_id)
        answer = result[0]['generated_text'].split("=")[-1].lower()
        if "positive" in answer:
            y_pred.append(1)
        elif "negative" in answer:
            y_pred.append(0)
        else:
            y_pred.append(-1)
    return y_pred

Get Model

In [None]:
def get_model():
    model_name = "mistralai/Mistral-7B-Instruct-v0.2"
    compute_dtype = getattr(torch, "float16")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=bnb_config,
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                              trust_remote_code=True,
                                              padding_side="left",
                                              add_bos_token=True,
                                              add_eos_token=True,
                                            )
    tokenizer.pad_token = tokenizer.eos_token
    return (model,tokenizer)

model,tokenizer = get_model()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Evaluate the base model

Retrain the model

In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=4,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    evaluation_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data_imdb,
    eval_dataset=eval_data_imdb,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=512,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
trainer.train()
trainer.model.save_pretrained("JoseBambora/mistral_retrained")

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.5448,2.581332
2,2.365,2.583975
3,2.2284,2.614433
4,2.1499,2.620781


Evaluate Retrained model

In [None]:
y_pred = predict(X_test_imdb, model, tokenizer)
evaluate(y_true_imdb, y_pred)

100%|██████████| 50/50 [01:47<00:00,  2.14s/it]

Accuracy: 0.020
Accuracy for label 0: 0.020

Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         0
           0       1.00      0.02      0.04        50
           1       0.00      0.00      0.00         0

    accuracy                           0.02        50
   macro avg       0.33      0.01      0.01        50
weighted avg       1.00      0.02      0.04        50


Confusion Matrix:
[[1 2 0]
 [0 0 0]
 [0 0 0]]





Save the model on hugging face.

In [None]:
trainer.push_to_hub()

events.out.tfevents.1713624702.22fe43e0e455.7504.0:   0%|          | 0.00/4.18k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

events.out.tfevents.1713625047.22fe43e0e455.16168.0:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

Upload 9 LFS files:   0%|          | 0/9 [00:00<?, ?it/s]

events.out.tfevents.1713626183.22fe43e0e455.21356.0:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

events.out.tfevents.1713626788.22fe43e0e455.21356.1:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

events.out.tfevents.1713627159.22fe43e0e455.21356.2:   0%|          | 0.00/4.18k [00:00<?, ?B/s]

events.out.tfevents.1713627448.22fe43e0e455.26774.0:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/JoseBambora/logs/commit/46afa22a25adb8c8fda354e7e36e72c0ca4c9120', commit_message='End of training', commit_description='', oid='46afa22a25adb8c8fda354e7e36e72c0ca4c9120', pr_url=None, pr_revision=None, pr_num=None)