In [1]:
import warnings 
warnings.filterwarnings("ignore")

In [5]:
from datasets import load_dataset

dataset = load_dataset('knowledgator/events_classification_biotech') 
    
classes = [class_ for class_ in dataset['train'].features['label 1'].names if class_]
class2id = {class_:id for id, class_ in enumerate(classes)}
id2class = {id:class_ for class_, id in class2id.items()}


In [8]:
dataset['train'][0]

{'title': "Sarah Polley's Book Recommendations",
 'content': 'Drive Your Plow Over the Bones of The Dead\nby Olga Tokarczuk. I am an incredibly slow reader, but the tone and specificity of the world she creates in this book was something I couldnt leave behind until it was done. Also: All We Sawby Anne Michaels, Fight Nightby Miriam Toews, and The Summer Before the Darkby Doris Lessing.\nId like turned into a Netflix show:\nby Amia Srinivasan. One of the most brain-shattering books Ive ever read. Her thinking is so electrically rigorous and fearless. (I double DARE them to make this into a Netflix show!)\n...I last bought:\n. I rediscovered her poetry lately, and I feel like I dont want to read anything else for a while. She owns desire and submerged things.\n...has the greatest ending:\nby J.D. Salinger. The last page always leaves me breathless. The intimacy and truth of that final page is so arresting and almost painful to read.\nshould be on every college syllabus:\nby Anton Piatig

In [10]:
from transformers import AutoTokenizer

model_path = 'microsoft/deberta-v3-small'

tokenizer = AutoTokenizer.from_pretrained(model_path)

In [11]:
def preprocess_function(example):
    
    text = f"{example['title']}.\n{example['content']}"

    
    all_labels = example['all_labels']
    labels = [0. for i in range(len(classes))]

    for label in all_labels:
        label_id = class2id[label]
        labels[label_id] = 1.

    example = tokenizer(text, truncation=True)
    example['labels'] = labels
    return example

tokenized_dataset = dataset.map(preprocess_function)


Map:   0%|          | 0/381 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [12]:
dataset['train'][0]

{'title': "Sarah Polley's Book Recommendations",
 'content': 'Drive Your Plow Over the Bones of The Dead\nby Olga Tokarczuk. I am an incredibly slow reader, but the tone and specificity of the world she creates in this book was something I couldnt leave behind until it was done. Also: All We Sawby Anne Michaels, Fight Nightby Miriam Toews, and The Summer Before the Darkby Doris Lessing.\nId like turned into a Netflix show:\nby Amia Srinivasan. One of the most brain-shattering books Ive ever read. Her thinking is so electrically rigorous and fearless. (I double DARE them to make this into a Netflix show!)\n...I last bought:\n. I rediscovered her poetry lately, and I feel like I dont want to read anything else for a while. She owns desire and submerged things.\n...has the greatest ending:\nby J.D. Salinger. The last page always leaves me breathless. The intimacy and truth of that final page is so arresting and almost painful to read.\nshould be on every college syllabus:\nby Anton Piatig

In [16]:
dataset['train'][1500]

{'title': 'ShiraTronics, Inc. Completes $33M Million Series A Financing',
 'content': 'ShiraTronics, Inc. Completes $33M Million Series A Financing\nSearch jobs\n24-Oct-2019\nShiraTronics, Inc. Completes $33M Million Series A Financing\nMINNEAPOLIS, Oct. 24, 2019 /PRNewswire/ --ShiraTronics, Inc., a private medical device company, today announced it has completed a Series A financing of $33 million.The financing was co-led by USVP, Amzak Health, and Strategic HealthCare Investment Partners (S.H.I.P.), with participation from Aperture Ventures, LivaNova PLC, and a leading Academic Institution. Concurrent with the financing, the company also announced the hiring of Lynn Elliott as the President and CEO of the Company. Lynn brings three decades of active implantable device experience in medical device innovation, R&D, manufacturing, clinical trials, and regulatory approvals to healthcare challenges. Lynn has a unique blend of experience in large medical device companies such as Guidant an

In [19]:
tokenized_dataset['train'].to_pandas()

Unnamed: 0,title,content,target organization,all_labels,all_labels_concat,label 1,label 2,label 3,label 4,label 5,input_ids,token_type_ids,attention_mask,labels
0,Sarah Polley's Book Recommendations,Drive Your Plow Over the Bones of The Dead\nby...,Franny's Farmacy,[other],other,23,,,,,"[1, 4537, 16649, 3342, 280, 268, 2538, 34800, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,Denel staff get millions from attached bank ac...,"In the recently tabled National Budget, Denel ...",Heat Relief,[other],other,23,,,,,"[1, 11289, 3212, 979, 350, 3543, 292, 3448, 18...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,How to master productive pausing and get more ...,Shares\nTake a break its good for you (Picture...,Reframe,[other],other,23,,,,,"[1, 577, 264, 2549, 5769, 45482, 263, 350, 310...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,Early Bird & Eight Certifications! RESO Weekly...,RESO is currently hiring for two positions:\nP...,CARE SOUTH,[other],other,23,,,,,"[1, 5268, 8687, 429, 12913, 69051, 300, 40337,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,The Lifetime Discount Club,Charter Buyer Club\nWhat is the Charter Buyer ...,Big Sky Botanicals,[other],other,23,,,,,"[1, 279, 19178, 12236, 2057, 260, 12689, 14861...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2754,Birmingham Mail backs 2025 Invictus Games bid,0\nA regional daily wants to bring an internat...,Invictus Games,"[support & philanthropy, company description]","support & philanthropy, company description",26,21.0,,,,"[1, 8668, 7624, 396, 268, 19291, 85515, 3819, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2755,R1 RCM Inc. (RCM) Reveals an Earnings Mystery,Share on whatsapp\nR1 RCM Inc. (NASDAQ:RCM)\nw...,ABCS RCM,"[investment in public company, company descrip...","investment in public company, company description",22,21.0,,,,"[1, 909, 435, 90421, 1326, 260, 287, 17944, 11...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2756,Verboso Launches Full-Stack Online Speech Ther...,Verboso Launches Full-Stack Online Speech Ther...,Verboso,[product launching & presentation],product launching & presentation,14,,,,,"[1, 36024, 27357, 67980, 3306, 271, 56865, 230...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2757,"Barnet, Enfield and Haringey Mental Health Tru...","Barnet, Enfield and Haringey Mental Health Tru...",Barnet Enfield and Haringey Mental Health Trust,[executive statement],executive statement,1,,,,,"[1, 56259, 261, 42149, 263, 110553, 11099, 151...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [20]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [9]:
import evaluate
import numpy as np

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    predictions = sigmoid(predictions)
    predictions = (predictions > 0.5).astype(int).reshape(-1)
    return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1))
    references=labels.astype(int).reshape(-1)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [24]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(

   model_path, num_labels=len(classes),
           id2label=id2class, label2id=class2id,
                       problem_type = "multi_label_classification")

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
import os
os.environ["WANDB_PROJECT"]="Dilbazlar"

In [12]:
wandb_api_key = "04a083b14d60688b24482e00727ebcc57448ef88"

In [13]:
training_args = TrainingArguments(
   output_dir="my_awesome_model",
   learning_rate=2e-5,
   per_device_train_batch_size=3,
   per_device_eval_batch_size=3,
   num_train_epochs=2,
   weight_decay=0.01,
   evaluation_strategy="epoch",
   save_strategy="epoch",
   load_best_model_at_end=True,
   report_to="wandb", # Wandb = https://docs.wandb.ai/guides/integrations/huggingface
   run_name="Multi-label-model-baseline"
)

trainer = Trainer(

   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\halilibrahim.hatun\_netrc


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2317,0.146006,0.949588,0.367764,0.72,0.246951
2,0.1488,0.13435,0.955833,0.512,0.744186,0.390244


NameError: name 'wandb' is not defined