This notebook demonstrates the process of building, training, and evaluating a topic(event) extraction model based on the DistilBERT model. The workflow consists of:

**1. Dataset Preparation:**

Dialogsum Dataset: Preprocessed a dataset of 12,460 training examples across 8,521 uniqye topics, extracting summary and topic columns for training.

Dynamic-Topic-RedPajama Dataset: Expanded the training data to include a second dataset with 66,000 filtered examples across 959 unique topics.

**2. Model Training:**

The first model (model1) was fine-tuned on the Dialogsum dataset.
The second phase further fine-tuned model1 on the larger, filtered RedPajama dataset to refine its ability to predict topics.

**3. Evaluation and Results:**

Both models were evaluated on separate test sets to ensure performance accuracy.
The final model (event_extraction_model) demonstrated strong capability in extracting topics from sentences.

**4. Inference:**

Tested the final model on various sample inputs to verify its ability to identify key topics from textual data.


### Step 1: Setting up the environment

 Install necessary libraries

In [None]:
!pip install datasets
!pip install transformers
!pip install scikit-learn

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

Import libraries

In [None]:
import os
import torch
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
)

Download the first dataset: dialogsum which can be found on: https://www.kaggle.com/datasets/marawanxmamdouh/dialogsum

In [3]:
!kaggle datasets download marawanxmamdouh/dialogsum --unzip -p dialogsum_dataset

Dataset URL: https://www.kaggle.com/datasets/marawanxmamdouh/dialogsum
License(s): other
Downloading dialogsum.zip to dialogsum_dataset
 87% 7.00M/8.05M [00:00<00:00, 13.4MB/s]
100% 8.05M/8.05M [00:00<00:00, 10.6MB/s]


### Step 2: Preprocessing of the first dataset Dialogsum

This dataset contains the following files: 'hiddentest_dialogue.csv', 'validation.csv', 'test.csv', 'train.csv', 'hiddentest_topic.csv'

We used the files 'validation.csv', 'test.csv', 'train.csv' for training our model, where train data size is 12460 entries, validation data size is 500, and test data size is 1500.

Each entry contains 4 columns: id, dialogue, summary, topic

The columns that concern us are summary (text) and topic, so we removed the other columns and used those only.

The topic column contains 8,521 unique topics.

We trained the "distilbert" model on this dataset and obtained a model called "model1" which was further trained on a much larger dataset (Step 6).

In [None]:
# path to the extracted dataset
dataset_path = "dialogsum_dataset/CSV"

# list the files
print("Files in the dataset directory:")
print(os.listdir(dataset_path))

Files in the dataset directory:
['hiddentest_topic.csv', 'validation.csv', 'hiddentest_dialogue.csv', 'train.csv', 'test.csv']


set the train, validation, and test data

In [None]:
# load the CSV files
train_file = os.path.join(dataset_path, "train.csv")
validation_file = os.path.join(dataset_path, "validation.csv")
test_file = os.path.join(dataset_path, "test.csv")

# load the datasets
train_data = pd.read_csv(train_file)
validation_data = pd.read_csv(validation_file)
test_data = pd.read_csv(test_file)

# check the structure of the data
print("Train data sample:")
print(train_data.head())

Train data sample:
        id                                           dialogue  \
0  train_0  #Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. ...   
1  train_1  #Person1#: Hello Mrs. Parker, how have you bee...   
2  train_2  #Person1#: Excuse me, did you see a set of key...   
3  train_3  #Person1#: Why didn't you tell me you had a gi...   
4  train_4  #Person1#: Watsup, ladies! Y'll looking'fine t...   

                                             summary              topic  
0  Mr. Smith's getting a check-up, and Doctor Haw...     get a check-up  
1  Mrs Parker takes Ricky for his vaccines. Dr. P...           vaccines  
2  #Person1#'s looking for a set of keys and asks...          find keys  
3  #Person1#'s angry because #Person2# didn't tel...  have a girlfriend  
4  Malik invites Nikki to dance. Nikki agrees if ...              dance  


convert the pandas DataFrame to Hugging Face Dataset

In [None]:
def convert_to_hf_dataset(df):
    return Dataset.from_pandas(df)

train_dataset = convert_to_hf_dataset(train_data[['summary', 'topic']])
validation_dataset = convert_to_hf_dataset(validation_data[['summary', 'topic']])
test_dataset = convert_to_hf_dataset(test_data[['summary', 'topic']])

combine into a DatasetDict for easy access

In [None]:
dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset
})

extract unique topics and create mappings

In [None]:
unique_topics = list(train_data['topic'].unique())
label2id = {topic: idx for idx, topic in enumerate(unique_topics)}
id2label = {idx: topic for topic, idx in label2id.items()}

### Step 3: Training the "distilbert-base-uncased" model on the first preprocessed dataset Dialogsum

load the distilbert tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

check how many different topics this dataset includes

In [None]:
# combine and sort unique topics from all datasets
unique_topics = sorted(
    set(train_data['topic']).union(
        set(validation_data['topic']),
        set(test_data['topic'])
    )
)

# create mappings
label2id = {topic: idx for idx, topic in enumerate(unique_topics)}
id2label = {idx: topic for idx, topic in enumerate(unique_topics)}

num_labels = len(unique_topics)

print(f"Total unique topics: {num_labels}")

Total unique topics: 8521


tokenize the columns

In [None]:
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples['summary'],
        padding="max_length",
        truncation=True,
        max_length=128
    )
    # map topics to label IDs
    model_inputs["labels"] = [label2id[topic] for topic in examples["topic"]]
    return model_inputs

apply preprocessing to the dataset

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

set format for PyTorch compatibility

In [None]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

load the model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

define the training arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=['wandb'],  # enable W&B logging to be able to visualize the training process
)

define compute metrics function

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

initialize trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

train the model !!

In [None]:

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,8.9284,8.683805,0.034,0.006769,0.034,0.010099
2,7.9518,8.794666,0.06,0.011067,0.06,0.017588
3,7.1305,8.950762,0.072,0.017816,0.072,0.025458
4,6.4979,9.138571,0.082,0.023987,0.082,0.034199
5,5.9629,9.702597,0.086,0.034686,0.086,0.045657
6,5.3434,9.89323,0.088,0.038108,0.088,0.048941
7,4.9415,10.386211,0.092,0.046809,0.092,0.056444
8,4.707,10.631849,0.098,0.056951,0.098,0.066649
9,4.699,10.999092,0.088,0.054131,0.088,0.062054
10,4.1736,11.151324,0.092,0.058293,0.092,0.065794


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

TrainOutput(global_step=7790, training_loss=6.0691849034007, metrics={'train_runtime': 1538.7394, 'train_samples_per_second': 80.975, 'train_steps_per_second': 5.063, 'total_flos': 4753253622835200.0, 'train_loss': 6.0691849034007, 'epoch': 10.0})

### Step 4: evaluate the first model after training on first dataset

evaluate the model

In [None]:
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation results: {'eval_loss': 10.63184928894043, 'eval_accuracy': 0.098, 'eval_precision': 0.0569508658008658, 'eval_recall': 0.098, 'eval_f1': 0.06664851215848658, 'eval_runtime': 2.0469, 'eval_samples_per_second': 244.274, 'eval_steps_per_second': 15.634, 'epoch': 10.0}


('./model1/tokenizer_config.json',
 './model1/special_tokens_map.json',
 './model1/vocab.txt',
 './model1/added_tokens.json',
 './model1/tokenizer.json')

save the model for later use

In [None]:

model.save_pretrained("./model1")
tokenizer.save_pretrained("./model1")

let's test it on some examples now

In [None]:
# load the tokenizer and model from the saved directory
tokenizer = AutoTokenizer.from_pretrained("./model1")
model = AutoModelForSequenceClassification.from_pretrained("./model1")

# ensure the model is in evaluation mode
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [21]:
# Access id2label and label2id from the model's configuration
id2label = model.config.id2label
label2id = model.config.label2id

In [22]:
# Sample text input
text = "I wanted to dance so bad!"

In [23]:
# Tokenize the input text
inputs = tokenizer(
    text,
    padding="max_length",
    truncation=True,
    max_length=128,
    return_tensors="pt"  # Return PyTorch tensors
)

In [24]:
# Get predictions from the model
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get the predicted class ID
predicted_class_id = torch.argmax(logits, dim=1).item()

In [25]:
# Get the predicted topic name
# Corrected line: Access id2label using integer key
predicted_topic = id2label[predicted_class_id]


print(f"Text: {text} ; The predicted topic is: {predicted_topic}")

Text: I wanted to dance so bad! ; The predicted topic is: dance


### Step 5: Process the second dataset "Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens"

This dataset contains 100,000 entries, and 4 columns: text, topic 1, topic 2, topic 3. I created a new entry for each topic, so the new transformed dataset contained 300,000 entries with 2 columns: text and topic. 

However, after filtering missing entries and topics that were not found in more than 20 examples, the filtered data contained 66,000 entries.

Thus, the second training was done on 66,000 entries on "model1" which was obtained from the previous step.

load the dataset

In [None]:
dataset2 = load_dataset("AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens")

let's look at the structure of the dataset

In [33]:
print(dataset2)

DatasetDict({
    train: Dataset({
        features: ['text', 'topic_1', 'topic_2', 'topic_3'],
        num_rows: 100000
    })
})


convert the dataset to a pandas DataFrame

In [None]:
train_data = dataset2["train"].to_pandas()

# create a new DataFrame with the desired structure
transformed_data = pd.DataFrame({
    "text": pd.concat([train_data["text"], train_data["text"], train_data["text"]], ignore_index=True),
    "topic": pd.concat([train_data["topic_1"], train_data["topic_2"], train_data["topic_3"]], ignore_index=True),
})

# display the transformed dataset
print("Transformed Dataset Sample:")
print(transformed_data.head())

# save the new dataset if needed
transformed_data.to_csv("transformed_dataset.csv", index=False)
print("Transformed dataset saved as 'transformed_dataset.csv'.")


Transformed Dataset Sample:
                                                text                  topic
0  Colourful pictures of Lohri and Makar Sankrant...  Cultural Celebrations
1  Laurent Garnier\n04-12-2010Past EventAmsterdam...                  Music
2  Frame: 14 1/8" x 17 7/8"\nImage: 9 7/8" x 13 5...        Art and Framing
3  Advocates warn against sharing unconfirmed rum...     Immigration Issues
4  New project for Ideas2Action – Win on Waste on...  Community Initiatives
Transformed dataset saved as 'transformed_dataset.csv'.


In [35]:
print(transformed_data)

                                                     text  \
0       Colourful pictures of Lohri and Makar Sankrant...   
1       Laurent Garnier\n04-12-2010Past EventAmsterdam...   
2       Frame: 14 1/8" x 17 7/8"\nImage: 9 7/8" x 13 5...   
3       Advocates warn against sharing unconfirmed rum...   
4       New project for Ideas2Action – Win on Waste on...   
...                                                   ...   
299995  Save when you buy both the shirt and shirt tog...   
299996  Yadonia Group @ Facebook Yadonia Group @ Twitt...   
299997  GameCube accessories include first-party relea...   
299998  While I was in LA, Zach took me out to Anacapa...   
299999  TasCOSS MEDIA RELEASE: Funding to address comm...   

                              topic  
0             Cultural Celebrations  
1                             Music  
2                   Art and Framing  
3                Immigration Issues  
4             Community Initiatives  
...                             ...  
2

dropping all missing values and ensuring all topics are strings

In [None]:
transformed_data = transformed_data.dropna(subset=["text", "topic"]).reset_index(drop=True)
transformed_data["topic"] = transformed_data["topic"].astype(str)

print(transformed_data['topic'].value_counts())

topic
Sports                   2207
Software Development     1741
Literature               1593
Music                    1308
Education                1284
                         ... 
Roman Empire Tablets        1
Colorado Governance         1
Injectable Treatments       1
Forearm Balance             1
Theni District              1
Name: count, Length: 174856, dtype: int64


count how many examples each label has, and remove labels with less than 20 examples

In [38]:
class_counts = transformed_data['topic'].value_counts()
valid_classes = class_counts[class_counts > 20].index
filtered_data = transformed_data[transformed_data['topic'].isin(valid_classes)].reset_index(drop=True)

print(filtered_data['topic'].value_counts())

topic
Sports                  2207
Software Development    1741
Literature              1593
Music                   1308
Education               1284
                        ... 
Murder Case               21
Scholarship Programs      21
Impact of COVID-19        21
Creative Process          21
Superhero Movies          21
Name: count, Length: 959, dtype: int64


split the dataset into train (70%), validation (15%), and test (15%)

In [None]:
train2_data, temp_data = train_test_split(
    filtered_data, test_size=0.3, random_state=42, stratify=filtered_data['topic']
)

val2_data, test2_data = train_test_split(
    temp_data, test_size=0.5, random_state=42, stratify=temp_data['topic']
)

The size of the full dataset was 66,942, and now after splitting:

In [40]:
print(f"Training set size: {len(train2_data)}")
print(f"Validation set size: {len(val2_data)}")
print(f"Test set size: {len(test2_data)}")

Training set size: 46859
Validation set size: 10041
Test set size: 10042


let's look at the structure of the train data

In [41]:
print(train2_data.head())

                                                    text               topic
12765  Actress Shailene Woodley confirms engagement t...  Entertainment News
37273  The Appellate Court Says an In Pro Per Party's...   Legal Proceedings
27654  Alina Ivanchenko CG Artist - Did you know the ...     Art and Culture
36254  Tag: Practical Applications\nWhy Care About As...           Astronomy
27388  If you’ve eaten one too many crosushis (croiss...           Nutrition


convert the pandas DataFrame to Hugging Face Dataset

In [None]:
def convert_to_hf_dataset(df):
    return Dataset.from_pandas(df)

train_dataset2 = convert_to_hf_dataset(train2_data[['text', 'topic']])
validation_dataset2 = convert_to_hf_dataset(val2_data[['text', 'topic']])
test_dataset2 = convert_to_hf_dataset(test2_data[['text', 'topic']])

combine into a DatasetDict for easy access

In [None]:

dataset = DatasetDict({
    "train": train_dataset2,
    "validation": validation_dataset2,
    "test": test_dataset2,
})

let's find out how many unique topics we have in this dataset after filtering

In [None]:
# combine and sort unique topics from all datasets
unique_topics = sorted(
    set(train2_data['topic']).union(
        set(val2_data['topic']),
        set(test2_data['topic'])
    )
)

# Create mappings
label2id = {topic: idx for idx, topic in enumerate(unique_topics)}
id2label = {idx: topic for idx, topic in enumerate(unique_topics)}

num_labels = len(unique_topics)

print(f"Total unique topics: {num_labels}")

Total unique topics: 959


extract unique topics and create mappings

In [None]:
unique_topics = list(train2_data['topic'].unique())
label2id = {topic: idx for idx, topic in enumerate(unique_topics)}
id2label = {idx: topic for topic, idx in label2id.items()}

### Step 6: setup the training of the previous model "model1" on our new processed dataset "Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens"

load the tokenizer from the previously trained "model1"

In [57]:
tokenizer = AutoTokenizer.from_pretrained("./model1")

map topics to label IDs

In [None]:
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples['text'],
        padding="max_length",
        truncation=True,
        max_length=128
    )
    
    model_inputs["labels"] = [label2id[topic] for topic in examples["topic"]]
    return model_inputs

apply preprocessing to the dataset

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

set format for PyTorch compatibility

In [None]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

load the previously saved model "model1"

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "./model1",
    num_labels=num_labels,  # Current number of labels
    ignore_mismatched_sizes=True  # Ignore mismatched layer sizes
)

define training arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=['wandb'],  # enable W&B logging for visualization of the training process
)

define compute metrics function

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

initialize the trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  trainer = Trainer(


and now we train the second model !!

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.3445,3.143588,0.315108,0.195054,0.315108,0.218831


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.3445,3.143588,0.315108,0.195054,0.315108,0.218831
2,2.3854,2.640079,0.373768,0.275405,0.373768,0.296203
3,1.9839,2.505637,0.388109,0.307073,0.388109,0.3269
4,1.7295,2.496341,0.394881,0.321832,0.394881,0.34014


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=11716, training_loss=2.586590626655727, metrics={'train_runtime': 2347.3388, 'train_samples_per_second': 79.85, 'train_steps_per_second': 4.991, 'total_flos': 6313228013042688.0, 'train_loss': 2.586590626655727, 'epoch': 4.0})

### Step 7: Evaluate the final model: event_extraction_model

evaluate the model on the test set

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

Evaluation results: {'eval_loss': 2.4963409900665283, 'eval_accuracy': 0.3948809879494074, 'eval_precision': 0.32183156390194795, 'eval_recall': 0.3948809879494074, 'eval_f1': 0.3401403588636155, 'eval_runtime': 45.4144, 'eval_samples_per_second': 221.097, 'eval_steps_per_second': 13.828, 'epoch': 4.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


map the new labels to ids

In [None]:
model.config.id2label = id2label
model.config.label2id = label2id

save the model for later use

In [None]:
model.save_pretrained("./event_extraction_model")
tokenizer.save_pretrained("./event_extraction_model")

('./event_extraction_model/tokenizer_config.json',
 './event_extraction_model/special_tokens_map.json',
 './event_extraction_model/vocab.txt',
 './event_extraction_model/added_tokens.json',
 './event_extraction_model/tokenizer.json')

load the tokenizer and model from the saved directory

In [None]:
tokenizer = AutoTokenizer.from_pretrained("./event_extraction_model")
model = AutoModelForSequenceClassification.from_pretrained("./event_extraction_model")

# ensure the model is in evaluation mode
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Now let's test it on some sample text input to see if the model is able to extract the topic/context from a sentence

In [None]:
texts = [
    "I love playing football",
    "My mother made me food today.",
    "The weather is sunny today.",
    "I aced my exam!",
    "I spoke with my crush today.",
    "I hate my school"
]

tokenize the input text

In [None]:
inputs = tokenizer(
    texts,
    padding="max_length",
    truncation=True,
    max_length=128,
    return_tensors="pt"  # Return PyTorch tensors
)

get predictions from the model

In [None]:
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get the predicted class IDs
predicted_class_ids = torch.argmax(logits, dim=1).tolist()  # Convert to a list

Moment of truth: get the predicted topics 👀

In [None]:
predicted_topics = [id2label[class_id] for class_id in predicted_class_ids]

# display the results
for text, topic in zip(texts, predicted_topics):
    print(f"Text: {text} ; The predicted topic is: {topic}")

Text: I love playing football ; The predicted topic is: Sports
Text: My mother made me food today. ; The predicted topic is: Food and Cooking
Text: The weather is sunny today. ; The predicted topic is: Weather
Text: I aced my exam! ; The predicted topic is: Education
Text: I spoke with my crush today. ; The predicted topic is: Personal Experiences
Text: I hate my school ; The predicted topic is: Education


Amazing! The model works exactly like expected, given a sentence, it is able to extract the main context/topic of the sentence. Now combined with our extracted entities we can create an event instance for any given output 🎉