# Llama Model Training

The LLAMA method demonstrates notably varied performance across different dataset configurations. On the Climate Change (CC) dataset, it achieves its strongest results with an F1 score of 0.398 (precision: 0.321, recall: 0.528). This relatively strong performance might be attributed to the more structured and consistent nature of climate change narratives, which often revolve around established scientific concepts and recurring themes, making them easier for the model to identify and classify.

When applied to the Ukraine (UA) dataset, LLAMA's performance moderately declines with an F1 score of 0.221 (precision: 0.194, recall: 0.306). This decrease could be due to the more dynamic and evolving nature of conflict-related narratives, which may contain more varied vocabulary, rapidly changing context, and complex geopolitical elements that challenge the model's classification capabilities.

The model's effectiveness drops most dramatically on the full combined dataset, achieving an F1 score of 0.098 (precision: 0.064, recall: 0.212). This substantial performance degradation when handling multiple domains suggests that LLAMA struggles with the increased complexity of distinguishing between different types of narratives simultaneously. The challenge likely stems from the model having to maintain separate context awareness for different domains while attempting to identify domain-specific narrative patterns, leading to increased confusion and misclassification.

This clear pattern of declining performance from single-domain to multi-domain classification indicates that LLAMA's architecture may be better suited for specialized, domain-specific tasks rather than broader, multi-domain applications. The significant drop in precision on the full dataset particularly suggests that the model loses its ability to make confident, accurate predictions when faced with the additional complexity of multiple narrative domains.

In [2]:
import os
import pandas as pd
import wandb
import torch
import logging
from datetime import datetime
from huggingface_hub import login

from model import initialize_model, setup_peft
from data_utils import (
    prepare_data,
    get_predictions_batch,
    prepare_data_for_model,
    ensure_model_on_device,
)
from trainer import train_model
from debug_utils import debug_misclassifications, get_narrative_key

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

In [3]:
def train_single_dataset(df, model_name, output_dir, current_date, dataset_name):
    """
    Train model on a single dataset

    Args:
        df: DataFrame containing the dataset
        model_name: Name of the model to use
        output_dir: Directory to save outputs
        current_date: Current date string for naming
        dataset_name: Name of the dataset for logging

    Returns:
        tuple: (results, model, tokenizer, label_mapping, df)  # Added df to return values
    """
    try:
        # Create dataset-specific output directory
        dataset_output_dir = os.path.join(output_dir, f"{dataset_name}_{current_date}")
        os.makedirs(dataset_output_dir, exist_ok=True)

        print(f"\nTraining on {dataset_name} dataset...")

        # Initialize wandb run for this dataset
        wandb.init(
            project="llama-classification",
            name=f"llama-classification-{dataset_name}-{current_date}",
            reinit=True,
        )

        # Prepare data
        train_dataset, val_dataset, tokenizer, label_mapping, num_labels = prepare_data(
            df, model_name, dataset_output_dir
        )

        # Initialize and setup model
        print("\nInitializing model...")
        model = initialize_model(model_name, num_labels)
        model = setup_peft(model)

        # Move model to GPU if available
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model = model.to(device)

        # Create data collator that handles device placement
        from transformers import DataCollatorWithPadding

        data_collator = DataCollatorWithPadding(
            tokenizer=tokenizer, padding=True, max_length=512, return_tensors="pt"
        )

        def collate_fn(batch):
            # Collate the batch using the data collator
            batch = data_collator(batch)
            # Move to device
            return {
                k: v.to(device) if isinstance(v, torch.Tensor) else v
                for k, v in batch.items()
            }

        # Train model with custom collate_fn
        trainer = train_model(
            model,
            train_dataset,
            val_dataset,
            dataset_output_dir,
            current_date,
            dataset_name,
            collate_fn=collate_fn,  # Pass the custom collate function
        )

        # Evaluate model
        print("\nEvaluating model...")
        results = trainer.evaluate()

        print(f"\nEvaluation results for {dataset_name} dataset:")
        for metric, value in results.items():
            if isinstance(value, float):
                print(f"{metric}: {value:.4f}")
            else:
                print(f"{metric}: {value}")

        # Save model and tokenizer
        print(f"\nSaving {dataset_name} model...")
        trainer.save_model(dataset_output_dir)
        tokenizer.save_pretrained(dataset_output_dir)

        # End wandb run
        wandb.finish()

        # Return df along with other outputs
        return results, model, tokenizer, label_mapping, df

    except Exception as e:
        print(f"Error in training {dataset_name} dataset: {str(e)}")
        wandb.finish()
        raise

In [4]:
def setup_training():
    try:
        # Login to Hugging Face
        login("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")

        # Check CUDA availability
        print(f"CUDA Available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"GPU Device: {torch.cuda.get_device_name(0)}")
            print(
                f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB"
            )

        # Set paths
        def find_repo_root():
            current = os.getcwd()
            while current != os.path.dirname(current):
                if os.path.exists(os.path.join(current, ".git")):
                    return current
                current = os.path.dirname(current)
            raise Exception(
                "No .git directory found - repository root could not be determined"
            )

        # Set paths using repository root
        repo_root = find_repo_root()
        code_path = os.path.join(repo_root, "code")
        current_date = datetime.now().strftime("%Y%m%d")
        output_dir = os.path.join(code_path, "models", f"llama_{current_date}")
        os.makedirs(output_dir, exist_ok=True)

        # Load data from code directory
        print("\nLoading datasets...")
        print(f"Repository root: {repo_root}")
        print(f"Looking for data files in: {code_path}")
        input_file_full = os.path.join(code_path, "df_normalized.csv")
        input_file_ua = os.path.join(code_path, "df_normalized_ua.csv")
        input_file_cc = os.path.join(code_path, "df_normalized_cc.csv")

        df_normalized = pd.read_csv(input_file_full)
        df_normalized_ua = pd.read_csv(input_file_ua)
        df_normalized_cc = pd.read_csv(input_file_cc)

        # Model configuration
        model_name = "openlm-research/open_llama_3b"

        return {
            "output_dir": output_dir,
            "current_date": current_date,
            "model_name": model_name,
            "df_normalized": df_normalized,
            "df_normalized_ua": df_normalized_ua,
            "df_normalized_cc": df_normalized_cc,
        }

    except Exception as e:
        print(f"Error in setup: {str(e)}")
        import traceback

        traceback.print_exc()
        wandb.finish()
        raise


def train_ua():
    try:
        # Get setup configuration
        config = setup_training()

        print("\nStarting UA dataset training...")
        results, model, tokenizer, label_mapping, df = train_single_dataset(
            config["df_normalized_ua"],
            config["model_name"],
            config["output_dir"],
            config["current_date"],
            "ua",
        )

        return results, model, tokenizer, label_mapping, config["df_normalized_ua"]

    except Exception as e:
        print(f"Error in UA training: {str(e)}")
        import traceback

        traceback.print_exc()
        wandb.finish()
        raise


def train_cc():
    try:
        # Get setup configuration
        config = setup_training()

        print("\nStarting CC dataset training...")
        cc_results, cc_model, cc_tokenizer, cc_label_mapping, df = train_single_dataset(
            config["df_normalized_cc"],
            config["model_name"],
            config["output_dir"],
            config["current_date"],
            "cc",
        )

        return (
            cc_results,
            cc_model,
            cc_tokenizer,
            cc_label_mapping,
            config["df_normalized_cc"],
        )

    except Exception as e:
        print(f"Error in CC training: {str(e)}")
        import traceback

        traceback.print_exc()
        wandb.finish()
        raise


def train_full():
    try:
        # Get setup configuration
        config = setup_training()

        print("\nStarting full dataset training...")
        results, model, tokenizer, label_mapping, df_normalized = train_single_dataset(
            config["df_normalized"],
            config["model_name"],
            config["output_dir"],
            config["current_date"],
            "full",
        )

        return results, model, tokenizer, label_mapping, config["df_normalized"]

    except Exception as e:
        print(f"Error in full dataset training: {str(e)}")
        import traceback

        traceback.print_exc()
        wandb.finish()
        raise

In [5]:
def debug_model(model, dataset, tokenizer, label_mapping, dataset_type="Training"):
    """Run debug analysis on model predictions"""
    try:
        # Set up model and device
        model, device = ensure_model_on_device(model)
        print(f"\nAnalyzing {dataset_type} dataset...")

        # Prepare texts
        texts = (
            dataset["tokens_normalized"]
            .apply(lambda x: " ".join(x) if isinstance(x, list) else x)
            .tolist()
        )

        true_labels = torch.tensor(
            [
                label_mapping[
                    get_narrative_key(eval(n)[0] if isinstance(n, str) else n[0])
                ]
                for n in dataset["narrative_subnarrative_pairs"]
            ]
        ).to(device)

        print(f"Total samples: {len(texts)}")

        # Get predictions in batches
        batch_size = 8
        predictions = []
        confidences = []

        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i : i + batch_size]
            batch_preds, batch_confs = get_predictions_batch(
                model, batch_texts, tokenizer, device
            )
            predictions.append(batch_preds)
            confidences.append(batch_confs)

        # Concatenate and move to CPU
        predictions = torch.cat(predictions).cpu().numpy()
        confidences = torch.cat(confidences).cpu().numpy()
        true_labels = true_labels.cpu().numpy()

        # Track misclassifications
        misclassifications = []
        for idx, (pred, true, conf) in enumerate(
            zip(predictions, true_labels, confidences)
        ):
            if pred != true:
                misclassifications.append(
                    {
                        "text": texts[idx][:200],
                        "predicted": pred,
                        "actual": true,
                        "confidence": conf,
                        "dataset_type": dataset_type,
                    }
                )

        # Create DataFrame and display results
        misclass_df = pd.DataFrame(misclassifications)
        print(f"\nTotal misclassifications: {len(misclass_df)}")
        print(f"Accuracy: {1 - len(misclass_df)/len(texts):.4f}")

        if len(misclass_df) > 0:
            print("\nMisclassification distribution:")
            print(
                misclass_df.groupby(["actual", "predicted"])
                .size()
                .unstack(fill_value=0)
            )

            print("\nSample misclassifications:")
            for i, row in misclass_df.head().iterrows():
                print(f"\nExample {i+1}:")
                print(f"Text: {row['text']}")
                print(f"Predicted: {row['predicted']}, Actual: {row['actual']}")
                print(f"Confidence: {row['confidence']:.4f}")

        return misclass_df

    except Exception as e:
        print(f"Error in debug analysis: {str(e)}")
        import traceback

        traceback.print_exc()
        raise

In [6]:
config = setup_training()

CUDA Available: True
GPU Device: NVIDIA L40S
GPU Memory: 47.81 GB

Loading datasets...
Repository root: /teamspace/studios/this_studio/nlp_Backpropagandists_2024
Looking for data files in: /teamspace/studios/this_studio/nlp_Backpropagandists_2024/code


In [7]:
# Train UA dataset
ua_results, ua_model, ua_tokenizer, ua_label_mapping, df_normalized_ua = train_ua()

CUDA Available: True
GPU Device: NVIDIA L40S
GPU Memory: 47.81 GB

Loading datasets...
Repository root: /teamspace/studios/this_studio/nlp_Backpropagandists_2024
Looking for data files in: /teamspace/studios/this_studio/nlp_Backpropagandists_2024/code

Starting UA dataset training...

Training on ua dataset...


[34m[1mwandb[0m: Currently logged in as: [33mjonaskruse[0m ([33mbackpropagandists[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message



Creating narrative mapping...
Number of unique narratives: 12

Sample narrative mappings:
0: Amplifying war-related fears
1: Blaming the war on others rather than the invader
2: Discrediting Ukraine
3: Discrediting the West, Diplomacy
4: Distrust towards Media

Training set size: 940
Validation set size: 235

Initializing tokenizer...

Tokenizing texts...

Initializing model...


2025-01-25 13:57:42,958 - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at openlm-research/open_llama_3b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 10,688,000 || all params: 3,334,800,000 || trainable%: 0.3205

Starting classification head pre-training...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Confusion Matrix
1,2.1589,2.107835,0.238298,0.140942,0.114637,0.238298,"{'Class_0': [[216, 0], [19, 0]], 'Class_1': [[219, 0], [16, 0]], 'Class_2': [[67, 127], [4, 37]], 'Class_3': [[206, 0], [29, 0]], 'Class_4': [[233, 0], [2, 0]], 'Class_5': [[233, 0], [2, 0]], 'Class_6': [[226, 0], [9, 0]], 'Class_7': [[142, 39], [35, 19]], 'Class_8': [[233, 0], [2, 0]], 'Class_9': [[186, 13], [36, 0]], 'Class_10': [[221, 0], [14, 0]], 'Class_11': [[224, 0], [11, 0]]}"



Metrics for Class 0:
Confusion Matrix:
[[216   0]
 [ 19   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 1:
Confusion Matrix:
[[219   0]
 [ 16   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[ 67 127]
 [  4  37]]
Precision: 0.2256
Recall: 0.9024
F1 Score: 0.3610

Metrics for Class 3:
Confusion Matrix:
[[206   0]
 [ 29   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[226   0]
 [  9   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[142  39]
 [ 35  19]]
Precision: 0.3276
Recall: 0.3519
F1 Score: 0.3393

Metrics for Class 8:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Starting full model fine-tuning...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Confusion Matrix
1,2.1668,2.062745,0.306383,0.221379,0.194635,0.306383,"{'Class_0': [[212, 4], [17, 2]], 'Class_1': [[219, 0], [16, 0]], 'Class_2': [[119, 75], [12, 29]], 'Class_3': [[189, 17], [25, 4]], 'Class_4': [[233, 0], [2, 0]], 'Class_5': [[233, 0], [2, 0]], 'Class_6': [[226, 0], [9, 0]], 'Class_7': [[129, 52], [17, 37]], 'Class_8': [[233, 0], [2, 0]], 'Class_9': [[184, 15], [36, 0]], 'Class_10': [[221, 0], [14, 0]], 'Class_11': [[224, 0], [11, 0]]}"
2,0.6192,2.702352,0.225532,0.219968,0.228969,0.225532,"{'Class_0': [[191, 25], [14, 5]], 'Class_1': [[216, 3], [14, 2]], 'Class_2': [[162, 32], [24, 17]], 'Class_3': [[171, 35], [23, 6]], 'Class_4': [[230, 3], [2, 0]], 'Class_5': [[232, 1], [2, 0]], 'Class_6': [[225, 1], [9, 0]], 'Class_7': [[153, 28], [34, 20]], 'Class_8': [[230, 3], [2, 0]], 'Class_9': [[169, 30], [33, 3]], 'Class_10': [[207, 14], [14, 0]], 'Class_11': [[217, 7], [11, 0]]}"



Metrics for Class 0:
Confusion Matrix:
[[212   4]
 [ 17   2]]
Precision: 0.3333
Recall: 0.1053
F1 Score: 0.1600

Metrics for Class 1:
Confusion Matrix:
[[219   0]
 [ 16   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[119  75]
 [ 12  29]]
Precision: 0.2788
Recall: 0.7073
F1 Score: 0.4000

Metrics for Class 3:
Confusion Matrix:
[[189  17]
 [ 25   4]]
Precision: 0.1905
Recall: 0.1379
F1 Score: 0.1600

Metrics for Class 4:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[226   0]
 [  9   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[129  52]
 [ 17  37]]
Precision: 0.4157
Recall: 0.6852
F1 Score: 0.5175

Metrics for Class 8:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Metrics for Class 0:
Confusion Matrix:
[[203  13]
 [ 15   4]]
Precision: 0.2353
Recall: 0.2105
F1 Score: 0.2222

Metrics for Class 1:
Confusion Matrix:
[[209  10]
 [ 15   1]]
Precision: 0.0909
Recall: 0.0625
F1 Score: 0.0741

Metrics for Class 2:
Confusion Matrix:
[[148  46]
 [ 21  20]]
Precision: 0.3030
Recall: 0.4878
F1 Score: 0.3738

Metrics for Class 3:
Confusion Matrix:
[[169  37]
 [ 22   7]]
Precision: 0.1591
Recall: 0.2414
F1 Score: 0.1918

Metrics for Class 4:
Confusion Matrix:
[[231   2]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[226   0]
 [  9   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[141  40]
 [ 29  25]]
Precision: 0.3846
Recall: 0.4630
F1 Score: 0.4202

Metrics for Class 8:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Metrics for Class 0:
Confusion Matrix:
[[191  25]
 [ 14   5]]
Precision: 0.1667
Recall: 0.2632
F1 Score: 0.2041

Metrics for Class 1:
Confusion Matrix:
[[216   3]
 [ 14   2]]
Precision: 0.4000
Recall: 0.1250
F1 Score: 0.1905

Metrics for Class 2:
Confusion Matrix:
[[162  32]
 [ 24  17]]
Precision: 0.3469
Recall: 0.4146
F1 Score: 0.3778

Metrics for Class 3:
Confusion Matrix:
[[171  35]
 [ 23   6]]
Precision: 0.1463
Recall: 0.2069
F1 Score: 0.1714

Metrics for Class 4:
Confusion Matrix:
[[230   3]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[232   1]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[225   1]
 [  9   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[153  28]
 [ 34  20]]
Precision: 0.4167
Recall: 0.3704
F1 Score: 0.3922

Metrics for Class 8:
Confusion Matrix:
[[230   3]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000





Metrics for Class 0:
Confusion Matrix:
[[212   4]
 [ 17   2]]
Precision: 0.3333
Recall: 0.1053
F1 Score: 0.1600

Metrics for Class 1:
Confusion Matrix:
[[219   0]
 [ 16   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[119  75]
 [ 12  29]]
Precision: 0.2788
Recall: 0.7073
F1 Score: 0.4000

Metrics for Class 3:
Confusion Matrix:
[[189  17]
 [ 25   4]]
Precision: 0.1905
Recall: 0.1379
F1 Score: 0.1600

Metrics for Class 4:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[226   0]
 [  9   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[129  52]
 [ 17  37]]
Precision: 0.4157
Recall: 0.6852
F1 Score: 0.5175

Metrics for Class 8:
Confusion Matrix:
[[233   0]
 [  2   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0,1
eval/accuracy,▂█▂▁█
eval/f1,▁█▇██
eval/loss,▁▁▃█▁
eval/precision,▁▆▅█▆
eval/recall,▂█▂▁█
eval/runtime,▁▆▆█▁
eval/samples_per_second,█▃▃▁█
eval/steps_per_second,▁██▇█
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▁▁▂▂▂▃▃▃▃▃▄▄▄▄▅▅▆▆▆▆▆▇▇▇█████

0,1
eval/accuracy,0.30638
eval/f1,0.22138
eval/loss,2.06274
eval/precision,0.19464
eval/recall,0.30638
eval/runtime,30.6106
eval/samples_per_second,7.677
eval/steps_per_second,1.927
total_flos,2.780381184e+16
train/epoch,2.97872


In [8]:
ua_debug_df = debug_model(
    ua_model, df_normalized_ua, ua_tokenizer, ua_label_mapping, "UA"
)


Analyzing UA dataset...
Total samples: 1175



Total misclassifications: 714
Accuracy: 0.3923

Misclassification distribution:
predicted  0  1   2   3   7   9
actual                         
0          0  0  41   8  48   5
1          2  0  28  11  22   3
2          1  0   0   5  26   7
3          4  0  73   0  40   4
4          0  0   4   1  16   2
5          0  0   1   1   5   2
6          0  0  18   1  14   0
7          1  0  29   6   0  17
8          0  0   1   0  10   1
9          1  0  82   4  42   0
10         1  1  33   6  16   6
11         0  0  24   3  33   4

Sample misclassifications:

Example 1:
Text: ['putin', 'mass', 'hivpositive', 'prisoner', 'choose', 'go', 'meatgrinder', 'frontline', 'rather', 'rot', 'jail', 'med', 'putin', 'mass', 'hivpositive', 'prisoner', 'choose', 'go', 'meatgrinder', 'fro
Predicted: 7, Actual: 11
Confidence: 0.4568

Example 2:
Text: ['north', 'korea', 'kim', 'jong', 'un', 'putin', 'xi', 'meet', 'beijing', 'october', 'say', 'kremlin', 'russian', 'president', 'vladimir', 'putin', 'meet', 'china

In [9]:
# Train UA dataset
cc_results, cc_model, cc_tokenizer, cc_label_mapping, df_normalized_cc = train_cc()

CUDA Available: True
GPU Device: NVIDIA L40S
GPU Memory: 47.81 GB

Loading datasets...
Repository root: /teamspace/studios/this_studio/nlp_Backpropagandists_2024
Looking for data files in: /teamspace/studios/this_studio/nlp_Backpropagandists_2024/code



Starting CC dataset training...

Training on cc dataset...



Creating narrative mapping...
Number of unique narratives: 11

Sample narrative mappings:
0: Amplifying Climate Fears
1: Climate change is beneficial
2: Controversy about green technologies
3: Criticism of climate movement
4: Criticism of climate policies

Training set size: 415
Validation set size: 104

Initializing tokenizer...

Tokenizing texts...


2025-01-25 14:28:41,167 - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).



Initializing model...


Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at openlm-research/open_llama_3b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 10,684,800 || all params: 3,334,793,600 || trainable%: 0.3204

Starting classification head pre-training...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Confusion Matrix
1,1.8149,1.940315,0.451923,0.356348,0.316336,0.451923,"{'Class_0': [[49, 15], [12, 28]], 'Class_1': [[103, 0], [1, 0]], 'Class_3': [[96, 0], [8, 0]], 'Class_4': [[100, 0], [4, 0]], 'Class_5': [[92, 0], [12, 0]], 'Class_6': [[97, 0], [7, 0]], 'Class_7': [[103, 0], [1, 0]], 'Class_8': [[96, 0], [8, 0]], 'Class_9': [[40, 42], [3, 19]], 'Class_10': [[103, 0], [1, 0]]}"



Metrics for Class 0:
Confusion Matrix:
[[49 15]
 [12 28]]
Precision: 0.6512
Recall: 0.7000
F1 Score: 0.6747

Metrics for Class 1:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 3:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[100   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[92  0]
 [12  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[97  0]
 [ 7  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[40 42]
 [ 3 19]]
Precision: 0.3115
Recall: 0.8636
F1 Score: 0.4578

Metric

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Starting full model fine-tuning...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Confusion Matrix
1,1.6258,1.746097,0.528846,0.400064,0.323228,0.528846,"{'Class_0': [[41, 23], [2, 38]], 'Class_1': [[103, 0], [1, 0]], 'Class_3': [[96, 0], [8, 0]], 'Class_4': [[100, 0], [4, 0]], 'Class_5': [[92, 0], [12, 0]], 'Class_6': [[97, 0], [7, 0]], 'Class_7': [[103, 0], [1, 0]], 'Class_8': [[96, 0], [8, 0]], 'Class_9': [[56, 26], [5, 17]], 'Class_10': [[103, 0], [1, 0]]}"
2,1.6985,1.625987,0.519231,0.392975,0.316606,0.519231,"{'Class_0': [[39, 25], [2, 38]], 'Class_1': [[103, 0], [1, 0]], 'Class_3': [[96, 0], [8, 0]], 'Class_4': [[99, 1], [4, 0]], 'Class_5': [[92, 0], [12, 0]], 'Class_6': [[97, 0], [7, 0]], 'Class_7': [[103, 0], [1, 0]], 'Class_8': [[96, 0], [8, 0]], 'Class_9': [[58, 24], [6, 16]], 'Class_10': [[103, 0], [1, 0]]}"
3,1.5577,1.60472,0.528846,0.398956,0.321355,0.528846,"{'Class_0': [[40, 24], [2, 38]], 'Class_1': [[103, 0], [1, 0]], 'Class_3': [[96, 0], [8, 0]], 'Class_4': [[100, 0], [4, 0]], 'Class_5': [[92, 0], [12, 0]], 'Class_6': [[97, 0], [7, 0]], 'Class_7': [[103, 0], [1, 0]], 'Class_8': [[96, 0], [8, 0]], 'Class_9': [[57, 25], [5, 17]], 'Class_10': [[103, 0], [1, 0]]}"



Metrics for Class 0:
Confusion Matrix:
[[41 23]
 [ 2 38]]
Precision: 0.6230
Recall: 0.9500
F1 Score: 0.7525

Metrics for Class 1:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 3:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[100   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[92  0]
 [12  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[97  0]
 [ 7  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[56 26]
 [ 5 17]]
Precision: 0.3953
Recall: 0.7727
F1 Score: 0.5231

Metric

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Metrics for Class 0:
Confusion Matrix:
[[39 25]
 [ 2 38]]
Precision: 0.6032
Recall: 0.9500
F1 Score: 0.7379

Metrics for Class 1:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 3:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[99  1]
 [ 4  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[92  0]
 [12  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[97  0]
 [ 7  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[58 24]
 [ 6 16]]
Precision: 0.4000
Recall: 0.7273
F1 Score: 0.5161

Metrics fo

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Metrics for Class 0:
Confusion Matrix:
[[40 24]
 [ 2 38]]
Precision: 0.6129
Recall: 0.9500
F1 Score: 0.7451

Metrics for Class 1:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 3:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[100   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[92  0]
 [12  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[97  0]
 [ 7  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[57 25]
 [ 5 17]]
Precision: 0.4048
Recall: 0.7727
F1 Score: 0.5312

Metric

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Evaluating model...





Metrics for Class 0:
Confusion Matrix:
[[40 24]
 [ 2 38]]
Precision: 0.6129
Recall: 0.9500
F1 Score: 0.7451

Metrics for Class 1:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 3:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[100   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[92  0]
 [12  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[97  0]
 [ 7  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[103   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[96  0]
 [ 8  0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[57 25]
 [ 5 17]]
Precision: 0.4048
Recall: 0.7727
F1 Score: 0.5312

Metric

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0,1
eval/accuracy,▁█▇██
eval/f1,▁█▇██
eval/loss,█▄▁▁▁
eval/precision,▁█▁▆▆
eval/recall,▁█▇██
eval/runtime,▄▁▁▄█
eval/samples_per_second,▅██▅▁
eval/steps_per_second,▁████
train/epoch,▁▁▂▂▃▃▃▁▁▂▂▃▃▃▄▄▅▅▆▆▆▇▇████
train/global_step,▁▁▂▂▃▃▃▁▁▂▂▃▃▃▄▄▅▅▆▆▆▇▇████

0,1
eval/accuracy,0.52885
eval/f1,0.39896
eval/loss,1.60472
eval/precision,0.32135
eval/recall,0.52885
eval/runtime,13.9213
eval/samples_per_second,7.471
eval/steps_per_second,1.868
total_flos,1.2362741858304e+16
train/epoch,3.0


In [10]:
cc_debug_df = debug_model(
    cc_model, df_normalized_cc, cc_tokenizer, cc_label_mapping, "CC"
)


Analyzing CC dataset...
Total samples: 519

Total misclassifications: 238
Accuracy: 0.5414

Misclassification distribution:
predicted   0  5   9
actual              
0           0  0  10
1           2  0   0
2           4  0   2
3          11  0  18
4          21  0  13
5          42  0  29
6          20  1   7
7           0  0   2
8           4  1  13
9          30  0   0
10          1  0   7

Sample misclassifications:

Example 1:
Text: ['bill', 'gate', 'say', 'solution', 'climate', 'change', 'ok', 'four', 'private', 'jet', 'bill', 'gate', 'right', 'fly', 'around', 'world', 'private', 'jet', 'normal', 'person', 'force', 'live', 'minu
Predicted: 9, Actual: 3
Confidence: 0.3822

Example 2:
Text: ['new', 'paper', 'make', 'increase', 'tropical', 'cyclone', 'frequency', 'claim', 'contradicted', 'two', 'year', 'ago', 'noaa', 'noaa', 'july', 'headline', 'research', 'global', 'warming', 'contribute
Predicted: 9, Actual: 10
Confidence: 0.3197

Example 3:
Text: ['climate', 'crazy', 'fail', 'a

In [11]:
# Train full dataset

results, model, tokenizer, label_mapping, df_normalized = train_full()

CUDA Available: True
GPU Device: NVIDIA L40S
GPU Memory: 47.81 GB

Loading datasets...
Repository root: /teamspace/studios/this_studio/nlp_Backpropagandists_2024
Looking for data files in: /teamspace/studios/this_studio/nlp_Backpropagandists_2024/code

Starting full dataset training...

Training on full dataset...



Creating narrative mapping...
Number of unique narratives: 21

Sample narrative mappings:
0: Amplifying Climate Fears
1: Amplifying war-related fears
2: Blaming the war on others rather than the invader
3: Climate change is beneficial
4: Controversy about green technologies

Training set size: 1355
Validation set size: 339

Initializing tokenizer...

Tokenizing texts...

Initializing model...


2025-01-25 14:42:46,330 - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at openlm-research/open_llama_3b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 10,716,800 || all params: 3,334,857,600 || trainable%: 0.3214

Starting classification head pre-training...




Epoch,Training Loss,Validation Loss



Metrics for Class 0:
Confusion Matrix:
[[  0 303]
 [  0  36]]
Precision: 0.1062
Recall: 1.0000
F1 Score: 0.1920

Metrics for Class 1:
Confusion Matrix:
[[316   0]
 [ 23   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[322   0]
 [ 17   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[338   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[335   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[336   0]
 [  3   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[318   0]
 [ 21   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[289   0]
 [ 50   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[303   0]
 [ 36   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr


Starting full model fine-tuning...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Confusion Matrix
1,2.5249,2.52801,0.159292,0.085994,0.078832,0.159292,"{'Class_0': [[51, 252], [3, 33]], 'Class_1': [[316, 0], [23, 0]], 'Class_2': [[322, 0], [17, 0]], 'Class_4': [[338, 0], [1, 0]], 'Class_5': [[335, 0], [4, 0]], 'Class_6': [[336, 0], [3, 0]], 'Class_7': [[318, 0], [21, 0]], 'Class_8': [[289, 0], [50, 0]], 'Class_9': [[303, 0], [36, 0]], 'Class_10': [[334, 0], [5, 0]], 'Class_11': [[335, 0], [4, 0]], 'Class_13': [[332, 0], [7, 0]], 'Class_14': [[333, 0], [6, 0]], 'Class_15': [[248, 33], [37, 21]], 'Class_16': [[336, 0], [3, 0]], 'Class_17': [[304, 0], [35, 0]], 'Class_18': [[338, 0], [1, 0]], 'Class_19': [[325, 0], [14, 0]], 'Class_20': [[324, 0], [15, 0]]}"
2,2.3764,2.434927,0.212389,0.098226,0.064621,0.212389,"{'Class_0': [[153, 150], [13, 23]], 'Class_1': [[316, 0], [23, 0]], 'Class_2': [[322, 0], [17, 0]], 'Class_4': [[338, 0], [1, 0]], 'Class_5': [[335, 0], [4, 0]], 'Class_6': [[336, 0], [3, 0]], 'Class_7': [[318, 0], [21, 0]], 'Class_8': [[289, 0], [50, 0]], 'Class_9': [[303, 0], [36, 0]], 'Class_10': [[334, 0], [5, 0]], 'Class_11': [[335, 0], [4, 0]], 'Class_13': [[332, 0], [7, 0]], 'Class_14': [[333, 0], [6, 0]], 'Class_15': [[164, 117], [9, 49]], 'Class_16': [[336, 0], [3, 0]], 'Class_17': [[304, 0], [35, 0]], 'Class_18': [[338, 0], [1, 0]], 'Class_19': [[325, 0], [14, 0]], 'Class_20': [[324, 0], [15, 0]]}"



Metrics for Class 0:
Confusion Matrix:
[[ 51 252]
 [  3  33]]
Precision: 0.1158
Recall: 0.9167
F1 Score: 0.2056

Metrics for Class 1:
Confusion Matrix:
[[316   0]
 [ 23   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[322   0]
 [ 17   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[338   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[335   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[336   0]
 [  3   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[318   0]
 [ 21   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[289   0]
 [ 50   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[303   0]
 [ 36   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr


Metrics for Class 0:
Confusion Matrix:
[[142 161]
 [  9  27]]
Precision: 0.1436
Recall: 0.7500
F1 Score: 0.2411

Metrics for Class 1:
Confusion Matrix:
[[316   0]
 [ 23   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[322   0]
 [ 17   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[338   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[335   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[336   0]
 [  3   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[318   0]
 [ 21   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[289   0]
 [ 50   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[303   0]
 [ 36   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr


Metrics for Class 0:
Confusion Matrix:
[[153 150]
 [ 13  23]]
Precision: 0.1329
Recall: 0.6389
F1 Score: 0.2201

Metrics for Class 1:
Confusion Matrix:
[[316   0]
 [ 23   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[322   0]
 [ 17   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[338   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[335   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[336   0]
 [  3   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[318   0]
 [ 21   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[289   0]
 [ 50   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[303   0]
 [ 36   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr


Evaluating model...





Metrics for Class 0:
Confusion Matrix:
[[153 150]
 [ 13  23]]
Precision: 0.1329
Recall: 0.6389
F1 Score: 0.2201

Metrics for Class 1:
Confusion Matrix:
[[316   0]
 [ 23   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 2:
Confusion Matrix:
[[322   0]
 [ 17   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 4:
Confusion Matrix:
[[338   0]
 [  1   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 5:
Confusion Matrix:
[[335   0]
 [  4   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 6:
Confusion Matrix:
[[336   0]
 [  3   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 7:
Confusion Matrix:
[[318   0]
 [ 21   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 8:
Confusion Matrix:
[[289   0]
 [ 50   0]]
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Metrics for Class 9:
Confusion Matrix:
[[303   0]
 [ 36   0]]
Precision: 0.0000
Recall: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

0,1
eval/accuracy,▁▄█▇▇
eval/f1,▁▆███
eval/loss,█▅▂▁▁
eval/precision,▁█▇▇▇
eval/recall,▁▄█▇▇
eval/runtime,█▇▁▇▅
eval/samples_per_second,▁▂█▂▄
eval/steps_per_second,▁████
train/epoch,▁▁▂▂▂▂▃▃▃▃▁▁▁▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇██
train/global_step,▁▂▂▂▂▂▃▃▃▃▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇████

0,1
eval/accuracy,0.21239
eval/f1,0.09823
eval/loss,2.43493
eval/precision,0.06462
eval/recall,0.21239
eval/runtime,43.693
eval/samples_per_second,7.759
eval/steps_per_second,1.945
total_flos,4.01772240371712e+16
train/epoch,2.98525


In [12]:
debug_df = debug_model(model, df_normalized, tokenizer, label_mapping, "Full")


Analyzing Full dataset...
Total samples: 1694

Total misclassifications: 1274
Accuracy: 0.2479

Misclassification distribution:
predicted   0   15
actual            
0            0  66
1           61  68
2           41  33
3            2   0
4            3   3
5            6  23
6           19  15
7           36  36
8          154  52
9          116  58
10           6  17
11          15  13
12           0   2
13           5  23
14          18  17
15          53   0
16           1  12
17         100  65
18           0   8
19          36  27
20          26  38

Sample misclassifications:

Example 1:
Text: ['bill', 'gate', 'say', 'solution', 'climate', 'change', 'ok', 'four', 'private', 'jet', 'bill', 'gate', 'right', 'fly', 'around', 'world', 'private', 'jet', 'normal', 'person', 'force', 'live', 'minu
Predicted: 15, Actual: 5
Confidence: 0.5494

Example 2:
Text: ['new', 'paper', 'make', 'increase', 'tropical', 'cyclone', 'frequency', 'claim', 'contradicted', 'two', 'year', 'ago', 'noaa