# SetFit for Multilabel Text Classification

In this notebook, we'll learn how to do few-shot text classification on a multilabel dataset with SetFit.

## Setup

To be able to share your model with the community, there are a few more steps to follow.

First, you have to store your authentication token from the Hugging Face Hub (sign up [here](https://huggingface.co/join) if you haven't already!). To do so, execute the following cell and input an [access token](https://huggingface.co/docs/hub/security-tokens) associated with your account:

In [1]:
from dotenv import load_dotenv
load_dotenv()
import os
import huggingface_hub
from datasets import load_dataset, Dataset
from transformers import EarlyStoppingCallback
import wandb
from setfit import SetFitModel, Trainer, TrainingArguments
from sentence_transformers.losses import CosineSimilarityLoss
from sklearn.metrics import cohen_kappa_score
from collections import Counter
import kagglehub
import shutil
import torch
import numpy as np
import re

# Clear memory for all GPUs before model assignment
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()


# Set environment variables
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"

# Load dataset from Hugging Face hub
huggingface_username = 'HSLU-AICOMP-LearningAgencyLab'
dataset_name = 'learning-agency-lab-automated-essay-scoring-2_V3'
# dataset_name = 'learning-agency-lab-automated-essay-scoring-2'
our_model_name = 'automated-essay-scoring-setfit'

wandb_project = 'HSLU-AICOMP-LearningAgencyLab'
wandb_entity = 'Leo1212'

max_words=4096

huggingface_hub.login(token=os.getenv('HUGGINGFACE_TOKEN'))
wandb.login(key=os.getenv('WANDB_API_TOKEN'))

os.environ["WANDB_PROJECT"] = wandb_project


# Load the dataset from Hugging Face
dataset = load_dataset(f"{huggingface_username}/{dataset_name}")

# Inspect dataset before preprocessing
print("\nDataset before preprocessing:")
print(dataset)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/leonkrug/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: Currently logged in as: [33mleo1212[0m ([33mhslu_nlp[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/leonkrug/.netrc



Dataset before preprocessing:
DatasetDict({
    train: Dataset({
        features: ['essay_id', 'full_text', 'score', 'unique_mistakes', 'repeated_mistakes_count', 'max_repeated_mistake', 'word_count', 'flesch_reading_ease', 'flesch_kincaid_grade', 'sentence_count', 'average_sentence_length', 'pos_noun_count', 'pos_verb_count', 'pos_adj_count', 'pos_adv_count', 'grammar_error_count', 'syntactic_complexity', 'spelling_mistake_count', 'error_density', 'tfidf_keywords_vector', 'lda_topic_vector', 'keyword_coverage', 'pronoun_usage', 'unique_word_proportion', 'long_word_proportion', 'imagery_word_proportion', 'positive_sentiment_score', 'negative_sentiment_score', 'visual_word_proportion', 'unique_visual_word_proportion', 'average_imagery_score', 'discourse_marker_count', 'neural_coherence_score', 'longformer_sentence_embedding', 'longformer_coherence_score', 'type_token_ratio', 'lexical_diversity', 'vocabulary_maturity', 'in_persuade_corpus'],
        num_rows: 13845
    })
    eval: Dat

Only use the Kaggle Dataset to train

In [2]:
# from datasets import DatasetDict, concatenate_datasets

# # Step 1: Filter out examples where in_persuade_corpus == False
# filtered_dataset = dataset.filter(lambda example: example['in_persuade_corpus'] == False)

# # Step 2: Count examples per score in the filtered dataset
# score_counts = Counter(filtered_dataset['train']['score'])

# # Step 3: Supplement data where needed
# for score, count in score_counts.items():
#     if count < 130:
#         # Get additional examples where in_persuade_corpus == True for this score
#         additional_examples = dataset['train'].filter(
#             lambda example: example['in_persuade_corpus'] == True and example['score'] == score
#         )

#         # Determine how many examples we need to add
#         num_to_add = min(130 - count, len(additional_examples))

#         # Select only the required number of additional examples
#         additional_examples = additional_examples.select(range(num_to_add))

#         # Concatenate the additional examples to the filtered dataset
#         filtered_dataset['train'] = concatenate_datasets([filtered_dataset['train'], additional_examples])


# # Step 5: Create a new DatasetDict with the splits
# dataset = DatasetDict({
#     'train': filtered_dataset['train'],
#     'eval': dataset['eval'],
#     'test': dataset['test']
# })

# print(dataset)

# # Print the first example in the training set
# print(dataset['train'][0])

In [3]:
# Assuming you have already loaded the dataset with 'load_dataset'
score_counts = Counter(dataset['train']['score'])
print(score_counts)

Counter({3: 4985, 2: 3775, 4: 3156, 1: 1012, 5: 787, 6: 130})


In [4]:
def count_words(text):
    words = re.findall(r'\b\w+\b', text.lower())
    return len(words)

def truncate_text(text, max_words=384):
    words = text.split()
    return ' '.join(words[:max_words])

def subsample_dataset(dataset, split='train', score_column='score', num_per_score=15, max_words=384):
    reduced_dataset_list = []

    for score in range(1, 7):
        filtered = dataset[split].filter(lambda x: x[score_column] == score)
        filtered = filtered.map(lambda x: {'text': truncate_text(x['full_text'], max_words) if count_words(x['full_text']) > max_words else x['full_text']})

        if len(filtered) > 0:
            sample_count = min(len(filtered), num_per_score)
            reduced_dataset_list.append(filtered.shuffle(seed=42).select(range(sample_count)))

    reduced_dataset = Dataset.from_dict({k: sum([d[k] for d in reduced_dataset_list], []) for k in reduced_dataset_list[0].column_names})
    return reduced_dataset

def preprocess_datasets(num_per_score, max_words, fullEvalSet=False):

    eval_num_per_score = num_per_score
    if fullEvalSet == True:
        eval_num_per_score = 10000

    reduced_dataset_train = subsample_dataset(dataset, split='train', num_per_score=num_per_score, max_words=max_words)
    reduced_dataset_eval = subsample_dataset(dataset, split='eval', num_per_score=eval_num_per_score, max_words=max_words)

    def convert_label(record):
        record['label'] = int(record['score'])
        return record

    train_dataset = reduced_dataset_train.map(convert_label)
    eval_dataset = reduced_dataset_eval.map(convert_label)

    columns_to_keep = ['text', 'label']
    train_dataset = train_dataset.remove_columns([col for col in train_dataset.column_names if col not in columns_to_keep])
    eval_dataset = eval_dataset.remove_columns([col for col in eval_dataset.column_names if col not in columns_to_keep])

    train_dataset = train_dataset.map(lambda x: {"label": int(x["label"])})
    eval_dataset = eval_dataset.map(lambda x: {"label": int(x["label"])})
    return train_dataset, eval_dataset

In [5]:
# model_id = "allenai/longformer-base-4096"
base_model_id = 'Leo1212/longformer-base-4096-sentence-transformers-all-nli-stsb-quora-nq'
model = SetFitModel.from_pretrained(base_model_id) # revision="2104f5d5c622eff94ccb500ec8f722910e9c8f97"

# Set the device to GPU if available, otherwise fallback to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move model to the appropriate device
model.to(device)

print(f"Model classification head: {model.model_head}")

def compute_qwk(y_pred, y_true):
    y_pred = np.argmax(y_pred, axis=1) if y_pred.ndim > 1 else y_pred
    return {"qwk": cohen_kappa_score(y_true, y_pred, weights='quadratic')}

num_iterations=10
num_epochs=10
batch_size=2
num_per_score=130
use_amp=True
loss=CosineSimilarityLoss

train_dataset, eval_dataset = preprocess_datasets(num_per_score, max_words, fullEvalSet=False)

args = TrainingArguments(
    report_to="wandb",
    use_amp=use_amp,
    logging_strategy="epoch",       
    eval_strategy="epoch",  
    save_strategy="epoch",
    batch_size=batch_size,
    num_iterations=num_iterations,
    num_epochs=num_epochs,
    loss=loss,  
    load_best_model_at_end = True,
    greater_is_better=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,  
    eval_dataset=eval_dataset,
    metric=compute_qwk,
    column_mapping={"text": "text", "label": "label"},  
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


Using device: cuda
Model classification head: LogisticRegression()


Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/1012 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/3775 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/4985 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/3156 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/787 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13845 [00:00<?, ? examples/s]

Map:   0%|          | 0/130 [00:00<?, ? examples/s]

Map:   0%|          | 0/780 [00:00<?, ? examples/s]

Map:   0%|          | 0/676 [00:00<?, ? examples/s]

Map:   0%|          | 0/780 [00:00<?, ? examples/s]

Map:   0%|          | 0/676 [00:00<?, ? examples/s]

Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
Currently using DataParallel (DP) for multi-gpu training, while DistributedDataParallel (DDP) is recommended for faster training. See https://sbert.net/docs/sentence_transformer/training/distributed.html for more information.


Map:   0%|          | 0/780 [00:00<?, ? examples/s]

The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning
* `column_mapping`: The `SetFitTrainer` expects the inputs to be found in a `text` and `label` column. This mapping automatically formats the training and evaluation datasets for us.

Now that we've created a trainer, we can train it!

In [6]:
trainer.train()

wandb.config.update({
    "num_iterations": num_iterations,
    "num_epochs": num_epochs,
    "device": device.type,
    "loss_class": str(loss),
    "base_model_id": base_model_id,
    "batch_size": batch_size,
    "num_per_score": num_per_score,
    "max_words": max_words,
    "use_amp": use_amp,
})


***** Running training *****
  Num unique pairs = 15600
  Batch size = 2
  Num epochs = 10
  from IPython.core.display import HTML, display  # type: ignore


  from IPython.core.display import HTML, display  # type: ignore


  0%|          | 0/39000 [00:00<?, ?it/s]

Input ids are automatically padded to be a multiple of `config.attention_window`: 512


{'embedding_loss': 0.2925, 'grad_norm': 4.681674003601074, 'learning_rate': 5.128205128205129e-09, 'epoch': 0.0}
{'embedding_loss': 0.1808, 'grad_norm': 9.130257606506348, 'learning_rate': 2e-05, 'epoch': 1.0}


  0%|          | 0/3380 [00:00<?, ?it/s]

{'eval_embedding_loss': 0.19862136244773865, 'eval_embedding_runtime': 1229.2116, 'eval_embedding_samples_per_second': 10.999, 'eval_embedding_steps_per_second': 2.75, 'epoch': 1.0}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'embedding_loss': 0.0597, 'grad_norm': 0.4967077374458313, 'learning_rate': 1.7777777777777777e-05, 'epoch': 2.0}


  0%|          | 0/3380 [00:00<?, ?it/s]

{'eval_embedding_loss': 0.25149649381637573, 'eval_embedding_runtime': 1230.051, 'eval_embedding_samples_per_second': 10.991, 'eval_embedding_steps_per_second': 2.748, 'epoch': 2.0}
{'embedding_loss': 0.0181, 'grad_norm': 0.01842939667403698, 'learning_rate': 1.555555555555556e-05, 'epoch': 3.0}


  0%|          | 0/3380 [00:00<?, ?it/s]

{'eval_embedding_loss': 0.3038812577724457, 'eval_embedding_runtime': 1230.1214, 'eval_embedding_samples_per_second': 10.991, 'eval_embedding_steps_per_second': 2.748, 'epoch': 3.0}
{'embedding_loss': 0.0222, 'grad_norm': 0.005139842163771391, 'learning_rate': 1.3333333333333333e-05, 'epoch': 4.0}


  0%|          | 0/3380 [00:00<?, ?it/s]

{'eval_embedding_loss': 0.30127018690109253, 'eval_embedding_runtime': 1229.2187, 'eval_embedding_samples_per_second': 10.999, 'eval_embedding_steps_per_second': 2.75, 'epoch': 4.0}
{'embedding_loss': 0.0082, 'grad_norm': 0.18250790238380432, 'learning_rate': 1.1111111111111113e-05, 'epoch': 5.0}


  0%|          | 0/3380 [00:00<?, ?it/s]

{'eval_embedding_loss': 0.2740993797779083, 'eval_embedding_runtime': 1230.1654, 'eval_embedding_samples_per_second': 10.99, 'eval_embedding_steps_per_second': 2.748, 'epoch': 5.0}
{'train_runtime': 19446.6142, 'train_samples_per_second': 8.022, 'train_steps_per_second': 2.005, 'train_loss': 0.05778299667285039, 'epoch': 5.0}


The final step is to compute the model's performance using the `evaluate()` method. The default metric measures 'subset accuracy', which measures the fraction of samples where we predict all 8 labels correctly.

In [7]:
_, full_eval_dataset = preprocess_datasets(num_per_score, max_words, fullEvalSet=True)

metrics = trainer.evaluate(full_eval_dataset)
qwk_score = metrics.get("qwk", -1)
wandb.log({"eval_qwk": qwk_score})
metrics

Map:   0%|          | 0/780 [00:00<?, ? examples/s]

Map:   0%|          | 0/3462 [00:00<?, ? examples/s]

Map:   0%|          | 0/780 [00:00<?, ? examples/s]

Map:   0%|          | 0/3462 [00:00<?, ? examples/s]

Applying column mapping to the evaluation dataset
***** Running evaluation *****


{'qwk': 0.7462769974528534}

And once the model is trained, you can push it to the Hub:

In [8]:
trainer.push_to_hub(f"{huggingface_username}/{our_model_name}", private=True)

model.safetensors:   0%|          | 0.00/595M [00:00<?, ?B/s]

model_head.pkl:   0%|          | 0.00/37.8k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit/commit/19c9509ded90e4098505369211f4a491eff49017', commit_message='Add SetFit model', commit_description='', oid='19c9509ded90e4098505369211f4a491eff49017', pr_url=None, repo_url=RepoUrl('https://huggingface.co/HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit', endpoint='https://huggingface.co', repo_type='model', repo_id='HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit'), pr_revision=None, pr_num=None)

Upload to KaggleHub

In [9]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(f"{huggingface_username}/{our_model_name}")

In [10]:
VARIATION_SLUG = 'default'

LOCAL_MODEL_DIR = f"../src/models/{our_model_name}"
model.save_pretrained(LOCAL_MODEL_DIR)

# Compress the model directory (optional but helpful for large files)
shutil.make_archive(our_model_name, 'zip', LOCAL_MODEL_DIR)

kagglehub.model_upload(
  handle = f"leo1212abc/{our_model_name}/transformers/{VARIATION_SLUG}",
  local_model_dir = LOCAL_MODEL_DIR,
  version_notes = f"Metrics: {str(metrics)}")

Uploading Model https://www.kaggle.com/models/leo1212abc/automated-essay-scoring-setfit/transformers/default ...
Starting upload for file ../src/models/automated-essay-scoring-setfit/config_setfit.json


Uploading: 100%|██████████| 53.0/53.0 [00:00<00:00, 126B/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/config_setfit.json (53B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/tokenizer_config.json







Uploading: 100%|██████████| 1.41k/1.41k [00:00<00:00, 4.08kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/tokenizer_config.json (1KB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/tokenizer.json







Uploading: 100%|██████████| 3.56M/3.56M [00:01<00:00, 2.89MB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/tokenizer.json (3MB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/vocab.json







Uploading: 100%|██████████| 798k/798k [00:00<00:00, 811kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/vocab.json (780KB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/README.md







Uploading: 100%|██████████| 4.25k/4.25k [00:00<00:00, 11.1kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/README.md (4KB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/config_sentence_transformers.json







Uploading: 100%|██████████| 201/201 [00:00<00:00, 574B/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/config_sentence_transformers.json (201B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/config.json







Uploading: 100%|██████████| 913/913 [00:00<00:00, 2.65kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/config.json (913B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/modules.json







Uploading: 100%|██████████| 229/229 [00:00<00:00, 683B/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/modules.json (229B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/special_tokens_map.json







Uploading: 100%|██████████| 958/958 [00:00<00:00, 2.70kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/special_tokens_map.json (958B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/merges.txt







Uploading: 100%|██████████| 456k/456k [00:00<00:00, 511kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/merges.txt (446KB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/model.safetensors







Uploading: 100%|██████████| 595M/595M [00:14<00:00, 40.7MB/s] 

Upload successful: ../src/models/automated-essay-scoring-setfit/model.safetensors (567MB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/sentence_bert_config.json







Uploading: 100%|██████████| 54.0/54.0 [00:00<00:00, 157B/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/sentence_bert_config.json (54B)
Starting upload for file ../src/models/automated-essay-scoring-setfit/model_head.pkl







Uploading: 100%|██████████| 37.8k/37.8k [00:00<00:00, 89.6kB/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/model_head.pkl (37KB)
Starting upload for file ../src/models/automated-essay-scoring-setfit/1_Pooling/config.json







Uploading: 100%|██████████| 296/296 [00:00<00:00, 859B/s]

Upload successful: ../src/models/automated-essay-scoring-setfit/1_Pooling/config.json (296B)





Your model instance version has been created.
Files are being processed...
See at: https://www.kaggle.com/models/leo1212abc/automated-essay-scoring-setfit/transformers/default


You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `your-username/the-name-you-picked` so for instance:

Run inference. As is usual in toxicity models, it tends to think any mention of topics such as race or gender are negative.

In [11]:
preds = model(
    [
        dataset['test'][0]['full_text'],
        dataset['test'][1]['full_text'],
        dataset['test'][2]['full_text'],
    ]
)
preds

tensor([4, 3, 5])

In [None]:
from huggingface_hub import HfApi, Repository

def cleanup_yaml_text(text):
    """
    Cleans up a text input to make it YAML-valid by escaping necessary characters
    and formatting it properly for multi-line strings.

    Args:
        text (str): The text string to be cleaned.

    Returns:
        str: The cleaned and formatted text suitable for YAML.
    """
    # Escape single quotes for YAML compliance
    text = text.replace("'", "''")
    
    # Format text as a multi-line YAML string with '|-' to preserve formatting
    formatted_text = "|-\n  " + text.replace("\n", "\n  ")
    return formatted_text

# Generate cleaned text for each label
widget_texts = [
    cleanup_yaml_text(train_dataset.filter(lambda x: x['label'] == i)['text'][0])
    for i in range(1, 7)
]



best_qwk_score=qwk_score
best_hyperparameters = wandb.config
from types import SimpleNamespace

# Convert the dictionary to a SimpleNamespace object
best_hyperparameters = SimpleNamespace(**best_hyperparameters)

examples = []
examples.append(train_dataset.filter(lambda x: x['label'] == 1)['text'][0].replace('\n', ''))
examples.append(train_dataset.filter(lambda x: x['label'] == 2)['text'][0].replace('\n', ''))
examples.append(train_dataset.filter(lambda x: x['label'] == 3)['text'][0].replace('\n', ''))
examples.append(train_dataset.filter(lambda x: x['label'] == 4)['text'][0].replace('\n', ''))
examples.append(train_dataset.filter(lambda x: x['label'] == 5)['text'][0].replace('\n', ''))
examples.append(train_dataset.filter(lambda x: x['label'] == 6)['text'][0].replace('\n', ''))

# Define your variables
model_name = f"{huggingface_username}/{our_model_name}"

# Check if the directory already exists, if so, delte to allow for updates. git fetch/pull does not work
if os.path.exists(model_name):
    shutil.rmtree(model_name)

repo = Repository(local_dir=model_name, clone_from=model_name)

# Create or update the model card with desired information
# model_card = ModelCard.load(f"{model_name}/README.md")

# Add QWK score and hyperparameters to the model card's content
model_card_content = f"""
---
base_model: {base_model_id}
library_name: setfit
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- setfit
- sentence-transformers
- text-classification
- generated_from_setfit_trainer
inference: true
model-index:
- name: SetFit with {base_model_id}
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: Unknown
      type: unknown
      split: test
    metrics:
    - type: qwk
      value: {best_qwk_score}
      name: QWK
---

# SetFit with {base_model_id}

This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [{base_model_id}](https://huggingface.co/{base_model_id}) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.

## Model Details

### Model Description
- **Model Type:** SetFit
- **Sentence Transformer body:** [{base_model_id}](https://huggingface.co/{base_model_id})
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
- **Maximum Sequence Length:** 4098 tokens
- **Number of Classes:** 6 classes
<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)

### Model Labels
| Label | Examples                                                                                                                                                                                                                                                                                                                                           |
|:------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1     | <ul><li>'{examples[0]}'</li></ul>|
| 2     | <ul><li>'{examples[1]}'</li></ul>|
| 3     | <ul><li>'{examples[2]}'</li></ul>|
| 4     | <ul><li>'{examples[3]}'</li></ul>|
| 5     | <ul><li>'{examples[4]}'</li></ul>|
| 6     | <ul><li>'{examples[5]}'</li></ul>|

## Evaluation

### Metrics
| Label   | QWK |
|:--------|:---------|
| **all** | {best_qwk_score}   |

## Uses

### Direct Use for Inference

First install the SetFit library:

```bash
pip install setfit
```

Then you can load this model and run inference.

```python
from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit")
# Run inference
preds = model("In source 1, Elisabeth Rosenthal is inform us about the low car usage in Vauban,Germany. In Vauban, the residents no longer use cars. They use other means of transportation such as bicycles and walking etc. It is shown in paragraph 3 that 70 percent of Vauban's families do not own cars and that 57 percent sold a car to move there.

In Europe, passager cars are responsible for 12 percent of greenhouse gas and up to 50 percent in some car intense areas in the United States, paragraph 5. Efforts in past 20 ears have been made to make cities denser and better for walking. Populated with 5,500 , Vauban may be the most advanced experiment in low car suburban life, paragraph 6.

Scarsdale and Levittown, New York suburbs has strong apppeal, Many new suburbs may look more like Vauban. Cities around the world where emissions from cars are choking cities, their developing a little bit of the Vauban life style now, paragraph 8. The Environmental Protection Agency in the United States, is promoting the \"car reduced\" communities. Many experts expect public transport serving suburbs to play a much larger role in a six- year federal bill. 80 percent of apporpriation have by law gonre to highways and only 20 percent to other transport, paragraph 9.    ")
```

<!--
### Downstream Use

*List how someone could finetune this model on their own dataset.*
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Hyperparameters
- batch_size: {best_hyperparameters.batch_size}
- num_epochs: {best_hyperparameters.num_epochs})
- max_steps: {best_hyperparameters.max_steps}
- sampling_strategy: {best_hyperparameters.sampling_strategy}
- num_iterations: {best_hyperparameters.num_iterations}
- num_per_score: {best_hyperparameters.num_per_score}
- body_learning_rate: {best_hyperparameters.body_learning_rate}
- head_learning_rate: {best_hyperparameters.head_learning_rate}
- loss: {best_hyperparameters.loss}
- distance_metric: {best_hyperparameters.distance_metric}
- margin: {best_hyperparameters.margin}
- end_to_end: {best_hyperparameters.end_to_end}
- use_amp: {best_hyperparameters.use_amp}
- warmup_proportion: {best_hyperparameters.warmup_proportion}
- l2_weight: {best_hyperparameters.l2_weight}
- seed: {best_hyperparameters.seed}
- eval_max_steps: {best_hyperparameters.eval_max_steps}
- load_best_model_at_end: {best_hyperparameters.load_best_model_at_end}
- metric_for_best_model: qwk

### Framework Versions
- Python: 3.11.9
- SetFit: 1.1.0
- Sentence Transformers: 3.1.1
- Transformers: 4.45.2
- PyTorch: 2.3.1+cu121
- Datasets: 3.0.1
- Tokenizers: 0.20.0

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->
"""

# Write content to the model card
with open(f"{repo.local_dir}/README.md", "w") as file:
    file.write(model_card_content)

# Push the changes to the hub
repo.push_to_hub()


Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

Filter:   0%|          | 0/780 [00:00<?, ? examples/s]

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit into local empty directory.


Download file model.safetensors:   0%|          | 8.00k/567M [00:00<?, ?B/s]

Download file model_head.pkl:  20%|##        | 7.46k/36.9k [00:00<?, ?B/s]

Clean file model_head.pkl:   3%|2         | 1.00k/36.9k [00:00<?, ?B/s]

Clean file model.safetensors:   0%|          | 1.00k/567M [00:00<?, ?B/s]

To https://huggingface.co/HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit
   19c9509..ffbc35c  main -> main



'https://huggingface.co/HSLU-AICOMP-LearningAgencyLab/automated-essay-scoring-setfit/commit/ffbc35cbfcc45f4a5855f013fd1d7b268c42a436'

: 