**Load Data File**

Downloads the file's contents and saves it to the notebook's files.

In [None]:
import requests

data_file = "reddit_comments.csv"

request = requests.get("https://drive.google.com/uc?export=download&id=1grbBKQ8SEcujIYSTiKaDbOXbFpuhTijv")
with open(data_file, "wb") as file:
    file.write(request.content)

Examine file entries (sanity check)

In [None]:
import pandas as pd

dataframe = pd.read_csv(data_file)
print(dataframe.head())

**Prepare Dataset and General Setup**

I primarily use Hugging Face in this project, in order to save my model's weights upon every epoch, I needed a HuggingFace account to push the model to [their model hub](https://huggingface.co/docs/hub/models-the-hub). If you want to replicate my training you will need to create a Hugging Face account and then generate a token that has write access. 

You only need to run the `!pip3 install` code blocks once per session.

In [None]:
!pip3 install huggingface_hub

Paste your token into box and follow the instructions. You only need to authenticate once per session.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
!pip install datasets transformers==4.28.0

Set Seed

You need to run this before the Load and Partition Dataset section to ensure the train, valid and test partitions are the same.

In [None]:
from transformers import set_seed

set_seed(42)

Load and Partition Dataset

In [None]:
from datasets import load_dataset, DatasetDict, Features, ClassLabel, Value

dataset = load_dataset('csv', data_files=data_file, split="train", download_mode="reuse_cache_if_exists")

train_testvalid = dataset.train_test_split(test_size=0.2)

test_valid = train_testvalid['test'].train_test_split(test_size=0.5)

final_dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

print(final_dataset)

In [None]:
def count_dataset_divide(dataset_partition, partition_name):
  count_pos = 0
  count_neg = 0
  for entry in dataset_partition:
    if entry["label"] == 0:
      count_neg += 1
    else: #entry["label"] == 1:
      count_pos += 1
  print("partition_name:", partition_name)
  print("poitive_count: ", count_pos)
  print("negative_count: ", count_neg)

Examine Final Dataset

In [None]:
count_dataset_divide(final_dataset["train"], "Train")
print(final_dataset["train"][0], "\n")
count_dataset_divide(final_dataset["test"], "Test")
print(final_dataset["test"][0], "\n")
count_dataset_divide(final_dataset["valid"], "Valid")
print(final_dataset["valid"][0], "\n")

**Train**

GPU/CPU

In [None]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
from datasets import load_metric

task = "sst2"
num_labels = 2
model_checkpoint = "distilbert-base-uncased"
metric = load_metric('glue', 'sst2')

batch_size = 32
learn_rate = 5e-6
num_epochs = 6
w_decay = 0.01

Tokenizer

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Encode Dataset

In [None]:
def preprocess_function(entries):
    return tokenizer(entries["comment"], truncation=True)

encoded_dataset = final_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

Model

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Training Arguments

`model_name` is the name that the trainer will be saved as on the Hugging Face Hub

In [None]:
model_name = "reddit-comment-sentiment-final"

train_args = TrainingArguments(
    model_name,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=learn_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=w_decay,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

Compute Metrics - Accuracy

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Train

In [None]:
trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["valid"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Print evaluation on validation set, should be the Trainer version with the best accuracy (sanity check)

In [None]:
trainer.evaluate()

Push Trainer to the Hugging Face Hub

In [None]:
trainer.push_to_hub()