# Questions
- What are git remotes, how do they work, how do they edit them, how do I fix mine
- How do I push to my main branch of git
- Follow up on CAPS question on tokenization section

# Extract raw data
### Expected Format: 
<font color='red'>[</font>{<font color='green'>'text':</font> 'All services should have a dedicated channel for reporting fraud. We see and hear about a lot of fraud on small platforms, especially dating sights and job boards. ',
  <font color='green'>'label':</font> 0,
  <font color='green'>'summary':</font> 'Every service should have a dedicated reporting channel for fraud, because there is lots of fraud on a range of platforms. '}, <font color='red'>...]</font>

In [21]:
# MS: create training set, evaluation set and validation set
import sys
import pandas as pd
import csv

dataset_name = "/home/azureuser/cloudfiles/code/Users/Omololu.Makinde/Llama_tutorial/data/consultation2.csv"
dataset = []

with open(dataset_name, encoding='utf-8-sig') as FID:
    csvReader = csv.DictReader(FID, delimiter="\t")
    for key, row in enumerate(csvReader): 
        # print(key, row)
        dataset.append({
            "label" : row['Topic'].strip().lower().replace('\n', ''),
            "text" : row['Full summary of comment'],
            "summary" : row['One-line summary']
        })

df_dataset = pd.DataFrame(dataset)
df_dataset.drop(columns =['summary'], inplace=True)  # don't need this for now
display(df_dataset)

Unnamed: 0,label,text
0,user reporting and complaints (u2u and search),All services should have a dedicated channel f...
1,governance and accountability,"This mitigation should apply to all services, ..."
2,governance and accountability,Evidence of new kinds of illegal content on a ...
3,governance and accountability,A Code of Conduct or principles provided to al...
4,governance and accountability,"Staff, in particular engineers, involved in th..."
...,...,...
1890,governance and accountability,Snap asked that Ofcom consider how this measur...
1891,ror,Supports Ofcoms wide use of litrature to regog...
1892,approach to codes,They support the use of the STIM but alongside...
1893,icjg,ASW should be expected to take more of a proac...


# Data Check

In [22]:
df_dataset_count = df_dataset['label'].value_counts()
df_dataset['label'].value_counts()

label
approach to the codes                                                               468
automated content moderation (user to user)                                         270
governance and accountability                                                       228
user reporting and complaints (u2u and search)                                      140
content moderation (user to user)                                                   128
user access to services (u2u)                                                       111
enhanced user control (u2u)                                                         101
default settings and user support (u2u)                                              89
cumulative assessment                                                                76
terms of service and publicly available statements                                   60
recommender system testing (u2u)                                                     59
content moderation (search

In [23]:
keep_rows = list(df_dataset_count[df_dataset_count > 10].index)
df_dataset = df_dataset[df_dataset['label'].isin(keep_rows)]
display(df_dataset['label'].value_counts())

label
approach to the codes                                 468
automated content moderation (user to user)           270
governance and accountability                         228
user reporting and complaints (u2u and search)        140
content moderation (user to user)                     128
user access to services (u2u)                         111
enhanced user control (u2u)                           101
default settings and user support (u2u)                89
cumulative assessment                                  76
terms of service and publicly available statements     60
recommender system testing (u2u)                       59
content moderation (search)                            52
service design and user support (search)               30
automated content moderation (search)                  25
statutory tests                                        22
Name: count, dtype: int64

# Convert labels to binary classifier format
- 'governance and accountability' = 1, everything else = 0

Note data set heavily imbalanced so accuracy would be a poor metric of evaluating our model (in its current state)

In [24]:
import numpy as np

# add in our classifier labels
topic = 'governance and accountability'
df_dataset['label'] = np.where(np.array(df_dataset['label']) == topic, 1, 0)
df_dataset['label'].value_counts(normalize=False)

label
0    1631
1     228
Name: count, dtype: int64

# Extract Evaluation Set

In [25]:
# train and test data
class_len = len(df_dataset[df_dataset['label'] == 1])  # find how many values we can take and still have a balanced class
class_0_data = df_dataset[df_dataset.label.eq(0)].sample(class_len) 
class_1_data = df_dataset[df_dataset.label.eq(1)].sample(class_len)
train_test_data = pd.concat([class_0_data, class_1_data])  # 50/50 class split
display(train_test_data)

# evaluation data
eval_data = df_dataset.drop(train_test_data.index)  # put the rest into an evaluation set we can play with
eval_data

Unnamed: 0,label,text
1537,0,Spotify's response says that a large service s...
1431,0,Evidence suggests that those aged 15-17 dislik...
1705,0,5Rights argue that services should undertake r...
332,0,We are concerned that Ofcom’s definition of a ...
1171,0,"In 2022, eBay removed 773,000 items based on r..."
...,...,...
1531,1,Spotify says it wants Ofcom to be proportionat...
779,1,We also believe there could be significant cos...
920,1,MSPG urge Ofcom to not mandate use of external...
983,1,CCDH is aware of the societal impacts caused ...


Unnamed: 0,label,text
0,0,All services should have a dedicated channel f...
7,0,"When prioritising what content to review, rega..."
8,0,"(Disagree). As noted before, ""there should be ..."
12,0,The RSPCA agrees that the most onerous measure...
13,0,The RSPCA agrees that the most onerous measure...
...,...,...
1881,0,It is Snap’s practice to generally apply accou...
1882,0,Disadvantages of each signal ● Blocking by Dev...
1883,0,"We would stress that this issue is not easily,..."
1884,0,Notwithstanding an appeal being successful and...


# Transform data to usable version for huggingface model

### Using distilbert base uncased model
- BERT = is an LLM (large language model)
- distil = smaller model with most of the same power as a normal bert model
- base = not much tweaking or tuning done yet
- uncased = captilised letters make no difference i.e. england = EnglaND (Note: probably want to still consider this in case we change to an cased model)

### Tokenizing our data using our base model
- Tokenizing is breaking our data up into smaller parts for our model to read (note that tokens are always just words but can be special chars or subwords)
- Embedding (not used for this model) take individual token and turning it into a more computer friendly format through transforming it into a hidden multi-dimensional-numeric format e.g. cat -> (12, 45))
- Work from huggingface dataset 
- Only want to tokenize our text
- Pad using special characters (adds length up to the min tensor value) the values if necessary
- Truncate (shortens to the max tensor value) the values if necessary (up to max length)

In [26]:
from datasets import Dataset
from transformers import AutoTokenizer

huggingface_data = Dataset.from_pandas(train_test_data, preserve_index=False)  # don't include pandas index

pretrained_model_name = "distilbert/distilbert-base-uncased"  # This is our base model

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
def proc_data(data):
    return tokenizer(data['text'], max_length=512, padding=True, truncation=True)

tokenized_data = huggingface_data.map(proc_data, batched=True)  # advantage of ".map" is we can parallel process data in batches (i think)
# print(tokenized_data['text'] == huggingface_data['text'])  # SHOULDN'T THIS NOT BE TRUE??? ESPECIALLY IF I CHANGE MAX_LENGTH TO BE 1 OR SOMETHING

split_tokenized_hugginface_data = tokenized_data.train_test_split(test_size=0.10)  # 85/15 train/test split
print(split_tokenized_hugginface_data)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456/456 [00:00<00:00, 3958.24 examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 410
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 46
    })
})





# Train model
- 1 = "Positive" -> the model is predicting that the input CAN be caterogirsed by our input topic
- 0 = "Negative" -> the model is predicting that the input can NOT be caterogirsed by our input topic

Instead of doing evaluate.load(metric), can load several metrics using accuracy.combine([metric1, metric2, ...])

Note:
- Just reducing test set size from 15% to 10% had about 10% improvement on accuracy

In [27]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback
from transformers import DataCollatorWithPadding
import evaluate


# Select accuracy metric
evaluation_metrics = ["accuracy", "f1", "precision", "recall"]
accuracy = evaluate.combine(evaluation_metrics)# evaluate.load("accuracy")

# Use accuracy to determine which class is the most likely prediction
def compute_metrics(eval_pred):
    predictions, labels = eval_pred  # get predictions and labels
    predictions = np.argmax(predictions, axis=1)  # calculate highest probability class (negative or positive sentiment)
    return accuracy.compute(predictions=predictions, references=labels)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}


model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name,
                                                            id2label=id2label,
                                                            label2id=label2id)

# How we input training arguments into the model
model_output_path = "/home/azureuser/cloudfiles/code/Users/Michael.Sowter/Deep_Learning_Training/Text Classifier/Models"
training_args = TrainingArguments(
    output_dir=model_output_path,  # path model stored
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    logging_steps=100
    )


# Model training arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_tokenized_hugginface_data["train"],
    eval_dataset=split_tokenized_hugginface_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    # callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]  # Stop training if no improvement after 3 consecutive epochs

)

# Train model
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.612856,0.76087,0.77551,0.904762,0.678571
2,No log,0.437802,0.826087,0.851852,0.884615,0.821429
3,No log,0.437469,0.847826,0.867925,0.92,0.821429
4,0.427500,0.475368,0.847826,0.867925,0.92,0.821429
5,0.427500,0.611315,0.847826,0.867925,0.92,0.821429
6,0.427500,0.699294,0.847826,0.867925,0.92,0.821429
7,0.427500,0.621979,0.847826,0.867925,0.92,0.821429
8,0.068000,0.765226,0.847826,0.867925,0.92,0.821429
9,0.068000,0.816931,0.847826,0.867925,0.92,0.821429
10,0.068000,0.849058,0.847826,0.867925,0.92,0.821429


Checkpoint destination directory /home/azureuser/cloudfiles/code/Users/Michael.Sowter/Deep_Learning_Training/Text Classifier/Models/checkpoint-26 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory /home/azureuser/cloudfiles/code/Users/Michael.Sowter/Deep_Learning_Training/Text Classifier/Models/checkpoint-52 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory /home/azureuser/cloudfiles/code/Users/Michael.Sowter/Deep_Learning_Training/Text Classifier/Models/checkpoint-78 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory /home/azureuser/cloudfiles/code/Users/Michael.Sowter/Deep_Learning_Training/Text Classifier/Models/checkpoint-104 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory /home/azureuser/cloudfiles/code/Us

TrainOutput(global_step=520, training_loss=0.11145450839629541, metrics={'train_runtime': 405.7465, 'train_samples_per_second': 20.21, 'train_steps_per_second': 1.282, 'total_flos': 1086232668979200.0, 'train_loss': 0.11145450839629541, 'epoch': 20.0})