If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
!pip install sentencepiece==0.1.91

Collecting sentencepiece==0.1.91
  Downloading sentencepiece-0.1.91-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
[?25l[K     |▎                               | 10 kB 38.0 MB/s eta 0:00:01[K     |▋                               | 20 kB 36.6 MB/s eta 0:00:01[K     |█                               | 30 kB 41.8 MB/s eta 0:00:01[K     |█▏                              | 40 kB 29.5 MB/s eta 0:00:01[K     |█▌                              | 51 kB 15.5 MB/s eta 0:00:01[K     |█▉                              | 61 kB 17.6 MB/s eta 0:00:01[K     |██▏                             | 71 kB 13.4 MB/s eta 0:00:01[K     |██▍                             | 81 kB 14.7 MB/s eta 0:00:01[K     |██▊                             | 92 kB 16.2 MB/s eta 0:00:01[K     |███                             | 102 kB 14.3 MB/s eta 0:00:01[K     |███▍                            | 112 kB 14.3 MB/s eta 0:00:01[K     |███▋                            | 122 kB 14.3 MB/s eta 0:00:01[K     |████               

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[?25l[K     |█▏                              | 10 kB 39.7 MB/s eta 0:00:01[K     |██▎                             | 20 kB 45.5 MB/s eta 0:00:01[K     |███▍                            | 30 kB 45.1 MB/s eta 0:00:01[K     |████▌                           | 40 kB 26.0 MB/s eta 0:00:01[K     |█████▋                          | 51 kB 15.0 MB/s eta 0:00:01[K     |██████▊                         | 61 kB 17.0 MB/s eta 0:00:01[K     |████████                        | 71 kB 15.3 MB/s eta 0:00:01[K     |█████████                       | 81 kB 16.9 MB/s eta 0:00:01[K     |██████████▏                     | 92 kB 12.8 MB/s eta 0:00:01[K     |███████████▎                    | 102 kB 13.8 MB/s eta 0:00:01[K     |████████████▍                   | 112 kB 13.8 MB/s eta 0:00:01[K     |█████████████▌                  | 122 kB 13.8 MB/s eta 0:00:01[K     |██████████████▋                 | 133 kB 13.8 MB/s et

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
%load_ext tensorboard

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value="<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 0s (19.3 MB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155222 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.12.5


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/My\ Drive/Colab\ Notebooks/AAVE_SAE

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/AAVE_SAE


## !!Preprocess my dataset

In [None]:
sae = open("data/sae_samples.tsv").read().replace('\t', ' ')
sae = sae.split('\n')
sae = sae[0:-1]
print(len(sae))

aave = open("data/aave_samples.tsv").read().replace('\t', ' ')
aave = aave.split('\n')
print(len(aave))

2019
2019


In [None]:
import pandas as pd

In [None]:
def merge_two_dicts(x, y):
    z = x.copy()   # start with keys and values of x
    z.update(y)    # modifies z with keys and values of y
    return z

In [None]:
def gen_data_with_label(sae, aave):
  data_sae = {}
  data_sae['sentence'] = sae
  data_sae['label'] = 0

  data_aave = {}
  data_aave['sentence'] = aave
  data_aave['label'] = 1


  data = merge_two_dicts(data_sae, data_aave)

  data_sae = pd.DataFrame(data_sae)
  data_aave = pd.DataFrame(data_aave)
  data = pd.concat([data_sae, data_aave], ignore_index=True)
  data.iloc[len(data_sae):]['label'] = 1
  return data

In [None]:
train = gen_data_with_label(sae[0:1615], aave[0:1615])
val = gen_data_with_label(sae[1615:1817], aave[1615:1817])
test = gen_data_with_label(sae[1817:], aave[1817:])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [None]:
test

Unnamed: 0,sentence,label
0,"""Silver lining""? Really, jackass? People could...",0
1,I feel you! I only ball when there is a tourna...,0
2,"For those who are interested, I will be playin...",0
3,"I've gone bungee jumping, but I was drunk. I d...",0
4,You are not right for holding back a tweet. Be...,0
...,...,...
399,At least I'm happy with knowing I'm always coo...,1
400,I love when you ad lip. If that's what you cal...,1
401,ISO a nice place to do karaoke or DJ music tha...,1
402,"Must be an establishment with a ""Grown Folks"" ...",1


In [None]:
train.to_csv("data/training/train_labels.csv", index=False)
val.to_csv("data/training/val_labels.csv", index=False)
test.to_csv("data/training/test_labels.csv", index=False)

## Load Data

In [None]:
import torch
import transformers
import pandas as pd
from pprint import pprint
from tqdm.notebook import tqdm

In [None]:
# train = pd.read_csv("data/training/train_labels.csv")
# val = pd.read_csv("data/training/val_labels.csv")
# train

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train':'data/training/train_labels.csv', 'validation':'data/training/val_labels.csv', 'test':'data/training/test_labels.csv'})

Using custom data configuration default-d0184123eea6dbd7


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d0184123eea6dbd7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d0184123eea6dbd7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 3230
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 404
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 404
    })
})

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label
0,Tears are steady pouring from my eyes. I hate that I have to go through this alone.,1
1,Josh wilson checks out of the game in his head but his team helped him to cheat. I got the guy tripped up on the court and the ball went pasr his head and I died laughing,0
2,My supervisor tole me I can telecommute from another location. Wish I knew I had This option three years ag,0
3,"""That rain is very inconvenient."" Yes, it is. ""I just wish it would stop."" You aren't the only one",0
4,"The amount of liquor I drank last night was ungodly. But I didn't have a lick of a hangover...thank you, Patron God",0
5,"I'm trying to take this shit over next year. I have a plan! Hey bro, are you still sunny side up",0
6,"French Montana and his music are a joke to me, I don't even take him seriously. Like how do you style Kanye and Ying Yang Twins' ad-libs",0
7,Seen Vicki tonight! I love her & Tyrina! & Toya & Jasmine & Krystal crazy asses!!,1
8,well i dont know about those streets where you are but my chipotle is full of yuppies and professionals on lunch,1
9,Sooo how the Sigmas feel about Morris Chestnut throwing up the hooks in Best Man 2 tho? I bet he felt great doing it,1


In [None]:
model_checkpoint = "distilbert-base-cased"
task = "cola"
batch_size = 16

In [None]:
from datasets import load_metric
# metric = load_metric('glue', 'cola') # accuracy & f1

In [None]:
# print(f"Sentence: {dataset['train'][0]['sentence']}")

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 28996
}

loading file https://huggingface.co/distilbert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/ba377304984dc63e3ede0e23a938bbbf04d5c3835b66d5bb48343aecca188429.437aa611e89f6fc6675a049d2b5545390adbc617e7d65

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True)

In [None]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2268, 3121, 12008, 112, 184, 1108, 1515, 170, 2398, 2258, 7314, 120, 7210, 1114, 170, 1299, 12714, 1106, 1129, 2130, 119, 102], [101, 1109, 1376, 1873, 1110, 1136, 1280, 1106, 1782, 1272, 178, 1821, 1103, 2226, 117, 1177, 1131, 2993, 1106, 1831, 1184, 1131, 1110, 1833, 102], [101, 1124, 1110, 8829, 2266, 1208, 1105, 146, 112, 182, 1205, 1303, 2033, 2407, 119, 1135, 787, 188, 1164, 1106, 1301, 1205, 119, 27453, 1566, 11437, 1566, 119, 102], [101, 12216, 117, 146, 1821, 1150, 146, 1821, 119, 146, 112, 182, 1694, 170, 23043, 8032, 4404, 119, 146, 112, 182, 1833, 1122, 1111, 156, 2328, 4426, 1105, 1139, 4067, 117, 1115, 112, 188, 1139, 7533, 1142, 1214, 1105, 1191, 1128, 1274, 112, 189, 1176, 1122, 117, 1243, 1149, 102], [101, 1135, 1110, 1304, 4054, 1115, 146, 1243, 1184, 146, 1328, 117, 1133, 1208, 1115, 1110, 1144, 2171, 117, 146, 1202, 1136, 2197, 1113, 3196, 1122, 117, 1185, 2187, 1103, 2616, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-d0184123eea6dbd7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-e400ca343a0c5f22.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-d0184123eea6dbd7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-ac3d3ebeab378714.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-d0184123eea6dbd7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-dbc5b16080a79d2a.arrow


In [None]:
# dataset['train'][1610:1620]

In [None]:
# dataset['validation'][200:204]

## Fine-tuning

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 28996
}

loading weights file https://huggingface.co/distilbert-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c9f39769dba4c5fe379b4bc82973eb01297bd607954621434eb9f1bc85a23a0.06b428c87335c1bb22eae46fdab31

In [None]:
# metric_name = "matthews_correlation"
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "steps",
    save_strategy = "steps",
    eval_steps = 200,
    save_steps = 200,
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=1e-5,
    load_best_model_at_end=True, ####
    metric_for_best_model=metric_name,
    push_to_hub=False, ####
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    metric = load_metric("accuracy")
    computed = metric.compute(predictions=predictions, references=labels)

    metric = load_metric("precision")
    computed = dict(metric.compute(predictions=predictions, references=labels), **computed)

    metric = load_metric("recall")
    computed = dict(metric.compute(predictions=predictions, references=labels), **computed)

    metric = load_metric("f1")
    computed = dict(metric.compute(predictions=predictions, references=labels), **computed)

    computed['pred_1_ratio'] = predictions.mean()

    return computed

In [None]:
import numpy as np

In [None]:
for para in model.parameters():
  para.requires_grad = False

model.pre_classifier.weight.requires_grad = True
model.pre_classifier.bias.requires_grad = True
model.classifier.weight.requires_grad = True
model.classifier.bias.requires_grad = True

In [None]:
print(model.pre_classifier.weight)
print(model.distilbert.embeddings.word_embeddings.weight)

Parameter containing:
tensor([[ 0.0052,  0.0015,  0.0189,  ...,  0.0072, -0.0124, -0.0127],
        [ 0.0228, -0.0137,  0.0274,  ..., -0.0083,  0.0069,  0.0273],
        [ 0.0028,  0.0062,  0.0108,  ..., -0.0026,  0.0613,  0.0181],
        ...,
        [ 0.0193,  0.0156,  0.0020,  ...,  0.0228, -0.0045, -0.0297],
        [ 0.0420, -0.0160, -0.0118,  ..., -0.0154, -0.0189, -0.0012],
        [-0.0431,  0.0115, -0.0298,  ...,  0.0177, -0.0194, -0.0027]],
       requires_grad=True)
Parameter containing:
tensor([[-2.5130e-02, -3.3044e-02, -2.4396e-03,  ..., -1.0848e-02,
         -4.6824e-02, -9.4855e-03],
        [-4.8244e-03, -2.1486e-02, -8.7145e-03,  ..., -2.6029e-02,
         -3.7862e-02, -2.4103e-02],
        [-1.6531e-02, -1.7862e-02,  1.0596e-03,  ..., -1.6371e-02,
         -3.5670e-02, -3.1419e-02],
        ...,
        [-9.6466e-03,  1.4814e-02, -2.9182e-02,  ..., -3.7873e-02,
         -4.6263e-02, -1.6803e-02],
        [-1.3170e-02,  6.5378e-05, -3.7222e-02,  ..., -4.3558e-02,
   

In [None]:
from datasets import concatenate_datasets
# validation_key = "validation"
trainer = Trainer(
    model,
    args,
    # optimizer = model.parameters(),
    train_dataset=concatenate_datasets([encoded_dataset["train"], encoded_dataset["validation"]]),
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running training *****
  Num examples = 3230
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1010


Step,Training Loss,Validation Loss,F1,Recall,Precision,Accuracy,Pred 1 Ratio
20,No log,0.694865,0.0,0.0,0.0,0.5,0.0
40,No log,0.693728,0.0,0.0,0.0,0.5,0.0
60,No log,0.743834,0.0,0.0,0.0,0.5,0.0
80,No log,0.695763,0.0,0.0,0.0,0.5,0.0
100,No log,0.693161,0.666667,1.0,0.5,0.5,1.0
120,No log,0.697302,0.0,0.0,0.0,0.5,0.0
140,No log,0.73879,0.666667,1.0,0.5,0.5,1.0
160,No log,0.705819,0.0,0.0,0.0,0.5,0.0
180,No log,0.697057,0.0,0.0,0.0,0.5,0.0
200,No log,0.702646,0.666667,1.0,0.5,0.5,1.0


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Evaluation *****
  Num examples = 404
  Batch size = 64


Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

  _warn_prf(average, modifier, msg_start, len(result))


Downloading:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Evaluation *****
  Num examples = 404
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Evaluation *****
  Num examples = 404
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Evaluation *****
  Num examples = 404
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and ha

TrainOutput(global_step=1010, training_loss=0.7003464240838986, metrics={'train_runtime': 6484.9753, 'train_samples_per_second': 2.49, 'train_steps_per_second': 0.156, 'total_flos': 197592200971032.0, 'train_loss': 0.7003464240838986, 'epoch': 5.0})

In [None]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence.
***** Running Evaluation *****
  Num examples = 404
  Batch size = 64


{'epoch': 5.0,
 'eval_accuracy': 0.8688118811881188,
 'eval_f1': 0.8644501278772377,
 'eval_loss': 0.31895017623901367,
 'eval_precision': 0.8941798941798942,
 'eval_pred_1_ratio': 0.46782178217821785,
 'eval_recall': 0.8366336633663366,
 'eval_runtime': 2.7362,
 'eval_samples_per_second': 147.652,
 'eval_steps_per_second': 2.558}

In [None]:
!ls

bart_output.csv  distilbert-base-cased-finetuned-cola  T5_output.csv
data		 gpt2_output.csv
