<a href="https://colab.research.google.com/github/Kira1108/huggingface-examples/blob/main/CustomDatasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### When you are working on ML projects, you are expected to write dirty code.     
- Don't use design pattern at all.    
- Don't use data structure algorithms at all.   
- Use vectorized operations.    
- Use encapsulated packages, like scikit-learn, tensorflow etc.      
- Exploring data first and carefully.    

**Install Transformer Packages & Download Raw Dataset**

In [9]:
from IPython.display import clear_output

!pip install transformers datasets
!wget -nc https://lazyprogrammer.me/course_files/AirlineTweets.csv
!mkdir data
!mv AirlineTweets.csv ./data

clear_output()

In [10]:
import logging
logger = logging.getLogger("artifacts")

import pandas as pd
import os
import json
from sklearn.preprocessing import LabelEncoder
from pathlib import Path
from joblib import load, dump

**Clean data for transformer training**    

Althrough it is not a starndard way of transforming labels and prepare training dataset   
It is required by transformers.     

In [11]:
class ArtifactStore:
    """`ArtifactStore` stores files that you created when doint ML.
        :Param: artifacts_fq: root folder of artifacts path(default to `.artifacts`)
    """
    
    def __init__(self, artifacts_fp = "./artifacts"):
        self.artifacts_fp = Path(artifacts_fp)
        os.makedirs(self.artifacts_fp,exist_ok=True)

    def log_binary(self, obj, fname):
        fpath = self.artifacts_fp / f"{fname}.joblib"
        dump(obj,fpath)
        logger.info(f"Dumped binary to {fpath}")
        
    def load_binary(self, fname):
        fpath = self.artifacts_fp / f"{fname}.joblib"
        return load(fpath)
        
    def log_json(self, obj, fname):
        fpath = self.artifacts_fp / f"{fname}.json"
        json.dump(obj, open(fpath,'w'))
        logger.info(f"Dumped json file to {fpath}")
        
    def load_json(self, fname):
        fpath = self.artifacts_fp / f"{fname}.json"
        return json.load(open(fpath,'r'))
        
    def log_label_encoder(self, label_encoder):
        self.log_binary(label_encoder,"label_encoder")
        
        classmap = {i:c for i,c in enumerate(label_encoder.classes_)}
        self.log_json(classmap,'label_encoder_classmap')
            
alog = ArtifactStore()

**Don't try to write better code when doing ML.(Do that only if this code makes money for you)**

In [12]:
# 0. do settings
DATA_PATH = Path('./data')
ARTIFACTS_PATH = Path("./artifacts")

# 1. read data
df = pd.read_csv(DATA_PATH / "AirlineTweets.csv")[['text','airline_sentiment']]

# 2. ml preprocessing
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['airline_sentiment'])

# 3. do dirty column operations
df.drop('airline_sentiment', axis = 1, inplace = True)
df.rename(columns = {'text':'sentence'}, inplace = True)

# 4. log whatever that will be used in the future
alog.log_label_encoder(label_encoder)

# 5. additional steps for your task
df.to_csv(DATA_PATH / "train_data.csv", index = False)

# 6. validate the steps above
print(df.head(5))

                                            sentence  label
0                @VirginAmerica What @dhepburn said.      1
1  @VirginAmerica plus you've added commercials t...      2
2  @VirginAmerica I didn't today... Must mean I n...      1
3  @VirginAmerica it's really aggressive to blast...      0
4  @VirginAmerica and it's a really big bad thing...      0


**Load artifacts for later use**

In [13]:
alog.load_binary('label_encoder')\
    .transform(['negative','positive','neutral','positive'])

array([0, 2, 1, 2])

In [14]:
alog.load_json('label_encoder_classmap')

{'0': 'negative', '1': 'neutral', '2': 'positive'}

**Load a csv dataset**

In [15]:
from datasets import load_dataset

dataset = load_dataset(
    "csv", 
    data_files = str((DATA_PATH / "train_data.csv").resolve(strict = True))
)

dataset



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d6aed57e2ac76d82/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d6aed57e2ac76d82/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 14640
    })
})

**Indexed by position**

In [16]:
for i in range(10):
    print(dataset['train'][i])

{'sentence': '@VirginAmerica What @dhepburn said.', 'label': 1}
{'sentence': "@VirginAmerica plus you've added commercials to the experience... tacky.", 'label': 2}
{'sentence': "@VirginAmerica I didn't today... Must mean I need to take another trip!", 'label': 1}
{'sentence': '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse', 'label': 0}
{'sentence': "@VirginAmerica and it's a really big bad thing about it", 'label': 0}
{'sentence': "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA", 'label': 0}
{'sentence': '@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)', 'label': 2}
{'sentence': '@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP', 'label': 1}
{'sentence': "@virginamerica Well, I didn't…but NOW I DO! :-D", 'label': 2}
{'senten

**Indexed by column**

In [17]:
dataset['train']['sentence'][:3]

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky.",
 "@VirginAmerica I didn't today... Must mean I need to take another trip!"]

**Datasets are the same**     
1. Dataset Object. A container for data, central object of a dataset framework.
2. Dataset properties. Used to describe data.
3. Dataset transformations. Alter a dataset and returns a new dataset object.

In [18]:
dataset.sort('label')['train'][:10]['label']

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [19]:
dataset['train'][:10]['label']

[1, 2, 1, 0, 0, 0, 2, 1, 2, 2]

**trian test split**

In [20]:
split_dataset = dataset['train'].train_test_split(test_size = 0.3)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 10248
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 4392
    })
})

In [21]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_fn(batch):
    return tokenizer(batch['sentence'], truncation = True)

tokenized_dataset = split_dataset.map(tokenize_fn, batched = True)


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

In [22]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [23]:
training_args = TrainingArguments(
    output_dir = ARTIFACTS_PATH / "training_dir",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,    
)

In [24]:
import numpy as np
from sklearn.metrics import f1_score

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis = -1)
    return {"accuracy": np.mean(predictions == labels),"f1":f1_score(y_true = labels, y_pred = predictions, average = 'weighted')}

In [25]:
trainer = Trainer(
    model = model, 
    args = training_args, 
    train_dataset = tokenized_dataset['train'], 
    eval_dataset = tokenized_dataset['test'],
    tokenizer = tokenizer,
    compute_metrics=compute_metrics
    )

In [26]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10248
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1923
  Number of trainable parameters = 66955779
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5379,0.489832,0.805783,0.811782
2,0.3433,0.427317,0.844035,0.839002
3,0.2378,0.583586,0.845173,0.844736


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4392
  Batch size = 64
Saving model checkpoint to artifacts/training_dir/checkpoint-641
Configuration saved in artifacts/training_dir/checkpoint-641/config.json
Model weights saved in artifacts/training_dir/checkpoint-641/pytorch_model.bin
tokenizer config file saved in artifacts/training_dir/checkpoint-641/tokenizer_config.json
Special tokens file saved in artifacts/training_dir/checkpoint-641/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can sa

TrainOutput(global_step=1923, training_loss=0.32508383923003403, metrics={'train_runtime': 82.9951, 'train_samples_per_second': 370.431, 'train_steps_per_second': 23.17, 'total_flos': 356637574436832.0, 'train_loss': 0.32508383923003403, 'epoch': 3.0})