<a href="https://colab.research.google.com/github/Kira1108/huggingface-examples/blob/main/CustomDatasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### When you are working on ML projects, you are expected to write dirty code.     
- Don't use design pattern at all.    
- Don't use data structure algorithms at all.   
- Use vectorized operations.    
- Use encapsulated packages, like scikit-learn, tensorflow etc.      
- Exploring data first and carefully.    

**Install Transformer Packages**

In [1]:
# from IPython.display import clear_output

# !pip install transformers datasets

# clear_output()

**Download Raw Dataset**

In [2]:
# !wget -nc https://lazyprogrammer.me/course_files/AirlineTweets.csv
# !mv AirlineTweets.csv /data

In [34]:
import logging
logging.basicConfig(level = logging.DEBUG)
logger = logging.getLogger("artifacts")

import pandas as pd
import os
import json
from sklearn.preprocessing import LabelEncoder
from pathlib import Path
from joblib import load, dump

**Clean data for transformer training**    

Althrough it is not a starndard way of transforming labels and prepare training dataset   
It is required by transformers.     

In [36]:
class ArtifactStore:
    """`ArtifactStore` stores files that you created when doint ML.
        :Param: artifacts_fq: root folder of artifacts path(default to `.artifacts`)
    """
    
    def __init__(self, artifacts_fp = "./artifacts"):
        self.artifacts_fp = Path(artifacts_fp)
        os.makedirs(self.artifacts_fp,exist_ok=True)

    def log_binary(self, obj, fname):
        fpath = self.artifacts_fp / f"{fname}.joblib"
        dump(obj,fpath)
        logger.info(f"Dumped binary to {fpath}")
        
    def load_binary(self, fname):
        fpath = self.artifacts_fp / f"{fname}.joblib"
        return load(fpath)
        
    def log_json(self, obj, fname):
        fpath = self.artifacts_fp / f"{fname}.json"
        json.dump(obj, open(fpath,'w'))
        logger.info(f"Dumped json file to {fpath}")
        
    def load_json(self, fname):
        fpath = self.artifacts_fp / f"{fname}.json"
        return json.load(open(fpath,'r'))
        
    def log_label_encoder(self, label_encoder):
        self.log_binary(label_encoder,"label_encoder")
        
        classmap = {i:c for i,c in enumerate(label_encoder.classes_)}
        self.log_json(classmap,'label_encoder_classmap')
            
alog = ArtifactStore()

**Don't try to write better code when doing ML.(Do that only if this code makes money for you)**

In [33]:
# 0. do settings
DATA_PATH = Path('./data')
ARTIFACTS_PATH = Path("./artifacts")

# 1. read data
df = pd.read_csv(DATA_PATH / "AirlineTweets.csv")[['text','airline_sentiment']]

# 2. ml preprocessing
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['airline_sentiment'])

# 3. do dirty column operations
df.drop('airline_sentiment', axis = 1, inplace = True)
df.rename(columns = {'text':'sentence'}, inplace = True)

# 4. log whatever that will be used in the future
alog.log_label_encoder(label_encoder)

# 5. additional steps for your task
df.to_csv(DATA_PATH / "train_data.csv", index = False)

# 6. validate the steps above
print(df.head(5))

INFO:artifacts:Dumped binary to artifacts/label_encoder.joblib
INFO:artifacts:Dumped json file to artifacts/label_encoder_classmap.json


                                            sentence  label
0                @VirginAmerica What @dhepburn said.      1
1  @VirginAmerica plus you've added commercials t...      2
2  @VirginAmerica I didn't today... Must mean I n...      1
3  @VirginAmerica it's really aggressive to blast...      0
4  @VirginAmerica and it's a really big bad thing...      0


**Load artifacts for later use**

In [41]:
alog.load_binary('label_encoder')\
    .transform(['negative','positive','neutral','positive'])

array([0, 2, 1, 2])

In [42]:
alog.load_json('label_encoder_classmap')

{'0': 'negative', '1': 'neutral', '2': 'positive'}

**Load a csv dataset**

In [64]:
from datasets import load_dataset

dataset = load_dataset(
    "csv", 
    data_files = str((DATA_PATH / "train_data.csv").resolve(strict = True))
)

dataset

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /datasets.huggingface.co/datasets/datasets/csv/csv.py HTTP/1.1" 200 0


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 14640
    })
})

**Indexed by position**

In [71]:
for i in range(10):
    print(dataset['train'][i])

{'sentence': '@VirginAmerica What @dhepburn said.', 'label': 1}
{'sentence': "@VirginAmerica plus you've added commercials to the experience... tacky.", 'label': 2}
{'sentence': "@VirginAmerica I didn't today... Must mean I need to take another trip!", 'label': 1}
{'sentence': '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse', 'label': 0}
{'sentence': "@VirginAmerica and it's a really big bad thing about it", 'label': 0}
{'sentence': "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA", 'label': 0}
{'sentence': '@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)', 'label': 2}
{'sentence': '@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP', 'label': 1}
{'sentence': "@virginamerica Well, I didn't…but NOW I DO! :-D", 'label': 2}
{'senten

**Indexed by column**

In [74]:
dataset['train']['sentence'][:3]

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky.",
 "@VirginAmerica I didn't today... Must mean I need to take another trip!"]

**Datasets are the same**     
1. Dataset Object. A container for data, central object of a dataset framework.
2. Dataset properties. Used to describe data.
3. Dataset transformations. Alter a dataset and returns a new dataset object.

In [88]:
dataset.sort('label')['train'][:10]['label']



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [89]:
dataset['train'][:10]['label']

[1, 2, 1, 0, 0, 0, 2, 1, 2, 2]