<a href="https://colab.research.google.com/github/ShussainML/Bert-FARM-for-thesis-Work/blob/main/2_Build_a_processor_for_your_own_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FARM: Use your own dataset

In [Tutorial 1](https://colab.research.google.com/drive/130_7dgVC3VdLBPhiEkGULHmqSlflhmVM#scrollTo=tPltDefXjSiJ) you already learned about the major building blocks.   
In this tutorial, you will see how to use FARM with your own dataset. 


In [None]:
# Install FARM
!pip install farm==0.5.0

Collecting farm==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/e4/2f47c850732a1d729e74add867e967f058370f29a313da05dc871ff8465e/farm-0.5.0-py3-none-any.whl (207kB)
[K     |█▋                              | 10kB 26.7MB/s eta 0:00:01[K     |███▏                            | 20kB 31.1MB/s eta 0:00:01[K     |████▊                           | 30kB 35.8MB/s eta 0:00:01[K     |██████▎                         | 40kB 29.5MB/s eta 0:00:01[K     |███████▉                        | 51kB 31.6MB/s eta 0:00:01[K     |█████████▌                      | 61kB 34.4MB/s eta 0:00:01[K     |███████████                     | 71kB 25.3MB/s eta 0:00:01[K     |████████████▋                   | 81kB 23.1MB/s eta 0:00:01[K     |██████████████▏                 | 92kB 24.7MB/s eta 0:00:01[K     |███████████████▊                | 102kB 22.5MB/s eta 0:00:01[K     |█████████████████▎              | 112kB 22.5MB/s eta 0:00:01[K     |███████████████████             | 122kB 22.

# 1) How a Processor works

<h2>Architecture</h2>
The Processor converts a <b>raw input (e.g File) into a Pytorch dataset</b>.   
For using an own dataset we need to adjust this Processor.
<img src="https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/data_silo_no_bg.jpg", height=400 >


## Main Conversion Stages 
1. Read from file / raw input 
2. Create samples
3. Featurize samples
4. Create PyTorch Dataset

## Functions to implement

1. _file_to_dicts()
2. _dict_to_samples()
3. _sample_to_features()  



## Example: TextClassificationProcessor

In [None]:
from farm.data_handler.processor import *
from farm.data_handler.samples import Sample
from farm.modeling.tokenization import Tokenizer, tokenize_with_metadata

import os

# FARM has a built-in processor for Text Classification 
# -> farm.data_handler.processor.TextClassificationProcessor
# That's how it looks like internally:
class TextClassificationProcessor(Processor):
    """
    Used to handle the text classification datasets that come in tabular format (CSV, TSV, etc.)
    """
    def __init__(
        self,
        tokenizer,
        max_seq_len,
        data_dir,
        label_list=None,
        metric=None,
        train_filename="train.tsv",
        dev_filename=None,
        test_filename="test.tsv",
        dev_split=0.1,
        delimiter="\t",
        quote_char="'",
        skiprows=None,
        label_column_name="label",
        multilabel=False,
        header=0,
        proxies=None,
        max_samples=None,
        **kwargs
    ):
        #TODO If an arg is misspelt, e.g. metrics, it will be swallowed silently by kwargs

        # Custom processor attributes
        self.delimiter = delimiter
        self.quote_char = quote_char
        self.skiprows = skiprows
        self.header = header
        self.max_samples = max_samples

        # Init the parent processor class
        super(TextClassificationProcessor, self).__init__(
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            train_filename=train_filename,
            dev_filename=dev_filename,
            test_filename=test_filename,
            dev_split=dev_split,
            data_dir=data_dir,
            tasks={},
            proxies=proxies,

        )
        # A task defines which labels to extract for a certain prediction head and which metric to use for eval.
        # This becomes important for multitask learning, where we might have multiple text classification tasks.
        if metric and label_list:
            if multilabel:
                task_type = "multilabel_classification"
            else:
                task_type = "classification"
            self.add_task(name="text_classification",
                          metric=metric,
                          label_list=label_list,
                          label_column_name=label_column_name,
                          task_type=task_type)
        else:
            logger.info("Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for "
                        "using the default task or add a custom task later via processor.add_task()")

    # 1) Read from file to dicts
    def file_to_dicts(self, file: str) -> [dict]:
        column_mapping = {task["label_column_name"]: task["label_name"] for task in self.tasks.values()}
        dicts = read_tsv(
            filename=file,
            delimiter=self.delimiter,
            skiprows=self.skiprows,
            quotechar=self.quote_char,
            rename_columns=column_mapping,
            header=self.header,
            proxies=self.proxies,
            max_samples=self.max_samples
            )

        return dicts

    # 2) Convert one dict to tokenized sample(s)
    def _dict_to_samples(self, dictionary: dict, **kwargs) -> [Sample]:
        # this tokenization also stores offsets and a start_of_word mask
        text = dictionary["text"]
        tokenized = tokenize_with_metadata(text, self.tokenizer)
        if len(tokenized["tokens"]) == 0:
            logger.warning(f"The following text could not be tokenized, likely because it contains a character that the tokenizer does not recognize: {text}")
            return []
        # truncate tokens, offsets and start_of_word to max_seq_len that can be handled by the model
        for seq_name in tokenized.keys():
            tokenized[seq_name], _, _ = truncate_sequences(seq_a=tokenized[seq_name], seq_b=None, tokenizer=self.tokenizer,
                                                max_seq_len=self.max_seq_len)
        return [Sample(id=None, clear_text=dictionary, tokenized=tokenized)]

    # 3) Convert one sample to features
    def _sample_to_features(self, sample) -> dict:
        features = sample_to_features_text(
            sample=sample,
            tasks=self.tasks,
            max_seq_len=self.max_seq_len,
            tokenizer=self.tokenizer,
        )
        return features
      
      
# Helper
def read_tsv(filename, rename_columns, quotechar='"', delimiter="\t", skiprows=None, header=0, proxies=None, max_samples=None):
    """Reads a tab separated value file. Tries to download the data if filename is not found"""
    
    # get remote dataset if needed
    if not (os.path.exists(filename)):
        logger.info(f" Couldn't find {filename} locally. Trying to download ...")
        _download_extract_downstream_data(filename)
    
    # read file into df
    df = pd.read_csv(
        filename,
        sep=delimiter,
        encoding="utf-8",
        quotechar=quotechar,
        dtype=str,
        skiprows=skiprows,
        header=header
    )

    # let's rename our target columns to the default names FARM expects: 
    # "text": contains the text
    # "text_classification_label": contains a label for text classification
    columns = ["text"] + list(rename_columns.keys())
    df = df[columns]
    for source_column, label_name in rename_columns.items():
        df[label_name] = df[source_column]
        df.drop(columns=[source_column], inplace=True)
    
    if "unused" in df.columns:
        df.drop(columns=["unused"], inplace=True)
    raw_dict = df.to_dict(orient="records")
    return raw_dict

### Create a sample file


In [None]:
# The default format is: 
# - tab separated
# - column "text"
# - column "label" 

import pandas as pd

df = pd.DataFrame({"text": ["The concerts supercaliphractisch was great!", "I hate people ignoring climate change."],
                  "label": ["positive","negative"]
                  })
print(df)
df.to_csv("train.tsv", sep="\t")

                                          text     label
0  The concerts supercaliphractisch was great!  positive
1       I hate people ignoring climate change.  negative


### Investigate how the processor converts the file

In [None]:
tokenizer = Tokenizer.load(
    pretrained_model_name_or_path="bert-base-uncased")

processor = TextClassificationProcessor(data_dir = "", 
                                        tokenizer=tokenizer,
                                        max_seq_len=64,
                                        label_list=["positive","negative"],
                                        label_column_name="label",
                                        metric="acc",
                                       )

11/23/2020 19:02:22 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'BertTokenizer'


In [None]:
#  1. One File -> Dictionarie(s) with "raw data"
dicts = processor.file_to_dicts(file="train.tsv")
print(dicts)

[{'text': 'The concerts supercaliphractisch was great!', 'text_classification_label': 'positive'}, {'text': 'I hate people ignoring climate change.', 'text_classification_label': 'negative'}]


In [None]:
#  2. One Dictionary -> Sample(s) 
#     (Sample = "clear text" model input + meta information) 
samples = processor._dict_to_samples(dictionary=dicts[0])

# print each attribute of sample
print(samples[0].clear_text)
print(samples[0].tokenized)
print(samples[0].features)
print("----------------------------------\n\n\n")

# or in a nicer, formatted style
print(samples[0])

{'text': 'The concerts supercaliphractisch was great!', 'text_classification_label': 'positive'}
{'tokens': ['the', 'concerts', 'super', '##cal', '##ip', '##hra', '##ct', '##isch', 'was', 'great', '!'], 'offsets': [0, 4, 13, 18, 21, 23, 26, 28, 33, 37, 42], 'start_of_word': [True, True, True, False, False, False, False, False, True, True, False]}
None
----------------------------------





      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: None
Clear Text: 
 	text: The concerts supercaliphractisch was great!
 	text_classification_label: positive
Tokenized: 
 	tokens: ['the', 'concerts'

In [None]:
# 3. One Sample -> Features
#    (Features = "vectorized" model input)

features = processor._sample_to_features(samples[0])
print(features[0])


{'input_ids': [101, 1996, 6759, 3565, 9289, 11514, 13492, 6593, 19946, 2001, 2307, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'padding_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'segment_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'text_classification_label_ids': [0]}




# 2) Hands-On: Adjust it to your dataset


## Task 1: Use an existing Processor

This works if you have:
- standard tasks
- common file formats 

**Example: Text classification on CSV with multiple columns**

Dataset: GermEval18 (Hatespeech detection)  
Format: TSV  
Columns: `text coarse_label fine_label`

In [None]:
# Download dataset
from farm.data_handler import utils
utils._download_extract_downstream_data("germeval18/train.tsv")
!head -n 10 germeval18/train.tsv

# Task: Initialize a processor for the above file by passing the right arguments

processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir= ???,
                                        train_filename=???,
                                        label_list=???,
                                        metric="acc",
                                        label_column_name=???
                                        )

In [None]:
# test it
dicts = processor.file_to_dicts(file="germeval18/train.tsv")
print(dicts[0])
assert dicts[0] == {'text': '@corinnamilborn Liebe Corinna, wir würden dich gerne als Moderatorin für uns gewinnen! Wärst du begeisterbar?', 'text_classification_label': 'OTHER'}

## Task 2: Build your own Processor
This works best for:
- custom input files
- special preprocessing steps
- advanced multitask learning 

**Example: Text classification with JSON as input file** 

Dataset: [100k Yelp reviews](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-downstream/yelp_reviews_100k.json) ( [full dataset](https://https://www.yelp.com/dataset/download), [documentation](https://https://www.yelp.com/dataset/documentation/main))

Format: 

``` 
{
...
    // integer, star rating
    "stars": 4,

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
...
}
```

In [None]:
# Download dataset
!wget https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-downstream/yelp_reviews_100k.json
!head -5 yelp_reviews_100k.json


In [None]:
# Task: Create a new TextClassificationProcessor that reads in json instead of csv/tsv.
# You can simply overwrite the function that reads from the file

class CustomTextClassificationProcessor(TextClassificationProcessor):
  
    # we need to overwrite this function from the parent class,
    # because we read a json instead of a tsv/csv.
    def file_to_dicts(self, file: str) -> [dict]:
      
      #TODO: your turn :)

      # The returned list of dicts should look like this:
      # [{'text': 'Total bill for this horrible service? ...',
      #   'text_classification_label': '4'}, ...]
      return dicts
    

    


In [None]:
processor = CustomTextClassificationProcessor(tokenizer=tokenizer,
                                              max_seq_len=128,
                                              data_dir="",
                                              label_list=["1","2","3","4","5"],
                                              metric="acc",
                                              )

In [None]:
# test it

dicts = processor.file_to_dicts(file="yelp_reviews_100k.json")
print(dicts[0])

assert dicts[0] == {'text': 'Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', 'text_classification_label': '1'}