In [None]:
%pip install --upgrade pip
%pip install openicl
# Restart the kernel after the installation is completed

# 1. Getting Started with OpenICL: Introduction to Components

In this tutorial, we will introduce the main components of OpenICL in a simple way.

---



## 1-1 DatasetReader

`DatasetReader` is used to directly wrap your dataset and store the information (column name) of the input and output columns in the dataset. Also, `DatasetReader` internally integrates the `load_dataset` method from the huggingface [datasets](https://github.com/huggingface/datasets) library. Here is an example that defines a `DatasetReader` for the SST-2 dataset:

In [2]:
from openicl import DatasetReader

data = DatasetReader('gpt3mix/sst2', input_columns=['text'], output_column='label')

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset sst2 (/home/zhangyudejia/.cache/huggingface/datasets/gpt3mix___sst2/default/0.0.0/90167692658fa4abca2ffa3ede1a43a71e2bf671078c5c275c64c4231d5a62fa)
100%|██████████| 3/3 [00:00<00:00, 1169.63it/s]


Or, you can also load the dataset before defining the `DatasetReader`:

In [3]:
from datasets import load_dataset
from openicl import DatasetReader

# Loading dataset from huggingface 
dataset = load_dataset('gpt3mix/sst2')

data = DatasetReader(dataset, input_columns=['text'], output_column='label')

Found cached dataset sst2 (/home/zhangyudejia/.cache/huggingface/datasets/gpt3mix___sst2/default/0.0.0/90167692658fa4abca2ffa3ede1a43a71e2bf671078c5c275c64c4231d5a62fa)
100%|██████████| 3/3 [00:00<00:00, 1046.74it/s]


Additionally, it is also convenient to import dataset files from local using the `load_dataset` method. You can refer to the documentation [here](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/loading_methods#datasets.load_dataset) for more information. Here are some code snippets from the documentation:

In [None]:
from datasets import load_dataset

# Load a CSV file
ds = load_dataset('csv', data_files='path/to/local/my_dataset.csv')

# Load a JSON file
ds = load_dataset('json', data_files='path/to/local/my_dataset.json')

# Load from a local loading script
ds = load_dataset('path/to/local/loading_script/loading_script.py', split='train')



---

## 1-2 Retriever

OpenICL provides various `Retriever` for users to choose from, and supports many state-of-the-art retrieval methods, such as Random, [BM25](http://dx.doi.org/10.1561/1500000019) and [TopK](https://arxiv.org/abs/2101.06804). You can simply define a `Retriever` with a `DatasetReader`: 

In [4]:
# Define a retriever using the previous `DataLoader`.
# `ice_num` stands for the number of data in in-context examples.

# Random Retriever
from openicl import RandomRetriever
retriever = RandomRetriever(data, ice_num=8)

# TopK Retriever
from openicl import TopkRetriever
retriever = TopkRetriever(data, ice_num=8, index_split='train', test_split='test')

[2023-03-10 12:57:04,717] [openicl.icl_retriever.icl_topk_retriever] [INFO] Creating index for index set...
  0%|          | 0/6920 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 6920/6920 [02:18<00:00, 49.81it/s]


In `Retriever`, you can set index set and test set through `index_split` and `test_split` (default `index_split` and `test_split` are 'train' and 'test')



---

## 1-3 PromptTemplate

`PromptTemplate` is a module for generating prompts in a specified format. Typically, you can define a `PromptTemplate` to guide the generation of in-context examples or the final prompts fed into the model during the inference process. Additionally, you can embed the `PromptTemplate` within the `DatasetReader`, affecting the data used in the retrieval process.

### 1-3-1 Dictionary-Style Template

`PromptTemplate.template` could be either `dict` or `str` type. Firstly, We will discuss the case when `PromptTemplate.template` is a `dict`. Typically, in classification tasks, dictionary-style templates are very suitable. Still taking the SST-2 dataset as an example, there are only two types of labels, `0` and `1`, which represent `positive` and `negative` respectively. Therefore, the keys and values of `PromptTemplate.template` can be set as the label and the corresponding instruction. The following is an `PromptTemplate` example for SST-2.

In [12]:
from datasets import load_dataset
from openicl import PromptTemplate

# Loading dataset from huggingface 
dataset = load_dataset('gpt3mix/sst2')

template = PromptTemplate(template={
                                        0: 'Positive Movie Review: </text>',
                                        1: 'Negative Movie Review: </text>' 
                                    },
                          column_token_map={'text' : '</text>'} 
           )


100%|██████████| 3/3 [00:00<00:00, 1290.82it/s]


The `column_token_map` parameter is used to build a mapping from columns to placeholders in the template. Additionally, the correctness of the `PromptTemplate` can be checked using the `generate_item` method:

In [13]:
# Select a piece of data from the dataset
entry = dataset['validation'][0]
print(f'entry:\n{entry}\n')

# Generate ouput
output = template.generate_item(entry, output_field='label')
print(f'output:\n{output}')

entry:
{'text': "It 's a lovely film with lovely performances by Buy and Accorsi .", 'label': 0}

output:
Positive Movie Review: It 's a lovely film with lovely performances by Buy and Accorsi .


### 1-3-2 String-Style Template

In this section, we will use the string-style `PromptTemplate`. 
In generative tasks, due to the diversity of results, we cannot design a dictionary mapping from label to instruction. On the contrary, a unified string-style template can help us solve this problem. Taking the machine translation dataset `wmt16 (de-en)` as an example, you can use a string-style `PromptTemplate` like this:

In [5]:
from datasets import load_dataset
from openicl import PromptTemplate

# Loading dataset from huggingface 
dataset = load_dataset('wmt16', name='de-en', split='validation')

# Data Preprocessing
dataset = dataset.map(lambda example: example['translation']).remove_columns('translation')

# Template for en->de
template = PromptTemplate('</en> = </de>', {'en' : '</en>', 'de' : '</de>'})

Found cached dataset wmt16 (/home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227)
Loading cached processed dataset at /home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-9233edb7f71e770b.arrow


You can also check your `PromptTemplate` using `generate_item` method:

In [9]:
# Select a piece of data from the dataset
entry = dataset[0]
print(f'entry:\n{entry}\n')

# Generate output
output = template.generate_item(entry)
print(f'output:\n{output}\n')

# Generate masked output
masked_output = template.generate_item(entry, output_field='de')
print(f'masked output:\n{masked_output}')

entry:
{'de': 'Die Premierminister Indiens und Japans trafen sich in Tokio.', 'en': 'India and Japan prime ministers meet in Tokyo'}

output:
India and Japan prime ministers meet in Tokyo = Die Premierminister Indiens und Japans trafen sich in Tokio.

masked output:
India and Japan prime ministers meet in Tokyo = 


### 1-3-3 Explanation of `ice_token`

In the previous sections, we already know that `PromptTemplate` can guide the generation of in-context examples or the final prompts fed into the model during the inference process (In the `inference` method of `Inferencer`, they are represented as `ice_tempate` and `prompt_template` parameters respectively). `PromptTemplate.ice_token` is used to indicate the location of in-context examples when generating the prompt. In most cases, the template format for generating in-context examples and generating the final prompt is exactly the same, the only difference is that the in-context examples need to be spliced before the final prompt. 

Therefore, usually you only need to add the placeholder corresponding to `ice_token` at the beginning of the template. When `ice_token` is set, the current `PromptTemplate` can be used as `ice_template` or `prompt_template` in the `inference` method of `Inferencer` (`ice_token` will not play any role when used as `ice_template`). To use `PromptTemplate` as `prompt_template`, `ice_token` must be set.

The template with ice_token added is as follows(set `ice_token='</E>'`)：

In [14]:
# SST-2 Template Example
template = PromptTemplate(template={
                                        0: '</E>Positive Movie Review: </text>',
                                        1: '</E>Negative Movie Review: </text>' 
                                    },
                          column_token_map={'text' : '</text>'},
                          ice_token='</E>'
            )


# WMT16 en->de Template Example
template = PromptTemplate('</E></en> = </de>', {'en' : '</en>', 'de' : '</de>'}, ice_token='</E>')

In the next section, we will use these `PromptTemplate` in the inference process.



---

## 1-4 Inferencer

Similar to Retriever, the basic use of `Inferencer` is very convenient. Here we will demonstrate perplexity-based (**PPL-based**) and direct generation `Inferencer`.

When using the `inference` method of the `Inferencer`, we need to select a defined `Retriever`. If you want to get better experimental results, you can also add well-designed `PromptTemplate`(s) to guide the generation of in-context examples and the final prompt (in-context examples correspond to `ice_template`, and prompt corresponds to `prompt_template`).

### 1-4-1 PPL-based Inferencer Example 

In [7]:
from openicl import DatasetReader, PromptTemplate, TopkRetriever, PPLInferencer

# Define a DatasetReader, loading dataset from huggingface and selecting 5 pieces of data randomly.
data = DatasetReader('gpt3mix/sst2', input_columns=['text'], output_column='label', ds_size=5)

# SST-2 Template Example
template = PromptTemplate(template={
                                        0: '</E>Positive Movie Review: </text>',
                                        1: '</E>Negative Movie Review: </text>' 
                                   },
                          column_token_map={'text' : '</text>'},
                          ice_token='</E>'
           )

# TopK Retriever
retriever = TopkRetriever(data, ice_num=2, index_split='train', test_split='test')

# Define a Inferencer
inferencer = PPLInferencer(model_name='distilgpt2')

# Inference
predictions = inferencer.inference(retriever, ice_template=template, output_json_filename='sst2')
print(predictions)

Found cached dataset sst2 (/home/zhangyudejia/.cache/huggingface/datasets/gpt3mix___sst2/default/0.0.0/90167692658fa4abca2ffa3ede1a43a71e2bf671078c5c275c64c4231d5a62fa)
100%|██████████| 3/3 [00:00<00:00, 509.49it/s]
[2023-03-10 13:01:07,896] [openicl.icl_retriever.icl_topk_retriever] [INFO] Creating index for index set...
  0%|          | 0/5 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 5/5 [00:00<00:00, 59.04it/s]
[2023-03-10 13:01:16,734] [openicl.icl_retriever.icl_topk_retriever] [INFO] Embedding test set...
  0%|          | 0/5 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|█

[0, 0, 0, 0, 1]





Note that we only set `ice_template` in the `inference` method, but not `prompt_template`. However, since we set the `ice_token` in `ice_template`, this `ice_template` will be automatically used for `prompt_template`.

To view detailed output information, you can view the generated json file. The default path is `icl_inference_output/predictions.json`.

### 1-4-2 Direct Generation Inferencer Example

In [6]:
from openicl import DatasetReader, PromptTemplate, BM25Retriever, GenInferencer
from datasets import load_dataset

# Loading dataset from huggingface 
dataset = load_dataset('wmt16', name='de-en')

# Data Preprocessing
dataset = dataset.map(lambda example: example['translation']).remove_columns('translation')

# Define a DatasetReader, selecting 5 pieces of data randomly.
data = DatasetReader(dataset, input_columns='en', output_column='de', ds_size=5)

# WMT16 en->de Template Example
template = PromptTemplate('</E></en> = </de>', {'en' : '</en>', 'de' : '</de>'}, ice_token='</E>')

# BM25 Retriever
retriever = BM25Retriever(data, ice_num=1, index_split='validation', test_split='test')

# Define a Inferencer
inferencer = GenInferencer(model_name='distilgpt2') # we suggest to use XGLM here, like 'facebook/xglm-7.5B'

# Inference
predictions = inferencer.inference(retriever, ice_template=template, output_json_filename='wmt')
print(predictions)

Found cached dataset wmt16 (/home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227)
100%|██████████| 3/3 [00:09<00:00,  3.13s/it]
Loading cached processed dataset at /home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-03584df8d3376ce9.arrow
Loading cached processed dataset at /home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-9233edb7f71e770b.arrow
Loading cached processed dataset at /home/zhangyudejia/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-131fde3f5ed908a9.arrow
[2023-03-10 13:00:50,735] [openicl.icl_retriever.icl_bm25_retriever] [INFO] Retrieving data for test set...
100%|██████████| 5/5 [00:00<00:00, 1824.40it/s]
[2023-03-10 13:00:50,742] [openicl.icl_inferencer.icl_

['\xa0\nThe characters were not defined by their work, but rather their personalities, which shone through the interaction as friends. = \xa0\nThe characters were not defined by their work, but rather their personalities, which shone through the interaction as friends. = \xa0\nThe characters were not defined by their work, but rather their personalities, which shone through the interaction as friends. = \xa0\nThe characters were not defined by their work, but rather their personalities, which shone through the interaction as', 'ich auf die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die Weltung, die W', '\xa0\nThe court must now settle the amount of pecuniary compensation. = \xa0\nThe court must now settle the amount of pecuniary compensation. = \xa0\nThe court must now settle the amount of pecuniary co


