# Active Learning for Text Classification using Small-Text
*Notebook 1*  
This is an introductory example that shows you how to use [small-text](https://github.com/webis-de/small-text) to perform active learning for text classification using state-of-the-art transformer models.

----

## Overview

- [Part I: Installation](#nb1-part1-installation)
- [Part II: Data](#nb1-part2-data)
- [Part III: Setting up the Active Learner](#nb1-part3-active-learning)
- [Part IV: Plotting the Results](#nb1-part4-plotting)

----

<a id="nb1-part1-installation"></a>
## I. Installation

First, we install small-text for its active learning functionality, [datasets](https://github.com/huggingface/datasets) to load an example dataset, and [matptlotlib](https://matplotlib.org/) to plot the learning curves at the end.

In [1]:
# set tmp directory while trying for convenient clean up
import os 
TMP_DIR_VARIABLE = 'active_patcher_TEMP'
tmp_dir = '/root/active-patcher/tmp'
os.environ[TMP_DIR_VARIABLE] = tmp_dir
# automatically create tmp_dir
if not os.path.exists(tmp_dir):
    os.mkdir(tmp_dir)

### Preparation

You can skip this part when reading for the first time if you are only interested in active learning. Here, we configure the loggging behavior and display progress bars display of the `datasets` library to improve its appearance in the notebook.

In [2]:
import datasets
from sklearn.model_selection import train_test_split
datasets.logging.set_verbosity_error()

# disables the progress bar for notebooks: https://github.com/huggingface/datasets/issues/2651
datasets.logging.get_verbosity = lambda: logging.NOTSET

Moreover, we update the default matplotlib settings to receive a more visually appealing plot at the end of this tutorial.

In [3]:
from matplotlib import rcParams
rcParams.update({'xtick.labelsize': 14, 'ytick.labelsize': 14, 'axes.labelsize': 16})

Finally, we will fix the random seeds so that readers do not get confused when the results change upon repeated execution.

In [4]:
import torch
import numpy as np
import pandas as pd
seed = 2022
torch.manual_seed(seed)
np.random.seed(seed)

<a id="nb1-part2-data"></a>
## II. Data

First, we load the rotten tomatoes dataset. This dataset contains movie reviews sentences, which are labeled by their sentiment as either positive or negative.

In [5]:
neg = pd.read_csv(r'./datasets/negative+CC-900repos.csv',encoding='utf_8_sig')
neg['label'] = 0
pos = pd.read_csv(r'./datasets/positive+CC-900repos.csv', encoding='utf_8_sig')
pos['label'] = 1
df = pd.concat([neg[['github','message','diff','label']],pos[['github','message','diff','label']]],axis=0,ignore_index=False).set_index('github')
df.fillna('', inplace=True)
# 1是100%的意思
# shuffled_df = df.sample(frac=1, random_state=42).reset_index(drop=True)
df =  df[df['diff'].str.len()<512]
label2id={'negative':0,'positive':1}
df = df.replace({"label": label2id})
df = df.rename(columns={"github":'id'})
# X_labeled_full = df.sample(100)
# df['project_name'] = df['github'].str.extract(r'github\.com/([^/]+)')
df

Unnamed: 0_level_0,message,diff,label
github,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://github.com/omniauth/omniauth/commit/108f054837735a31430fca4b75a623f5d447375b,Update tested ruby versions,diff --git a/README.md b/README.md\nindex 4a2f...,0
https://github.com/omniauth/omniauth/commit/f654faf709bbb3871b7fc7228aa481a5f96952e5,Remove 2_0-indev from CI tracking,diff --git a/.github/workflows/main.yml b/.git...,0
https://github.com/h2o/h2o/commit/fbe835bf031f0ffef6b9844bcc2bbb71dce56478,[h2olog] Fix parameter position in example usa...,diff --git a/src/h2olog/h2olog.cc b/src/h2olog...,0
https://github.com/h2o/h2o/commit/b414dcf5bdc3477432d0c65abd3b8a1e736c94e1,ci: fix a regression that step scripts should ...,diff --git a/.github/workflows/ci.yml b/.githu...,0
https://github.com/heimdal/heimdal/commit/936d8dd4ee944fce67f4645fc857937da4ea21df,asn1: Add SRVName to PKIX module\n\nThis is in...,diff --git a/lib/asn1/rfc2459.asn1 b/lib/asn1/...,0
...,...,...,...
https://github.com/josh/rails/commit/605aadb3cdba9f469e88c39c0cad7448d59a9f0c,protect new rails apps from csrf by default.\n...,diff --git a/railties/helpers/application.rb b...,1
https://github.com/cakephp/cakephp/commit/69e5226fd2b3a9f6386527ef00159a03c53bbf0c,Merge pull request #7021 from quickapps/master...,diff --git a/src/Network/Response.php b/src/Ne...,1
https://github.com/php/php-src/commit/80367584910885baa1a2a4476a4a31efdcf0c9c0,Fix bug #69646\tOS command injection vulnerabi...,diff --git a/ext/standard/exec.c b/ext/standar...,1
https://github.com/cakephp/cakephp/commit/92e3e09fdc218ebf8eb50e896dcb1728d02eadfc,Fix directory traversal security checking\n\nf...,diff --git a/src/Network/Response.php b/src/Ne...,1


In [6]:
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

In [7]:
import logging
from datasets import Dataset, DatasetDict, ClassLabel
# 构造 DatasetDict
raw_dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(val_df)
})

# num_classes = raw_dataset['train'].features['label'].num_classes




print('First 10 training samples:\n')
for i in range(10):
    print(raw_dataset['train']['label'][i], ' ', raw_dataset['train']['diff'][i])

First 10 training samples:

0   diff --git a/index.php b/index.php
index 1d2db34..cb39173 100644
--- a/index.php
+++ b/index.php
@@ -464,6 +464,7 @@
         if (count === 0) {
             $(".list-group").prepend("<li class=\"list-group-item\" id=\"search-error\">No tasks found</li>");
         }
+        document.title = "Burden (" + count + ")";
     });
     /* End */
     /* Set Up Notifications */

0   diff --git a/.zuul.yaml b/.zuul.yaml
index 3a545311..d201b5c9 100644
--- a/.zuul.yaml
+++ b/.zuul.yaml
@@ -10,7 +10,6 @@
       - openstack/heat-templates
 
 - project:
-    name: openstack/heat-templates
     check:
       jobs:
         - heat-templates-check

0   diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..eda3292
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+# Ignore Saurus site specific files
+config.php
+error.log
+classes/smarty/templates/
+classes/smarty/templates_c/
+public/
+shared/
+.htaccess

0   diff --git a/engine/Monstra.php b/engine/M

### Preparing the Data

Next, we have to convert this raw text data into a format usable by small-text. Since the transformer-based classification in small-text uses huggingface transformers this step is pretty similar to the preprocessing you may know from transformers, with the addition that the end result must be a `TransformersDataset`. In this example, we use `bert-base-uncased` as transformer model.

In [8]:
from transformers import AutoTokenizer

transformer_model_name = './models/codebert-base'

tokenizer = AutoTokenizer.from_pretrained(
    transformer_model_name
)

We use the `TransformersDataset.from_arrays()` helper function which constructs a `TransformersDataset` instance using the tokenizer, text, and labels.

In [9]:
from active_patcher import TransformersDataset

num_classes = 2
target_labels = np.arange(num_classes)

train = TransformersDataset.from_arrays(raw_dataset['train']['diff'],
                                        raw_dataset['train']['label'],
                                        tokenizer,
                                        max_length=512,
                                        target_labels=target_labels)
test = TransformersDataset.from_arrays(raw_dataset['test']['diff'],
                                       raw_dataset['test']['label'],
                                       tokenizer,
                                       max_length=512,
                                       target_labels=target_labels)



---

<a name="nb1-part3-active-learning"></a>
## III. Setting up the Active Learner

Now we constrauct a `PoolBasedActiveLearner` instance which requires a classifier factory, a query strategy, and the train dataset.

To obtain a first model, we initialize the active learner by providing the true labels for 10 sentences. This corresponds to an initial labeling the real-world setting.

In [10]:
from active_patcher import (
    PoolBasedActiveLearner,
    PredictionEntropy,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
    random_initialization_balanced,
    RandomSampling
)


# simulates an initial labeling to warm-start the active learning process
def initialize_active_learner(active_learner, y_train):

    indices_initial = random_initialization_balanced(y_train, n_samples=20)
    active_learner.initialize_data(indices_initial, y_train[indices_initial])

    return indices_initial


transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(transformer_model,
                                                    num_classes,
                                                    kwargs=dict({'device': 'cuda',
                                                                 'mini_batch_size': 8,
                                                                 'class_weight': 'balanced'
                                                                }))
# query_strategy = RandomSampling()
query_strategy = PredictionEntropy()
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
indices_labeled = initialize_active_learner(active_learner, train.y)

我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_0-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_1-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_2-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_3-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_4-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_5-b0.pt.optimizer
我去哪了 /root/active-patcher/tmp/tmpd8t055i9/model_6-b0.pt.optimizer


RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 263155712 vs 263155600

### Active Learning Loop

The main active learning loop queries the unlabeled pool and thereby decides which documents are labeled next.
We then provide the labels for those documents and the active learner retrains the model.
After each query, we evaluate the current model against the test set and save the result.


Note: This is active learning as it is done in a scientific simulation. In reality, the label feedback would have been given by human annotators, and moreover, we would not be able to measure the test accuracy.

In [None]:
from sklearn.metrics import accuracy_score


num_queries = 10


def evaluate(active_learner, train, test):
    y_pred = active_learner.classifier.predict(train)
    y_pred_test = active_learner.classifier.predict(test)

    test_acc = accuracy_score(y_pred_test, test.y)

    print('Train accuracy: {:.2f}'.format(accuracy_score(y_pred, train.y)))
    print('Test accuracy: {:.2f}'.format(test_acc))

    return test_acc


results = []
results.append(evaluate(active_learner, train[indices_labeled], test))


for i in range(num_queries):
    # ...where each iteration consists of labelling 20 samples
    indices_queried = active_learner.query(num_samples=20)

    # Simulate user interaction here. Replace this for real-world usage.
    y = train.y[indices_queried]

    # Return the labels for the current query to the active learner.
    active_learner.update(y)

    indices_labeled = np.concatenate([indices_queried, indices_labeled])

    print('---------------')
    print(f'Iteration #{i} ({len(indices_labeled)} samples)')
    results.append(evaluate(active_learner, train[indices_labeled], test))

----

<a id="nb1-part4-plotting"></a>
## IV. Plotting the Results

Using the previously saved results we can plot a [learning curve](https://en.wikipedia.org/wiki/Learning_curve_(machine_learning)) to visualize the resulting accuracy on the test set.

In [None]:
results

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12, 8))
# ax = plt.axes()

data = np.stack((np.arange(num_queries + 1), np.array(results)), axis=1)
# data = np.column_stack((np.arange(num_queries + 1), results))

In [None]:
data

In [None]:
# np.save('random_strategy.npy',data)#如果文件路径末尾没有扩展名.npy，该扩展名会被自动加上。
# res =  np.load('random_strategy.npy')
# np.save('prediction_entropy.npy',data)#如果文件路径末尾没有扩展名.npy，该扩展名会被自动加上。
res =  np.load('prediction_entropy.npy')

In [None]:
res

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

num_queries = 10
# results = [0.81, 0.83, 0.85, 0.86, 0.87, 0.88, 0.885, 0.89, 0.892, 0.895, 0.9]

# data = np.column_stack((np.arange(num_queries + 1), results))

sns.lineplot(x=res[:, 0], y=res[:, 1])
plt.xlabel('number of queries', labelpad=15)
plt.ylabel('test accuracy', labelpad=25)
sns.despine()
plt.show()
