# Transfer Learning with Huggingface

```transformers``` is a python package by Huggingface (https://huggingface.co/transformers/)    

With the pacakge, we can load pre-trained packages that can interacts with pytorch and tensorflow

Import packages:

In [4]:
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModel
from tqdm.notebook import tqdm

Loading part of the ACL-ARC citation prediction dataset

In [None]:
training = pd.read_json('~/datasets/s4/ACL-ARC/training.jsonl', lines=True)

In [2]:
training = pd.read_json('../../training.jsonl', lines=True)

Loading pre-trained model and tokenizer.  

SPECTER pre-trained model in ```transformers``` package is based a document-level representation learning method using Citation-informed Transformers. The fine-tuned model generates document-level embedding of scientific documents of high quality.

In [5]:
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = AutoModel.from_pretrained('allenai/specter')

Take one sentenece for example

In [8]:
sentence = training['cur_sent'][0]

In [9]:
sentence

'the system consists of two linguistically significant parts a machine lexicon residing on a direct access device and a program package'

The tokenizer is spliting the sentence into tokens. The tokenizer will first split the sentence into smaller units: for example, words, then look up a dictionary to find an index corresponds to a word, as shown in ```"token (int)"```. There will be cases when a word is not in the dictionary. The tokenizer will then further split the word into smaller sub-words recursively until smaller sub-words will be found in the dictionary.

In [19]:
tokens = tokenizer(sentence)

print("token (str): {}".format(tokenizer.convert_ids_to_tokens(tokens['input_ids'])))
print("token (int): {}".format(tokens['input_ids']))

token (str): ['[CLS]', 'the', 'system', 'consists', 'of', 'two', 'linguistic', '##ally', 'significant', 'parts', 'a', 'machine', 'lexicon', 'residing', 'on', 'a', 'direct', 'access', 'device', 'and', 'a', 'program', 'package', '[SEP]']
token (int): [101, 111, 483, 3777, 125, 532, 12106, 397, 700, 3983, 105, 4623, 28374, 29295, 201, 105, 1464, 2131, 3476, 136, 105, 2457, 8553, 102]
type       : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


After tokenizing, we will pass the tokens to the model to generate sentence representation.

In [20]:
tokens_pt = tokenizer(sentence, return_tensors="pt")

outputs = model(**tokens_pt)
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output

There are two types of input for BERT-based models. one is called "last hidden state", which is the sequence of hidden-states at the output of the last layer of the model. Each token correspond to a last hidden-state output

In [23]:
last_hidden_state.shape

torch.Size([1, 24, 768])

In [21]:
last_hidden_state

tensor([[[-0.7792,  0.8333,  0.2410,  ...,  0.4887, -0.0467,  0.3051],
         [ 0.2098,  0.0838,  0.1247,  ...,  0.5135,  0.5461, -0.1118],
         [-0.5035,  0.1904,  0.6871,  ...,  0.8021,  0.1066,  0.3445],
         ...,
         [-1.4837,  0.8487,  0.6305,  ...,  0.5358,  0.8152,  0.3259],
         [-1.4575,  0.7227,  0.6408,  ..., -0.5666,  0.5255,  0.2348],
         [-0.9408,  0.8182,  0.2210,  ...,  0.5561,  0.0099,  0.3110]]],
       grad_fn=<NativeLayerNormBackward>)

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

The other type is call "pooler output", which is the last layer hidden-state of the first token of the sequence further processed by a Linear layer and a Tanh activation function. The first token of the sequence is related to pre-training classification task, so pooler output is often used for classification task. 

In [24]:
pooler_output.shape

torch.Size([1, 768])

In [22]:
pooler_output

tensor([[-0.2540,  0.2876, -0.2340,  0.1843,  0.9807, -0.0738,  0.5956,  0.1019,
         -0.4288, -0.2507,  0.7336, -0.0908, -0.3881,  0.3673,  0.2697, -0.4390,
          0.1150,  0.2368,  0.2760, -0.1142, -0.1232, -0.0835, -0.0471, -0.0462,
         -0.0721,  0.1732, -0.7120,  0.9663,  0.1583,  0.4891, -0.2640,  0.0396,
          0.2055,  0.1348, -0.3211,  0.4341,  0.1282,  0.2765,  0.3996, -0.0257,
          0.7475, -0.9900,  0.8704, -0.1534, -0.2722,  0.5214, -0.2670,  0.9717,
         -0.2234,  0.5857, -0.0092,  0.2714, -0.5649,  0.0629, -0.0214,  0.9859,
          0.9603,  0.9990, -0.0141,  0.9154,  0.9727,  0.1833, -0.3122,  0.4822,
          0.2448, -0.5942, -0.1967,  0.3873, -0.3230, -0.5539, -0.0785, -0.5370,
          0.0573,  0.9926,  0.1674, -0.0150, -0.5095,  0.9821, -0.5030, -0.0936,
         -0.0617,  0.0863,  0.4232, -0.2958,  0.0321,  0.9949,  0.1428, -0.4704,
         -0.6120,  0.2401, -0.1410, -0.0368, -0.5527,  0.2229, -0.1670,  0.4611,
         -0.8674,  0.4148,  

It will take a long while to compute all embeddings for the entire dataset. Here we sampled 40000 sentences from the dataset and computer SPECTER embeddings for all the sentences. Let's use the embeddings to build a simple logistic regression to predict citation!

In [24]:
specter_emb = pd.read_csv('~/datasets/s4/ACL-ARC/specter_embeddings.csv')

In [25]:
label = specter_emb.pop('label')

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(specter_emb, label, test_size=0.33, random_state=42)

In [32]:
lr = LogisticRegression(penalty = 'elasticnet', solver = 'saga', l1_ratio = 0.0, n_jobs = 4, max_iter = 1000)

In [33]:
lr.fit(X_train, y_train)

LogisticRegression(l1_ratio=0.0, max_iter=1000, n_jobs=4, penalty='elasticnet',
                   solver='saga')

In [34]:
pred = lr.predict(X_test)
f1_score(y_test, pred)

0.7328484666716651

