# Transfer Learning
In the previous module, we learnt how to use existing HuggingFace models using `pipeline` API and `AutoClass` features.  In this module, we will make progress towards tailoring existing models for our usecases through *transfer learning*. We will first go over pretraining and finetuning concepts. You will then learn how to load your own dataset using PyTorch’s `TensorDataset`, `DataLoader`. In the next notebook,  we use the loaded custom dataset for building models.

__What you will learn:__

By the end of this notebook, you will have the skills to load your own dataset, tokenize, and divide it into train, valid, and test dataset.

The notebook is divided into 2 sections. In the first section, we will discuss pretraining and finetuning concepts. We will learn how to load some of the pretrained transformer models. In the second section, we will learn data preparation---how to load custom datasets to be processed by transformers. 

Topics covered:
- Pretraining and Finetuning
- Datasets and Dataloader



### Pretraining, Finetuning and Transfer Learning

Let's say your company wants to build the following products over 2020:
- sentimental analyzer (Jan - Mar, 2020)
- paraphrase detecter (Apr - Jun, 2020)
- named entity recognizer (Jul - Sep, 2020)
- quesion answering system (Oct - Dec, 2020)

The company is willing to provide about 5000 training examples for each of the tasks. You, as a Machine Learning practitioner will start building neural networks for each of these tasks over the year. Your individual classifiers might be randomly initialized or initialized with word embeddings. As you are building more and more classifiers, you realize that the learnings from one task could potentially help another task. In some ways, the learning from one task can be "transferred" to another task. Even better, what if we could build one model that can be modified for different tasks?

The ML community has built several such pretrained models on large amount of data. Since labeling is expensive, the tasks on which these large models are trained, are self-supervised in nature. For example, during pretraining  of BERT model, it learns to predict the masked words in a given sentence. 

```
Amazon is the longest [MASK] in the [MASK]
```

By learning to predict that `MASK` here is `{river, world}`, BERT learns *contextual word embeddings*. BERT was trained on millions of such examples, To be specific, the pretraining was performed on BookCorpus data (800M words) and English Wikipedia (2500M words), and the models were made publicly available by Google.

#### Glossary
- __downstream tasks__: 

Downstream tasks are end-tasks that are of interest to you and to your company. Tasks like sentiment analysis, question answering etc., that have limited training examples are downstream tasks. 

- __transfer learning__:

While training  downstream tasks, instead of randomly initializing the network, we would rather initialize it with weights correspodning to the pretrained model. The pretrained models, having gone through millions of words in gigantic corpora, can produce better representation for words in our input. By initializing a model with pretrained model weights, we can work on a downstream task with fewer labels. This is called *transfer learning*, where knowledge from the pretrained task is said to be "transferred" to the downstream task.

- __fine tuning__:

Finetuning is simply training a downstream task by initializing the network with weights corresponding to a pretrained model.

Okay, enough theory, let's load some pretrained models. 

In [None]:
# !pip install --quiet transformers
#or if you are using conda
# conda install -c huggingface transformers

From our previous notebook, you might remember that the `pipeline` API will run a query against a pretrained model. Therefore, it cannot be used for finetuning a downstream task. 

The two packages that will be handy are:
- `AutoModel`: obtain models like GPT-2, ProphetNet, BART etc
- `AutoTokenizer`: helps us get the tokenizers that were originally used for tokenizing text for these pretrained models.



In [1]:
from transformers import AutoModel, AutoTokenizer

To obtain pretrained models, simply use the function `from_pretrained()` and pass the model path to `pretrained_model_name_or_path` parameter. Go through the [model repository](https://huggingface.co/models) to try different pretrained models.


In [2]:
bert_model = AutoModel.from_pretrained(
    pretrained_model_name_or_path="bert-base-uncased"
)
bert_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path="bert-base-uncased"
)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [3]:
bert_tokenizer.pad_token_id

0

Let's look at another example. How about [T5](https://arxiv.org/pdf/1910.10683.pdf), another transformer-based model? T5 is a transformer model that is trained on a large pretraining dataset called [Colossal Clean Crawled Corpus (C4)](https://commoncrawl.org/). T5 also has more parameters and outperforms BERT on several downstream tasks.


In [4]:
from transformers import AutoModelWithLMHead

t5_model = AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path="t5-small")
t5_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="t5-small")



Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Before an input text is passed to the model, it has to be converted into numerical representation. Tokenizer's `encode` method helps us accomplish this.

In [5]:
query = "I think it is going to rain tonight"
tokenized_query = bert_tokenizer.encode(query)

Let's take a look at what the numerical representation of the input sentence looks like:

In [6]:
tokenized_query

[101, 1045, 2228, 2009, 2003, 2183, 2000, 4542, 3892, 102]

The numericalized tokens will then be passed to the model along with information like attention mask as we shall soon see.

# Loading your own datasets

In [7]:
import torch

Now, one may not have huge amount of data, enough for even finetuning tasks. As a Machine Learning practitioner, you will often run into such scenarios. Trying to find more labeled data on the Internet, might turn out to be a wild goose chase. An alternative solution is to annotate your own dataset! Yes, you heard that right!

There are some really cool tools such as [prodigy](https://prodi.gy/) that can help us annotate faster. We will learn more about prodigy and how we can use it to augment our dataset in our next notebook. For now, we will use the `Dataset Search` app from Google to obtain a dataset. 

Head over to [Dataset Search](https://datasetsearch.research.google.com/), and search for `Twitter Climate Change Sentiment Dataset`, and download the dataset from Kaggle, and unzip it on your local computer. 


Let's upload the dataset here. We will be using the same dataset in notebooks 3 and 4 as well.

In [None]:
# !pip install google-colab
# # or
# !conda install -c conda-forge google-colab

In [12]:
from google.colab import files
climate_change_dataset = files.upload()

  from IPython.utils import traitlets as _traitlets


KeyboardInterrupt: 

Let's load the dataset into a dataframe.

In [13]:
import pandas as pd
df = pd.read_csv("twitter_sentiment_data.csv")

It's always a good idea to take a look at what the dataset looks like.

The dataset has 4 classes: `{-1, 0, 1, 2}`. `-1` indicates the most negative 
outlook towards climate change, and `2` implies the most positive outlook. Our task, therefore, is to identify the sentiment for a given tweet.

# Tokenizing

As we discussed earlier in this notebook, our first step is to map the tokens into a consistent numerical representation using tokenizers. For the rest of the learning package, we will work with BERT models as they are the most popular NLP model in the recent past.

There are a few special tokens in BERT:
1. __[CLS]__: this is a special token that is added at the beginning of every sequence. The final hidden vector correspoding to this token is used for classification tasks.
2. __[SEP]__: this token sepearates two sequences
3. __[PAD]__: pad the sequence tokens upto `MAX_LEN` with a pad token (`0`), where MAX_LEN is the maximum length of the sequence.

Using the `encoder` method, we will first convert input tokens to their corresponding numerical form.

### Encoding text

In [14]:
input_ids = []

for tweet in df.message:
    tweet_in_ids = bert_tokenizer.encode(tweet, add_special_tokens=True)
    input_ids.append(tweet_in_ids)

The `add_special_tokens` parameter automatically adds `CLS` and `SEP` token at the beginning and at the end of the sentnce respectively. Let's take a look at what the input tokens look like after they have been numericalized.


In [15]:
input_ids[0]

[101,
 1030,
 9543,
 2666,
 4783,
 19092,
 4785,
 2689,
 2003,
 2019,
 5875,
 15876,
 22516,
 2004,
 2009,
 2001,
 3795,
 12959,
 2021,
 1996,
 4774,
 3030,
 12959,
 2005,
 2321,
 2748,
 2096,
 1996,
 15620,
 8797,
 102]

The maximum token (sequence) length supported by BERT is 512. But, our task doesn't always need so many tokens. If our `MAX_LEN` is 64, we would want the rest of the tokens to be filled with a `PAD` token. We can do this with keras's `pad_sequences` library.

Also, since the model eventually expects all the variables to be tensors, let us go ahead convert all our variables to tensors.

### Padding

In [None]:
# !pip install keras
# # or 
# !conda install -c conda-forge keras 

In [17]:
from keras.preprocessing.sequence import pad_sequences

MAX_LEN = 64
PAD_TOKEN = 0
input_ids = pad_sequences(
    input_ids,
    maxlen=MAX_LEN,
    dtype="long", 
    value=bert_tokenizer.pad_token_id,
    padding="post",
    truncating="post"
)
input_ids = torch.tensor(input_ids)

# Warnings below are due to the newer numpy version w/TF

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [18]:
input_ids[0]

tensor([  101,  1030,  9543,  2666,  4783, 19092,  4785,  2689,  2003,  2019,
         5875, 15876, 22516,  2004,  2009,  2001,  3795, 12959,  2021,  1996,
         4774,  3030, 12959,  2005,  2321,  2748,  2096,  1996, 15620,  8797,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])

let's take a look at how the first sentence looks like

In [19]:
input_ids[0]

tensor([  101,  1030,  9543,  2666,  4783, 19092,  4785,  2689,  2003,  2019,
         5875, 15876, 22516,  2004,  2009,  2001,  3795, 12959,  2021,  1996,
         4774,  3030, 12959,  2005,  2321,  2748,  2096,  1996, 15620,  8797,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])

As you can see, the input_ids are now padded with the `PAD` token 0. The `post` option in padding implies that the `PAD` token will be appended instead of prepending ot the list. Similarly, `post` in `truncating` ensures that all extra tokens beyond `MAX_LEN` will be removed at the end of the list rather than the beginning of the list.



### Attention mask
With attention mask we let BERT know which tokens should be attended to,**bold text** and which should not. All the actual tokens will have an attention mask `1` and the rest of the tokens (pad tokens) will have mask `0`.

In [20]:
attention_masks = torch.tensor([[int(tok > 0) for tok in tweet] for tweet in input_ids])

In [21]:
attention_masks[0]

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

which can also be written as:

In [22]:
attention_masks = (input_ids > 0).int()

Here's how the attention mask corresponding to our first sentence looks like:

In [23]:
attention_masks[0]

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.int32)

# Constructing Train, Validation Dataset
The Twitter dataset on climate change does not come with a train/dev/test split. Let's manually split the dataset using `scikit-learn`'s `train_test_split` function.

In [24]:
from sklearn.model_selection import train_test_split
import torch

train_data, val_data, train_labels, val_labels = train_test_split(
    input_ids,
    list(df.sentiment), 
    random_state=1234,
    test_size=0.2
)

train_mask, val_mask, _, _ = train_test_split(
    attention_masks,
    list(df.sentiment),
    random_state=1234,
    test_size=0.2
)

#let's also convert labels to tensors
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)

Awesomesauce! We now have split the dataset into `train`, `validation` sets. We have also managed to map text to ids using BERT tokenizer. The final step is to construct a `DataLoader` so that we can support sampling and batch training on our dataset. 

Since the dataset is already in tensor format, we will use `TensorDataset` to wrap it into a `DataSet` before constructing a `DataLoader`.


In [25]:
from torch.utils.data import DataLoader, TensorDataset, RandomSampler

BATCH_SZ = 4
train_dataset = TensorDataset(train_data, train_mask, train_labels)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(
    dataset=train_dataset,
    sampler=train_sampler,
    batch_size=BATCH_SZ
)

Let us take the first batch of the dataloader and see how the data looks like. You should be able to see the four samples retrieved from training dataset. 

In [26]:
next(iter(train_dataloader))

[tensor([[  101, 19387,  1030,  2197, 28075,  2669, 18743,  1024,  2065,  2017,
          24260, 29645,  2050,  1522,  1078,  2050,  1525,  1068,  2102,  2903,
           2158,  1011,  2081,  3795, 12959,  2003,  1037,  1037, 10021,  3277,
           1010,  2507,  2000,  1996,  3019,  4219,  3639,  2473,  1006, 16770,
           1024,  1037, 29645,  2050,  1522,  1078,  2050, 29649,   102,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0],
         [  101, 19044,  2708,  1024,  6351, 14384,  2025,  1005,  3078, 12130,
           1005,  2000,  4785,  2689, 16770,  1024,  1013,  1013,  1056,  1012,
           2522,  1013,  1043,  2549,  2140,  2575,  2480, 28311, 13668,  2497,
            102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
 

# Homework
Congratulations on making it to the end of this lesson!

As a homework, go ahead and build validation dataloader. A warning though, you cannot use RandomSampler on the validation dataset. Do you know why? What else can we use? Think about it before you check the next notebook out! If you can't figure it out, don't worry, you will know it in the next notebook!