In [None]:
%conda install -c conda-forge ipywidgets
# %jupyter nbextension enable --py widgetsnbextension

# What is Transfer Learning?

In recent years, models have become increasingly larger (billions of parameters) and more and more labeled data has become avaialble. With more powerful infrastructure for holding that data and trainign the models (mainly thanks to the cloud), we have become increasingly good at training deep neural networks to learn a very accurate mapping from inputs to outputs.


However, those models are usually trained on a specific dataset and for a specific task. In real world, you deal with messy data and new scenarios, many of which your model has not encountered during training and for which it is in turn ill-prepared to make predictions.

The ability to transfer knowledge to new conditions is generally known as **transfer learning**.

----

Figure 1 shows a classic supervised learning scenario. You train a model for some task and domain A, assuming that labeled data for the same task and domain is provided.  On another occasion, when given data for some other task or domain B, we require again labeled data of the same task or domain that we can use to train a new model B so that we can expect it to perform well on this data.

<img src="./img/lab_12_ml.png">
Figure 1: [The traditional supervised learning setup in ML](https://ruder.io/transfer-learning/)

In the other hand, Transfer learning allows us to deal with these scenarios by leveraging the already existing labeled data of some related task or domain. We try to store this knowledge gained in solving the source task in the source domain and apply it to our problem of interest as can be seen in Figure 2.


<img src="./img/lab_12_tf.png">

Figure 2: [Transfer learning setup](https://ruder.io/transfer-learning/)

------

### Steps of Transfer Learning

Transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch.

For instance, features from a model that has learned to identify racoons may be useful to kick-start a model meant to identify tanukis.

A common transfer learning workflow consists of:

* Take layers from a previously trained model.
* Freeze them, so as to avoid destroying any of the information they contain during future training rounds.
* Add some new, trainable layers on top of the frozen layers. They will learn to turn the old features into predictions on a new dataset.
* Train the new layers on your dataset.

A last, optional step, is fine-tuning, which consists of unfreezing the entire model you obtained above (or part of it), and re-training it on the new data with a very low learning rate. This can potentially achieve meaningful improvements, by incrementally adapting the pretrained features to the new data.

## What is Sentiment Analysis

# What is BERT?

<img src="./img/lab_12_bert.jpg">

**BERT** stands for Bidirectional Encoder Representations from Transformers. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. 


There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture.

# What is Hugging Face 🤗

Hugging Face is an open-source provider of natural language processing (NLP) technologies. It has a large open-source community, in particular around the Transformers library.

🤗/Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

`pip install transformers`


more here: https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html


# Sentiment Analysis using BERT

The advantage of using `Transformers` lies in the straight-forward model-agnostic API. Loading a pre-trained model, along with its tokenizer can be done in a few lines of code. Here is an example of loading the BERT TensorFlow models as well as their tokenizers:

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /Users/amir/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /Users/amir/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [4]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [None]:
from transformers import TFBertModel, BertTokenizer

bert_model = TFBertModel.from_pretrained("bert-base-cased")  # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

## Fine-tuning a Transformer model


from tensorflow import keras
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

train_ds, validation_ds, test_ds = tfds.load(
    "cats_vs_dogs",
    # Reserve 10% for validation and 10% for test
    split=["train[:40%]", "train[40%:50%]", "train[50%:60%]"],
    as_supervised=True,  # Include labels
)

print("Number of training samples: %d" % tf.data.experimental.cardinality(train_ds))
print(
    "Number of validation samples: %d" % tf.data.experimental.cardinality(validation_ds)
)
print("Number of test samples: %d" % tf.data.experimental.cardinality(test_ds))

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(train_ds.take(9)):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image)
    plt.title(int(label))
    plt.axis("off")