<a href="https://colab.research.google.com/github/PyTorchLightning/lightning-flash/blob/master/flash_notebooks/text_classification.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In this notebook, we'll go over the basics of lightning Flash by finetunig a TextClassifier on [IMDB Dataset](https://www.imdb.com/interfaces/).

# Finetuning

Finetuning consists of four steps:
 
 - 1. Train a source neural network model on a source dataset. For text classication, it is traditionally  a transformer model such as BERT [Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805) trained on wikipedia.
As those model are costly to train, [Transformers](https://github.com/huggingface/transformers) or [FairSeq](https://github.com/pytorch/fairseq) libraries provides popular pre-trained model architectures for NLP. In this notebook, we will be using [tiny-bert](https://huggingface.co/prajjwal1/bert-tiny).

 
 - 2. Create a new neural network the target model. Its architecture replicates all model designs and their parameters on the source model, expect the latest layer which is removed. This model without its latest layers is traditionally called a backbone
 

- 3. Add new layers after the backbone where the latest output size is the number of target dataset categories. Those new layers, traditionally called head, will be randomly initialized while backbone will conserve its pre-trained weights from ImageNet.
 

- 4. Train the target model on a target dataset, such as Hymenoptera Dataset with ants and bees. However, freezing some layers at training start such as the backbone tends to be more stable. In Flash, it can easily be done with `trainer.finetune(..., strategy="freeze")`. It is also common to `freeze/unfreeze` the backbone. In `Flash`, it can be done with `trainer.finetune(..., strategy="freeze_unfreeze")`. If a one wants more control on the unfreeze flow, Flash supports `trainer.finetune(..., strategy=MyFinetuningStrategy())` where `MyFinetuningStrategy` is subclassing `pytorch_lightning.callbacks.BaseFinetuning`.

---
  - Give us a ⭐ [on Github](https://www.github.com/PytorchLightning/pytorch-lightning/)
  - Check out [Flash documentation](https://lightning-flash.readthedocs.io/en/latest/)
  - Check out [Lightning documentation](https://pytorch-lightning.readthedocs.io/en/latest/)
  - Join us [on Slack](https://join.slack.com/t/pytorch-lightning/shared_invite/zt-pw5v393p-qRaDgEk24~EjiZNBpSQFgQ)

### Setup  
Lightning Flash is easy to install. Simply ```pip install lightning-flash```

In [None]:
%%capture
! pip install 'git+https://github.com/PyTorchLightning/lightning-flash.git#egg=lightning-flash[text]'

In [1]:
import flash
from flash.core.data.utils import download_data
from flash.text import TextClassificationData, TextClassifier

###  1. Download the data
The data are downloaded from a URL, and save in a 'data' directory.

In [None]:
download_data("https://pl-flash-data.s3.amazonaws.com/imdb.zip", 'data/')

<h2>2. Load the data</h2>

Flash Tasks have built-in DataModules that you can use to organize your data. Pass in a train, validation and test folders and Flash will take care of the rest.
Creates a TextClassificationData object from csv file.

In [2]:
datamodule = TextClassificationData.from_csv(
    train_file="data/imdb/train.csv",
    val_file="data/imdb/valid.csv",
    test_file="data/imdb/test.csv",
    input_fields="review",
    target_fields="sentiment",
    backbone="prajjwal1/bert-tiny",
)

Using 'prajjwal1/bert-tiny' provided by Hugging Face/tokenizers (https://github.com/huggingface/tokenizers).
Using custom data configuration default-ba12b9e9680796cf
Reusing dataset csv (/Users/pietrolesci/.cache/huggingface/datasets/csv/default-ba12b9e9680796cf/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)
100%|██████████| 1/1 [00:00<00:00, 208.90it/s]
Loading cached processed dataset at /Users/pietrolesci/.cache/huggingface/datasets/csv/default-ba12b9e9680796cf/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-f4149a270e713310.arrow
100%|██████████| 23/23 [00:06<00:00,  3.73ba/s]
Using custom data configuration default-b01bb1f2486cafb6
Reusing dataset csv (/Users/pietrolesci/.cache/huggingface/datasets/csv/default-b01bb1f2486cafb6/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)
100%|██████████| 1/1 [00:00<00:00, 469.95it/s]
Loading cached processed dataset at /Users/pietrolesci/.cache/huggingface/datasets/cs

###  3. Build the model

Create the TextClassifier task. By default, the TextClassifier task uses a [tiny-bert](https://huggingface.co/prajjwal1/bert-tiny) backbone to train or finetune your model demo. You could use any models from [transformers - Text Classification](https://huggingface.co/models?filter=text-classification,pytorch)

Backbone can easily be changed with such as `TextClassifier(backbone='bert-tiny-mnli')`

In [3]:
model = TextClassifier(num_classes=datamodule.num_classes, backbone=datamodule.backbone)

Using 'prajjwal1/bert-tiny' provided by Hugging Face/transformers (https://github.com/huggingface/transformers).
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification

###  4. Create the trainer. Run once on data

In [5]:
trainer = flash.Trainer(max_epochs=1,
    limit_train_batches=2,
    limit_val_batches=2,
    limit_test_batches=2,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


###  5. Fine-tune the model

The backbone won't be freezed and the entire model will be finetuned on the imdb dataset 

In [6]:
trainer.finetune(model, datamodule=datamodule, strategy="freeze")


  | Name          | Type                          | Params
----------------------------------------------------------------
0 | train_metrics | ModuleDict                    | 0     
1 | val_metrics   | ModuleDict                    | 0     
2 | model         | BertForSequenceClassification | 4.4 M 
----------------------------------------------------------------
258       Trainable params
4.4 M     Non-trainable params
4.4 M     Total params
17.545    Total estimated model params size (MB)


                                                                      

  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Epoch 0: 100%|██████████| 4/4 [00:00<00:00, 14.78it/s, loss=1.12, v_num=8, train_accuracy_step=0.250, train_cross_entropy_step=1.720, val_accuracy=0.500, val_cross_entropy=0.959]


###  6. Test model

In [7]:
trainer.test(model, datamodule=datamodule)

  rank_zero_deprecation(
  rank_zero_warn(


Testing: 100%|██████████| 2/2 [00:00<00:00, 14.86it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_accuracy': 0.75, 'test_cross_entropy': 0.5825178027153015}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 2/2 [00:00<00:00, 14.52it/s]


[{'test_accuracy': 0.75, 'test_cross_entropy': 0.5825178027153015}]

###  7. Save it!

In [8]:
trainer.save_checkpoint("text_classification_model.pt")

# Predicting

### 1. Load the model from a checkpoint

In [10]:
model = TextClassifier.load_from_checkpoint("text_classification_model.pt")

Using 'prajjwal1/bert-tiny' provided by Hugging Face/transformers (https://github.com/huggingface/transformers).
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification

### 2a. Classify a few sentences! How was the movie?

In [11]:
predictions = model.predict([
    "Turgid dialogue, feeble characterization - Harvey Keitel a judge?.",
    "The worst movie in the history of cinema.",
    "I come from Bulgaria where it 's almost impossible to have a tornado.",
    "Very, very afraid",
    "This guy has done a great job with this movie!",
])
print(predictions)

  elif getattr(self, "datamodule", None) is not None:
100%|██████████| 1/1 [00:00<00:00, 391.37ba/s]

['negative', 'negative', 'negative', 'negative', 'negative']





### 2b. Or generate predictions from a sheet file!

In [13]:
datamodule = TextClassificationData.from_csv(
    predict_file="data/imdb/predict.csv",
    input_fields="review",
    backbone="prajjwal1/bert-tiny",
)
predictions = flash.Trainer().predict(model, datamodule=datamodule)
print(predictions)

Using 'prajjwal1/bert-tiny' provided by Hugging Face/tokenizers (https://github.com/huggingface/tokenizers).
Using custom data configuration default-aec6292850745c66
Reusing dataset csv (/Users/pietrolesci/.cache/huggingface/datasets/csv/default-aec6292850745c66/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)
100%|██████████| 1/1 [00:00<00:00, 437.82it/s]
100%|██████████| 3/3 [00:00<00:00,  4.31ba/s]
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  rank_zero_warn(


Predicting: 0it [00:00, ?it/s]

ArrowInvalid: Needed to copy 1 chunks with 1 nulls, but zero_copy_only was True

<code style="color:#792ee5;">
    <h1> <strong> Congratulations - Time to Join the Community! </strong>  </h1>
</code>

Congratulations on completing this notebook tutorial! If you enjoyed it and would like to join the Lightning movement, you can do so in the following ways!

### Help us build Flash by adding support for new data-types and new tasks.
Flash aims at becoming the first task hub, so anyone can get started to great amazing application using deep learning. 
If you are interested, please open a PR with your contributions !!! 


### Star [Lightning](https://github.com/PyTorchLightning/pytorch-lightning) on GitHub
The easiest way to help our community is just by starring the GitHub repos! This helps raise awareness of the cool tools we're building.

* Please, star [Lightning](https://github.com/PyTorchLightning/pytorch-lightning)

### Join our [Slack](https://join.slack.com/t/pytorch-lightning/shared_invite/zt-pw5v393p-qRaDgEk24~EjiZNBpSQFgQ)!
The best way to keep up to date on the latest advancements is to join our community! Make sure to introduce yourself and share your interests in `#general` channel

### Interested by SOTA AI models ! Check out [Bolt](https://github.com/PyTorchLightning/lightning-bolts)
Bolts has a collection of state-of-the-art models, all implemented in [Lightning](https://github.com/PyTorchLightning/pytorch-lightning) and can be easily integrated within your own projects.

* Please, star [Bolt](https://github.com/PyTorchLightning/lightning-bolts)

### Contributions !
The best way to contribute to our community is to become a code contributor! At any time you can go to [Lightning](https://github.com/PyTorchLightning/pytorch-lightning) or [Bolt](https://github.com/PyTorchLightning/lightning-bolts) GitHub Issues page and filter for "good first issue". 

* [Lightning good first issue](https://github.com/PyTorchLightning/pytorch-lightning/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22)
* [Bolt good first issue](https://github.com/PyTorchLightning/lightning-bolts/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22)
* You can also contribute your own notebooks with useful examples !

### Great thanks from the entire Pytorch Lightning Team for your interest !

<img src="https://raw.githubusercontent.com/PyTorchLightning/lightning-flash/18c591747e40a0ad862d4f82943d209b8cc25358/docs/source/_static/images/logo.svg" width="800" height="200" />