## Lesson 4

### Introduction

* Top down "getting things done" approach (as before).  

* Details of the models (RNN / Transformers) will come later (Chater 12, Part 2)

### Kaggle Notebook / Video Lecture

*  Notebook: [Getting started with NLP for absolute beginners] (https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners) 

    * Based on the US Patent Phrase to Phrase matching competition.  

    * Consists of 'anchors' 'context' 'targets' and 'scores'.  The anchor and target are the things we want to compare (in the context 'context'), and the score is the comparison (0,.25, .5, .75, 1.0) are the possibilities. 

    * Context is some code like `A47`, the CPC classification code, the subject within which the simularity is to be scored. 

    * The key trick is to turn this into a classification problem, that is to classify (regress) the following text :  

   `df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor`

    ```
    0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
    1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
    2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
    3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
    4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
    ```

    * NB:  I ran it with 'pin to original environment' on.

* Note: Jeremy emphasizes that classification using NLP is very accessible and has wide application.  

* Video and notebook have a discussion of some basics, which I skimmed:
    * train / val / test split and why its needed and 
[How (and why) to create a good validation set](https://www.fast.ai/posts/2017-11-13-validation-sets.html). To make sure you don't overfit your hyperparameters / model choice on validation data, hold out test data until the very end! 

    * Pearson correlation coefficient, the metric to be used for the Kaggle competition. (N.B., I think his presentation would have been clearer if he standardized the variables first? $ \beta = r \frac{s_x}{s_y} $ )  
    
    * There is an aside in the video discussing [The problem with metrics is a big problem for AI](https://www.fast.ai/posts/2019-09-24-metrics.html#:~:text=The%20problem%20with%20metrics%20is%20a%20big%20problem,environments%20...%205%20When%20Metrics%20are%20Useful%20).  For example, metrics becoming targets (Goodhart's Law "When a measure becomes a target, it ceases to be a good measure"). Typical examples are KPP's and SPIF's having unintended consequences.  AI can run amok chasing metrics. 

 
* Language Model PreTraining: 
    *  Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels.  The labels are built into the data!  Only in step three do we need labeled data.  

    * RNN's tend to use 'next word' prediction, while Tranformers used here (BERT type models) are trained on 'masked word' prediction.

    * Next word prediction training (and masked training) can create language models with deep 'understanding' of language. 

    

* UML Fit
    * Again approach is to finetune a pre-trained model.  
    
    * Jeremy uses CNN as an example, that it is the later layers that are task specific. Specifically, in the image case, the last layer that was used in the pretraining for classification we throw away and add on a new classification head for a specific problem, ( random matrix)  , and train that. More on how to do this in detail from scratch in part two.

    * [ULMfit](https://arxiv.org/abs/1801.06146) - used an RNN 

        - Step one - Use a language model pretrained on next word predictions on a large corpus (e.g. wikipedia) 
        - Step two - Fine tune on next word for IMDB movie review
        - Step three - Fine tune model on movie sentiment with a classifier head.


    * Note there are a lot of pretrained models on the [Huggingface model library](https://huggingface.co/models), and some might be trained on something close to what you want to do. In this case he is using a general use model (microsoft/deberta-v3-small) which was good starting point at that time.



* Steps for a language model:

    * Tokenization  (break up into tokens) 
        * Word based (uncommon) - seperate on spaces

        * Subword tokens (common) - better especially for languages where spaces are a useful seperation of concepts in a sentence. Note also this can handle non-language sequences, like music or DNA

        * character based - Fun to play with! [Shakespeare model](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

    * Numericalization: Create a vocab matching each unique token to a number, and convert tokens to numbers. (Neural net can only work on numbers!)

    * Language model data loader creation: Generates next token target, shuffles training data, etc.

    * Language model creation:  Transformer or RNN (as in the chapter).  For now think of as just another deep neural network except that it can handle arbtitray lists of numbers. 


*  Note that they use "AutoModelForSequenceClassification" using num_labels =1 , so they are doing a regression (using `MSELoss()`) on the label not a classification. But it seems to work well here.  

* Key takeaways: You can get pretty good results using existing pretrained models 
 


### Addional notes from Video

* Transformers were introduced around the time of the ULMfit. Advantages:
   - Parallel training on GPUs
   - No vansishing gradients

* Jeremy says that transformers are not well suited to next word prediction (?? But they are! Gilbert 2018) and that they instead used masked word prediction (i.e. predict a missing word.) This is the case for many models, but also next word prediction is done.  Note that DeBerta is a variant of masked language model. 


#### Key libraries:

* Numpy
* Pandas
* Matplotlib (Seaborn)
* Pytorch (and others like sklearn and statsmodels)

Recommends [Python for Data Analysis](https://wesmckinney.com/book/). The book covers mainly the first three, and touches on statsmodels and sklearn at the end.


### Use and Misuse of NLP

* NLP is moving fast, things are possible now that are not possible are year ago!! Huge oppurtunity area.

* FCC Comments to 2017 proposal to repeal net neutrality had huge number (> 95%) of (likely) [fake pro-repeal](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6) comments. This may have effected the outcome!   with current tech this would be much harder to detect

* Fake identiy Katie Jones on linked in.  (Fake images),  GPT-2 conversatinos with itself on reddit, etc.

* Since the book and video were made, we all know that these models have gotten even better (SONA!) ! There will always be an arms race between the fake generates and the fake detectors.


### Final question about categorical vs regression

Jeremy mentions that yes, if you pass in num_labels = 1 you get a regression model. 

### Hugging face tasks

[Hugging face tasks](https://huggingface.co/tasks) contains helpful starting points for a variety of tasks, including:

* Text Classification (what we were doing here)

* Question Answering (given a question and a text containing the answer, give the answer)

* Text generation (Chat gpt)

* .... 

### Chapter 10 of the book

This seems to similar to the video with these differences:

 -  Uses the IMDB sentiment classification problem (as in the ULMFit paper). 
 - Uses Fast.AI library instead of Hugginface library
 -  Uses RNN's instead of transformers 
 
Like the video, there is not much said about the structure of these models. (Ch 12 does it at a lower level, and part 2 of the course will also go deeper.)



### Chapter 10 extra notes 

 


* Embedding 
   * After numericalization, the first layer maps each index into a n-dimensional 'embedding' vector.   This is mentioned in Chapter 10, but embedding in general is discussed in CHapter 9 as well as Lesson 7.





* Fast AI tools 

   * Fastai has a general purpose tokenizers for word and subword tokenization.  (`SpacyTokenizer()` for words, `SentencePieceTokenizer` for subword.  The source fastai.text.core defines aliasa for these: `WordTokenizer` and `SubWordTokenizer` but they are *not* documented. ).  
   
   * Note that SentancePieceTokenizer must be trained: It creates a vocab of a speficied size by finding common sequences of characters.

   * Not also that in displaying tokenized output an underscore is commonly used to represent spaces in the original text (so that spaces can be used to seperate tokens) 

   * Reminder that the transformer models all come packaged with the right tokenizer to use on those models.

   * Special tokens can be used for things like "Beginning of Stream" etc. Fastai Tokenizer takes care of adding these special tokens.

   * Fastai has also a 'Numericalize' to create numerical representation of the vocab

   * In fast ai `DataBlock` also creates the 'dependant' variable by offseting the streams by one token. For text processing, we need to pass in TextBlock for the blocks arguments. Two ways to preprocess the text:
   
      - For pretraining, `is_lm=True` processes the text for next word prediction as follows:
         * All the reviews are shuffled
         * Concat together into one long stream
         * Cut stream in a certain number of batches. If batch size was 64, we would now have 64 streams.  
         * Model then reads the ministreams in order (slicing them up further in to sequences of `seq_len` to create fixed size tensors) 
      
      - While `is_lm=False` is for classification / regression heads.  Since each review has a different length, and we need each tensor to be a single shape, we use padding to make them the same size. This uses a padding token that the model knows to ignore.  Further they roughly group texts that are close to the same size to minimize padding (each batch doesn't have to use the same padding!). This sorting and padding is down automatically in Fastai when using `TextBlock` with `is_lm = False` (the default).  Note this is not an issue with the pretraining, since we concat and then split into equal sized pieces! 
 
* Notes on fitting in fastai:

   * `fit_one_cycle` will freeze the model except the random embeddings. My attempt at 'step 2' training on  Paperspace training would took 45 minutes .  Saved the model 353MB. Unfortunately I obliterated it when I reran the notebook and did not make another attempt at this.  It is a bit frustrating! Next step would have been to unfreeze and (pre-train) some more. 

   * Text generation: For fun he uses this model to generate random reviews. 

   * Final step: extract the pretrained 'encoder' (i.e. all but the last layer)  

   * Create a new dataloader , using the labeled data , TextBlock and CategoryBlock. This adds a classification head. 

   * Finally he uses this (pretrained) model to train on classsifcation using a gradual unfreezing method 