## Lesson 4

Like the rest of the course so far, we are taking a top down 'getting things done' approach, using models without digging into how they work.  RNN? Transformers?  That will come later.

### Getting started with NLP for absolute beginners

Recommended first doing the kaggle notebook [Getting started with NLP for absolute beginners] (https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners) 

  - I ran it with 'pin to original environment' on.

   - It really is for beginners in that it goes through a lot of basics (like train/val/test split, and pearson correlation coefficient). I skimmed most of this and suspect most of the group will do so as well.  Note he says something not quite right about correlation and slope. $ \beta = r \frac{s_x}{s_y} $

   - Based on the US Patent Phrase to Phrase matching competition. Uses classification in an interesting way to classify the combination of two phrases as similar, different or identical.

   - The dataset consists of 'anchors' 'context' 'targets' and 'scores'.  The anchor and target are the things we want to compare (in the context 'context'), and the score is the comparison (0,.25, .5, .75, 1.0) are the possibilities.  The key trick is to turn this into a classification problem, that is to classify (regress) the following text :  

   `df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor`

   - Context is some code like `A47`, the CPC classification code, the subject within which the simularity is to be scored.   

   - Tokenization splits the text into tokens (subword usually), and then turns the tokens into a number for each unique token (vocab). Why? Because neural nets can only use numbers! Note, you must use the right tokenizer for the model you are using. Notebook has good examples of how 'subword' looks.

   - Huggingface dataset is different from pytorch dataset !  Huggingface models expect a Hugginface `dataset`, which can be created from a pandas dataframe.

   - Note there are a lot of pretrained models on the [Huggingface model library](https://huggingface.co/models), and some might be trained on something close to what you want to do. In this case he is using a general use model (microsoft/deberta-v3-small) which was good starting point at that time.

   - Note that they use "AutoModelForSequenceClassification" using num_labels =1 , so they are doing a regression (using `MSELoss()`) on the label not a classification. But it seems to work well here.  

   - Key takeaways: You can get pretty good results using existing pretrained models 
 


### Video

* This video doesn't use fast.ai library at all!  Just huggingface.  Huggingface is supposedly state of the art for NLP. 

* Video mentions that these advanced methods will be folded into fast.ai, this doesn't seem to be the case?

[Official sumamry](https://course.fast.ai/Lessons/Summaries/lesson4.html)

#### ULMFit 

* Again approach is to finetune a pre-trained model. First part of lecture tries to make what this means more clear using the slider model from previous chapter... imagine some sliders are already close.

* [ULMfit](https://arxiv.org/abs/1801.06146) - used an RNN 

   - Step one - use a language model pretrained on next word predictions on wikipedia. 
   - Step two - Fine tune on next workd for IMDB movie review
   - Step three - Fine tune model on movie sentiment with a classifier head.

* Nice thing is that the pretraining requires no special labels, the labels are built into the data!  Only in step three do we need labeled data.

* Transformers were introduced around the time of the ULMfit. Advantages:
   - Parallel training on GPUs
   - No vansishing gradients

* Jeremy says that transformers are not well suited to next word prediction (?? But they are! Gilbert 2018) and that they instead used masked word prediction (i.e. predict a missing word.) This is the case for many models, but also next word prediction is done.  Note that DeBerta is a variant of masked language model. 

* Jeremy uses CNN as an example, that it is the later layers that are task specific. Specifically, in the image case, the last layer that was used in the pretraining for classification we throw away and add on a new classification head for a specific problem, ( random matrix)  , and train that. More on how to do this in detail from scratch in part two.

#### Kaggle notebook

* Next video walks through the notebook, see my notes above.  He assumes the same level for the audience: Not familiar with pandas, etc. 

* He emphasizes that classification using NLP is very accessible and has wide application. 


#### Key libraries:

* Numpy
* Pandas
* Matplotlib (Seaborn)
* Pytorch (and others like sklearn and statsmodels)

Recommends [Python for Data Analysis](https://wesmckinney.com/book/). The book covers mainly the first three, and touches on statsmodels and sklearn at the end.

#### Aside on validation sets and metrics

[How (and why) to create a good validation set](https://www.fast.ai/posts/2017-11-13-validation-sets.html)

Just about being careful, and make sure your validation set validates the task you really want to do. Key example is for time series, validation should be future points.

Also one must be careful not to overfit to the validation set (through hyperparameter choice or model choice). Test set should be held out until the very end!

[The problem with metrics is a big problem for AI](https://www.fast.ai/posts/2019-09-24-metrics.html#:~:text=The%20problem%20with%20metrics%20is%20a%20big%20problem,environments%20...%205%20When%20Metrics%20are%20Useful%20)

* Issues with metrics becoming targets (Goodhart's Law "When a measure becomes a target, it ceases to be a good measure")  (e.g. KPPs, SPIF's etc have unintended consequences! )

* AI makes this worse (Leverage)


### Use and Misuse of NLP

* NLP is moving fast, things are possible now that are not possible are year ago!! Huge oppurtunity area.

* Fake comments and articles... could influence outcomes! 


### Final question about categorical vs regression

Jeremy mentions that yes, if you pass in num_labels = 1 you get a regression model. 

### Hugging face tasks

[Hugging face tasks](https://huggingface.co/tasks) contains helpful starting points for a variety of tasks, including:

* Text Classification (what we were doing here)

* Question Answering (given a question and a text containing the answer, give the answer)

* Text generation (Chat gpt)

* .... 

### Chapter 10 of the book

This seems to similar to the "Getting started with NLP" notebook, except uses Fastai libraries and RNN's instead of Transformers.  And like the video, there is not much said about the structure of these models. (Ch 12 does it at a lower level, and part 2 of the course will also go deeper.)

Some key take aways:
 
* Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text. 


* Next word prediction training (and masked training) can create language models with deep 'understanding' of language. 

* Universal Language Model Fine-tuning (UMLFit) is a three step process that improves transfer learning with pretrained models: 

    1.  Self-supervised train on large general corpus (e.g. wikipeidal). This part is already done for you with many models! (next word prediction)

    2.  Self-supervised fine-tune on your data.  (next word prediction)

    3.  Supervised train on your *labeled* data. (Classification)

* Steps:

    * Tokenization  (break up into tokens) 
        * Word based (uncommon) - seperate on spaces

        * Subword tokens (common) - better especially for languages where spaces are a useful seperation of concepts in a sentence. Note also this can handle non-language sequences, like music or DNA

        * character based - Fun to play with! [Shakespeare model](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

    * Numericalization: Create a vocab matching each unique token to a number, and convert tokens to numbers
    * Language model data loader creation: Generates next token target, shuffles training data, etc.
    * Language model creation:  Transformer or RNN (as in the chapter).  For now think of as just another deep neural network except that it can handle arbtitray lists of numbers. 


* Fastain (and Huggingspace) have tools to help you this.  I only skimmed the fast ai implementation in the book, where he uses the IMDb database with the goal of classifying review sentiment.  Some notes:

   * Fastai has a general purpose tokenizers for word and subword tokenization.  (`SpacyTokenizer()` for words, `SentencePieceTokenizer` for subword.  The source fastai.text.core defines aliasa for these: `WordTokenizer` and `SubWordTokenizer` but they are *not* documented. ).  
   
   * Note that SentancePieceTokenizer must be trained: It creates a vocab of a speficied size by finding common sequences of characters.

   * Not also that in displaying tokenized output an underscore is commonly used to represent spaces in the original text (so that spaces can be used to seperate tokens) 

   * Remind you that the transformer models all come packaged with the right tokenizer to use on those models.

   * Special tokens can be used for things like "Beginning of Stream" etc. Fastai Tokenizer takes care of adding these special tokens.

   * Fastai has also a 'Numericalize' to create numerical representation of the vocab

* Text preprocessign was discussed in some detail (but this is all handled automatically by the libraries)
   * All the reviews are shuffled
   * Concat together into one long stream
   * Cut stream in a certain number of batches. If batch size was 64, we would now have 64 streams.  
   * Model then reads the ministreams in order (slicing them up further in to sequences of `seq_len` to create fixed size tensors)

* in fast ai `Dataloader` also creates the 'dependant' variable by offseting the streams by one token.

* All of the above is handled when TextBLock is passed to DataBlock in fastai. I guess it does word tokens by default?


* Instead of transformer, this chapter uses a pretrained RNN  (`AWD-LSTM`) .  Promissed to show us how to do thsi from scratch in CHapter 12. 

    * Note important step that the model does: Embedds the word indices into a vector space. (Chapter 9 talks about this, and I presume we will later)

    * What is NOT clear to me is how the pretrained vocab is supposed to match the one we created ? The text claims that the embeddings in the pretrained model are merged with random embeddings for words not in the pretrained vocab. But we never used the pretrained vocab at all as far as I can tell! However, looking at the docs it looks like the text learners  have a function (that is called behind the scenes?) that matches the embeddings to the pretrained model `match_embeddings`

    * `fit_one_cycle` will freeze the model except the random embeddings. Paperspace training would took 45 minutes .  Saved the model 353MB. Unfortunately I obliterated it when I reran the notebook and did not make another attempt at this.  It is a bit frustrating!

    * then we have to unfreeze and train some more.

    * Then he saves the 'encoder' which is everything except the final (task specific) layer.

* Text generation: For fun he uses the (full) model to generate random reviews. 

* Final step: using the pretrained 'encoder' to do classification

   * Create a new dataloader , using the labeled data , TextBlock and CategoryBlock 

   * a wrinkle here each review has a different length, and we need each tensor to be a single shape. This is done by padding using a padding token that the model knows to ignore.  Further they roughly group texts that are close to the same size to minimize padding (each batch doesn't have to use the same padding!). This sorting and padding is down automatically in Fastai when using `TextBlock` with `is_lm = False` (the default).  Note this is not an issue with the pretraining, since we concat and then split into equal sized pieces! 

   * Finally he uses the pretrained model to train on classsifcation using a gradual unfreezing method 

#### Disinformation and Language Models

This section was similar to the videos.   Examples:

* FCC Comments to 2017 proposal to repeal net neutrality had huge number (> 95%) of (likely) [fake pro-repeal](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6) comments. This may have effected the outcome!   with current tech this would be much harder to detect

* Fake identiy Katie Jones on linked in.  (Fake images),  GPT-2 conversatinos with itself on reddit, etc.

* Since the book and video were made, we all know that these models have gotten even better (SONA!) ! There will always be an arms race between the fake generates and the fake detectors.