<a href="https://colab.research.google.com/github/Firojpaudel/Machine-Learning-Notes/blob/main/Practical%20Deep%20Learning%20For%20Coders/Chapter_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP Deep Dive: RNNs

#### **The Basic Overview:**
---

In Chapter 1, the book introduced how deep learning can achieve great results with natural language datasets using pretrained language models fine-tuned for specific tasks. This chapter dives deeper into the foundational concepts and processes behind training language models, particularly for NLP tasks.


#### 1. **Language Modeling and Transfer Learning**
---

**Key Concepts:**
- **Self-Supervised Learning**: Training a model using labels embedded in the data itself, such as predicting the next word in a sentence.
  - This task forces the model to develop an understanding of the language.
- **Universal Language Model Fine-tuning (ULMFiT)**: Introduced an additional step of fine-tuning a pretrained language model to the target corpus before fine-tuning it for classification tasks. This process improves predictions by:
  1. Fine-tuning the language model on the specific corpus (e.g., IMDb movie reviews).
  2. Using the fine-tuned model as the base for classification.
---
**Transfer Learning Stages in NLP:**
1. Pretrain a language model on a large corpus (e.g., Wikipedia).
2. Fine-tune the pretrained model to the target corpus.
3. Fine-tune the model for the specific classification task.
---

#### 2. **Building a Language Model**
---

***Text Preprocessing Steps:***
1. **Tokenization**:
   - Splits text into smaller units (tokens).
   - Methods include:
     - **Word-based**: Splits based on spaces and rules (e.g., `don’t` -> `do n’t`).
     - **Subword-based**: Splits words into common substrings (e.g., `occasion` -> `o c ca sion`).
     - **Character-based**: Splits into individual characters.

   Fastai provides `Tokenizer` to handle tokenization, adding special tokens like `xxbos` (start of text), `xxmaj` (capitalized word), and `xxunk` (unknown word).

2. **Numericalization**:
   - Converts tokens into numbers using a vocabulary.
   - Fastai provides utilities to:
     - Use existing pretrained vocabularies.
     - Initialize embeddings for new words with random vectors.

3. **Language Model Data Loader Creation**:
   - `fastai.LMDataLoader` offsets independent and dependent variables by one token.
   - Ensures proper shuffling while maintaining structure.

4. **Language Model Creation**:
   - Requires a model that can handle sequences of variable lengths (e.g., RNNs).
---


#### 3. **Tokenization Using fastai**
---

- Fastai uses external libraries like spaCy for tokenization.
- Default rules include:
  - `fix_html`: Replaces HTML characters.
  - `replace_rep`: Handles repeated characters (e.g., `xxxx` -> `xxrep 4 x`).
  - `replace_maj`: Handles capitalization.
  - `lowercase`: Converts text to lowercase and adds `xxbos` and `xxeos` tokens.


In [1]:
##@ Example:
from fastai.text.all import *
path = untar_data(URLs.IMDB)
files = get_text_files(path, folders=['train', 'test', 'unsup'])
txt = files[0].open().read()
spacy = WordTokenizer()
tkn = Tokenizer(spacy)
toks = tkn(txt)
print(coll_repr(toks, 30))

(#303) ['xxbos','xxmaj','the','true','story','of','a','bunch','of','junkies','robbing','a','not','so','honest','businessman','of','drugs',',','jewelry',',','guns',',','and','money','.','xxmaj','some','would','say'...]


#### 4. **Subword Tokenization**
---

- Useful for languages without spaces (e.g., Chinese) or languages with long compound words (e.g., Turkish).
- Steps:
  1. Analyze a corpus to find common substrings.
  2. Tokenize using these substrings.