# Getting Started with NLP
- [Getting Started with NLP for Absolute Beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners)
- first tutorial in the Kaggle [Natural Language Processing Guide](https://www.kaggle.com/learn-guide/natural-language-processing)

## NLP For Classification
One of the more useful applications of NLP. Can be used for a bunch of stuff like organizing documents by topic or Sentiment Analysis (finding out if people are saying *positive* or *negative* stuff about your product)

## [U.S. Patent Phrase to Phrase Matching Competition](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching)
- compare two words or short phrases
    - original competition:
        - score them `0`-`1` based on whether they're similar or not
        - `0` = totally different meaning, `1` = identical meaning, `0.5` = somewhat similar meaning
    - classification version (what we'll do here)
        - classify the pairs of words or phrases into `Different`, `Similar`, or `Identical` categories

In [1]:
from pathlib import Path
import pandas as pd

### Get the Dataset
- we'll be getting the dataset from Kaggle. 
    - One problem - when you go to download a data set from a Kaggle competition, you need to agree to the competition rules, including a rule to *not* make the data available to people who haven't agreed to the competition rules. So I can't just add it to my *publicly-available* repo.
    - I could just download it from the webpage manually and put it in the right place, but since I can't add it to tracked files, I'd need to re-do that manually for any notebooks that I'd done that for previously anytime I cloned the repo down.
- Instead, [install the Kaggle API](https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md) to download the dataset here so I can import it into this notebook, but don't track it in Git.
    - If you haven't already, go to the [Competition page](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching), go to the `Data` tab, and `Accept` the rules of the competition to be allowed to download the dataset.
    - If not already installed, install the API (usually with `pip install kaggle`, but since I'm using `UV` as a dependency manager, I used `uv add kaggle`. Running `uv sync` in this repo should install with all the other dependencies)
    - On the [Kaggle website](https://www.kaggle.com/), make or login to your account, Click the Profile picture -> `Settings` -> `API` -> `Create new Token` to download `kaggle.json` to computer.
        - Move that file to `~/.kaggle/kaggle.json` (`~` is the home directory)
        - note: I use Sphinx with `myst_nb` to turn these notebooks into documentation, and `myst_nb` runs the notebooks to check if they still work. Since I can't commit the `kaggle.json` file to the repo without making my private `kaggle api key` publicly available, specify the API key with environment variables instead: `KAGGLE_USERNAME` and `KAGGLE_KEY`. Get those values out of the `kaggle.json` and add them to [GitHub Secrets for the Github Actions Pipeline to use](https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/use-secrets)
    - run the cell below to download and unzip the dataset if it doesn't already exist. 
    - initially this gave me a `"Forbidden URL" error` but later it worked. Possibly I hadn't accepted the rules for the competition yet.

In [2]:
# download and unzip the dataset to this folder if not already downloaded
data_dir = Path("us-patent-phrase-to-phrase-matching")
if not data_dir.exists():
    import kaggle
    import zipfile

    # download the dataset from Kaggle as zip file
    kaggle.api.competition_download_cli(str(data_dir))  
    zip_path = data_dir.with_suffix(".zip")  # path to the downloaded zip file
    zipfile.ZipFile(zip_path).extractall(data_dir)  # unzip the file
    zip_path.unlink()  # delete the zip file after unzipping


### Examine the DataSet
- check the [Competition's Data tab on Kaggle](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) for info on the dataset you couldn't get from the CSV's
- `anchor` and `target` phrases are rated for similarity
- `context` is the subject of a patent according to the [Cooperative Patent Classification (CPC)](https://en.wikipedia.org/wiki/Cooperative_Patent_Classification)
    - `A47`: Section `A` (`Human Necessities`), Class `47` (`Furniture`). ([A47C](https://www.uspto.gov/web/patents/classification/cpc/html/cpc-A47C.html) would be `chairs; sofas; beds`)
    - the  phrases `bird` and `Cape Cod` are much closer in the `context` of a `house` than in normal language
- `score` rates how similar the `anchor` and `target` phrases are (created by manual expert ratings)
    - `0` = not at all similar
    - `1` = identical meaning

In [3]:
# import and check the dataset. Looks like it's already scoring similarity of word/phrase pairs.
df = pd.read_csv(data_dir / "train.csv")
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


- using `describe()` reveals 36,473 rows 
- 733 unique `anchor` phrases, 
- a whopping 29340 unique `target` phrases, 
- 106 unique `context`s (subject matter). 
- Some anchors appear a LOT - the `anchor` `component composite coating` appears 152 times

In [4]:
# get descriptive statistics on the object (string) columns
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


### Concatenate the Input
- we'll be representing the input to the model like this
- `TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement`
- so use `+` to concatenate multiple columns into one "input" column 
- so we'll have one input string per row containing all the important data
- I'd forgotten that you can refer to Pandas columns (series's) with dots
- i.e. `df['context'] = df.context`

In [5]:
# createe an 'input' column by concatenating the important columns with specifiers between
df["input"] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor
df.input.head()  # print out the first 5 entries of the new column

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenize
- we're going to pass this to a deep learning model
    - a neural net expects numbers as inputs, not strings
    - must convert these strings to numbers in two steps
        - **Tokenization** - split the text into `tokens` (sometimes these are words)
        - **Numericalization** - convert each `token` into a number
- to connect the bits and bobs of the networks together we'll use a [Hugging Face Transformer](https://huggingface.co/docs/transformers/en/index)
    - `Transformers` store their datasets in ... `Dataset` ... objects ...
    - take a look at that object after converting to one from Pandas DataFrame

In [6]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

- pick an NLP model to start with (the `tokenization` and `numericalization` methods will depend on your model)
- the `microsoft/deberta-v3-small` is a decent starting place for most NLP problems
- use `microsoft/deberta-v3-large` for a slower but more accurate model (after initial exploration)
- these are pre-trained models, already adept at parsing natural language
- use the `AutoTokenizer`

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_nm = "microsoft/deberta-v3-small" # select a pretrained model from Hugging Face model hub
# get the tokenization that was used with the pretrained model, make into a tokenizer object to use on our inputs
tokz = AutoTokenizer.from_pretrained(model_nm) #, use_fast=False) would use a slower tokenizer

["G'day folks, I'm Jeremy from fast.ai!"] ->
['▁G', "'", 'day', '▁folks', ',', '▁I', "'", 'm', '▁Jeremy', '▁from', '▁fast', '.', 'ai', '!']
["A platypus is an ornithorhynchus anatinus."] ->
['▁A', '▁platypus', '▁is', '▁an', '▁or', 'ni', 'tho', 'rhynch', 'us', '▁an', 'at', 'inus', '.']


Show how the tokenizer splits text into tokens
- uncommon words are split into subwords (like `G'day` → `_G`, `'`, `day`)
- `_` is added to the start of new words (distinguishes new words like `_folks` from the `day` in `G'day`)
- punctuation like  is treated as separate tokens (like `'`, `,`, `!`, `.`)
- uncommon words are split into subwords (like `ornithorynchus` → `▁or`, `ni`, `tho`, `rhynch`, `us`)

In [23]:

print("[\"G'day folks, I'm Jeremy from fast.ai!\"] ->")
print(tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!"))
print("[\"A platypus is an ornithorhynchus anatinus.\"] ->")
print(tokz.tokenize("A platypus is an ornithorhynchus anatinus."))

["G'day folks, I'm Jeremy from fast.ai!"] ->
['▁G', "'", 'day', '▁folks', ',', '▁I', "'", 'm', '▁Jeremy', '▁from', '▁fast', '.', 'ai', '!']
["A platypus is an ornithorhynchus anatinus."] ->
['▁A', '▁platypus', '▁is', '▁an', '▁or', 'ni', 'tho', 'rhynch', 'us', '▁an', 'at', 'inus', '.']


In [None]:
# define a function to apply the tokenizer to the 'input' column of the dataset
def tok_func(x): return tokz(x["input"])
# use map to run the tokenizer function quickly on the dataset, in parallel batches for speed
tok_ds = ds.map(tok_func, batched=True)
# that added the columns input_ids, token_type_ids, attention_mask
print("original  columns:", ds.column_names)
print("tokenized columns:", tok_ds.column_names)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Map: 100%|██████████| 36473/36473 [00:00<00:00, 59226.94 examples/s]

original  columns: ['id', 'anchor', 'target', 'context', 'score', 'input']
tokenized columns: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask']





Look at the columns that were added by the tokenization
- these columns have lists in each cell
- `input_ids` are the number assigned to a token

In [76]:
# look at the columns that the tokenization added for the first row
row = tok_ds[0]
input_row = row['input']
tk_df = pd.DataFrame({
    'token': [''] + tokz.tokenize(row['input']) + [''], # there are extra start and end id's
    'input_ids': row['input_ids'],
    'attention_mask': row['attention_mask'],
    'token_type_ids': row['token_type_ids'],
})
print("input:", f"\"{row['input']}\"")
print(f"vocab of token 'of': {tokz.vocab['1']}\n")
print(tk_df.to_string(index=False))  # print the whole dataframe without truncating

input: "TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement"
vocab of token 'of': 435

     token  input_ids  attention_mask  token_type_ids
                    1               1               0
     ▁TEXT      54453               1               0
         1        435               1               0
         :        294               1               0
        ▁A        336               1               0
        47       5753               1               0
         ;        346               1               0
     ▁TEXT      54453               1               0
         2        445               1               0
         :        294               1               0
▁abatement      47284               1               0
       ▁of        265               1               0
▁pollution       6435               1               0
         ;        346               1               0
      ▁ANC      23702               1               0
         1        435               1     

### Prepare Labels
- `Transformers` assume that the labels column is named `labels`
- rename the `score` column

In [77]:
tok_ds = tok_ds.rename_column("score", "labels")  # rename the score column to labels
tok_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

### Get Test and Validation Sets
- the `train` set is used to ... train ... the model
- the `validation` set is used to select the `architecture` and for `hyperparameter tuning`
- the `test` set is *completely unseen* and scores how well the model will generalize to real world data

#### Validation Set
- split off some of the test data to use for `validation`
    - use it for `architecture selection` / `hyperparameter tuning`
    - note that you can `overfit` to the `validation data` as well as the `training data`
    - that's where the `test` set comes in - it can help check for `overfitting`
- we're doing it randomly here, but apparently choosing a *good* validation set is one of *the* most important parts of training
    - if you do a random set, sometimes there are differences between development and production use
    - see the article [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) by Dr. Rachel Thomas

#### Test Set
- import the test data set (36 entries)
    - often we split it off ourself, but Kaggle gives you a separate one already
    - your accuracy at predicting with this set goes on the *public leaderboard*
- in addition, Kaggle keeps a *second* test data set that they bring out at the *end* of a competition
    - if you overfit to the public leaderboard test data, you could lose ability to generalize to the holdout set
    - this one is called the *private leaderboard*
- they note that you can even `overfit` to the `test set` ... yikes, sounds unavoidable

In [None]:
# import the CSV with all the test data 
# (name it "eval" to avoid confusion with the "test" split of the training data)
eval_df = pd.read_csv(data_dir / 'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,hybrid bearing,inorganic photoconductor drum,G02
freq,1,2,1,3


In [None]:
# split off a quarter of the training data to use as a validation set
dds = tok_ds.train_test_split(0.25, seed=42)
dds
# NOTE: it's automatically named "test" instead of "validation", don't mix it up

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

### Metrics and Correlation
- while `training`, we're generally minimizing or maximizing one or more `metrics`
- you can't apply them unthinkingly - see [The Problem with Metrics is a Big problem for AI](https://www.fast.ai/2019/09/24/metrics/)
- Kaggle Competitions have specific metrics already defined so everyone is scored the same way
- they're listed on the Competition's [Evaluation Page](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation), in this case the [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
    - it measures correlation between two variables
        - it varies from `-1` (perfect inverse correlation) to `1` (perfect positive correlation)
        - it's the [covariance](https://en.wikipedia.org/wiki/Covariance) (joint variability) of the two variables divided by the product of their standard deviations
    - the equation is on the complicated side, and there's a different one for populations
        - $\rho_{X,Y}=\Large\frac{cov(X,Y)}{\sigma_X \sigma_Y}$
    - vs for samples
        - $r_{xy}=\Large\frac{\sum^n_{i=1}{(x_i - \bar x)(y_i - \bar y)}}{\sqrt{\sum^n_{i=1} (x_i - \bar x)^2}\sqrt{\sum^n_{i=1} (y_i - \bar y)^2}}$
- the example notebook doesn't list the formulas
    - but it does goes over the correlation in more detail in the section [Metrics and Correlation](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners#metrics-and-correlation)
    - this mostly involves checking it on other datasets, so I'm not replicating that here

In [None]:
# create a function to calculate Pearson correlation coefficient
# corrcoeff returns a 2x2 array, just grab the [0][1] element 
# which is the correlation between variables x and y
import numpy as np
def corr(x,y): return np.corrcoef(x,y)[0][1]

# Transformers expect metrics to be returned as a dictionary
# Create a function to return a dictionary for the metric
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

### Training the Model

In [None]:
from transformers import TrainingArguments, Trainer