# Getting Started with NLP
- [Getting Started with NLP for Absolute Beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners)
- first tutorial in the Kaggle [Natural Language Processing Guide](https://www.kaggle.com/learn-guide/natural-language-processing)

## NLP For Classification
One of the more useful applications of NLP. Can be used for a bunch of stuff like organizing documents by topic or Sentiment Analysis (finding out if people are saying *positive* or *negative* stuff about your product)

## [U.S. Patent Phrase to Phrase Matching Competition](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching)
- compare two words or short phrases
    - original competition:
        - score them `0`-`1` based on whether they're similar or not
        - `0` = totally different meaning, `1` = identical meaning, `0.5` = somewhat similar meaning
    - classification version (what we'll do here)
        - classify the pairs of words or phrases into `Different`, `Similar`, or `Identical` categories

In [1]:
from pathlib import Path
import pandas as pd

### Get the Dataset
- we'll be getting the dataset from Kaggle. 
    - One problem - when you go to download a data set from a Kaggle competition, you need to agree to the competition rules, including a rule to *not* make the data available to people who haven't agreed to the competition rules. So I can't just add it to my *publicly-available* repo.
    - I could just download it from the webpage manually and put it in the right place, but since I can't add it to tracked files, I'd need to re-do that manually for any notebooks that I'd done that for previously anytime I cloned the repo down.
- Instead, [install the Kaggle API](https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md) to download the dataset here so I can import it into this notebook, but don't track it in Git.
    - If you haven't already, go to the [Competition page](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching), go to the `Data` tab, and `Accept` the rules of the competition to be allowed to download the dataset.
    - If not already installed, install the API (usually with `pip install kaggle`, but since I'm using `UV` as a dependency manager, I used `uv add kaggle`. Running `uv sync` in this repo should install with all the other dependencies)
    - On the [Kaggle website](https://www.kaggle.com/), make or login to your account, Click the Profile picture -> `Settings` -> `API` -> `Create new Token` to download `kaggle.json` to computer.
        - Move that file to `~/.kaggle/kaggle.json` (`~` is the home directory)
        - note: I use Sphinx with `myst_nb` to turn these notebooks into documentation, and `myst_nb` runs the notebooks to check if they still work. Since I can't commit the `kaggle.json` file to the repo without making my private `kaggle api key` publicly available, specify the API key with environment variables instead: `KAGGLE_USERNAME` and `KAGGLE_KEY`. Get those values out of the `kaggle.json` and add them to [GitHub Secrets for the Github Actions Pipeline to use](https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/use-secrets)
    - run the cell below to download and unzip the dataset if it doesn't already exist. 
    - initially this gave me a `"Forbidden URL" error` but later it worked. Possibly I hadn't accepted the rules for the competition yet.

In [None]:
# download and unzip the dataset to this folder if not already downloaded
data_dir = Path("us-patent-phrase-to-phrase-matching")
if not data_dir.exists():
    import kaggle
    import zipfile

    # download the dataset from Kaggle as zip file
    kaggle.api.competition_download_cli(str(data_dir))  
    zip_path = data_dir.with_suffix(".zip")  # path to the downloaded zip file
    zipfile.ZipFile(zip_path).extractall(data_dir)  # unzip the file
    zip_path.unlink()  # delete the zip file after unzipping


### Examine the DataSet
- check the [Competition's Data tab on Kaggle](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) for info on the dataset you couldn't get from the CSV's
- `anchor` and `target` phrases are rated for similarity
- `context` is the subject of a patent according to the [Cooperative Patent Classification (CPC)](https://en.wikipedia.org/wiki/Cooperative_Patent_Classification)
    - `A47`: Section `A` (`Human Necessities`), Class `47` (`Furniture`). ([A47C](https://www.uspto.gov/web/patents/classification/cpc/html/cpc-A47C.html) would be `chairs; sofas; beds`)
    - the  phrases `bird` and `Cape Cod` are much closer in the `context` of a `house` than in normal language
- `score` rates how similar the `anchor` and `target` phrases are (created by manual expert ratings)
    - `0` = not at all similar
    - `1` = identical meaning

In [3]:
# import and check the dataset. Looks like it's already scoring similarity of word/phrase pairs.
df = pd.read_csv(data_dir / "train.csv")
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


- using `describe()` reveals 36,473 rows 
- 733 unique `anchor` phrases, 
- a whopping 29340 unique `target` phrases, 
- 106 unique `context`s (subject matter). 
- Some anchors appear a LOT - the `anchor` `component composite coating` appears 152 times

In [4]:
# get descriptive statistics on the object (string) columns
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


### Concatenate the Input
- we'll be representing the input to the model like this
- `TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement`
- so use `+` to concatenate multiple columns into one "input" column 
- so we'll have one input string per row containing all the important data
- I'd forgotten that you can refer to Pandas columns (series's) with dots
- i.e. `df['context'] = df.context`

In [5]:
# createe an 'input' column by concatenating the important columns with specifiers between
df["input"] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor
df.input.head()  # print out the first 5 entries of the new column

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenize
- we're going to pass this to a deep learning model - a `Transformer`
    - a neural net expects numbers as inputs, not strings
    - must convert these strings to numbers in two steps
        - **Tokenization** - split the text into `tokens` (sometimes these are words)
        - **Numericalization** - convert each `token` into a number
- `Transformers` store their datasets in ... `Dataset` ... objects ...
    - take a look at that object after converting to one from Pandas DataFrame

In [6]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

- pick an NLP model to start with 
    (the `tokenization` and `numericalization` methods will depend on your model)
- the `microsoft/deberta-v3-small` is a decent starting place for most NLP problems
- use `microsoft/deberta-v3-large` for a slower but more accurate model (after initial exploration)
- these are pre-trained models, already adept at parsing natural language

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# select a model
model_nm = "microsoft/deberta-v3-small"
# model_nm = "microsoft/deberta-v3-large"

# use AutoTokenizer to create a tokenizer appropriate to the model
tokz = AutoTokenizer.from_pretrained(model_nm)
#tokz = AutoTokenizer.from_pretrained(model_nm, use_fast=False)  # could use slower tokenizer to prevent a warning about "byte fallback" feature from SentencePiece for "fast" tokenizer


