## Introduction

One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.

Perhaps the most widely practically useful application of NLP is *classification* -- that is, classifying a document automatically into some category. This can be used, for instance, for:

- `Sentiment analysis (e.g are people saying *positive* or *negative* things about your product)`
- `Author identification (what author most likely wrote some document)`
- Legal discovery (which documents are in scope for a trial)
- Organizing documents by topic
- Triaging inbound emails
- ...and much more!

Classification models can also be used to solve problems that are not, at first, obviously appropriate. For instance, consider the Kaggle [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/) competition. In this, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of `1` it is considered that the two inputs have identical meaning, and `0` means they have totally different meaning. For instance, *abatement* and *eliminating process* have a score of `0.5`, meaning they're somewhat similar, but not identical.

It turns out that this can be represented as a classification problem. How? By representing the question like this:

> For the following text...: "TEXT1: abatement; TEXT2: eliminating process" ...chose a category of meaning similarity: "Different; Similar; Identical".

In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

We've been mainly been dealing with images and now , lets learn about how to apply deep learning on documents. 


`what are documents?` these are inputs to an nlp model that contains text.

`classification` is a huge space for those interested in NLP

We'll need slightly different code depending on whether we're running on Kaggle or not, so we'll use this variable to track where we are:

In [1]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

## using Kaggle on your own machine

Kaggle limits your weekly time using a GPU machine. The limits are very generous, but you may well still find it's not enough! In that case, you'll want to use your own GPU server, or a cloud server such as Colab, Paperspace Gradient, or SageMaker Studio Lab (all of which have free options). To do so, you'll need to be able to download Kaggle datasets.

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:

    !pip install kaggle

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., `creds = '{"username":"xxx","key":"xxx"}'`):


In [2]:
creds = '{"username":"brainhostdotexe","key":"9187295120509267d5636079d8cf238a"}'

Then execute this cell (this only needs to be run once):

In [3]:
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle.

In [4]:
path = Path('us-patent-phrase-to-phrase-matching')

And use the Kaggle API to download the dataset to that path, and extract it:

In [5]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

ValueError: Error: Missing username in configuration.

Note that you can easily download notebooks from Kaggle and upload them to other cloud services. So if you're low on Kaggle GPU credits, give this a try!

## Import and EDA

In [7]:
if iskaggle:
    path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets

It looks like this competition uses CSV files. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library, which is explained brilliantly in [this book](https://wesmckinney.com/book/) by the lead developer (it's also an excellent introduction to matplotlib and numpy, both of which I use in this notebook). Generally it's imported as the abbreviation `pd`.

In [8]:
import pandas as pd

There are 4 relevant libraries in NLP:
- `numpy` ~basic numerical programming
- `matplotlib` ~ plotting
- `pandas` ~ tables of data
- `pytorch` ~ deep learning

To check out pandas, check out `python for data analysis by wes mckinney`

with pandas we can now read our csv file


In [9]:
path = 'C:/Users/BM/ML fastai/lesson 4/U.S. Patent Phrase to Phrase Matching/train.csv'

In [10]:
df = pd.read_csv('C:/Users/BM/ML fastai/lesson 4/U.S. Patent Phrase to Phrase Matching/train.csv')

This creates a [DataFrame](https://pandas.pydata.org/docs/user_guide/10min.html), which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:

In [11]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


It's important to carefully read the [dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) to understand how each of these columns is used.

One of the most useful features of `DataFrame` is the `describe()` method:

In [12]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


`describe method()` is used to get the description of objects or fields in your dataset.

As we can see, the `anchor` column tells us that there are 733 unique anchors, the `target` column tells us there about 30000 unique targets.

ok then, lets create a text that contains context, target and anchor with a field separator in between them

In [13]:
df['input'] = 'TEXT1: ' + df.context + ':TEXT2: '+ df.target +': ANC1: '+ df.anchor

in pandas we can use refer to a column by using square brackets or treat it as an attribute. When setting it, use square brackets with the column name inside quotes. When reading it call it as an attribute

`head` is the first few rows

In [14]:
df.input.head()

0    TEXT1: A47:TEXT2: abatement of pollution: ANC1...
1    TEXT1: A47:TEXT2: act of abating: ANC1: abatement
2    TEXT1: A47:TEXT2: active catalyst: ANC1: abate...
3    TEXT1: A47:TEXT2: eliminating process: ANC1: a...
4     TEXT1: A47:TEXT2: forest region: ANC1: abatement
Name: input, dtype: object

## Tokenization

Now, we've got some documents that we can work with . But as you can remeber, neural networks work with numbers.

first, we'll have to split these up into tokens. Tokens are basically just words.

after that, we will get a list of all the unique words that appear ie the `vocabulary`. And every one of the unique words is going to get a number. 

Remmber that a big vocabulary takes on more memory and time so we don't want a big vocabulary.

thus the process`tokenization` is whereby we split words into smaller unit-like words. and these unit-like words will be `tokens`

We are now going to turn our pandas dataframe into a huggingface datasets. not to be confused with pytorch's dataset

In [15]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

In [16]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

if we take a look , it has the same features the dataframe had plus the input we just added with the concantenates strings

Now ,w e have to split up the words into tokens ie `tokenization` and turn each token into its unique id based on its position in the vocabulary ie `numericalization`

before you start tokenizing, you have to decide on which model to use since the tokenization , vocabulary and numericallization will follow the models format.

the good thing is tat hugging face has multiple models .

for now we will fetch `debarta-v3-small` from microsoft.


In [17]:
model_nm = 'microsoft/deberta-v3-small'

to use the model, we will use `Autotokenizer` to create a tokenizer appropriate for this model

In [18]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
