# A brief introduction to the patent system
The patent is a register, tipically a document, that to document a exclusive discovery, invention or method and aims to give to the patent holder exclusive rights over the discovery/invention.

<--TODO: Explain the T,O method of organizing patents -->

To organize the patents and find a suitable way to structure its information, a commonly used method defines a patent with 2 characteristics:
1. **Task:** the method used in the described patent. In can be compress something or agilize a effect, for example.
2. **Object:** the "target" of the task. It can be a food, a construction material or any other object that, combined with the task, defines the patent.

This method is defined by the Hallbach matrix, that defines a list of Task and Objects that can be extracted from the Title or the Resume of the patent.

# T,O Finder
The T,O Finder is the method that identifies the Task and the Object from a given patent and in this notebook we will construct a method to do such thing.

In [1]:
import pandas as pd
import string
import spacy
import unicodedata

In [None]:
# Loading the patents dataset
df = pd.read_csv('../../data/processed/patentes_inpi_lemmatized.csv')
df.head()

Unnamed: 0,id_pedido,data_deposito,titulo,ipc,url,resumo,classifica_ipc,titulo_lemmatized,resumo_lemmatized
0,BR 11 2021 018393 0,02/03/2020,TRATAMENTO DE COLISÕES EM UPLINK,H04L 1/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,"A presente invenção se refere a métodos, sis...",H04L 1/18,tratamento colisao uplink,presente invencao referir metodo sistema di...
1,BR 11 2021 018071 0,02/03/2020,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO ANT...,H01T 13/14,https://busca.inpi.gov.br/pePI/servlet/Patente...,ALOJAMENTO DE VELA DE IGNIÇÃO COM PROTEÇÃO A...,H01T 13/14 ; H01T 13/20 ; H01T 13/32 ; H0...,alojamento vela ignicao protecao anticorrosivo...,alojamento vela ignicao protecao anticorros...
2,BR 11 2021 016947 4,02/03/2020,ANTICORPOS QUE RECONHECEM TAU,C07K 16/18,https://busca.inpi.gov.br/pePI/servlet/Patente...,ANTICORPOS QUE RECONHECEM TAU. A invenção fo...,C07K 16/18 ; G01N 33/68,anticorpo reconhecer tau,anticorpo reconhecer tau invencao fornecer ...
3,BR 10 2020 004169 0,02/03/2020,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTÃO PAR...,F24H 3/00,https://busca.inpi.gov.br/pePI/servlet/Patente...,AQUECEDOR DE AR A LENHA COM DUPLA EXAUSTAO P...,F24H 3/008 ; F24H 4/06,aquecedor ar lenha dupla exaustao utilizar amb...,aquecedor ar lenha dupla exaustao utilizar ...
4,BR 11 2021 006234 3,02/03/2020,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNICOS...,C12N 15/10,https://busca.inpi.gov.br/pePI/servlet/Patente...,BIBLIOTECAS DE CÉLULAS ÚNICAS E NÚCLEOS ÚNIC...,C12N 15/10,biblioteca celula unico nucleo unico alto rend...,biblioteca celula unico nucleo unico alto r...


# POS tagging
**POS (Part-of-Speech) Tagging** is the process of labeling each word in a text with its corresponding part of speech, such as noun, verb, adjective, etc. This is a fundamental step in Natural Language Processing (NLP) as it helps in understanding the grammatical structure and meaning of a sentence.

For example:
- Sentence: "The cat is sleeping."
- POS Tags: `The (Determiner)`, `cat (Noun)`, `is (Verb)`, `sleeping (Verb)`.

POS tagging is useful for tasks like:
- Text parsing and syntactic analysis.
- Named Entity Recognition (NER).
- Sentiment analysis and text classification.

## Example Code
```python
import spacy

# Load a spaCy language model
nlp = spacy.load("en_core_web_sm")  # Replace with your desired language model

# Input text
text = "tratamento colisao uplink."

# Process the text
doc = nlp(text)

# Print each token and its POS tag
for token in doc:
    print(f"{token.text} -> {token.pos_} ({token.tag_})")
```

It gives the following output:

```sh
tratamento -> NOUN (NOUN)
colisao -> ADJ (ADJ)
uplink -> VERB (VERB)
```

## POS Tagging with spaCy
spaCy provides an efficient and easy-to-use method for [POS](https://universaldependencies.org/u/pos/) tagging. It extracts from the words the following characteristics:
- `token.pos_`: The coarse-grained part-of-speech tag (e.g., NOUN, VERB).
- `token.tag_`: The fine-grained part-of-speech tag (e.g., VBZ, NN).

In spaCy, each token in a text is assigned with a bunch of characteristics:

1. Text: The original word text.
2. Lemma: The base form of the word.
3. **POS:** The simple UPOS part-of-speech tag.
4. **Tag:** The detailed part-of-speech tag.
5. Dep: Syntactic dependency, i.e. the relation between tokens.
6. Shape: The word shape – capitalization, punctuation, digits.
7. is alpha: Is the token an alpha character?
8. is stop: Is the token part of a stop list, i.e. the most common words of the language?


## Reference
Common `pos_` Tags
| Tag   | Description           |
|-------|-----------------------|
| `ADJ` | Adjective             |
| `ADP` | Adposition            |
| `ADV` | Adverb                |
| `AUX` | Auxiliary verb        |
| `CONJ`| Coordinating conjunction |
| `DET` | Determiner            |
| `INTJ`| Interjection          |
| `NOUN`| Noun                  |
| `NUM` | Numeral               |
| `PART`| Particle              |
| `PRON`| Pronoun               |
| `PROPN`| Proper noun          |
| `PUNCT`| Punctuation          |
| `SCONJ`| Subordinating conjunction |
| `SYM` | Symbol                |
| `VERB`| Verb                  |
| `X`   | Other                 |

Common `tag_` Tags (English Example)
| Tag   | Description                          |
|-------|--------------------------------------|
| `NN`  | Noun, singular                      |
| `NNS` | Noun, plural                        |
| `VB`  | Verb, base form                     |
| `VBD` | Verb, past tense                    |
| `VBG` | Verb, gerund or present participle  |
| `VBN` | Verb, past participle               |
| `VBZ` | Verb, 3rd person singular present   |
| `JJ`  | Adjective                           |
| `RB`  | Adverb                              |
| `IN`  | Preposition or subordinating conjunction |
| `DT`  | Determiner                          |

In [4]:
nlp = spacy.load("pt_core_news_lg")

In [18]:
# Process the text
doc = nlp(df.loc[25, "titulo_lemmatized"])

# Print each token and its POS tag
print("Original text POS")
for token in doc:
    print(f"{token.text} -> {token.pos_} ({token.tag_})")

# Process the text
doc = nlp(df.loc[25, "titulo"])

# Print each token and its POS tag
print("\n\nProcessed text POS")
for token in doc:
    print(f"{token.text} -> {token.pos_} ({token.tag_})")

Original text POS
metodo -> NOUN (NOUN)
codificacao -> ADJ (ADJ)
video -> PROPN (PROPN)
codificador -> ADJ (ADJ)
decodificador -> ADJ (ADJ)
produto -> NOUN (NOUN)
programa -> ADJ (ADJ)
computador -> ADJ (ADJ)


Processed text POS
MÉTODO -> PROPN (PROPN)
DE -> ADP (ADP)
CODIFICAÇÃO -> PROPN (PROPN)
DE -> PROPN (PROPN)
VÍDEO -> PROPN (PROPN)
, -> PUNCT (PUNCT)
CODIFICADOR -> PROPN (PROPN)
, -> PUNCT (PUNCT)
DECODIFICADOR -> PROPN (PROPN)
E -> CCONJ (CCONJ)
PRODUTO -> PROPN (PROPN)
DE -> ADP (ADP)
PROGRAMA -> PROPN (PROPN)
DE -> ADP (ADP)
COMPUTADOR -> NOUN (NOUN)


As seen, the portuguese POS is not very effection into matching with the words, as it cannot identify properly the Task and Objects.