# Patent classification

To begin with, we generate a tiny artificial dataset with 4 instances.

In [None]:
texts = ["This is machine learning related patent", "This is a biology related patent", "This is machine learning related patent", "This is a biology related patent"]
labels = ["Machine learning", "Biology", "Machine learning", "Biology"]

All the classifiers require the labels to be integers. For this reason, the `LabelsConverter` class can be used to do the needed conversions. The `encode_labels` method converts string labels to int while the `decode_labels` method does the opposite. The latter could be useful to convert the output of the predictions to the more meaningful existing label names.

In [None]:
from dapc.labels_converter import LabelsConverter

lb = LabelsConverter(labels)
lb.classes

labels_int = lb.encode_labels(labels)

The simpliest way to train a patent classier is to instantiate a Classifier by specifying the classifier type and its parameters.

Below we instantiate one classifier for each of the available categories.

In [None]:
from dapc.classifier import Classifier

transformer_classifier = model = Classifier("transformers", model_name="bert-base-uncased", num_of_labels=len(lb.classes))
adapter_classifier = Classifier("adapters", model_name="bert-base-uncased", num_of_labels=len(lb.classes))
trans_cnn_classifier = Classifier("transformers_cnn", model_name="bert-base-uncased", num_of_labels=len(lb.classes))


All the above classifiers use the default configuration that the respective models hold (Check their implementation for further details). The use of a custom model is also permitted. In this case, someone should initialize the custom model and then pass it directly to the classifier. The example below depicts the instantiation of a custom transformer model with different learning rate.

In [None]:
from dapc.models.transformer_classifier import TransformerClassifier
transformer_custom_model = TransformerClassifier(classes=3, learning_rate=5e-5)
transformer_classifier_custom = Classifier(transformer_custom_model)

Let's focus on the transformer_classifier and train it using our tiny example dataset. In order to train a classifier we need to provide a list of texts, the respective list of labels and the number of epochs.

In [None]:
transformer_classifier.train(texts,labels_int, 1)

The accuracy is zero as the dataset is dummy but let's pretend that the model has been trained succesfully and we would like to evaluate it. This can be done by using the `evaluate` method and passing to it a set of texts and labels.

In [None]:
transformer_classifier.evaluate(texts,labels_int)

Respectively for prediction, we can use the `predict` method and providing the list of texts of interest. The output of the prediction is the predicted labels together with the logits.

In [None]:
predictions, logits = transformer_classifier.predict(texts)

Let's use the decode_labels methood to convert the labels to their names and inspect them.

In [None]:
lb.decode_labels(predictions)

To cross validate the classifier, someone can use the `cross_validation` method. In addition to the texts,labels and epochs, number of cross validations(k) should also be provided.

In [None]:
transformer_classifier.cross_validation(texts,labels_int,k=2,epochs=3)

Once the training is done, we can save the model using the `save` method.

In [None]:
transformer_classifier.save("my_beatiful_classifier")

Then you can reload it using the `load` method and use it again.

In [None]:
transformer_classifier.load("my_beatiful_classifier")
transformer_classifier.evaluate(texts,labels_int)

Note that as during cross validation more than one models are trained, we suggest to not use the save method after the cross validation. Instead you can define a save model strategy during the cross validation by using the `save_model_path` and `save_model_strategy` parameters of the cross_validation method. Specifically `save_model_path` defines the path where the checkpoins will be saved while the `save_model_strategy` defines the strategy based on which they are going to be saved. For instance, if someone wishes to store all the models then should define `save_model_strategy='all'`. Alternative, if the goal is to save the best model based on a specific metric then the name of the metric should be provided, for example:  `save_model_strategy='micro_f1_score'`.

### Multilingual patent classification

The package supports multilingual training. All the information that have been presented in the monolingual notebook stand also for this case. The only addition that the multilingual classifier holds is the ability to perform per language evaluation of the model's performance.

Firstly, let's create an instance of the multilingual classifer.

In [None]:
from dapc.classifier_multilingual import ClassifierMultilingual

transformer_classifier_multi = ClassifierMultilingual("transformers", model_name="bert-base-uncased", num_of_labels=len(lb.classes))

Then, let's stick on the already existing small dataset and consider that it is a multilingual dataset. For a multilingual evaluation, we need a list of languages of the input texts. If we do not have this inforation available already, the package contains the LanguageDetector class to do this for you relying on spacy. 

<u>Note: before running the language detector for the first time execute the following command to download the needed model:
`python -m spacy download en_core_web_sm`!</u>

In [None]:
from dapc.languages_detector import LanguagesDetector

lan_detector = LanguagesDetector()
langs = lan_detector.infer_languages(texts)
langs

Once the training set and their languages are known, we can train the model or use the multilingual cross validation and evaluation.

In [None]:
transformer_classifier_multi.cross_validation_multilingual(texts,labels_int, langs,k=2,epochs=3)