### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: RoBERTa__

This model was trained using a python v3.11.3 environment and requires:
- spacy
- spacy-transformers

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [2]:
# check cuda and gpu status
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [3]:
!nvidia-smi

Fri Jun 23 15:08:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8     1W / 115W |      0MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [4]:
# setup the config file
!python -m spacy init config --pipeline textcat config_transformer.cfg --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_transformer.cfg
You can now add your data and train your pipeline:
python -m spacy train config_transformer.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


#### 2. Train the model and evaluate model performance

In [4]:
# train them model
!python -m spacy train config_transformer.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model_transformer --gpu-id 0

[38;5;2m✔ Created output directory: textcat_model_transformer[0m
[38;5;4mℹ Saving to output directory: textcat_model_transformer[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-06-23 15:09:58,739] [INFO] Set up nlp object from config
[2023-06-23 15:09:58,746] [INFO] Pipeline: ['transformer', 'textcat']
[2023-06-23 15:09:58,748] [INFO] Created vocabulary
[2023-06-23 15:09:58,749] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical 

In [5]:
# evaluate the model
!python -m spacy evaluate ./textcat_model_transformer/model-best/ --output ./metrics_transformer.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   50.21 
SPEED               4085  

[1m

                     P       R       F
POLITICS         79.07   81.26   80.15
WELLNESS         60.10   67.61   63.63
ENTERTAINMENT    69.20   76.56   72.70
TRAVEL           81.66   79.68   80.66
HEALTHY LIVING   33.61   26.32   29.52
BUSINESS         49.92   54.06   51.91
WEIRD NEWS       45.92   37.50   41.28
SPORTS           65.77   78.79   71.69
PARENTING        61.17   55.81   58.36
STYLE & BEAUTY   72.98   84.19   78.19
GREEN            40.25   51.82   45.31
FOOD & DRINK     68.67   70.87   69.76
QUEER VOICES     73.00   71.86   72.42
THE WORLDPOST    41.33   64.44   50.36
HOME & LIVING    82.29   71.64   76.60
WEDDINGS         81.84   76.50   79.08
PARENTS          47.04   35.66   40.57
ARTS & CULTURE   40.00   37.41   38.66
CRIME            66.21   54.19   59.60
CULTURE & ARTS   43.42   38.37   40.74
ENVIRONMENT      70.00   15.79   25.77
COMEDY           60.

In [6]:
# check results
import spacy
nlp = spacy.load("textcat_model_transformer/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.00022873641864862293, 'WELLNESS': 0.0008439306984655559, 'ENTERTAINMENT': 0.00024795858189463615, 'TRAVEL': 0.9911842942237854, 'HEALTHY LIVING': 0.00013403574121184647, 'BUSINESS': 0.0006679550861008465, 'WEIRD NEWS': 0.00028393309912644327, 'SPORTS': 0.00016008797683753073, 'PARENTING': 0.00030838174279779196, 'STYLE & BEAUTY': 0.00044757052091881633, 'GREEN': 0.00022121783695183694, 'FOOD & DRINK': 0.0008898158557713032, 'QUEER VOICES': 4.772236934513785e-05, 'THE WORLDPOST': 5.8274436014471576e-05, 'HOME & LIVING': 0.00030177002190612257, 'WEDDINGS': 0.00015639951743651181, 'PARENTS': 7.467636896762997e-05, 'ARTS & CULTURE': 4.363317566458136e-05, 'CRIME': 6.480376760009676e-05, 'CULTURE & ARTS': 0.00033347454154863954, 'ENVIRONMENT': 0.0003155650629196316, 'COMEDY': 0.00018637241737451404, 'RELIGION': 0.00012191912537673488, 'MONEY': 0.00019806383352261037, 'BLACK VOICES': 7.40106261218898e-05, 'COLLEGE': 5.0833878049161285e-05, 'DIVORCE': 6.233305612113327e-05, 'U.

In [7]:
max(doc.cats, key=doc.cats.get)

'TRAVEL'

In [8]:
doc.cats["TRAVEL"]

0.9911842942237854

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model_transformer/model-best"
zipfile_name = "textcat_model_transformer/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model_transformer" > "textcat_model_transformer_2023-07-17_12-24"