### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: BERT__

This model was trained using a python v3.11.3 environment and requires:
- spacy
- spacy-transformers

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [1]:
# check cuda and gpu status
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [2]:
!nvidia-smi

Fri Jun 23 13:16:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   34C    P8     1W / 115W |      0MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [3]:
# generate the generic the config file
!python -m spacy init config --pipeline textcat config_transformer.cfg --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_transformer.cfg
You can now add your data and train your pipeline:
python -m spacy train config_transformer.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


__Note: Edit the config file to use the correct transformer model__


1. in the section

```
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "bert-base-uncased"
mixed_precision = false
```

2. save the file



#### 2. Train the model and evaluate model performance

In [4]:
# train them model
!python -m spacy train config_transformer.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model_transformer --gpu-id 0

[38;5;2m✔ Created output directory: textcat_model_transformer[0m
[38;5;4mℹ Saving to output directory: textcat_model_transformer[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-06-23 13:17:40,196] [INFO] Set up nlp object from config
[2023-06-23 13:17:40,203] [INFO] Pipeline: ['transformer', 'textcat']
[2023-06-23 13:17:40,204] [INFO] Created vocabulary
[2023-06-23 13:17:40,205] [INFO] Finished initializing nlp object
Downloading (…)okenizer_config.json: 100%|████| 28.0/28.0 [00:00<00:00, 238kB/s]
Downloading (…)lve/main/config.json: 100%|█████| 570/570 [00:00<00:00, 5.42MB/s]
Downloading (…)solve/main/vocab.txt: 232kB [00:00, 7.66MB/s]
Downloading (…)/main/tokenizer.json: 466kB [00:00, 9.69MB/s]
Downloading model.safetensors: 100%|█████████| 440M/440M [00:41<00:00, 10.6MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.de

In [5]:
# evaluate the model
!python -m spacy evaluate ./textcat_model_transformer/model-best/ --output ./metrics_transformer.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   51.67 
SPEED               4386  

[1m

                     P       R       F
POLITICS         78.44   80.73   79.57
WELLNESS         59.16   66.86   62.78
ENTERTAINMENT    72.26   70.91   71.58
TRAVEL           79.03   80.53   79.77
HEALTHY LIVING   34.06   35.86   34.94
BUSINESS         50.26   47.26   48.72
WEIRD NEWS       43.65   45.83   44.72
SPORTS           71.18   74.34   72.73
PARENTING        53.98   64.34   58.70
STYLE & BEAUTY   80.46   78.92   79.68
GREEN            42.86   43.72   43.29
FOOD & DRINK     68.77   77.44   72.85
QUEER VOICES     66.14   72.80   69.31
THE WORLDPOST    48.18   44.38   46.20
HOME & LIVING    80.31   78.11   79.19
WEDDINGS         79.95   78.07   79.00
PARENTS          42.17   26.18   32.31
ARTS & CULTURE   39.42   29.50   33.74
CRIME            60.52   52.23   56.07
CULTURE & ARTS   81.08   34.88   48.78
ENVIRONMENT      58.82   22.56   32.61
COMEDY           53.

In [6]:
# check results
import spacy
nlp = spacy.load("textcat_model_transformer/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 6.087727160775103e-05, 'WELLNESS': 0.00016755890101194382, 'ENTERTAINMENT': 0.00012017521657980978, 'TRAVEL': 0.9958972930908203, 'HEALTHY LIVING': 4.173409251961857e-05, 'BUSINESS': 0.00016323034651577473, 'WEIRD NEWS': 0.0003394785162527114, 'SPORTS': 6.078213482396677e-05, 'PARENTING': 8.394128235522658e-05, 'STYLE & BEAUTY': 0.0003491932584438473, 'GREEN': 0.00011509830073919147, 'FOOD & DRINK': 0.0004563691036310047, 'QUEER VOICES': 4.171185355517082e-05, 'THE WORLDPOST': 4.907382026431151e-05, 'HOME & LIVING': 0.0001826801453717053, 'WEDDINGS': 7.545178959844634e-05, 'PARENTS': 2.223265983047895e-05, 'ARTS & CULTURE': 1.4923170965630561e-05, 'CRIME': 5.273676651995629e-05, 'CULTURE & ARTS': 9.494981350144371e-05, 'ENVIRONMENT': 0.0003040886949747801, 'COMEDY': 4.834379069507122e-05, 'RELIGION': 1.7640128135099076e-05, 'MONEY': 8.6573651060462e-05, 'BLACK VOICES': 2.124297861882951e-05, 'COLLEGE': 1.541004348837305e-05, 'DIVORCE': 2.8447317163227126e-05, 'U.S. NEWS': 

In [7]:
max(doc.cats, key=doc.cats.get)

'TRAVEL'

In [8]:
doc.cats["TRAVEL"]

0.9958972930908203

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model_transformer/model-best"
zipfile_name = "textcat_model_transformer/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model_transformer" > "textcat_model_transformer_2023-07-17_12-24"