### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: Spacy TextCatCNN__

This model was trained using a python v3.11.3 environment and requires:
- spacy

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [3]:
# check cuda version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [4]:
# check gpu status
!nvidia-smi

Tue Jul 11 22:21:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8     1W / 114W |      0MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [5]:
# setup the config file
!python -m spacy init config --pipeline textcat config.cfg --gpu

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


modify the [model] and [model.tok2vec]

```
[components.textcat.model]
@architectures = "spacy.TextCatCNN.v2"
exclusive_classes = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
```



#### 2. Train the model and evaluate model performance

In [12]:
# train the model
!python -m spacy train config.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model --gpu-id 0

[38;5;2m✔ Created output directory: textcat_model[0m
[38;5;4mℹ Saving to output directory: textcat_model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-11 23:39:17,741] [INFO] Set up nlp object from config
[2023-07-11 23:39:17,748] [INFO] Pipeline: ['textcat']
[2023-07-11 23:39:17,749] [INFO] Created vocabulary
[2023-07-11 23:39:17,749] [INFO] Finished initializing nlp object
[2023-07-11 23:39:34,042] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.30        1.26    0.01
  0     200          5.71        1.46    0.01
  0     400          4.54        2.56    0.03
  0     600          4.44        2.95    0.03
  0     800          4.27        4.73    0.05
  0    1000          4.15        8.05    0.08
  0    1200          3.93        7.41    0.07
  0    1

In [13]:
# evaluate the model
!python -m spacy evaluate ./textcat_model/model-best/ --output ./textcat_model/metrics.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   39.01 
SPEED               6175  

[1m

                     P       R       F
POLITICS         67.39   82.80   74.30
WELLNESS         52.64   64.21   57.85
ENTERTAINMENT    59.58   71.57   65.03
TRAVEL           70.89   68.95   69.90
HEALTHY LIVING   24.84   32.24   28.06
BUSINESS         34.47   45.44   39.20
WEIRD NEWS       32.76   23.75   27.54
SPORTS           58.05   66.26   61.89
PARENTING        50.00   55.81   52.74
STYLE & BEAUTY   67.69   76.57   71.86
GREEN            38.41   21.46   27.53
FOOD & DRINK     56.45   68.38   61.85
QUEER VOICES     74.19   65.09   69.35
THE WORLDPOST    34.86   50.76   41.34
HOME & LIVING    64.64   68.66   66.59
WEDDINGS         75.00   72.85   73.91
PARENTS          32.01   26.18   28.81
ARTS & CULTURE   13.89    3.60    5.71
CRIME            49.52   43.02   46.04
CULTURE & ARTS   46.30   29.07   35.71
ENVIRONMENT      40.00   15.04   21.86
COMEDY           53.

In [14]:
# check results
import spacy
nlp = spacy.load("textcat_model/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.01840197481215, 'WELLNESS': 0.00032212372752837837, 'ENTERTAINMENT': 0.005269561428576708, 'TRAVEL': 0.00636184960603714, 'HEALTHY LIVING': 0.0034280235413461924, 'BUSINESS': 0.03493398800492287, 'WEIRD NEWS': 1.2614979823410977e-05, 'SPORTS': 0.00023266530479304492, 'PARENTING': 0.043444517999887466, 'STYLE & BEAUTY': 0.0008673993870615959, 'GREEN': 0.0037550267297774553, 'FOOD & DRINK': 0.0006035942933522165, 'QUEER VOICES': 0.06572605669498444, 'THE WORLDPOST': 0.0008242229232564569, 'HOME & LIVING': 0.0003070157254114747, 'WEDDINGS': 0.0016251156339421868, 'PARENTS': 0.00020135444356128573, 'ARTS & CULTURE': 0.004422228783369064, 'CRIME': 9.721733658807352e-05, 'CULTURE & ARTS': 0.05658094584941864, 'ENVIRONMENT': 0.01311678159981966, 'COMEDY': 0.0040053692646324635, 'RELIGION': 0.002186872297897935, 'MONEY': 0.01357474084943533, 'BLACK VOICES': 0.0017910278402268887, 'COLLEGE': 0.004576589912176132, 'DIVORCE': 0.008593172766268253, 'U.S. NEWS': 0.001206616754643619,

In [15]:
max(doc.cats, key=doc.cats.get)

'WORLDPOST'

In [16]:
doc.cats["PARENTING"]

0.043444517999887466

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model/model-best"
zipfile_name = "textcat_model/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model" > "textcat_model_2023-07-17_12-24"