### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: Spacy Ensemble__

This model was trained using a python v3.11.3 environment and requires:
- spacy

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [3]:
# check cuda version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [4]:
# check gpu status
!nvidia-smi

Wed Jul 12 00:53:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   33C    P8     1W / 140W |      0MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [5]:
# setup the config file
!python -m spacy init config --pipeline textcat config.cfg --gpu

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


modify the [model] and [model.tok2vec]

```
[components.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[components.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false

[components.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
depth = 2
```



#### 2. Train the model and evaluate model performance

In [7]:
# train the model
!python -m spacy train config.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model --gpu-id 0

[38;5;4mℹ Saving to output directory: textcat_model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-12 01:06:59,290] [INFO] Set up nlp object from config
[2023-07-12 01:06:59,297] [INFO] Pipeline: ['textcat']
[2023-07-12 01:06:59,298] [INFO] Created vocabulary
[2023-07-12 01:06:59,298] [INFO] Finished initializing nlp object
[2023-07-12 01:07:16,289] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.02        1.16    0.01
  0     200          4.45        2.75    0.03
  0     400          4.16        5.16    0.05
  0     600          4.09        5.85    0.06
  0     800          4.01        8.26    0.08
  0    1000          3.88       12.56    0.13
  0    1200          3.69       16.04    0.16
  0    1400          3.59       17.25    0.17
  0    1600      

In [8]:
# evaluate the model
!python -m spacy evaluate ./textcat_model/model-best/ --output ./textcat_model/metrics.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   43.81 
SPEED               6573  

[1m

                     P       R       F
POLITICS         71.58   80.32   75.70
WELLNESS         51.58   68.47   58.84
ENTERTAINMENT    64.25   68.57   66.34
TRAVEL           65.56   70.95   68.15
HEALTHY LIVING   29.41   22.20   25.30
BUSINESS         43.91   37.65   40.54
WEIRD NEWS       29.89   32.50   31.14
SPORTS           57.99   65.25   61.41
PARENTING        52.12   52.37   52.25
STYLE & BEAUTY   73.62   76.35   74.96
GREEN            34.42   38.46   36.33
FOOD & DRINK     67.30   62.52   64.83
QUEER VOICES     66.56   65.72   66.14
THE WORLDPOST    45.36   40.12   42.58
HOME & LIVING    72.55   66.42   69.35
WEDDINGS         76.65   72.85   74.70
PARENTS          33.66   34.66   34.15
ARTS & CULTURE   34.58   26.62   30.08
CRIME            49.41   46.93   48.14
CULTURE & ARTS   49.09   31.40   38.30
ENVIRONMENT      36.36   21.05   26.67
COMEDY           50.

In [9]:
# check results
import spacy
nlp = spacy.load("textcat_model/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.07092093676328659, 'WELLNESS': 0.0028532857540994883, 'ENTERTAINMENT': 0.005385665223002434, 'TRAVEL': 0.045635074377059937, 'HEALTHY LIVING': 0.0006666149711236358, 'BUSINESS': 0.005303025245666504, 'WEIRD NEWS': 0.0005795331089757383, 'SPORTS': 0.0005500675761140883, 'PARENTING': 0.004485529847443104, 'STYLE & BEAUTY': 0.0014961593551561236, 'GREEN': 0.011978158727288246, 'FOOD & DRINK': 0.0002930395130533725, 'QUEER VOICES': 0.016649233177304268, 'THE WORLDPOST': 0.012198668904602528, 'HOME & LIVING': 0.0004385566571727395, 'WEDDINGS': 0.0005049843457527459, 'PARENTS': 0.0014251680113375187, 'ARTS & CULTURE': 0.0020301323384046555, 'CRIME': 0.0017437454080209136, 'CULTURE & ARTS': 0.012227863073348999, 'ENVIRONMENT': 0.005559844896197319, 'COMEDY': 0.0013016789453104138, 'RELIGION': 0.020096885040402412, 'MONEY': 0.0003741745313163847, 'BLACK VOICES': 0.009691660292446613, 'COLLEGE': 0.0013013698626309633, 'DIVORCE': 0.0010543664684519172, 'U.S. NEWS': 0.0001820547040

In [10]:
max(doc.cats, key=doc.cats.get)

'WORLDPOST'

In [11]:
doc.cats["PARENTING"]

0.004485529847443104

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model/model-best"
zipfile_name = "textcat_model/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model" > "textcat_model_2023-07-17_12-24"