# **Text Classification with Spacy using a news headline classification dataset**

* https://catherinebreslin.medium.com/text-classification-with-spacy-3-0-d945e2e8fc44

* https://www.kaggle.com/datasets/rmisra/news-category-dataset

**News Category Dataset** is used to identify the type of news based on headlines and short descriptions. The two fields we’re interested in are category and headline. Our task will be to determine the category based on the text of the headline. Note that each headline in this set has a single category. In some classification tasks, multiple categories are allowed.

The headlines in this set fall into 41 categories, and you can see that the data is reasonably unbalanced, for that we’ll only use data from the top 4 categories: POLITICS, WELLNESS, ENTERTAINMENT and TRAVEL. There are 76,511 examples in total.


## Process the data


In [36]:
with open('resources\\News_Category_Dataset_v3.json') as f:
    lines = f.readlines()

In [37]:
import json

categories_count = {}

# Iterar a través de cada diccionario en la lista
for dic in lines:
    category = json.loads(dic)['category']
    
    # Actualizar el recuento de la categoría
    if category in categories_count:
        categories_count[category] += 1
    else:
        categories_count[category] = 1

sorted(categories_count.items(), key=lambda x: x[1], reverse=True)



[('POLITICS', 35602),
 ('WELLNESS', 17945),
 ('ENTERTAINMENT', 17362),
 ('TRAVEL', 9900),
 ('STYLE & BEAUTY', 9814),
 ('PARENTING', 8791),
 ('HEALTHY LIVING', 6694),
 ('QUEER VOICES', 6347),
 ('FOOD & DRINK', 6340),
 ('BUSINESS', 5992),
 ('COMEDY', 5400),
 ('SPORTS', 5077),
 ('BLACK VOICES', 4583),
 ('HOME & LIVING', 4320),
 ('PARENTS', 3955),
 ('THE WORLDPOST', 3664),
 ('WEDDINGS', 3653),
 ('WOMEN', 3572),
 ('CRIME', 3562),
 ('IMPACT', 3484),
 ('DIVORCE', 3426),
 ('WORLD NEWS', 3299),
 ('MEDIA', 2944),
 ('WEIRD NEWS', 2777),
 ('GREEN', 2622),
 ('WORLDPOST', 2579),
 ('RELIGION', 2577),
 ('STYLE', 2254),
 ('SCIENCE', 2206),
 ('TECH', 2104),
 ('TASTE', 2096),
 ('MONEY', 1756),
 ('ARTS', 1509),
 ('ENVIRONMENT', 1444),
 ('FIFTY', 1401),
 ('GOOD NEWS', 1398),
 ('U.S. NEWS', 1377),
 ('ARTS & CULTURE', 1339),
 ('COLLEGE', 1144),
 ('LATINO VOICES', 1130),
 ('CULTURE & ARTS', 1074),
 ('EDUCATION', 1014)]

In [39]:
from sklearn.model_selection import train_test_split
from spacy.tokens import DocBin
import spacy
import json

categories = ["POLITICS", "WELLNESS", "ENTERTAINMENT", "TRAVEL"]
headline = [
    json.loads(line)["headline"]
    for line in lines
    if json.loads(line)["category"] in categories
]
category = [
    json.loads(line)["category"]
    for line in lines
    if json.loads(line)["category"] in categories
]

X_train, X_test, y_train, y_test = train_test_split(
    headline, category, test_size=0.2, stratify=category, random_state=42
)


def convert(text_list: list, label_list: list, outfile: str):
    nlp = spacy.blank("en")
    db = DocBin()
    for text, label in zip(text_list, label_list):
        doc = nlp.make_doc(text)
        doc.cats = {category: 0 for category in categories}
        doc.cats[label] = 1
        db.add(doc)
    db.to_disk(outfile)


convert(X_train, y_train, "resources/4_news_train.spacy")
convert(X_test, y_test, "resources/4_news_dev.spacy")


## Model training

In [41]:
!python -m spacy init config --pipeline textcat_multilabel 4news_config.cfg


[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat_multilabel
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
4news_config.cfg
You can now add your data and train your pipeline:
python -m spacy train 4news_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [45]:
! python -m spacy train 4news_config.cfg --paths.train resources/4_news_train.spacy  --paths.dev resources/4_news_dev.spacy --output resources/4_news_model --verbose

[38;5;2m✔ Created output directory: resources\4_news_model[0m
[38;5;4mℹ Saving to output directory: resources\4_news_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  -------------  ----------  ------
  0       0           0.25       66.90    0.67
  0     200          38.24       82.84    0.83
  0     400          28.12       87.87    0.88
  0     600          23.47       90.38    0.90
  0     800          20.95       92.21    0.92
  0    1000          18.55       93.69    0.94
  0    1200          17.09       94.71    0.95
  0    1400          15.14       95.38    0.95
  0    1600          14.44       96.02    0.96
  0    1800          13.59       96.59    0.97
  0    2000          12.61       96.93    0.97
  1    2200          11.69       97.28    0.97
  1    2400          10.33       97.51    0.98
  1  

In [46]:
nlp = spacy.load("resources/4_news_model/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.14570043981075287, 'WELLNESS': 0.09250684827566147, 'ENTERTAINMENT': 0.18099035322666168, 'TRAVEL': 0.42413559556007385}


##  Package for reuse

In [49]:
! python -m spacy package resources/4_news_model/model-best resources --name news_4cat --version 0.0

[38;5;3m⚠ Generating packages without the 'build' package is deprecated and
will not be supported in the future. To install 'build': pip install build[0m
[38;5;4mℹ Building package artifacts: sdist[0m
[38;5;2m✔ Loaded meta.json from file[0m
resources\4_news_model\model-best\meta.json
[38;5;2m✔ Generated README.md from meta.json[0m
[38;5;2m✔ Successfully created package directory 'en_news_4cat-0.0'[0m
en_news_4cat-0.0
[38;5;3m⚠ Creating sdist with 'python -m build' failed. Falling back to
deprecated use of 'python setup.py sdist'[0m
running sdist
running egg_info
creating en_news_4cat.egg-info
writing en_news_4cat.egg-info\PKG-INFO
writing dependency_links to en_news_4cat.egg-info\dependency_links.txt
writing entry points to en_news_4cat.egg-info\entry_points.txt
writing requirements to en_news_4cat.egg-info\requires.txt
writing top-level names to en_news_4cat.egg-info\top_level.txt
writing manifest file 'en_news_4cat.egg-info\SOURCES.txt'
reading manifest file 'en_news_4cat

c:\Users\Manue!_PC\AppData\Local\Programs\Python\Python310\python.exe: No module named build


