In this project, we want to compare the accuracy of predicting categories for news articles in order to see which is best to be used for classification among the following<br><br>

1.   Headline - simple set of a few keywords
2.   Short description - meaningful sentence or two
<br>
<br>
**The motivation for this is to be able to tell which data is best to be scraped if we want to obtain a news feed with certain categories - e.g. it is faster to just scrape headline text than descriptions - but is it the best method?**



We will first take a Kaggle dataset in JSON from here https://www.kaggle.com/datasets/rmisra/news-category-dataset.<br><br>We need to clean this dataset, explore the data, and try deep learning algorithms on it in order to get the best model and feature.<br> Afterwards, we will try out **two different types of deep learnign models - the LSTM model, which is standard for language processing and the convolutional neural network for text (a surprising but effective in some cases kind of approach)**  to classify based on descriptions and titles, and perform **hyperparameter tuning** on these networks to get the best **accuracy score.**

In [None]:
!pip install -q torchtext

In [None]:
import pandas as pd
import numpy as np
import re
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from torch.nn.utils.rnn import pad_sequence
from collections import Counter

import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


This project took a little longer to train as there was no GPU available.

In [None]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


Let us explore and visualize the data from the json dataset - Pandas allows us to simply load in a JSON so no need to convert to a CSV file type.

In [None]:
df = pd.read_json("drive/MyDrive/NewsData/News_Category_Dataset_v3.json", lines=True)
df


Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
...,...,...,...,...,...,...
209522,https://www.huffingtonpost.com/entry/rim-ceo-t...,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH,Verizon Wireless and AT&T are already promotin...,"Reuters, Reuters",2012-01-28
209523,https://www.huffingtonpost.com/entry/maria-sha...,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS,"Afterward, Azarenka, more effusive with the pr...",,2012-01-28
209524,https://www.huffingtonpost.com/entry/super-bow...,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS,"Leading up to Super Bowl XLVI, the most talked...",,2012-01-28
209525,https://www.huffingtonpost.com/entry/aldon-smi...,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS,CORRECTION: An earlier version of this story i...,,2012-01-28


Further data cleaning
 - Dropping NaN and NULL values
 - We noticed empty authors - these articles may be unreliable, so best to exclude them
 - Cleaning both the headlines and descriptions
   - Remove punctuations that stand on their own
   - Remove all links (start with http)

In [None]:
df = df.dropna()
df

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
...,...,...,...,...,...,...
209522,https://www.huffingtonpost.com/entry/rim-ceo-t...,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH,Verizon Wireless and AT&T are already promotin...,"Reuters, Reuters",2012-01-28
209523,https://www.huffingtonpost.com/entry/maria-sha...,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS,"Afterward, Azarenka, more effusive with the pr...",,2012-01-28
209524,https://www.huffingtonpost.com/entry/super-bow...,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS,"Leading up to Super Bowl XLVI, the most talked...",,2012-01-28
209525,https://www.huffingtonpost.com/entry/aldon-smi...,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS,CORRECTION: An earlier version of this story i...,,2012-01-28


In [None]:
df = df[df['authors'] != '']
df = df.reset_index()
df

Unnamed: 0,index,link,headline,category,short_description,authors,date
0,0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
...,...,...,...,...,...,...,...
172104,209517,https://www.huffingtonpost.com/entry/games-for...,Good Games -- Is It possible?,TECH,I don't think people who play Zynga games are ...,"Mateo Gutierrez, Contributor\nArtist",2012-01-28
172105,209518,https://www.huffingtonpost.com/entry/google-pl...,Google+ Now Open for Teens With Some Safeguards,TECH,"For the most part, teens' experience on Google...","Larry Magid, Contributor\nTechnology journalist",2012-01-28
172106,209519,https://www.huffingtonpost.com/entry/congress-...,Web Wars,TECH,"These ""Web Wars"" threaten to rage on for some ...","John Giacobbi, Contributor\nTales from the Int...",2012-01-28
172107,209521,https://www.huffingtonpost.com/entry/watch-top...,Watch The Top 9 YouTube Videos Of The Week,TECH,If you're looking to see the most popular YouT...,Catharine Smith,2012-01-28


In [None]:
def clean(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text.strip()

df['clean_headline'] = df['headline'].apply(clean)
df['clean_description'] = df['short_description'].apply(clean)
df

Unnamed: 0,index,link,headline,category,short_description,authors,date,clean_headline,clean_description
0,0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23,over 4 million americans roll up sleeves for o...,health experts said it is too early to predict...
1,1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23,american airlines flyer charged banned for lif...,he was subdued by passengers and crew when he ...
2,2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23,23 of the funniest tweets about cats and dogs ...,until you have a dog you dont understand what ...
3,3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23,the funniest tweets from parents this week sep...,accidentally put grownup toothpaste on my todd...
4,4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22,woman who called cops on black birdwatcher los...,amy cooper accused investment firm franklin te...
...,...,...,...,...,...,...,...,...,...
172104,209517,https://www.huffingtonpost.com/entry/games-for...,Good Games -- Is It possible?,TECH,I don't think people who play Zynga games are ...,"Mateo Gutierrez, Contributor\nArtist",2012-01-28,good games is it possible,i dont think people who play zynga games are b...
172105,209518,https://www.huffingtonpost.com/entry/google-pl...,Google+ Now Open for Teens With Some Safeguards,TECH,"For the most part, teens' experience on Google...","Larry Magid, Contributor\nTechnology journalist",2012-01-28,google now open for teens with some safeguards,for the most part teens experience on google w...
172106,209519,https://www.huffingtonpost.com/entry/congress-...,Web Wars,TECH,"These ""Web Wars"" threaten to rage on for some ...","John Giacobbi, Contributor\nTales from the Int...",2012-01-28,web wars,these web wars threaten to rage on for some co...
172107,209521,https://www.huffingtonpost.com/entry/watch-top...,Watch The Top 9 YouTube Videos Of The Week,TECH,If you're looking to see the most popular YouT...,Catharine Smith,2012-01-28,watch the top 9 youtube videos of the week,if youre looking to see the most popular youtu...


Let us now visualize what our data looks like with cleaner descriptions and headlines, and single them out since we do not need other info except for the category

In [None]:
df = df[['clean_headline', 'clean_description', 'category']]
df

Unnamed: 0,clean_headline,clean_description,category
0,over 4 million americans roll up sleeves for o...,health experts said it is too early to predict...,U.S. NEWS
1,american airlines flyer charged banned for lif...,he was subdued by passengers and crew when he ...,U.S. NEWS
2,23 of the funniest tweets about cats and dogs ...,until you have a dog you dont understand what ...,COMEDY
3,the funniest tweets from parents this week sep...,accidentally put grownup toothpaste on my todd...,PARENTING
4,woman who called cops on black birdwatcher los...,amy cooper accused investment firm franklin te...,U.S. NEWS
...,...,...,...
172104,good games is it possible,i dont think people who play zynga games are b...,TECH
172105,google now open for teens with some safeguards,for the most part teens experience on google w...,TECH
172106,web wars,these web wars threaten to rage on for some co...,TECH
172107,watch the top 9 youtube videos of the week,if youre looking to see the most popular youtu...,TECH


Let us visualize all the different categories - there is 42 of them !

Note that usually we would scale by the categories number but is this intuitive here ? After all we DO know the realistic distribution of categories - since we have data from 2012 to today (as noticed above!). Therefore, we want the model to give more attention to categories that are more frequent.

In [None]:
df['category'].unique()

array(['U.S. NEWS', 'COMEDY', 'PARENTING', 'WORLD NEWS', 'CULTURE & ARTS',
       'TECH', 'SPORTS', 'POLITICS', 'ENTERTAINMENT', 'WEIRD NEWS',
       'ENVIRONMENT', 'EDUCATION', 'SCIENCE', 'WELLNESS', 'BUSINESS',
       'CRIME', 'STYLE & BEAUTY', 'FOOD & DRINK', 'MEDIA', 'QUEER VOICES',
       'HOME & LIVING', 'WOMEN', 'BLACK VOICES', 'TRAVEL', 'MONEY',
       'RELIGION', 'LATINO VOICES', 'IMPACT', 'WEDDINGS', 'COLLEGE',
       'PARENTS', 'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE',
       'HEALTHY LIVING', 'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST',
       'FIFTY', 'ARTS', 'DIVORCE'], dtype=object)

Since we are dealing with language, we want a tokenizer with out of vocab [OOV] tokens, and we form a vector of vectors (tokenized vectors) for our descriptions and headlines, defining them as X_headline and X_description, respectively.<br>These are actually the feature vectors we train on.

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_LEN = 100
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(df['clean_headline'])

X_headline = pad_sequences(tokenizer.texts_to_sequences(df['clean_headline']), maxlen=MAX_LEN)
tokenizer.fit_on_texts(df['clean_description'])
X_description = pad_sequences(tokenizer.texts_to_sequences(df['clean_description']), maxlen=MAX_LEN)
y_labels = pd.get_dummies(df['category']).values


In [None]:
X_description

array([[   0,    0,    0, ...,    9,    2,  413],
       [   0,    0,    0, ...,    7, 1286, 1405],
       [   0,    0,    0, ...,   97,   18, 6657],
       ...,
       [   0,    0,    0, ...,   44,    8,  373],
       [   0,    0,    0, ..., 1604,   12,    2],
       [   0,    0,    0, ..., 5836,    1,   80]], dtype=int32)

Let's define what architectures we want to use and how we fine-tune them.
- CNN - we use a convolutional neural network on the X_desc and X_head, we want to experiment and see which is better, and can it be used like this - we have the embedding layer and the 1D convolutional network which is used for time series and continuous data (here we try for text)
  - We fine tune the filters count and kernel size for the 1D convolutional neural network doing grid search
- LSTM - a standard model for text analysis, we try an architecture with embedding and a ReLU layer

At the end we have a softmax layer used for classification.

In [None]:
from itertools import product
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

def tune_cnn():
    best_model = None
    best_acc = 0
    for filters, kernel_size in product([64, 128], [3, 5]):
        model = Sequential([
            Embedding(input_dim=10000, output_dim=128, input_length=MAX_LEN),
            Conv1D(filters, kernel_size, activation='relu'),
            GlobalMaxPooling1D(),
            Dense(128, activation='relu'),
            Dense(y_labels.shape[1], activation='softmax')
        ])
        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
        history = model.fit(X_headline, y_labels, epochs=3, batch_size=32, validation_split=0.2, verbose=0)
        acc = history.history['val_accuracy'][-1]
        if acc > best_acc:
            best_acc = acc
            best_model = model
    return best_model, best_acc

best_cnn_model, best_accuracy = tune_cnn()
print("Best CNN accuracy on headline:", best_accuracy)



Best CNN accuracy on headline: 0.42507699131965637


In [None]:
from itertools import product
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

def tune_cnn():
    best_model = None
    best_acc = 0
    for filters, kernel_size in product([64, 128], [3, 5]):
        model = Sequential([
            Embedding(input_dim=10000, output_dim=128, input_length=MAX_LEN),
            Conv1D(filters, kernel_size, activation='relu'),
            GlobalMaxPooling1D(),
            Dense(128, activation='relu'),
            Dense(y_labels.shape[1], activation='softmax')
        ])
        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
        history = model.fit(X_description, y_labels, epochs=3, batch_size=32, validation_split=0.2, verbose=0)
        acc = history.history['val_accuracy'][-1]
        if acc > best_acc:
            best_acc = acc
            best_model = model
    return best_model, best_acc

best_cnn_model, best_accuracy = tune_cnn()
print("Best CNN accuracy on description:", best_accuracy)



Best CNN accuracy on description: 0.4246121644973755


In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense

def build_lstm_model():
    model = Sequential([
        Embedding(input_dim=10000, output_dim=128, input_length=MAX_LEN),
        LSTM(64),
        Dense(128, activation='relu'),
        Dense(y_labels.shape[1], activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

lstm_model_headline = build_lstm_model()
lstm_model_description = build_lstm_model()

history_headline = lstm_model_headline.fit(
    tf.convert_to_tensor(X_headline, dtype=tf.int32),
    tf.convert_to_tensor(y_labels, dtype=tf.float32),
    epochs=5,
    batch_size=32,
    validation_split=0.2
)

print("LSTM Headline Accuracy History:", history_headline.history['accuracy'])
print("LSTM Headline Validation Accuracy History:", history_headline.history['val_accuracy'])

history_description = lstm_model_description.fit(
    tf.convert_to_tensor(X_description, dtype=tf.int32),
    tf.convert_to_tensor(y_labels, dtype=tf.float32),
    epochs=5,
    batch_size=32,
    validation_split=0.2
)

print("LSTM Description Accuracy History:", history_description.history['accuracy'])
print("LSTM Description Validation Accuracy History:", history_description.history['val_accuracy'])


Epoch 1/5
[1m4303/4303[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m206s[0m 47ms/step - accuracy: 0.3551 - loss: 2.5239 - val_accuracy: 0.3869 - val_loss: 2.2389
Epoch 2/5
[1m4303/4303[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m208s[0m 48ms/step - accuracy: 0.5555 - loss: 1.5755 - val_accuracy: 0.4505 - val_loss: 2.0816
Epoch 3/5
[1m4303/4303[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m209s[0m 49ms/step - accuracy: 0.6125 - loss: 1.3344 - val_accuracy: 0.3915 - val_loss: 2.1804
Epoch 4/5
[1m4303/4303[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m206s[0m 48ms/step - accuracy: 0.6498 - loss: 1.1797 - val_accuracy: 0.4244 - val_loss: 2.2347
Epoch 5/5
[1m4303/4303[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m212s[0m 49ms/step - accuracy: 0.6916 - loss: 1.0231 - val_accuracy: 0.4213 - val_loss: 2.3400
LSTM Headline Accuracy History: [0.4304908812046051, 0.5564795732498169, 0.6016690135002136, 0.6367849111557007, 0.6762657165527344]
LSTM Headline Validation Accuracy His

Let us analyze the following results (looking at validation accuracy):
- CNN for description 42.6%
- CNN for headline 42.5%
- LSTM for headline 45% (we should be implementing early stopping here, but this is the rough accuracy it produces)
- LSTM for descriptions 45.3%

At first - it appears this is quite bad! After all, it only is right half the time.<br>However, let us analyze this result more closely... Namely, our baseline accuracy of a random model is only 1/42 given that there are 42 classes - this is around 2.5%, so around 50 is a 20x improvement.<br>Of course, being wrong half the time is still not good - but this is an example of when a classifier might be working much better than it lets on - if we closely analyze the categories, we see many can overlap, and since LSTM catches the meanings of words it is likely to predict a good category albeit not perfect for a lot more than 45% of the articles. This is something to look into for future work.<br><br>
Finally, we conclude descriptions are only marginally better, so so far we may stick with only taking titles and speed up our scraping - this is the problem we were looking at initially. For the model, the choice would most likely be LSTM.