# Text classification task

In this module, we will begin with a straightforward text classification task using the **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset: the goal is to classify news headlines into one of four categories: World, Sports, Business, and Sci/Tech.

## The Dataset

To load the dataset, we will utilize the **[TensorFlow Datasets](https://www.tensorflow.org/datasets)** API.


In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,
# we will set tensorflow option to grow GPU memory allocation when required.
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

dataset = tfds.load('ag_news_subset')

We can now access the training and test portions of the dataset by using `dataset['train']` and `dataset['test']` respectively:


In [3]:
ds_train = dataset['train']
ds_test = dataset['test']

print(f"Length of train dataset = {len(ds_train)}")
print(f"Length of test dataset = {len(ds_test)}")

Length of train dataset = 120000
Length of test dataset = 7600


Let's print out the first 10 new headlines from our dataset:


In [4]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

for i,x in zip(range(5),ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
1 (Sports) -> b"Wood's Suspension Upheld (Reuters)" b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.'
2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'
3 (Sci/Tech) -> b"'Halt science decline in schools'" b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'
1 (Sports) -> b'Gerrard leaves practice' b'London, England (Sports Network

## Text vectorization

Now we need to transform text into **numbers** that can be represented as tensors. If we want to represent text at the word level, we need to do two things:

* Use a **tokenizer** to break the text into **tokens**.
* Create a **vocabulary** from those tokens.

### Limiting vocabulary size

In the AG News dataset example, the vocabulary size is quite large, exceeding 100k words. Generally, we don't need words that appear infrequently in the text — only a few sentences will contain them, and the model won't learn much from them. Therefore, it makes sense to reduce the vocabulary size to a smaller number by passing an argument to the vectorizer constructor:

Both of these steps can be managed using the **TextVectorization** layer. Let's create the vectorizer object and then use the `adapt` method to process all the text and build a vocabulary:


In [5]:
vocab_size = 50000
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title']+' '+x['description']))

> **Note** that we are using only a subset of the entire dataset to build a vocabulary. This is done to speed up execution time and avoid keeping you waiting. However, this approach carries the risk that some words from the full dataset may not be included in the vocabulary and will be ignored during training. Therefore, using the full vocabulary size and processing the entire dataset during `adapt` could improve the final accuracy, but not by a significant margin.

Now we can access the actual vocabulary:


In [6]:
vocab = vectorizer.get_vocabulary()
vocab_size = len(vocab)
print(vocab[:10])
print(f"Length of vocabulary: {vocab_size}")

['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for']
Length of vocabulary: 5335


Using the vectorizer, we can easily encode any text into a set of numbers:


In [7]:
vectorizer('I love to play with my words')

<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 112, 3695,    3,  304,   11, 1041,    1], dtype=int64)>

## Bag-of-words text representation

Since words carry meaning, sometimes we can infer the meaning of a text simply by analyzing the individual words, without considering their order in the sentence. For instance, when categorizing news articles, words like *weather* and *snow* are likely to suggest *weather forecast*, whereas words like *stocks* and *dollar* might point to *financial news*.

**Bag-of-words** (BoW) vector representation is the simplest and most intuitive traditional vector representation. Each word is assigned to a specific index in the vector, and the corresponding vector element indicates the frequency of that word in a given document.

![Image showing how a bag of words vector representation is represented in memory.](../../../../../translated_images/bag-of-words-example.606fc1738f1d7ba98a9d693e3bcd706c6e83fa7bf8221e6e90d1a206d82f2ea4.en.png) 

> **Note**: You can also think of BoW as the sum of all one-hot-encoded vectors for the individual words in the text.

Below is an example of how to create a bag-of-words representation using the Scikit Learn Python library:


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
sc_vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
sc_vectorizer.fit_transform(corpus)
sc_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)

We can also use the Keras vectorizer that we defined above, converting each word number into a one-hot encoding and adding all those vectors up:


In [9]:
def to_bow(text):
    return tf.reduce_sum(tf.one_hot(vectorizer(text),vocab_size),axis=0)

to_bow('My dog likes hot dogs on a hot day.').numpy()

array([0., 5., 0., ..., 0., 0., 0.], dtype=float32)

> **Note**: You might notice that the result is different from the previous example. This is because, in the Keras example, the vector length corresponds to the size of the vocabulary, which was created from the entire AG News dataset. In contrast, in the Scikit Learn example, we built the vocabulary dynamically from the sample text.


## Training the BoW classifier

Now that we've learned how to create the bag-of-words representation for our text, let's train a classifier that utilizes it. First, we need to transform our dataset into a bag-of-words format. This can be done using the `map` function as follows:


In [11]:
batch_size = 128

ds_train_bow = ds_train.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)
ds_test_bow = ds_test.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)

Now let's define a simple classifier neural network that contains one linear layer. The input size is `vocab_size`, and the output size corresponds to the number of classes (4). Because we're solving a classification task, the final activation function is **softmax**:


In [12]:
model = keras.models.Sequential([
    keras.layers.Dense(4,activation='softmax',input_shape=(vocab_size,))
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train_bow,validation_data=ds_test_bow)



<keras.callbacks.History at 0x20c70a947f0>

Since we have 4 classes, achieving an accuracy above 80% is considered a good result.

## Training a classifier as a single network

Because the vectorizer is also a Keras layer, we can define a network that incorporates it and train it end-to-end. This approach eliminates the need to vectorize the dataset using `map`; instead, we can directly feed the original dataset into the network's input.

> **Note**: We would still need to apply mapping to our dataset to transform fields from dictionaries (like `title`, `description`, and `label`) into tuples. However, when loading data from disk, we can structure the dataset correctly from the start.


In [13]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

inp = keras.Input(shape=(1,),dtype=tf.string)
x = vectorizer(inp)
x = tf.reduce_sum(tf.one_hot(x,vocab_size),axis=1)
out = keras.layers.Dense(4,activation='softmax')(x)
model = keras.models.Model(inp,out)
model.summary()

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 5335)        0         
                                                                 
 tf.math.reduce_sum (TFOpLam  (None, 5335)             0         
 bda)                                                            
                                                                 
 dense_2 (Dense)             (None, 4)                 21344     
                                                                 
Total params: 21,344
Trainable params: 21,344
Non-trainable p

<keras.callbacks.History at 0x20c721521f0>

## Bigrams, trigrams, and n-grams

One drawback of the bag-of-words approach is that some words belong to multi-word expressions. For instance, the term 'hot dog' has a completely different meaning compared to the individual words 'hot' and 'dog' in other contexts. If we always represent 'hot' and 'dog' using the same vectors, it could lead to confusion in our model.

To solve this issue, **n-gram representations** are often employed in document classification methods, where the frequency of each single word, pair of words, or triplet of words becomes a valuable feature for training classifiers. In bigram representations, for example, we include all word pairs in the vocabulary, in addition to the original single words.

Here’s an example of how to create a bigram bag-of-words representation using Scikit Learn:


In [14]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


Vocabulary:
 {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}


array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

The main disadvantage of the n-gram approach is that the size of the vocabulary grows extremely quickly. In practice, we need to combine the n-gram representation with a dimensionality reduction technique, such as *embeddings*, which we will cover in the next unit.

To apply an n-gram representation to our **AG News** dataset, we need to pass the `ngrams` parameter to the `TextVectorization` constructor. The size of a bigram vocabulary is **significantly larger**—in our case, it exceeds 1.3 million tokens! Therefore, it makes sense to limit the number of bigram tokens to a reasonable amount.

We could use the same code as before to train the classifier, but this approach would be very memory-inefficient. In the next unit, we will train the bigram classifier using embeddings. In the meantime, you can experiment with training a bigram classifier in this notebook and see if you can achieve higher accuracy.


## Automatically calculating BoW Vectors

In the example above, we manually calculated BoW vectors by summing the one-hot encodings of individual words. However, the latest version of TensorFlow allows us to automatically calculate BoW vectors by setting the `output_mode='count'` parameter in the vectorizer constructor. This simplifies the process of defining and training our model significantly:


In [15]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='count'),
    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x20c725217c0>

## Term frequency - inverse document frequency (TF-IDF)

In the BoW representation, word occurrences are weighted using the same method regardless of the word itself. However, it's obvious that frequent words like *a* and *in* are far less significant for classification than specialized terms. In most NLP tasks, some words carry more importance than others.

**TF-IDF** stands for **term frequency - inverse document frequency**. It is a variation of the bag-of-words approach, where instead of a binary 0/1 value indicating whether a word appears in a document, a floating-point value is used, which reflects the frequency of the word's occurrence in the corpus.

More formally, the weight $w_{ij}$ of a word $i$ in document $j$ is defined as:
$$
w_{ij} = tf_{ij}\times\log({N\over df_i})
$$
where
* $tf_{ij}$ is the number of times $i$ appears in $j$, i.e., the BoW value we discussed earlier
* $N$ is the total number of documents in the collection
* $df_i$ is the number of documents in the collection that contain the word $i$

The TF-IDF value $w_{ij}$ increases in proportion to how often a word appears in a document but is offset by the number of documents in the corpus that contain the word. This adjustment helps account for the fact that some words are more common than others. For instance, if a word appears in *every* document in the collection, $df_i=N$, and $w_{ij}=0$, meaning those terms would be completely ignored.

You can easily create a TF-IDF vectorization of text using Scikit Learn:


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,
        0.43381609, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

In Keras, the `TextVectorization` layer can automatically compute TF-IDF frequencies by passing the `output_mode='tf-idf'` parameter. Let's repeat the code we used above to see if using TF-IDF increases accuracy:


In [17]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='tf-idf'),
    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x20c729dfd30>

## Conclusion

Although TF-IDF representations assign frequency weights to various words, they cannot capture meaning or order. As the renowned linguist J. R. Firth stated in 1935, "The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously." Later in the course, we will explore how to extract contextual information from text using language modeling.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
