# Multiclass Classification

## Objectives

- Implement and evaluate neural network models for multiclass classification of text data.
- Examine the effect of varying network architecture on the ability to differentiate among 46 different topics.
- Explore techniques such as dropout to mitigate overfitting in dense neural network layers.

## Background

The notebook applies neural network techniques to classify Reuters newswires into 46 distinct topics. This setup demonstrates challenges specific to multiclass classification with many categories, emphasizing proper network architecture and data handling.

## Datasets Used

Reuters Dataset: It comprised of short newswires and their corresponding topics from 1986, categorized into 46 different topics, with a predefined split for training and testing.

## Reuters dataset

In this notebook, we will solve non-binary classification problems with neural networks. 

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

In [2]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.datasets import reuters

We will build a network to classify Reuters newswires into 46 mutually exclusive topics. 

The dataset consists of short newswires and their topics, published by Reuters in 1986. It is a simple, widely used dataset for text classification. There are 46 different topics; some are more represented than others, but each has at least ten examples in the training set.

Because we have many classes, this problem is an instance of multiclass classification. Because each piece of news is classified into only one category, the problem is an instance of `single-label, multiclass classification`.

In [3]:
max_words = 10000
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.2)
print('Train = %i cases \t Test = %i cases' %(len(X_train), len(X_test)))

Train = 8982 cases 	 Test = 2246 cases


The argument `num_words=max_words` means you will only keep the top `max_words` most frequently occurring words in the training data. Rare words will be discarded. This allows you to work with vector data of manageable size.

In [4]:
# Some data examples
print('The first 5 elements of case  0 are: ', X_train[0][:5], '\t\ty_label:', y_train[0])
print('The first 5 elements of case 12 are:', X_train[12][:5], '\t\ty_label:', y_train[12])
print('The first 5 elements of case 20 are:', X_train[20][:5], '\ty_label:',   y_train[20])

The first 5 elements of case  0 are:  [1, 2, 2, 8, 43] 		y_label: 3
The first 5 elements of case 12 are: [1, 2, 81, 8, 16] 		y_label: 4
The first 5 elements of case 20 are: [1, 779, 37, 38, 465] 	y_label: 11


You can quickly decode one of these reviews back to English words. Let's do it with the smallest one.

In [5]:
# Finding the smallest sequence 
seq_len = np.array([len(x) for x in X_train])

print('Minimum sequence length:', seq_len.min(), 'at the position', seq_len.argmin()) 
print('Smallest sequence:', X_train[seq_len.argmin()], '\ty_label:', y_train[seq_len.argmin()])

Minimum sequence length: 13 at the position 6519
Smallest sequence: [1, 486, 341, 151, 26, 219, 93, 124, 146, 93, 155, 17, 12] 	y_label: 3


What is this review about?

In [6]:
# index is a dictionary mapping words to an integer index.
index = reuters.get_word_index()      
# Reverses it, mapping integer indices to words
reverse_index = dict([(value, key) for (key, value) in index.items()])
# Decoding the review 
print(" ".join([reverse_index.get(i - 3, "#") for i in X_train[seq_len.argmin()]])) 

# qtly div nine cts pay april 30 record april six reuter 3


In [7]:
# Get the unique target values and their counts
y = np.concatenate((y_train, y_test), axis=0)
unique_values, counts = np.unique(y, return_counts=True)

In [8]:
# Plot the distribution of the target variable
px.bar(x=unique_values, y=counts,  
       width=800, height=500, title='Class distribution')

 There are 46 different classes; as you can see, some are more represented than others.

## Encoding the data

We cannot feed lists of integers into a neural network. We have to prepare the data. 

We will vectorize every review and fill it with zeros to contain exactly `max_words` numbers. That means we will fill every review shorter than `max_words` with zeros. We need to do this because the biggest review is nearly that long, and every input for our neural network needs to have the same size.

In [9]:
print('Number of dimensions: ', X_train.ndim)
print('Dimensions (or shape):', X_train.shape)

Number of dimensions:  1
Dimensions (or shape): (8982,)


In [10]:
print('Lenght Review 0  =', len(X_train[0]), ' - Ten first elements:', X_train[0][:10])
print('Lenght Review 12 =', len(X_train[12]), ' - Ten first elements:', X_train[12][:10])
print('Lenght Review 20 =', len(X_train[20]), '- Ten first elements:', X_train[20][:10])

Lenght Review 0  = 87  - Ten first elements: [1, 2, 2, 8, 43, 10, 447, 5, 25, 207]
Lenght Review 12 = 65  - Ten first elements: [1, 2, 81, 8, 16, 625, 42, 120, 7, 1679]
Lenght Review 20 = 231 - Ten first elements: [1, 779, 37, 38, 465, 278, 6623, 55, 900, 6]


In [11]:
def vectorize(sequences, dimension = 10000):
    '''
    This function takes a list of sequences (array of lists) and returns 
    a NumPy array of shape (len(sequences), dimension) with 0 and 1.
    '''
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):        
        results[i, sequence] = 1
    return results

In [12]:
X_train_v = vectorize(X_train)
X_test_v  = vectorize(X_test)
print('Number of dimensions: ', X_train_v.ndim)
print('Dimensions (or shape):', X_train_v.shape)

Number of dimensions:  2
Dimensions (or shape): (8982, 10000)


In [13]:
print('Lenght Vectorized Review 0  =', len(X_train_v[0]), ' - Ten first elements:', X_train_v[0][:10])
print('Lenght Vectorized Review 12 =', len(X_train_v[1]), ' - Ten first elements:', X_train_v[12][:10])
print('Lenght Vectorized Review 20 =', len(X_train_v[1]), ' - Ten first elements:', X_train_v[20][:10])

Lenght Vectorized Review 0  = 10000  - Ten first elements: [0. 1. 1. 0. 1. 1. 1. 1. 1. 1.]
Lenght Vectorized Review 12 = 10000  - Ten first elements: [0. 1. 1. 0. 1. 1. 1. 1. 1. 1.]
Lenght Vectorized Review 20 = 10000  - Ten first elements: [0. 1. 1. 0. 1. 1. 1. 1. 1. 1.]


We must convert integer numbers of the target variable (`y_train` and `y_test`) into tensors.

In [14]:
# Some target values examples
print('Target value of case 0: ', y_train[0])
print('Target value of case 12:', y_train[12])
print('Target value of case 20:', y_train[20])

Target value of case 0:  3
Target value of case 12: 4
Target value of case 20: 11


In [15]:
# Vectorizing the labels with one-hot encoding
y_train_v = to_categorical(y_train)
y_test_v  = to_categorical(y_test)

In [16]:
# Some target values examples
print('Target value of case 0: ',   y_train[0],  '\n', y_train_v[0])
print('\nTarget value of case 12:', y_train[12], '\n', y_train_v[12])
print('\nTarget value of case 20:', y_train[20], '\n', y_train_v[20])

Target value of case 0:  3 
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Target value of case 12: 4 
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Target value of case 20: 11 
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


The objective value of case 0 is 3. Notice that the associated vector is all 0 but a 1 at position 3.

## Model A

This classification problem looks similar to the previous movie-review classification problem: we are trying to classify short text snippets in both cases. Now we have an additional constraint: the number of output classes has gone from 2 to 46. 

Each layer can only access information in the previous layer's output in a stack of `Dense` layers. If one layer drops some information relevant to the problem, it can never be recovered: each layer becomes an information bottleneck. 

In the previous example, we used a 25-dimensional intermediate layer, but it could be too limited to learn to separate 46 different classes. We will use larger layers. Let's go with 512 units.

Remember, we will end the network with a `Dense` layer of size 46. The network will output a 46-dimensional vector (the total number of output classes) for each input sample. 

The last layer uses a `softmax` activation. The network will output a probability distribution over the 46 classes, each one represents the probability the sample belongs to class i. The 46 scores will sum to 1.

In [17]:
# Define the model architecture
modelA = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(160, activation='relu'),  # First dense layer with 160 neurons
    Dense(46, activation='softmax') # Output layer with 46 neurons, suitable for multi-class classification
])

# Display model summary
modelA.summary()

The best loss function to use is `categorical_crossentropy`. It measures the distance between two probability distributions: here, between the probability distribution output by the network and the true distribution of the labels. Minimizing the distance between these two distributions trains the network to output something as close as possible to the true labels.

In [18]:
# Compiling the model
modelA.compile(optimizer='adam',
               loss='categorical_crossentropy',
               metrics=['accuracy'])

In [19]:
# Train the model
batch_size = 512
epochs = 10
historyA = modelA.fit(X_train_v, y_train_v,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test_v, y_test_v));

Epoch 1/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 108ms/step - accuracy: 0.4364 - loss: 3.1207 - val_accuracy: 0.6674 - val_loss: 1.7688
Epoch 2/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 83ms/step - accuracy: 0.7062 - loss: 1.4669 - val_accuracy: 0.7409 - val_loss: 1.2621
Epoch 3/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 93ms/step - accuracy: 0.8050 - loss: 0.9464 - val_accuracy: 0.7680 - val_loss: 1.0606
Epoch 4/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 55ms/step - accuracy: 0.8704 - loss: 0.6533 - val_accuracy: 0.7907 - val_loss: 0.9554
Epoch 5/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 58ms/step - accuracy: 0.9069 - loss: 0.4804 - val_accuracy: 0.8005 - val_loss: 0.8981
Epoch 6/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 56ms/step - accuracy: 0.9255 - loss: 0.3782 - val_accuracy: 0.8023 - val_loss: 0.8602
Epoch 7/10
[1m18/18[0m [32m━━━

Let's use `plot_history` for plotting the results.

In [20]:
def plot_history(history):
    '''
    Plotting the results of the neural network training process
    '''
    hist = history.history
    d = pd.DataFrame({'epochs': [epoch + 1 for epoch in history.epoch],
                      'accuracy': hist['accuracy'],
                      'val_accuracy': hist['val_accuracy'],
                      'loss': hist['loss'],
                      'val_loss': hist['val_loss']})
    
    fig = px.line(d, x='epochs', y=['loss', 'val_loss', 'accuracy', 'val_accuracy'],
                  color_discrete_sequence=['orange', 'peru', 'yellowgreen', 'darkolivegreen'],
                  labels={'epochs': 'Epochs', 'value': 'Loss/Accuracy', 'variable': 'Legend'},
                  title='Neural Network Training History', width=800, height=500)
    
    fig.update_traces(mode='lines+markers')
    
    return fig.show()

In [21]:
plot_history(historyA)

In [22]:
# Evaluate the model on train data
tr_lossA, tr_accA = modelA.evaluate(X_train_v, y_train_v, batch_size=batch_size, verbose=0)
print('Train loss     = %.4f' % tr_lossA)
print('Train accuracy = %.4f' % tr_accA)

Train loss     = 0.1469
Train accuracy = 0.9618


In [23]:
# Evaluate the model on test data
ts_lossA, ts_accA = modelA.evaluate(X_test_v, y_test_v, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % ts_lossA)
print('Test accuracy = %.4f' % ts_accA)

Test loss     = 0.8664
Test accuracy = 0.8063


A model is overfitted when it has been trained too well (excellent accuracy and low loss on training sets), but it performs poorly on testing data. It looks like it is our case. Let's add a Dropout layer to try to improve the prediction quality on unseen data.

## Model B

In [24]:
# Defining the model architecture
modelB = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(160, activation='relu'),  # First dense layer with 160 neurons
    Dropout(0.8),                   # Dropout layer with a rate of 0.8
    Dense(46, activation='softmax') # Output layer with 46 neurons, suitable for multi-class classification
])

# Display model summary
modelB.summary()

In [25]:
# Compiling the model
modelB.compile(optimizer='adam',
               loss='categorical_crossentropy',
               metrics=['accuracy'])

In [26]:
# Train the model
historyB = modelB.fit(X_train_v, y_train_v,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test_v, y_test_v));

Epoch 1/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 63ms/step - accuracy: 0.2709 - loss: 3.3695 - val_accuracy: 0.5993 - val_loss: 1.9832
Epoch 2/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - accuracy: 0.5745 - loss: 2.0160 - val_accuracy: 0.6701 - val_loss: 1.4825
Epoch 3/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 48ms/step - accuracy: 0.6664 - loss: 1.5217 - val_accuracy: 0.7004 - val_loss: 1.2947
Epoch 4/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 56ms/step - accuracy: 0.7146 - loss: 1.2563 - val_accuracy: 0.7329 - val_loss: 1.1823
Epoch 5/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.7437 - loss: 1.1232 - val_accuracy: 0.7542 - val_loss: 1.1038
Epoch 6/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step - accuracy: 0.7695 - loss: 1.0352 - val_accuracy: 0.7689 - val_loss: 1.0432
Epoch 7/10
[1m18/18[0m [32m━━━━

In [27]:
plot_history(historyB)

In [28]:
# Evaluate the model on train data
tr_lossB, tr_accB = modelB.evaluate(X_train_v, y_train_v, batch_size=batch_size, verbose=0)
print('Train loss     = %.4f' % tr_lossB)
print('Train accuracy = %.4f' % tr_accB)

Train loss     = 0.4620
Train accuracy = 0.9018


In [29]:
# Evaluate the model on test data
ts_lossB, ts_accB = modelB.evaluate(X_test_v, y_test_v, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % ts_lossB)
print('Test accuracy = %.4f' % ts_accB)

Test loss     = 0.9052
Test accuracy = 0.7912


Model B is much better!

## Model C

Because the final outputs is 46-dimensional, we should avoid intermediate layers with many fewer than 46 hidden units. 

Let’s see what happens when you introduce an information bottleneck by having intermediate layers that are significantly less than 46-dimensional: for example, 4-dimensional.

In [30]:
# Defining the model architecture
modelC = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(160, activation='relu'),  # First dense layer with 160 neurons    
    Dense(4, activation='relu'),    # Second dense layer with 4 neurons
    Dense(46, activation='softmax') # Output layer with 46 neurons, suitable for multi-class classification
])

# Display model summary
modelC.summary()

In [31]:
# Compiling the model
modelC.compile(optimizer='adam',
               loss='categorical_crossentropy',
               metrics=['accuracy'])

In [32]:
# Train the model
historyC = modelC.fit(X_train_v, y_train_v,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test_v, y_test_v));

Epoch 1/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 61ms/step - accuracy: 0.0240 - loss: 3.7851 - val_accuracy: 0.0401 - val_loss: 3.6549
Epoch 2/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 0.0553 - loss: 3.5834 - val_accuracy: 0.0623 - val_loss: 3.5047
Epoch 3/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 46ms/step - accuracy: 0.0728 - loss: 3.3708 - val_accuracy: 0.0770 - val_loss: 3.3176
Epoch 4/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step - accuracy: 0.0841 - loss: 3.1194 - val_accuracy: 0.0779 - val_loss: 3.0772
Epoch 5/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 46ms/step - accuracy: 0.1189 - loss: 2.7886 - val_accuracy: 0.2551 - val_loss: 2.7384
Epoch 6/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 44ms/step - accuracy: 0.3175 - loss: 2.3292 - val_accuracy: 0.2752 - val_loss: 2.3607
Epoch 7/10
[1m18/18[0m [32m━━━━

In [33]:
plot_history(historyC)

In [34]:
# Evaluate the model on train data
tr_lossC, tr_accC = modelC.evaluate(X_train_v, y_train_v, batch_size=batch_size, verbose=0)
print('Train loss     = %.4f' % tr_lossC)
print('Train accuracy = %.4f' % tr_accC)

Train loss     = 1.0737
Train accuracy = 0.6813


In [35]:
# Evaluate the model on test data
ts_lossC, ts_accC = modelC.evaluate(X_test_v, y_test_v, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % ts_lossC)
print('Test accuracy = %.4f' % ts_accC)

Test loss     = 1.6793
Test accuracy = 0.6051


Model C is not a valid option!

In [36]:
# Ploting validation accuracy of models A, B, and C
px.bar(x=['Model A','Model B','Model C'], y=[ts_accA, ts_accB, ts_accC],
       labels={'x': 'Model', 'y': 'Accuracy'},
       width=800, height=500, title='Validation Accuracy')

The test accuracy has reduced significantly.

The drop is due mainly to the fact that we are trying to compress a lot of information (enough information to recover the separation hyperplanes of 46 classes) into an intermediate space that is too low-dimensional.  

**Conclusion**: If you need to classify data into a large number of categories, you should avoid creating information bottlenecks in your network due to intermediate layers that are too small.

## A different way to handle labels

In [37]:
print('Initial target value for case 0:    ', y_train[0])
print('Transformed target value for case 0:', y_train_v[0])

Initial target value for case 0:     3
Transformed target value for case 0: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Another way to encode the labels is to cast them as an integer tensor, that is, without transformation. Let's do it!

In [38]:
# Building the model
modelD = Sequential([
    Input(shape=(max_words,)),      # Explicitly define the input shape
    Dense(160, activation='relu'),  # First dense layer with 160 neurons
    Dropout(0.8),                   # Dropout layer with a rate of 0.8
    Dense(46, activation='softmax') # Output layer with 46 neurons, suitable for multi-class classification
])

# Display model summary
modelD.summary()

In [39]:
# Compiling the model
modelD.compile(optimizer='adam',
               loss='sparse_categorical_crossentropy',
               metrics=['accuracy'])

In [40]:
# Train the model
historyD = modelD.fit(X_train_v, y_train,
                epochs=epochs,
                batch_size=batch_size,
                validation_data=(X_test_v, y_test));

Epoch 1/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 110ms/step - accuracy: 0.2653 - loss: 3.3792 - val_accuracy: 0.5899 - val_loss: 2.0197
Epoch 2/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 52ms/step - accuracy: 0.5863 - loss: 2.0138 - val_accuracy: 0.6692 - val_loss: 1.4834
Epoch 3/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 55ms/step - accuracy: 0.6702 - loss: 1.4908 - val_accuracy: 0.7030 - val_loss: 1.2953
Epoch 4/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 63ms/step - accuracy: 0.7092 - loss: 1.2779 - val_accuracy: 0.7306 - val_loss: 1.1897
Epoch 5/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.7357 - loss: 1.1336 - val_accuracy: 0.7498 - val_loss: 1.1145
Epoch 6/10
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 62ms/step - accuracy: 0.7637 - loss: 1.0230 - val_accuracy: 0.7618 - val_loss: 1.0581
Epoch 7/10
[1m18/18[0m [32m━━━

In [41]:
plot_history(historyD)

In [42]:
# Evaluate the model on train data
tr_lossD, tr_accD = modelD.evaluate(X_train_v, y_train, batch_size=batch_size, verbose=0)
print('Train loss     = %.4f' % tr_lossD)
print('Train accuracy = %.4f' % tr_accD)

Train loss     = 0.4784
Train accuracy = 0.8984


In [43]:
# Evaluate the model on test data
ts_lossD, ts_accD = modelD.evaluate(X_test_v, y_test, batch_size=batch_size, verbose=0)
print('Test loss     = %.4f' % ts_lossD)
print('Test accuracy = %.4f' % ts_accD)

Test loss     = 0.9216
Test accuracy = 0.7890


Model D is a good option too!

Key points:

- If you are trying to classify data points among `n` classes, your network should end with a Dense layer of size `n`.

- In a single-label, multiclass classification problem, your network should end with a `softmax` activation so that it will output a probability distribution over the `n` output classes.

- Categorical crossentropy is almost always the loss function you should use for such problems. 

- There are two ways to handle labels in multiclass classification:
    – Encoding the labels via categorical encoding (also known as `one-hot encoding`) and using `categorical_crossentropy` as a loss function
    – Encoding the labels as integers and using the `sparse_categorical_crossentropy` loss function
    
- If you need to classify data into a large number of categories, you should avoid creating information bottlenecks in your network due to intermediate layers that are too small.

## Conclusions

Key Takeaways:
- Neural networks can handle multiclass text classification by transforming and vectorizing text data into a format suitable for model training.
- Larger layers and more complex architectures generally perform better for categories with many potential values, reducing information loss within the network.
- Implementing dropout layers can reduce overfitting, improving model generalization on unseen data.
- The choice between using sparse categorical crossentropy and categorical crossentropy determines how you should format your label data. This choice significantly impacts the training behavior and the model's effectiveness on new, unseen data.

## References

- Chollet, F. (2021) *Deep Learning with Python*, Second Edition, Manning Publications Co, chap 2