# **Lab 20: Natural Language Processing I**
---

### **Description**
In this week's lab, we will see how to use neural networks for one of the most popular NLP tasks: **text classification**. This will involve applying what you already know about neural nets and new NLP concepts of tokenization and vectorization.

For this project, we will be working with the `fetch_20newsgroups` dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. Each newsgroup covers a different topic, such as sports, politics, religion, and technology. The documents within each newsgroup were posted by various authors, and cover a wide range of subtopics related to the main theme of the newsgroup.

The goal of this project is to build a machine learning model that can accurately classify newsgroup documents based on their content.

<br>

### **Lab Structure**
**Part 1**: [News Subject Classification with a DNN](#p1)

**Part 2**: [News Subject Classification with a CNN](#p2)





<br>

### **Goals**
By the end of this lab, you will:
* Understand the concept of tokenization in NLP.
* Compare a fully connected network to a CNN for text classification.

<br>

### **Cheat Sheets**
[Natural Language Processing I](https://docs.google.com/document/d/1ZaLtMF7aQsG05myetJpoTJlr-sAIURP_a9sQr66pfqw/edit?usp=drive_link)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
!pip install --quiet torch==1.13.1
!pip install --quiet torchdata==0.5.1

!pip install torch
!pip install torchtext
import torchtext

import numpy as np
import pandas as pd

import tensorflow as tf
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 1.13.1 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 1.13.1 which is incompatible.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 1.13.1 which is incompatible.
torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 1.13.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 1.13.1 which is incompatible.
torchtext 0.16.0 requires torchdata==0.7.0, but you have torchdata 0.5.1 which is incompatible.[0m[31m
Collecting torch==2.1.0 (from torchtext)
  Using cached torc

---
## **Part 1: News Subject Classification with a DNN**
---

In this section, we will learn how to tokenize the results of a neural network classifying news articles by subject. We will see how to use keras's `TextVectorization` layer alongside the other layers we have seen to classify these articles.

We will be working with the [AG News dataset](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html), which is a corpus of over 1 million articles from more than 2000 sources and is commonly used in academic research (see its [papers with code page](https://paperswithcode.com/dataset/ag-news) for more information).

<br>

Each input will be text containing an article's title, source, and a snippet from the article itself. We will then classify each input as one of these subjects: `"World"`, `"Sports"`, `"Business"`, or `"Sci/Tech"`.

<br>


**Run the code provided below to import the dataset.**

In [None]:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

x_train, y_train = [], []
for Y, X in train_dataset:
    x_train.append(X)
    y_train.append(Y)

x_test, y_test = [], []
for Y, X in test_dataset:
    x_test.append(X)
    y_test.append(Y)

x_train = np.array(x_train)
x_test = np.array(x_test)

y_train, y_test = np.array(y_train) - 1, np.array(y_test) - 1
y_train = to_categorical(y_train, dtype = 'int32')
y_test = to_categorical(y_test, dtype = 'int32')

#### **Problem #1.1: Create the `TextVectorization` layer**

Let's create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

##### **1. Initialize the layer with the specified parameters.**

* `max_tokens = 5000`
* `output_mode = 'int'`
* `output_sequence_length = 50`

In [None]:
vectorize_layer = TextVectorization(
    # WRITE YOUR CODE HERE
  )

###### **Solution**


In [None]:
vectorize_layer = TextVectorization(
    max_tokens = 5000,
    output_mode = 'int',
    output_sequence_length = 50
  )

vectorize_layer.adapt(x_train)

#### ***STOP!* Answer the following question: Why does every output sequence need to be the same length?**

##### **2. Adapt the layer to the training data.**

###### **Solution**

In [None]:
vectorize_layer.adapt(x_train)

#### **Problem #1.2: Look at the vocabulary**


**Run the code below to look at a portion of the vocabulary that was just built for the training data.**

In [None]:
vectorize_layer.get_vocabulary()[0:50]

['',
 '[UNK]',
 'the',
 'to',
 'a',
 'of',
 'in',
 'and',
 'on',
 'for',
 '39s',
 'that',
 'with',
 'as',
 'at',
 'its',
 'is',
 'new',
 'by',
 'said',
 'it',
 'us',
 'has',
 'from',
 'reuters',
 'an',
 'ap',
 'his',
 'will',
 'after',
 'was',
 'be',
 'over',
 'have',
 'their',
 'are',
 'up',
 'but',
 'first',
 'more',
 'two',
 'he',
 'this',
 'world',
 'monday',
 'wednesday',
 'tuesday',
 'oil',
 'out',
 'thursday']

#### **Problem #1.3: Add the input and text vectorization layers to the model**




In [None]:
model = Sequential()

model.add(Input(# COMPLETE THIS LINE
model.add(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model = Sequential()

model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)

#### **Problem #1.4: Look at the vectorization of an example**


Add your own sentence below to see how it would be vectorized with our newly adapted layer.

<br>

**NOTE:** `TextVectorizer` will ignore any punctuation and consider upper and lower case the same. There are extra parameters that can set to adjust this.

In [None]:
vector_0 = model.predict([# COMPLETE THIS LINE

print(vector_0)
print(vector_0.shape)

###### **Solution**


In [None]:
vector_0 = model.predict(['hello world my name is Adam'])

print(vector_0)
print(vector_0.shape)

[[   1   43 1293  924   16 4087    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]]
(1, 50)


#### ***STOP!* Answer the following question: What does each number in the output vector represent?**

#### **Problem #1.5: Add hidden layers and an output layer**


Add two dense layers with 512 neurons and ReLU activation.

Then, create the output layer so we can classify the data as `"World"`, `"Sports"`, `"Business"`, or `"Sci/Tech"`. You will use the softmax activation function.

In [None]:
model.add(Dense(# COMPLETE THIS LINE
model.add(Dense(# COMPLETE THIS LINE
model.add(Dense(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model.add(Dense(512, activation = 'relu'))
model.add(Dense(512, activation = 'relu'))
model.add(Dense(4, activation = 'softmax'))

Let's take a look at our DNN.

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, 50)                0         
 ctorization)                                                    
                                                                 
 dense (Dense)               (None, 512)               26112     
                                                                 
 dense_1 (Dense)             (None, 512)               262656    
                                                                 
 dense_2 (Dense)             (None, 4)                 2052      
                                                                 
Total params: 290820 (1.11 MB)
Trainable params: 290820 (1.11 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### **Problem #1.6: Compile and fit the model**


Compile and fit the model with the following parameters:
* Adam learning rate of 0.001
* `categorical_crossentropy` for the loss function
* Accuracy as the metric
* For the fit, use `epochs=5` and `batch_size=256`

In [None]:
opt = Adam(learning_rate = # COMPLETE THIS LINE
model.compile(optimizer = opt, loss = # COMPLETE THIS LINE

model.fit(# COMPLETE THIS LINE

###### **Solution**


In [None]:
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])

model.fit(x_train, y_train, epochs = 5, batch_size = 256)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7ed767bb5ea0>

#### **Problem #1.7: Evaluate the model**


Now, evaluate the model for both the training and test sets.

<br>

**NOTE:** As a baseline, randomly guessing 1 out of 4 possible classes would achieve a roughly 0.25 accuracy.

In [None]:
# Evaluate the training set
model.evaluate(# COMPLETE THIS LINE

# Evaluate the test set
model.evaluate(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)



[1.4007633924484253, 0.25276315212249756]

---

<center>

### **Back to lecture**

---

---
## **Part 2: News Subject Classification with a CNN**
---

The model in Part 1 likely did not do much better than random guessing. Let's try with a CNN instead.

Let's start by building a new CNN model. Remember, the syntax for CNNs for NLP is a little different than for images. We will be using the 1D versions of the convolution and max pooling layers. Examples:
* `Conv1D(filters=128, kernel_size=5, activation='relu')`
* `MaxPooling1D(pool_size=2)`


#### **Problem #2.1: Initialize the model with an input and vectorizer layer**


*Hint: This is the same as last time.*

In [None]:
cnn_model = # COMPLETE THIS LINE

cnn_model.add(Input(# COMPLETE THIS LINE
cnn_model.add(# COMPLETE THIS LINE

###### **Solution**


In [None]:
cnn_model = Sequential()

cnn_model.add(Input(shape=(1,), dtype=tf.string))
cnn_model.add(vectorize_layer)

#### **Problem #2.2: Finish building the CNN**


Build your CNN with the following layers:
* a convolutional layer with 64 filters, a kernel size of 5, and ReLU activation
* a max pooling layer with a pool size of 2
* a convolutional layer with 128 filters, a kernel size of 5, and ReLU activation
* a max pooling layer with a pool size of 2
* a flatten layer
* a dense layer with 256 neurons
* the output layer

In [None]:
# The convolution layer requires us to cast the inputs to a different data type
# and reshape the input as well. We have done this for you.
cnn_model.add(Lambda(lambda x: tf.cast(x, 'float32')))
cnn_model.add(Reshape((50, 1)))

# Start building your CNN below.
# WRITE YOUR CODE HERE

###### **Solution**

In [None]:
# The convolution layer requires us to cast the inputs to a different data type
# and reshape the input as well. We have done this for you.
cnn_model.add(Lambda(lambda x: tf.cast(x, 'float32')))
cnn_model.add(Reshape((50, 1)))

# Start building your CNN below.
cnn_model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Flatten())
cnn_model.add(Dense(256, activation = 'relu'))
cnn_model.add(Dense(4, activation = 'softmax'))

Let's take a look at the completed model.

In [None]:
cnn_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, 50)                0         
 ctorization)                                                    
                                                                 
 lambda (Lambda)             (None, 50)                0         
                                                                 
 reshape (Reshape)           (None, 50, 1)             0         
                                                                 
 conv1d (Conv1D)             (None, 46, 64)            384       
                                                                 
 max_pooling1d (MaxPooling1  (None, 23, 64)            0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, 19, 128)          

#### **Problem #2.3: Compile and fit the model**


Compile and fit the model with the same parameters as Part 1.

In [None]:
opt = Adam(# COMPLETE THIS LINE)
cnn_model.compile(# COMPLETE THIS LINE

cnn_model.fit(# COMPLETE THIS LINE

###### **Solution**


In [None]:
opt = Adam(learning_rate = 0.001)
cnn_model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])

cnn_model.fit(x_train, y_train, epochs = 5, batch_size = 256)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7ed767885030>

#### **Problem #2.4: Evaluate the model**


Now, evaluate the model for both the training and test sets.

In [None]:
cnn_model.evaluate(# COMPLETE THIS LINE
cnn_model.evaluate(# COMPLETE THIS LINE

###### **Solution**

In [None]:
cnn_model.evaluate(x_train, y_train)
cnn_model.evaluate(x_test, y_test)



[1.373782753944397, 0.28842106461524963]

**Oh no!** It looks like the CNN didn't do much better! It turns out that tokenization and vectorization is not enough to prepare text data for deep learning. There's an additional processing step we can take that will set our models up for success: **embedding.** We will see how embedding improves model performance in next week's lab.

# End of notebook
---
© 2024 The Coding School, All rights reserved