# **Lab 21: Natural Language Processing II**
---

### **Description**
In today's lab, we have another text classification task, but this time we will be using **embeddings.** For this project, we will be working with a dataset of BBC News articles classified by topic.

<br>

### **Lab Structure**

**Part 1**: [Text Classification of BBC Articles](#p1)

**Part 2**: [Convolutional Neural Networks](#p2)


**Part 3**: [[OPTIONAL] IMDB Sentiment Classification](#p3)

<br>

### **Goals**
By the end of this lab, you will:
* Understand how to apply embedding layers in models.
* Compare a fully connected network to a CNN for text classification with embeddings.

<br>

### **Cheat Sheets**
[Natural Language Processing II](https://docs.google.com/document/d/1p3xVUL1F6SEkusCI4klPLYqQwCkVN5s00ZvJjBpiSqM/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**

In [None]:
!pip install lime

from lime import lime_text
import numpy as np
import pandas as pd

import tensorflow as tf
import numpy as np
import os

from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam, SGD
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator

from sklearn.model_selection import train_test_split

from random import choices

import warnings
warnings.filterwarnings('ignore')

Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lime
  Building wheel for lime (setup.py) ... [?25l[?25hdone
  Created wheel for lime: filename=lime-0.2.0.1-py3-none-any.whl size=283835 sha256=33be6d59f1e398381ad1f94cd08b7129dc48daa49aaf98cff048ba9568a4082b
  Stored in directory: /root/.cache/pip/wheels/fd/a2/af/9ac0a1a85a27f314a06b39e1f492bee1547d52549a4606ed89
Successfully built lime
Installing collected packages: lime
Successfully installed lime-0.2.0.1


<a name="p1"></a>

---
## **Part 1: Text Classification of BBC Articles**
---

In this section, we'll apply our text classification knowledge to a corpus of BBC articles. Your task is to develop a model that categorizes articles based on snippets of text, assigning each to a specific category. Unlike previous labs where we focused on visual data, here we'll use neural networks with traditional Dense layers to process and classify text data.

<br>



**Run the code provided below to import the dataset.**

In [None]:
dataset = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vRRiQ1DUkUxk31YpaHA2i9QtwGq_VGXiy86z7l3aT9v5zoB6M7a-2M2qlYckr1C_ZG6StBELlU_hD3S/pub?output=csv')

### **Problem #1.1: Determine the number of categories**


Using any necessary pandas functions or attributes, determine the total number of unique categories that the texts are assigned to.

**In the cell code below, display the first few rows in the dataset.**


#### **Solution**


In [None]:
dataset.head()

Unnamed: 0,text,category,category_id
0,libya takes 1bn unfrozen funds libya withdrawn 1bn assets assets previously frozen libyan central bank came lifted trade ban reward tripoli giving weapons mass destruction vowing compensate lockerbie victims original size libya funds 400 central bank reuters withdrawal mean libya ties process opening accounts banks united states central bank vice president farhat omar ben gadaravice previously frozen assets invested countries believed included equity holdings banks ban trade economic activity tripoli imposed president ronald regan 1986 series deemed terrorist acts 1988 lockerbie air crash ...,business,0
1,saudi investor picks savoy famous savoy hotel sold group combining saudi billionaire investor prince alwaleed bin talal unit hbos bank financial details includes nearby simpson strand restaurant disclosed seller irish based property quinlan private bought savoy berkeley claridge connaught £ 750 m prince alwaleed hotel investments luxury george v paris substantial stakes fairmont hotels resorts manage savoy simpson strand seasons fairmont planned invest 48 m £ 26 m renovating parts savoy including river room suites views river thames completed summer 2006 fairmont,business,0
2,tate lyle boss bags award tate lyle chief executive named businessman leading magazine iain ferguson awarded title publication forbes returning venerable manufacturers 100 sugar group absent ftse 100 seven years mr ferguson helped return growth tate shares leapt 55 boosted firming sugar prices sales artificial sweeteners years sagging stock price seven hiatus ftse 100 venerable manufacturers returned vaunted index forbes mr ferguson took helm 2003 spending career consumer goods giant unilever tate lyle original member historic ft 30 index 1935 operates 41 factories 20 additional production...,business,0
3,uk economy facing major risks uk manufacturing sector continue face challenges years british chamber commerce bcc quarterly survey found exports picked months 2004 levels years rise came despite exchange rates cited major concern bcc found uk economy faced major risks warned growth set slow recently forecast economic growth slow 2004 little 5 2006 manufacturers domestic sales growth fell slightly quarter survey 5 196 firms found employment manufacturing fell job expectations lowest level despite positive export sector worrying signs manufacturing bcc results reinforce concern sector persis...,business,0
4,aids climate davos agenda climate change fight aids leading list concerns day world economic forum swiss resort davos 000 business political leaders globe listen uk prime tony blair opening speech wednesday mr blair focus africa development global warming earlier day came update efforts million people anti aids drugs 2005 world health organisation 700 000 people poor countries extending drugs 440 000 earlier amounting million needed 2bn funding gap stood hitting 2005 target themes stressed mr blair attendance announced minute wants dominate uk chairmanship g8 industrialised states issues d...,business,0


Let's now see which of the categories in our dataset are unique.

In [None]:
print(dataset["category"].unique())
print(len(dataset["category"].unique()))

['business' 'entertainment' 'politics' 'sport' 'tech']
5


**Question:** How many unique categories do we have from the above output? What are the unique categories?

### **Problem #1.2: Split the data into training and test sets**


Determine the correct variables to use as the feature(s) and label here, making sure to provide numerical labels for the neural network to predict.

#### **Solution**


In [None]:
dataset['category'].unique()

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [None]:
# Split the dataset into features and labels
x = dataset['text'].values
y = dataset['category_id'].values

# Split the dataset into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

### **Problem #1.3: Create the `TextVectorization` layer**


To get started building the neural network, create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

3. Look at the newly built vocabulary.

#### **1.Initialize the layer with the specified parameters.**

* The vocabulary should be at most 10000 words.
* The layer's output should always be 20 integers.

###### **Solution**


In [None]:
vectorize_layer = TextVectorization(max_tokens = 10000, output_mode = 'int', output_sequence_length = 20)

#### **2. Adapt the layer to the training data.**

##### **Solution**


In [None]:
vectorize_layer.adapt(x_train)

#### **3. Look at the newly built vocabulary.**

##### **Solution**


In [None]:
vectorize_layer.get_vocabulary()[0:30]

['',
 '[UNK]',
 'mr',
 'm',
 'people',
 'new',
 '£',
 't',
 'government',
 'film',
 'year',
 'uk',
 'music',
 'game',
 'world',
 'best',
 'labour',
 'election',
 'time',
 'blair',
 '1',
 'party',
 'games',
 'mobile',
 'market',
 'england',
 '000',
 'tv',
 '3',
 '2']

### **Problem #1.4: Build the model**


Complete the code below to build a model with the following layers.

An `Embedding` layer such that:
* The vocabulary contains 10000 tokens.
* The input length corresponds to the output of the vectorization layer.
* The number of outputs per input is 200.

<br>

Two `Dense` layers with a number of neurons and activation function that you choose. We recommend you try a few options.

<br>

A `Dense` layer for outputting classification probabilities for each of the possible categories.


*Hint: If you're not sure which activation function to use, use `relu`.*

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(# COMPLETE THIS LINE

# Hidden Layers
model.add(# COMPLETE THIS LINE

# Output Layer
model.add(# COMPLETE THIS LINE

SyntaxError: ignored

#### **Solution**


In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 10000, output_dim = 200, input_length = 50))
model.add(Flatten())

# Hidden Layers
model.add(Dense(64, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))

# Output Layer
model.add(Dense(len(dataset['category'].unique()), activation = 'softmax'))

### **Problem #1.5: Compile and fit the model**

Using standard parameters for classification, compile and train 8yncthis neural network using:
* A learning rate of 0.01.
* A batch size of 200.
* 5 epochs.

In [None]:
opt = Adam(learning_rate = # COMPLETE THIS LINE
model.compile(optimizer = opt, loss = # COMPLETE THIS LINE

model.fit(# COMPLETE THIS LINE

SyntaxError: ignored

#### **Solution**


In [None]:
opt = Adam(learning_rate = 0.01)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])

model.fit(x_train, y_train, epochs = 5, batch_size = 200)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7882a06f8b50>

### **Problem #1.6: Evaluate the model**


Now, evaluate the model for both the training and test sets.

#### **Solution**


In [None]:
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)



[1.917319416999817, 0.901123583316803]

<a name="p2"></a>

---
## **Part 2:  Convolutional Neural Networks**
---

In this section, you will classify the articles using neural networks with `Conv1D` and `MaxPooling1D` hidden layers. Feel free to include `Dense` hidden layers too.

### **Problem #2.1: Create the highest performing model possible using CNNs**


Complete the code below to train a new model that is identical to the one above, except using any or all of the CNN layers that keras provides. The goal is to create a model that performs as well as possible on the *test set*.

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
# COMPLETE THIS CODE


# Hidden Layers
# COMPLETE THIS CODE


# Output Layer
# COMPLETE THIS CODE


# Fitting
# COMPLETE THIS CODE


# Evaluating
print("\n\n\n")
# COMPLETE THIS CODE

#### **Solution**


In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 10000, output_dim = 200, input_length = 50))

# Hidden Layers
model.add(Conv1D(filters=200, kernel_size=11, activation='relu'))
model.add(MaxPooling1D(pool_size=10))

model.add(Flatten()) # Add a Flatten layer before the Dense layer
model.add(Dense(100, activation='relu'))

# Output Layer
model.add(Dense(len(dataset['category'].unique()), activation = 'softmax'))

# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")

# Fitting
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 5, batch_size = 256)

# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)


(None, 1) -> (None, 20)
(None, 20) -> (None, 20, 200)
(None, 20, 200) -> (None, 10, 200)
(None, 10, 200) -> (None, 1, 200)
(None, 1, 200) -> (None, 200)
(None, 200) -> (None, 100)
(None, 100) -> (None, 5)




Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[0.3363010585308075, 0.9168539047241211]

<a name="p3"></a>

---
## **Part 3: [OPTIONAL] IMDB Sentiment Classification**
---

In this part we will focus on building a CNN model using the IMDB sentiment classification dataset. This is a dataset of 25,000 movie reviews with sentiment labels: 0 for negative and 1 for positive.

<br>


**Run the code provided below to import the dataset.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTdgncgNHtppfS89LHOh1kGl5tYzoEUrUwmOPOQF7mQ0U5Rzba27H45imvZ06_J2x0-wCJySylP5V3_/pub?gid=1712575053&single=true&output=csv'

df = pd.read_csv(url)
df.head()

x_train, x_test, y_train, y_test = train_test_split(df["review"], df["sentiment"], test_size = 0.2, random_state = 42)

x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

### **Problem #3.1: Create the `TextVectorization` layer**


To get started, let's create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

3. Look at the newly built vocabulary.

#### **1. Initialize the layer with the specified parameters.**

* The vocabulary should be at most 5000 words.
* The layer's output should always be 64 integers.

##### **Solution**

In [None]:
vectorize_layer = TextVectorization(max_tokens = 5000, output_mode = 'int', output_sequence_length = 64)

#### **2. Adapt the layer to the training data.**

##### **Solution**

In [None]:
vectorize_layer.adapt(x_train)

#### **3. Look at the newly built vocabulary.**

##### **Solution**

In [None]:
vectorize_layer.get_vocabulary()

['',
 '[UNK]',
 'the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'in',
 'it',
 'i',
 'this',
 'that',
 'br',
 'was',
 'as',
 'for',
 'with',
 'movie',
 'but',
 'film',
 'on',
 'not',
 'you',
 'are',
 'his',
 'have',
 'be',
 'he',
 'one',
 'its',
 'at',
 'all',
 'by',
 'an',
 'they',
 'from',
 'who',
 'so',
 'like',
 'just',
 'or',
 'her',
 'about',
 'if',
 'has',
 'out',
 'some',
 'there',
 'what',
 'good',
 'very',
 'when',
 'more',
 'my',
 'even',
 'no',
 'up',
 'would',
 'she',
 'time',
 'only',
 'which',
 'really',
 'their',
 'see',
 'story',
 'were',
 'had',
 'can',
 'me',
 'we',
 'than',
 'much',
 'well',
 'been',
 'get',
 'do',
 'will',
 'also',
 'great',
 'into',
 'bad',
 'other',
 'people',
 'because',
 'how',
 'most',
 'first',
 'him',
 'dont',
 'then',
 'movies',
 'made',
 'make',
 'could',
 'them',
 'films',
 'way',
 'any',
 'too',
 'after',
 'characters',
 'think',
 'watch',
 'many',
 'seen',
 'being',
 'two',
 'character',
 'never',
 'where',
 'love',
 'acting',
 'plot',
 'did'

### **Problem #3.2: Build and Train a Dense model**

Complete the code below to build a model with the following layers.

An Embedding layer such that:
- The vocabulary contains 5000 tokens.
- The input length corresponds to the output of the vectorization layer.
- The number of outputs per input is 128.

<br>

Hidden layers such that:

- There's at least one Dense layer.

<br>

A Dense layer for outputting classification probabilities for "negative" or "positive" labels.

#### **Solution**

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 5000, output_dim = 128, input_length = 64))

# Hidden Layers
model.add(Flatten()) # Add a Flatten layer before the Dense layer
model.add(Dense(128, activation='relu'))

# Output Layer
model.add(Dense(2, activation = 'softmax'))

# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")

# Fitting
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 5, batch_size = 256)

# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)


(None, 1) -> (None, 64)
(None, 64) -> (None, 64, 128)
(None, 64, 128) -> (None, 8192)
(None, 8192) -> (None, 128)
(None, 128) -> (None, 2)




Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[1.1797369718551636, 0.758899986743927]

This other alternative includes building the model with CNN.
**Which architecture performs better?**

In [None]:
# [OPTIONAL] USING CNNs
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 5000, output_dim = 128, input_length = 64))

# Hidden Layers
model.add(Conv1D(filters = 16, kernel_size = 4, activation = 'relu'))
model.add(MaxPooling1D(pool_size = 3))
model.add(Flatten())

# Output Layer
model.add(Dense(2, activation = 'softmax'))



# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")



# Fitting
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 5, batch_size = 256)


# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)

(None, 1) -> (None, 64)
(None, 64) -> (None, 64, 128)
(None, 64, 128) -> (None, 61, 16)
(None, 61, 16) -> (None, 20, 16)
(None, 20, 16) -> (None, 320)
(None, 320) -> (None, 2)




Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[0.5064528584480286, 0.7864999771118164]

---
###© 2024 The Coding School, All rights reserved