# Load Libraries

In [None]:
import csv
import pandas as pd
import numpy as np
import tensorflow as tf

In [None]:
with open("./data/bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

First line (header) looks like this:

category,text

Each data point looks like this:

tech,tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially

As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article.

## Parse Data from file

In [None]:
 def parse_data_from_file(filename):
    """
    Extracts sentences and labels from a CSV file

    Args:
        filename (str): path to the CSV file

    Returns:
        (list[str], list[str]): tuple containing lists of sentences and labels
    """
    sentences = []
    labels = []



    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            labels.append(row[0])
            sentences.append(row[1])



    return sentences, labels

In [None]:
sentences, labels = parse_data_from_file("./data/bbc-text.csv")

print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words.\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

There are 2225 sentences in the dataset.

First sentence has 737 words.

There are 2225 labels in the dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']




## Standardize Function



One important step when working with text data is to standardize it so it is easier to extract information out of it. For instance, one probably want to convert it all to lower-case (so the same word doesn't have different representations such as "hello" and "Hello") and to remove the [stopwords](https://en.wikipedia.org/wiki/Stop_word) from it. These are the most common words in the language and they rarely provide useful information for the classification process. The next cell provides a list of common stopwords which one can use in the exercise.


In [None]:
# List of stopwords
STOPWORDS = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

In [None]:
def standardize_func(sentence):
    """Standardizes sentences by converting to lower-case and removing stopwords.

    Args:
        sentence (str): Original sentence.

    Returns:
        str: Standardized sentence in lower-case and without stopwords.
    """

    sentence = sentence.lower()
    words = sentence.split()
    filtered_words = [word for word in words if word not in STOPWORDS]
    sentence = " ".join(filtered_words)


    return sentence

In [None]:
test_sentence = "Hello! We're just about to see this function in action =)"
standardized_sentence = standardize_func(test_sentence)
print(f"Original sentence is:\n{test_sentence}\n\nAfter standardizing:\n{standardized_sentence}")

standard_sentences = [standardize_func(sentence) for sentence in sentences]

print("\n\n--- Apply the standardization to the dataset ---\n")
print(f"There are {len(standard_sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words originally.\n")
print(f"First sentence has {len(standard_sentences[0].split())} words (after removing stopwords).\n")

Original sentence is:
Hello! We're just about to see this function in action =)

After standardizing:
hello! just see function action =)


--- Apply the standardization to the dataset ---

There are 2225 sentences in the dataset.

First sentence has 737 words originally.

First sentence has 436 words (after removing stopwords).



## Fit vectorizer

In [None]:
def fit_vectorizer(sentences):
    """
    Instantiates the TextVectorization layer and adapts it to the sentences.

    Args:
        sentences (list[str]): lower-cased sentences without stopwords

    Returns:
        tf.keras.layers.TextVectorization: an instance of the TextVectorization layer adapted to the texts.
    """


    # Instantiate the TextVectorization class
    vectorizer = tf.keras.layers.TextVectorization()

    # Adapt to the sentences
    vectorizer.adapt(sentences)

    return vectorizer

In [None]:
# Create the vectorizer adapted to the standardized sentences
vectorizer = fit_vectorizer(standard_sentences)

# Get the vocabulary
vocabulary = vectorizer.get_vocabulary()

print(f"Vocabulary contains {len(vocabulary)} words\n")
print("[UNK] token included in vocabulary" if "[UNK]" in vocabulary else "[UNK] token NOT included in vocabulary")

Vocabulary contains 33088 words

[UNK] token included in vocabulary


In [None]:
# Vectorize and pad sentences
padded_sequences = vectorizer(standard_sentences)

# Show the output
print(f"First padded sequence looks like this: \n\n{padded_sequences[0]}\n")
print(f"Tensor of all sequences has shape: {padded_sequences.shape}\n")
print(f"This means there are {padded_sequences.shape[0]} sequences in total and each one has a size of {padded_sequences.shape[1]}")

First padded sequence looks like this: 

[  93  155 1186 ...    0    0    0]

Tensor of all sequences has shape: (2225, 2418)

This means there are 2225 sequences in total and each one has a size of 2418


Notice that now the variable refers to `sequences` rather than `sentences`. This is because all text data is now encoded as a sequence of integers.

## Fit label Encoder


In [None]:

def fit_label_encoder(labels):
    """
    Tokenizes the labels

    Args:
        labels (list[str]): labels to tokenize

    Returns:
        tf.keras.layers.StringLookup: adapted encoder for labels
    """

    # Instantiate the StringLookup layer without an OOV token
    label_encoder = tf.keras.layers.StringLookup(
        oov_token=None,
        num_oov_indices=0)

    # Adapt the layer to the labels
    label_encoder.adapt(labels)


    return label_encoder

In [None]:
# Create the encoder adapted to the labels
label_encoder = fit_label_encoder(labels)

# Get the vocabulary
vocabulary = label_encoder.get_vocabulary()

# Encode labels
label_sequences = label_encoder(labels)

print(f"Vocabulary of labels looks like this: {vocabulary}\n")
print(f"First ten labels: {labels[:10]}\n")
print(f"First ten label sequences: {label_sequences[:10]}\n")

Vocabulary of labels looks like this: ['sport', 'business', 'politics', 'tech', 'entertainment']

First ten labels: ['tech', 'business', 'sport', 'sport', 'entertainment', 'politics', 'politics', 'sport', 'sport', 'entertainment']

First ten label sequences: [3 1 0 0 4 2 2 0 0 4]



The interpretation of the first ten label sequences **[3 1 0 0 4 2 2 0 0 4]** based on the provided **['sport', 'business', 'politics', 'tech', 'entertainment']** is as follows:
* 3: Corresponds to the 4th element in the vocabulary, which is 'tech'.
* 1: Corresponds to the 2nd element in the vocabulary, which is 'business'.
* 0: Corresponds to the 1st element in the vocabulary, which is 'sport'.
* 4: Corresponds to the 5th element in the vocabulary, which is 'entertainment'.
* 2: Corresponds to the 3rd element in the vocabulary, which is 'politics'.

Therefore, the first ten label sequences **[3 1 0 0 4 2 2 0 0 4]** represent the following sequence of labels:

**['tech', 'business', 'sport', 'sport', 'entertainment', 'politics', 'politics', 'sport', 'sport', 'entertainment']**

This matches the "First ten labels" you provided, confirming the interpretation. The numerical sequence is simply an encoded representation of the categorical labels based on their position in the vocabulary list (using 0-based indexing).






