In [30]:
import csv
import pandas as pd
import numpy as np
import tensorflow as tf

In [31]:
with open("./data/bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

First line (header) looks like this:

category,text

Each data point looks like this:

tech,tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially

In [32]:
def parse_data_from_file(filename):
    sentences=[]
    labels=[]
    data=pd.read_csv(filename)

    #iterating the data using itertuple
    for row in data.itertuples():
        sentences.append(row.text)
        labels.append(row.category)
    return sentences,labels

In [33]:
sentences, labels = parse_data_from_file("./data/bbc-text.csv")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words.\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

There are 2225 sentences in the dataset.

First sentence has 737 words.

There are 2225 labels in the dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']




## Now lets standardize the sentences.
This is crucial step since we need to extract only the valuable information out of it.
For instance, you probably want to convert it all to lower-case (so the same word doesn't have different representations such as "hello" and "Hello") and to remove the stopwords from it. These are the most common words in the language and they rarely provide useful information for the classification process.

In [34]:
# List of stopwords
STOPWORDS = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

In [35]:
def standardize_func(sentences):
    words=sentences.split()
    temp=[]
    for word in words:
        word=word.lower()
        if word not in STOPWORDS:
            temp.append(word)
    sentences=" ".join(temp)
    return sentences

In [36]:
test_sentence = "Hello! We're just about to see this function in action :)"
standardized_sentence = standardize_func(test_sentence)
print(f"Original sentence is:\n{test_sentence}\n\nAfter standardizing:\n{standardized_sentence}")

standard_sentences = [standardize_func(sentence) for sentence in sentences]

print("\n\n--- Apply the standardization to the dataset ---\n")
print(f"There are {len(standard_sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words originally.\n")
print(f"First sentence has {len(standard_sentences[0].split())} words (after removing stopwords).\n")

Original sentence is:
Hello! We're just about to see this function in action :)

After standardizing:
hello! just see function action :)


--- Apply the standardization to the dataset ---

There are 2225 sentences in the dataset.

First sentence has 737 words originally.

First sentence has 436 words (after removing stopwords).



## TextVectorization
Now that data is standardized, it is time to vectorize the sentences of the dataset.This is the step where we will try to generate tokes out of the words of the sentences

In [37]:
def fit_vectorizer(sentences):
    vectorizer=tf.keras.layers.TextVectorization()
    vectorizer.adapt(sentences)
    return vectorizer

In [38]:
vectorizer = fit_vectorizer(standard_sentences)
vocabulary = vectorizer.get_vocabulary()

print(f"Vocabulary contains {len(vocabulary)} words\n")
print("[UNK] token included in vocabulary" if "[UNK]" in vocabulary else "[UNK] token NOT included in vocabulary")
print(f"first 100 words:\n{vocabulary[:100]}")

Vocabulary contains 33088 words

[UNK] token included in vocabulary
first 100 words:
['', '[UNK]', 's', 'said', 'will', 'not', 'mr', 'also', 'people', 'new', 'us', 'year', 'one', 'can', 'last', 't', 'first', 'world', 'two', 'government', 'time', 'now', 'uk', 'years', 'just', 'no', 'make', 'best', 'told', 'get', 'game', 'made', 'film', 'like', 'music', 'many', 'labour', '000', 'next', 'bbc', 'back', 'three', 'number', 'take', 'added', 'way', 'set', 'well', 'says', 'may', 'market', 'company', 'home', 'good', '2004', 'going', 'still', 'england', 'games', 'election', 'party', 'much', 'win', 'since', 'firm', 'work', 'go', 'blair', 'won', 'show', 'think', 'use', 'say', 'week', 'million', 'play', 'part', 'off', 'minister', 'want', 'public', 'top', 'technology', 'second', 'see', 'british', 'used', 'players', 'news', 'european', 'mobile', 'however', 'country', 'tv', 'group', 'even', 'sales', 'expected', 'end', 'plans']


## Generate tokens
Now we have our vocabulary to genarte token for the sentences.In the next step we will generate tokens for the sentences.

In [39]:
#sample 
sample_sentence="This is my sample sentence"
sample_sentence_padded_sequences=vectorizer(sample_sentence)

print(f"sample text:{sample_sentence}\nthis is sample text tokens:{sample_sentence_padded_sequences}\n")

# Vectorize and pad sentences
padded_sequences = vectorizer(standard_sentences)

# Show the output
print(f"First padded sequence looks like this: \n{padded_sequences}\n")
print(f"Tensor of all sequences has shape: {padded_sequences.shape}\n")
print(f"This means there are {padded_sequences.shape[0]} sequences in total and each one has a size of {padded_sequences.shape[1]}")

sample text:This is my sample sentence
this is sample text tokens:[ 2052  2424 17561  6607  2095]

First padded sequence looks like this: 
[[   93   155  1186 ...     0     0     0]
 [ 1560   611   277 ...     0     0     0]
 [ 4960  6975  3850 ...     0     0     0]
 ...
 [ 5860  2138     9 ...     0     0     0]
 [  358 10057 22930 ...     0     0     0]
 [ 2405  7987   886 ...     0     0     0]]

Tensor of all sequences has shape: (2225, 2418)

This means there are 2225 sequences in total and each one has a size of 2418


With the sentences already vectorized it is time to encode the labels

In [40]:
def fit_label_encoder(labels):
    label_encoder = tf.keras.layers.StringLookup(num_oov_indices=0)
    label_encoder.adapt(labels)
    return label_encoder

In [41]:
# Create the encoder adapted to the labels
label_encoder = fit_label_encoder(labels)

# Get the vocabulary
vocabulary = label_encoder.get_vocabulary()

# Encode labels
label_sequences = label_encoder(labels)

print(f"Vocabulary of labels looks like this: {vocabulary}\n")
print(f"First ten labels: {labels[:10]}\n")
print(f"First ten label sequences: {label_sequences[:10]}\n")

Vocabulary of labels looks like this: ['sport', 'business', 'politics', 'tech', 'entertainment']

First ten labels: ['tech', 'business', 'sport', 'sport', 'entertainment', 'politics', 'politics', 'sport', 'sport', 'entertainment']

First ten label sequences: [3 1 0 0 4 2 2 0 0 4]

