<a href="https://colab.research.google.com/github/HelenLit/bbc-news_nlp/blob/main/bbc_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Exploring the BBC News archive

In this project I am working with a variation of the [BBC News Classification Dataset](https://www.kaggle.com/c/learn-ai-bbc/overview), which contains 2225 examples of news articles with their respective categories (labels).

In [13]:
!wget https://raw.githubusercontent.com/HelenLit/bbc-news_nlp/main/bbc-text.csv
!wget https://raw.githubusercontent.com/HelenLit/bbc-news_nlp/main/bbc-text-minimal.csv

--2023-08-31 18:15:24--  https://raw.githubusercontent.com/HelenLit/bbc-news_nlp/main/bbc-text.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5057493 (4.8M) [text/plain]
Saving to: ‘bbc-text.csv.2’


2023-08-31 18:15:24 (57.5 MB/s) - ‘bbc-text.csv.2’ saved [5057493/5057493]

--2023-08-31 18:15:24--  https://raw.githubusercontent.com/HelenLit/bbc-news_nlp/main/bbc-text-minimal.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11337 (11K) [text/plain]
Saving to: ‘bbc-text-minimal.csv’


2023-08-31 18:15:24 (39.2 MB/s) - ‘bbc-

In [6]:
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Looking at the structure of the csv that contains the data:

In [15]:
with open("./bbc-text.csv.1", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

First line (header) looks like this:

category,text

Each data point looks like this:

tech,tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially

Each data point is composed of the category of the news article followed by a comma and then the actual text of the article.

## Removing Stopwords

One important step when working with text data is to remove the **stopwords** from it. These are the most common words in the language and they rarely provide useful information for the classification process.

`remove_stopwords` function receives a string and returns another string that excludes all of the stopwords provided.

In [9]:
def remove_stopwords(sentence):
    """
    Removes a list of stopwords

    Args:
        sentence (string): sentence to remove the stopwords from

    Returns:
        sentence (string): lowercase sentence without the stopwords
    """
    # Set of stopwords for faster lookup
    stopwords = set(["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ])

    # Sentencing converted to lowercase-only
    sentence = sentence.lower()
    words = sentence.split()
    # Removing stopwords using list comprehension
    words_without_stopwords = [word for word in words if word not in stopwords]

    # Joining the words back into a sentence
    sentence = ' '.join(words_without_stopwords)
    return sentence

## Reading the raw data

`parse_data_from_file` function reads the data from the csv file

In [10]:
def parse_data_from_file(filename):
    """
    Extracts sentences and labels from a CSV file

    Args:
        filename (string): path to the CSV file

    Returns:
        sentences, labels (list of string, list of string): tuple containing lists of sentences and labels
    """
    sentences = []
    labels = []
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            labels.append(row[0])
            sentence = remove_stopwords(row[1])
            sentences.append(sentence)
    return sentences, labels

In [16]:
# Testing parse_data_from_file function

# With original dataset
sentences, labels = parse_data_from_file("./bbc-text.csv.1")

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a miniature version of the dataset that contains only first 5 rows
mini_sentences, mini_labels = parse_data_from_file("./bbc-text-minimal.csv")

print("MINIATURE DATASET:\n")
print(f"There are {len(mini_sentences)} sentences in the miniature dataset.\n")
print(f"First sentence has {len(mini_sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(mini_labels)} labels in the miniature dataset.\n")
print(f"The first 5 labels are {mini_labels[:5]}")

ORIGINAL DATASET:

There are 2225 sentences in the dataset.

First sentence has 436 words (after removing stopwords).

There are 2225 labels in the dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']


MINIATURE DATASET:

There are 5 sentences in the miniature dataset.

First sentence has 436 words (after removing stopwords).

There are 5 labels in the miniature dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']


## Using the Tokenizer

`fit_tokenizer` function tokenizes the sentences of the dataset.  

This function receives the list of sentences as input and return a [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) that has been fitted to those sentences.

In [27]:
max_length = 120
embedding_dim = 16
trunc_type ='post'
pad_type = 'post'
oov_tok = "<OOV>"


In [19]:
def fit_tokenizer(sentences):
    """
    Instantiates the Tokenizer class

    Args:
        sentences (list): lower-cased sentences without stopwords

    Returns:
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary
    """

    # Instantiating the Tokenizer class by passing in the oov_token argument
    tokenizer = Tokenizer(oov_token = oov_tok)

    # Fitting on the sentences
    tokenizer.fit_on_texts(sentences)

    return tokenizer

In [18]:
tokenizer = fit_tokenizer(sentences)
word_index = tokenizer.word_index

print(f"Vocabulary contains {len(word_index)} words\n")
print("<OOV> token included in vocabulary" if "<OOV>" in word_index else "<OOV> token NOT included in vocabulary")

Vocabulary contains 29714 words

<OOV> token included in vocabulary


In [25]:
def get_padded_sequences(tokenizer, sentences):
    """
    Generates an array of token sequences and pads them to the same length

    Args:
        tokenizer (object): Tokenizer instance containing the word-index dictionary
        sentences (list of string): list of sentences to tokenize and pad

    Returns:
        padded_sequences (array of int): tokenized sentences padded to the same length
    """
    # Converting sentences to sequences
    sequences = tokenizer.texts_to_sequences(sentences)

    # Padding the sequences using the post padding strategy
    padded_sequences =  pad_sequences(sequences, maxlen=max_length, padding=pad_type, truncating=trunc_type)

    return padded_sequences

In [28]:
padded_sequences = get_padded_sequences(tokenizer, sentences)
print(f"First padded sequence looks like this: \n\n{padded_sequences[0]}\n")
print(f"Numpy array of all sequences has shape: {padded_sequences.shape}\n")
print(f"This means there are {padded_sequences.shape[0]} sequences in total and each one has a size of {padded_sequences.shape[1]}")

First padded sequence looks like this: 

[   96   176  1157  1220    54  1122   742  5211    85  1074  4267   147
   184  4127  1344  1311  1595    47     9   949    96     4  6516   329
    92    23    17   140  3128  1330  2519   576   419  1277    72  2963
  3046  1755    10   894     4   755    12   954 19513    11   656  1578
  1053   414     4  1999  1220   778    54   502  1497  2114  1652   135
   333   123  2744   817  5212  1088   609    12  4413  4128   894  2580
   147   351   184  4127  8812  5798    44    73  3218    31    11     2
  5473    22     2  1397   145   454     9   138  1398    82  4598   488
  5213    96  1053    87  6517    83  2115    63  8813    96     8  1123
   634    85  1074    96  1970   148   159   420    11  2879    46    56]

Numpy array of all sequences has shape: (2225, 120)

This means there are 2225 sequences in total and each one has a size of 120


In [21]:
def tokenize_labels(labels):
    """
    Tokenizes the labels

    Args:
        labels (list of string): labels to tokenize

    Returns:
        label_sequences, label_word_index (list of string, dictionary): tokenized labels and the word-index
    """
    # Instantiating the Tokenizer class
    label_tokenizer = Tokenizer()

    # Fitting the tokenizer to the labels
    label_tokenizer.fit_on_texts(labels)

    # Saving the word index
    label_word_index = label_tokenizer.word_index

    # Saving the sequences
    label_sequences = label_tokenizer.texts_to_sequences(labels)

    return label_sequences, label_word_index

In [22]:
label_sequences, label_word_index = tokenize_labels(labels)
print(f"Vocabulary of labels looks like this {label_word_index}\n")
print(f"First ten sequences {label_sequences[:10]}\n")

Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]

