## Coursework1 Part 2

Import libraries required for data processing (pandas, numpy), natural language processing (nltk) and machine learning (sklearn), as well as related analytics functions.

Dealing with the BBC News dataset, three features were chosen to train the classification model from, namely feature extraction from full text using bag-of-words model, feature extraction from headlines, and full text extraction using TF-IDF model. 

In [5]:
import operator
import os

import nltk
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

nltk.download('stopwords') 
nltk.download('punkt') 
nltk.download('wordnet') 
nltk.download('omw-1.4') 

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vertigo/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/vertigo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/vertigo/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/vertigo/nltk_data...


True

### Data Preprocessing

A function that loads a dataset from a bbc directory, where each subdirectory represents a category. Extracts the contents of all files ending in txt, removes line breaks and stores them in the first column of the DataFrame, and stores the category numbers in the second column.

In [6]:
# Load the data
def load_data(directory):
    data = []
    category_dict = {}
    # Iterate through the subdirectories
    for entry in os.scandir(directory):
        if entry.is_dir():
            # Create a category dictionary
            category_dict[entry.name] = len(category_dict)
            # Iterate through the files in the subdirectory
            for sub_entry in os.scandir(entry):
                if sub_entry.is_file() and sub_entry.name.endswith('.txt'):
                    # Open the file and read its content
                    with open(sub_entry, 'r', encoding='latin1') as file:
                        # Append the content and category
                        data.append([file.read().splitlines(), category_dict[entry.name]])
                        file.close()
    return pd.DataFrame(data, columns=["content", "category"]), category_dict  # Return the data and category dictionary


Load the dataset and split it into training, development and test sets using 80%/10%/10% split ratio for subsequent model training.

In [7]:
dataset_full, category_dict = load_data('bbc')
# split train/test/dev data into 80%/10%/10%
X_train, X_test, y_train, y_test = train_test_split(dataset_full['content'], dataset_full['category'], test_size=0.2,
                                                    random_state=42)
X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

### Feature Extraction

The bag-of-words feature model for text data preprocessing, which includes functions such as stem extraction, normalisation (conversion to lower case), deletion of deactivated words, and vector representation.

In [8]:
# Initialize stopwords
stopwords = set(nltk.corpus.stopwords.words("english"))
stopwords.add(".")
stopwords.add(",")
stopwords.add("--")
stopwords.add("-")
stopwords.add("``")
stopwords.add("''")
stopwords.add("'")
stopwords.add("(")
stopwords.add(")")
stopwords.add("%")
stopwords.add("$")
stopwords.add(":")
stopwords.add(";")
stopwords.add("'s")


# Convert articles to text tokens
def get_list_tokens(pd_array, title=False):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    list_tokens = []
    if title:
        sentence_split = nltk.tokenize.sent_tokenize(pd_array[0])
        for sentence in sentence_split:
            list_tokens_sentence = nltk.tokenize.word_tokenize(sentence)
            for token in list_tokens_sentence:
                # Words in lowercase
                list_tokens.append(lemmatizer.lemmatize(token).lower())
    else:
        for content in pd_array:
            sentence_split = nltk.tokenize.sent_tokenize(content)
            for sentence in sentence_split:
                list_tokens_sentence = nltk.tokenize.word_tokenize(sentence)
                for token in list_tokens_sentence:
                    # Words in lowercase
                    list_tokens.append(lemmatizer.lemmatize(token).lower())
    return list_tokens


# Building a vocabulary
def get_vocabulary(training_set, num_features, title=False):
    dict_word_frequency = {}
    for instance in training_set:
        # Get the tokens
        sentence_tokens = get_list_tokens(instance, title)
        for word in sentence_tokens:
            # Skip stopwords
            if word in stopwords: continue
            # Add the word to the dictionary
            if word not in dict_word_frequency:
                dict_word_frequency[word] = 1
            else:
                dict_word_frequency[word] += 1
    sorted_list = sorted(dict_word_frequency.items(), key=operator.itemgetter(1), reverse=True)[:num_features]
    vocabulary_temp = []
    for word, frequency in sorted_list:
        # Append the word to the vocabulary
        vocabulary_temp.append(word)
    return vocabulary_temp


# Get the vector representation of the text
def get_vector_text(list_vocab, string, title=False):
    vector_text = np.zeros(len(list_vocab))
    # Get the tokens
    list_tokens_string = get_list_tokens(string, title)
    for i, word in enumerate(list_vocab):
        if word in list_tokens_string:
            # Count the frequency of the word
            vector_text[i] = list_tokens_string.count(word)
    return vector_text


# Get the vector representation of the data
def get_vector_data(x_input, y_input, vocabulary_input, title=False):
    x_vector = []
    y_vector = []
    for i in x_input.index:
        # Append the vector representation of the text
        x_vector.append(get_vector_text(vocabulary_input, x_input.loc[i], title))
        # Append the category
        y_vector.append(y_input.loc[i])
    return x_vector, y_vector

### Model Training and Validation

The model is then trained using the SVM classifier and validated using the development set. The best number of features is selected based on the accuracy of the development set, and the model is tested using the test set. The accuracy, precision, recall, and F1 score are used as evaluation metrics.

In [9]:
# feature: Word Frequency Feature (Bag of Words)
# Set the SVM classifier
def set_svm_classifier(x_train_svm, y_train_svm):
    svm_clf = sklearn.svm.SVC(kernel="linear", gamma='auto')
    svm_clf.fit(np.asarray(x_train_svm), np.asarray(y_train_svm))
    return svm_clf

# Model Testing
def validation(model, vocabulary_val, x_input, y_input, title=False):
    # Get the vector representation of the data
    x_val, y_val = get_vector_data(x_input, y_input, vocabulary_val, title)
    
    # Get the predictions
    predictions = model.predict(x_val)
    y_val = np.asarray(y_val)

    accuracy = accuracy_score(y_val, predictions)
    precision = precision_score(y_val, predictions, average="macro")
    recall = recall_score(y_val, predictions, average="macro")
    f_score = f1_score(y_val, predictions, average="macro")

    return accuracy, precision, recall, f_score

# Adjust the number of features
def adjust_feature_bow(x_train_input, y_train_input, x_dev_input, y_dev_input, list_num_input, title=False):
    best_accuracy_dev = 0.0
    for num_features in list_num_input:
        # Get the vocabulary
        vocabulary_feature = get_vocabulary(x_train_input, num_features, title)
        # Get the vector representation of the data
        x_train_feature, y_train_feature = get_vector_data(x_train_input, y_train_input, vocabulary_feature, title)
        # Set the SVM classifier
        svm_model = set_svm_classifier(x_train_feature, y_train_feature)
        accuracy, precision, recall, f_score = validation(svm_model, vocabulary_feature, x_dev_input, y_dev_input,
                                                          title)
        print("Accuracy with " + str(num_features) + ": " + str(round(accuracy, 3)))
        # Get the best accuracy
        if accuracy >= best_accuracy_dev:
            best_accuracy_dev = accuracy
            best_num_features = num_features
    print("\n Best accuracy overall in the dev set is " + str(round(best_accuracy_dev, 3)) + " with " + str(
        best_num_features) + " features.\n\n")

    return svm_model, vocabulary_feature


# Model Testing
def test_bow_performance(x_train_input, y_train_input, x_test_input, y_test_input, x_dev_input, y_dev_input,
                         list_num_input, title=False):
    # Getting the best number of features
    model_test, vocabulary_test = adjust_feature_bow(x_train_input, y_train_input, x_dev_input, y_dev_input,
                                                     list_num_input, title)
    # Testing with Test Sets
    accuracy, precision, recall, f_score = validation(model_test, vocabulary_test, x_test_input, y_test_input, title)
    print("Accuracy: " + str(round(accuracy, 3)))
    print("macro-averaged precision: " + str(round(precision, 3)))
    print("macro-averaged recall: " + str(round(recall, 3)))
    print("macro-averaged F_score: " + str(round(f_score, 3)))

### Model BoW (Full text) Testing 

Full text feature extraction of articles using bag-of-words models and testing the models. The number of features is adjusted to find the best accuracy, and the model is tested using the test set. The accuracy, precision, recall, and F1 score are used as evaluation metrics.

In [10]:
# feature: Word Frequency Feature (Full Text)
list_num_features = [300, 400, 500, 600, 700, 800, 900, 1000]
test_bow_performance(X_train, y_train, X_test, y_test, X_dev, y_dev, list_num_features, title=False)

Accuracy with 300: 0.914
Accuracy with 400: 0.937
Accuracy with 500: 0.946
Accuracy with 600: 0.95
Accuracy with 700: 0.955
Accuracy with 800: 0.946
Accuracy with 900: 0.946
Accuracy with 1000: 0.941

 Best accuracy overall in the dev set is 0.955 with 700 features.

Accuracy: 0.942
macro-averaged precision: 0.938
macro-averaged recall: 0.946
macro-averaged F_score: 0.941


### Model BoW (Title) Testing

Feature extraction from headlines using bag-of-words models and testing the models. The number of features is adjusted to find the best accuracy, and the model is tested using the test set. The accuracy, precision, recall, and F1 score are used as evaluation metrics.

In [11]:
# feature: Word Frequency Feature (Title)
list_num_features = [700, 800, 900, 1000, 1100, 1200, 1300, 1400]
test_bow_performance(X_train, y_train, X_test, y_test, X_dev, y_dev, list_num_features, title=True) 
# The title parameter controls the feature extraction range

Accuracy with 700: 0.73
Accuracy with 800: 0.739
Accuracy with 900: 0.748
Accuracy with 1000: 0.757
Accuracy with 1100: 0.761
Accuracy with 1200: 0.784
Accuracy with 1300: 0.788
Accuracy with 1400: 0.793

 Best accuracy overall in the dev set is 0.793 with 1400 features.


Accuracy: 0.78
macro-averaged precision: 0.784
macro-averaged recall: 0.779
macro-averaged F_score: 0.78


### Feature Extraction from Full Text using TF-IDF Model

The feature extraction from full text using TF-IDF models and testing the models. The number of features is adjusted to find the best accuracy, and the model is tested using the test set. The accuracy, precision, recall, and F1 score are used as evaluation metrics.

In [12]:
# feature: TF-IDF
def adjust_feature_tfidf(x_train_input, y_train_input, x_dev_input, y_dev_input, list_num_input):
    best_accuracy_dev = 0.0
    # Convert from Array to String
    x_train_joined = x_train_input.apply(' '.join)
    x_dev_joined = x_dev_input.apply(' '.join)
    # Iterate through the number of features
    for num_features in list_num_input:
        # Set the TF-IDF vectorizer
        tfidf = TfidfVectorizer(max_features=num_features, stop_words='english')
        tfidf.fit_transform(x_train_joined)
        tfidf_x_train = tfidf.transform(x_train_joined)
        # Set the SVM classifier
        svm_model = sklearn.svm.SVC(kernel="linear", gamma="auto")
        svm_model.fit(tfidf_x_train, y_train_input.to_numpy())
        tfidf_x_dev = tfidf.transform(x_dev_joined)
        # Get the predictions
        predictions_idf = svm_model.predict(tfidf_x_dev)
        tfidf_y_dev = y_dev_input.to_numpy()
        # Get the accuracy
        accuracy = accuracy_score(tfidf_y_dev, predictions_idf)
        print("Accuracy with " + str(num_features) + ": " + str(round(accuracy, 3)))
        if accuracy >= best_accuracy_dev:
            best_accuracy_dev = accuracy
            best_num_features = num_features
    print("\n Best accuracy overall in the dev set is " + str(round(best_accuracy_dev, 3)) + " with " + str(
        best_num_features) + " features.\n\n")

    return svm_model, tfidf


def test_tfidf_performance(x_train_input, y_train_input, x_test_input, y_test_input, x_dev_input, y_dev_input,
                           list_num_input):
    # Getting the best number of features
    model_test, tfidf_test = adjust_feature_tfidf(x_train_input, y_train_input, x_dev_input, y_dev_input,
                                                  list_num_input)
    x_test_joined = x_test_input.apply(' '.join)
    tfidf_x_test = tfidf_test.transform(x_test_joined)
    # Testing with Test Sets
    predictions_idf = model_test.predict(tfidf_x_test)
    tfidf_y_test = y_test_input.to_numpy()
    accuracy = accuracy_score(tfidf_y_test, predictions_idf)
    precision = precision_score(tfidf_y_test, predictions_idf, average="macro")
    recall = recall_score(tfidf_y_test, predictions_idf, average="macro")
    f_score = f1_score(tfidf_y_test, predictions_idf, average="macro")
    print("Accuracy: " + str(round(accuracy, 3)))
    print("macro-averaged precision: " + str(round(precision, 3)))
    print("macro-averaged recall: " + str(round(recall, 3)))
    print("macro-averaged F_score: " + str(round(f_score, 3)))

### Model TF-IDF Testing

Feature extraction from full text using TF-IDF models and testing the models. The number of features is adjusted to find the best accuracy, and the model is tested using the test set. The accuracy, precision, recall, and F1 score are used as evaluation metrics.

In [13]:
# feature: TF-IDF
list_num_features = [300, 400, 500, 600, 700, 800, 900, 1000]
test_tfidf_performance(X_train, y_train, X_test, y_test, X_dev, y_dev, list_num_features)

Accuracy with 300: 0.937
Accuracy with 400: 0.95
Accuracy with 500: 0.955
Accuracy with 600: 0.959
Accuracy with 700: 0.968
Accuracy with 800: 0.968
Accuracy with 900: 0.959
Accuracy with 1000: 0.959

 Best accuracy overall in the dev set is 0.968 with 800 features.


Accuracy: 0.96
macro-averaged precision: 0.958
macro-averaged recall: 0.963
macro-averaged F_score: 0.96
