# 1. Downloading the Dataset

In this section, the literary dataset used throughout the notebook is downloaded. It consists of several classic books by authors such as Jane Austen, Charles Dickens, and Mark Twain. These texts will serve as the foundation for building word embeddings and training classification models in later sections.

In [1]:
import requests
import os

In [None]:
"""
This dictionary maps short identifiers of classic English literature books 
to their corresponding raw text file URLs from Project Gutenberg. 

The keys follow the pattern: <author_lastname>_<short_title>.
The values are direct links to the plain text (.txt) versions of the books.
"""


books_urls = {
    "austen_pride": "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
    "austen_emma": "https://www.gutenberg.org/cache/epub/158/pg158.txt",
    "austen_sense": "https://www.gutenberg.org/cache/epub/161/pg161.txt",

    "dickens_two_cities": "https://www.gutenberg.org/cache/epub/98/pg98.txt",
    "dickens_expectations": "https://www.gutenberg.org/cache/epub/1400/pg1400.txt",
    "dickens_twist": "https://www.gutenberg.org/cache/epub/730/pg730.txt",

    "twain_tom": "https://www.gutenberg.org/cache/epub/74/pg74.txt",
    "twain_huck": "https://www.gutenberg.org/cache/epub/76/pg76.txt",
    "twain_prince": "https://www.gutenberg.org/cache/epub/1837/pg1837.txt",
}

In [None]:
"""
Downloads a collection of classic literature books from their URLs and 
saves them as plain text files in a local directory.

For each book:
1. It sends an HTTP GET request to the Project Gutenberg URL.
2. It writes the full text content to a UTF-8 encoded `.txt` file.
3. It prints a confirmation message with the file path.

The output files are stored in the `../data/` directory and are named 
based on the keys of the `books_urls` dictionary.
"""


for name, url in books_urls.items():
	response = requests.get(url)
	path = f"../data/{name}.txt"

	with open(path, "w", encoding = "utf-8") as f:
		f.write(response.text)
	print(f"Book downloaded: {path}")

Book downloaded: ../data/austen_pride.txt
Book downloaded: ../data/austen_emma.txt
Book downloaded: ../data/austen_sense.txt
Book downloaded: ../data/dickens_two_cities.txt
Book downloaded: ../data/dickens_expectations.txt
Book downloaded: ../data/dickens_twist.txt
Book downloaded: ../data/twain_tom.txt
Book downloaded: ../data/twain_huck.txt
Book downloaded: ../data/twain_prince.txt


# 2. Data Preprocessing and Loading

In this section, the raw book texts are cleaned and preprocessed to make them suitable for training machine learning models. This involves removing metadata, tokenizing sentences and words, filtering out stopwords and non-alphabetic tokens, and splitting the texts into manageable chunks. Once processed, the data is loaded and stored so it can be easily accessed and reused throughout the notebook.

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import nltk
import re

In [None]:
"""
Downloads the required NLTK tokenization models and prepares resources 
for text preprocessing. Specifically, this section:

1. Downloads the 'punkt' and 'punkt_tab' models, which are used by NLTK 
   for sentence and word tokenization.
2. Loads the standard set of English stopwords to filter out common 
   non-informative words.
3. Compiles a regular expression pattern to match valid lowercase words 
   and simple contractions (e.g., "don't", "it's") for clean token filtering.
"""


nltk.download("punkt")
nltk.download("punkt_tab")

stop_words = set(stopwords.words("english"))

token_pattern = re.compile(r"^[a-z]+(?:'[a-z]+)?$")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nbedo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\nbedo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
def strip_gutenberg_headers(text):

	"""
    Removes the standard Project Gutenberg header and footer from a raw text.

    Project Gutenberg texts typically contain metadata at the beginning and 
    end of the file, such as licensing information, title, author notes, and 
    disclaimers. These sections are marked by standardized delimiters like:
    
    *** START OF THE PROJECT GUTENBERG EBOOK <title> ***
    *** END OF THE PROJECT GUTENBERG EBOOK <title> ***

    This function searches for those delimiters using regular expressions 
    and extracts only the book's main content between them.

    Args:
        text (str): The full raw text of a Project Gutenberg book.

    Returns:
        str: The cleaned text containing only the book's main content, 
             without headers or footers.
    """
	
	start_match = re.search(r"\*\*\*\s*START OF (THE|THIS) PROJECT GUTENBERG EBOOK.*\*\*\*", text, re.IGNORECASE)
	start_index = start_match.end() if start_match else 0

	end_match = re.search(r"\*\*\*\s*END OF (THE|THIS) PROJECT GUTENBERG EBOOK.*\*\*\*", text, re.IGNORECASE)
	end_index = end_match.start() if end_match else len(text)

	clean_text = text[start_index:end_index].strip()
	return clean_text.strip()

In [None]:
def text_to_sent_token_lists(text):

	"""
    Converts a raw text string into a list of tokenized sentences, 
    applying basic normalization and filtering.

    The process involves:
    1. Splitting the text into individual sentences.
    2. Converting each sentence to lowercase and replacing hyphens with spaces.
    3. Tokenizing each sentence into individual word tokens.
    4. Filtering out tokens that:
       - Do not match the defined `token_pattern` (e.g., non-alphabetic tokens).
       - Are purely numeric.
       - Are English stopwords.
    5. Keeping only sentences that contain at least two valid tokens.

    Args:
        text (str): The input text to tokenize and clean.

    Returns:
        list[list[str]]: A list of sentences, where each sentence is represented 
                         as a list of cleaned word tokens.
    """

	sentences = sent_tokenize(text)
	tokenized = []

	for sentence in sentences:
		sentence = sentence.lower()
		sentence = sentence.replace("-", " ")

		words = word_tokenize(sentence)

		cleaned = [word for word in words if token_pattern.match(word)]
		cleaned = [word for word in cleaned if word and not word.isdigit()]
		cleaned = [word for word in cleaned if not word in stop_words]

		if len(cleaned) >= 2:
			tokenized.append(cleaned)
    
	return tokenized

In [None]:
def build_corpus_from_books(book_dir):
    
	"""
    Builds a tokenized sentence corpus from a directory of plain text book files.

    This function iterates through all text files in the specified directory, 
    cleans each book by removing Project Gutenberg headers and footers, 
    tokenizes the text into sentences and words, and aggregates all processed 
    sentences into a single corpus list.

    Args:
        book_dir (str): Path to the directory containing the book `.txt` files.

    Returns:
        list[list[str]]: A corpus represented as a list of tokenized sentences, 
                         where each sentence is a list of cleaned word tokens.
    """
     
	all_sentences = []
	files = sorted([file for file in os.listdir(book_dir)])

	for file_name in files:
		path = os.path.join(book_dir, file_name)
		print(f"Processing {path} ...")
		
		with open(path, "r", encoding = "utf-8") as f:
			raw = f.read()
			
		stripped = strip_gutenberg_headers(raw)
		sentences = text_to_sent_token_lists(stripped)
		all_sentences.extend(sentences)
		
	print(f"\nTotal sentences in corpus: {len(all_sentences)}")
	return all_sentences

In [None]:
"""
Build the full tokenized corpus from the downloaded book collection.

This step reads all `.txt` files stored in the `../data` directory, 
cleans and tokenizes them, and returns a unified list of tokenized sentences. 
The resulting `all_sentences` corpus will later be used.
"""


all_sentences = build_corpus_from_books("../data")

Processing ../data\austen_emma.txt ...
Processing ../data\austen_pride.txt ...
Processing ../data\austen_sense.txt ...
Processing ../data\dickens_expectations.txt ...
Processing ../data\dickens_twist.txt ...
Processing ../data\dickens_two_cities.txt ...
Processing ../data\twain_huck.txt ...
Processing ../data\twain_prince.txt ...
Processing ../data\twain_tom.txt ...

Total sentences in corpus: 42516


# 3. Training Embedding Models

In this section, custom word embeddings are trained using the Word2Vec architecture with negative sampling. The preprocessed sentences are used to learn vector representations of words at different dimensionalities. These trained embeddings capture semantic relationships between words and will later be used to initialize embedding layers in classification models.

In [10]:
from gensim.models import Word2Vec
import os

In [None]:
"""
Define the configuration parameters for training word embedding models.

`dimensions` specifies the different vector sizes to be used when 
training multiple embedding models (e.g., Word2Vec). Training models 
with different dimensions allows for later comparison of their 
performance or quality.

`group_code` is a simple identifier that can be used to group, label, 
or version models trained under the same experiment settings.
"""


dimensions = [50, 100, 200]
group_code = "G01"

In [None]:
def train_and_save_models(sentences, dimensions, group_code, output_dir):
    
    """
    Trains multiple Word2Vec embedding models with different vector dimensions 
    and saves both the full model and its word vectors to disk.

    For each specified dimension:
    1. A Word2Vec model is trained on the provided tokenized sentences.
    2. The model is saved in `.model` format for later reloading in Python.
    3. The learned word vectors are exported in `.vec` (text) format for 
       compatibility with external tools (e.g., Gensim's KeyedVectors, 
       visualization tools, or embedding evaluators).

    Args:
        sentences (list[list[str]]): The training corpus represented as a list 
                                     of tokenized sentences.
        dimensions (list[int]): A list of vector sizes to train models with.
        group_code (str): An identifier used to group or label model files.
        output_dir (str): Path to the directory where model files will be saved.
    """
    
    for dimension in dimensions:
        print(f"\nTraining model with dimension {dimension} ...")
        
        model = Word2Vec(
            sentences = sentences,
            vector_size = dimension,
            window = 5,
            negative = 20, 
            min_count = 2,
            workers = 8,
            sg = 1,
            epochs = 40
        )
        
        model_path = os.path.join(output_dir, f"Books_{dimension}_{group_code}.model")
        vec_path = os.path.join(output_dir, f"Books_{dimension}_{group_code}.vec")
        
        model.save(model_path)
        model.wv.save_word2vec_format(vec_path, binary = False)
        
        print(f"Model saved in:\n  - {model_path}\n  - {vec_path}")

In [None]:
"""
Train and save Word2Vec models using the preprocessed book corpus.

This step trains multiple Word2Vec models with different vector dimensions 
(defined in `dimensions`) on the `all_sentences` corpus. Each model is 
saved both in `.model` format (for reloading in Gensim) and `.vec` format 
(for interoperability with other tools). The `group_code` is used to label 
the output files consistently.
"""


train_and_save_models(all_sentences, dimensions, group_code, "../models")


Training model with dimension 50 ...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Model saved in:
  - ../models\Books_50_G01.model
  - ../models\Books_50_G01.vec

Training model with dimension 100 ...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Model saved in:
  - ../models\Books_100_G01.model
  - ../models\Books_100_G01.vec

Training model with dimension 200 ...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Model saved in:
  - ../models\Books_200_G01.model
  - ../models\Books_200_G01.vec


# 4. Loading the Trained Models

In this section, the previously trained Word2Vec embedding models are loaded from disk. These models will be used in the next steps to explore semantic relationships between words and to build embedding matrices for neural network training.

In [14]:
from gensim.models import Word2Vec

In [None]:
"""
Load the previously trained Word2Vec models from disk for later use and evaluation.

This step rebuilds the `Word2Vec` objects from the saved `.model` files 
so they can be queried (e.g., for word similarities, analogies, or 
embedding visualizations) without retraining.

The models are stored in a dictionary keyed by their vector dimensions 
for easy access.
"""


model_dir = "../models"

model_paths = {
    50: f"{model_dir}/Books_50_G01.model",
    100: f"{model_dir}/Books_100_G01.model",
    200: f"{model_dir}/Books_200_G01.model"
}

models = {}
for dimension, path in model_paths.items():
    print(f"Loading model {dimension}D from {path} ...")
    models[dimension] = Word2Vec.load(path)
    print(models[dimension])


Loading model 50D from ../models/Books_50_G01.model ...
Word2Vec<vocab=16108, vector_size=50, alpha=0.025>
Loading model 100D from ../models/Books_100_G01.model ...
Word2Vec<vocab=16108, vector_size=100, alpha=0.025>
Loading model 200D from ../models/Books_200_G01.model ...
Word2Vec<vocab=16108, vector_size=200, alpha=0.025>


# 5. 2D Visualization of Embeddings

In this section, the trained word embeddings are projected into a 2D space using the t-SNE technique. This visualization allows us to explore and identify meaningful semantic relationships, focusing on the names of the main characters from each book. By plotting the most similar words to these character names, we can observe how the model groups related terms in the embedding space.

In [16]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

In [None]:
def plot_similar_words(model, target_word, topn = 5):
    
    """
    Visualizes the most similar words to a target word using t-SNE dimensionality reduction.

    This function:
    1. Retrieves the `topn` most similar words to the `target_word` based on cosine similarity.
    2. Extracts their corresponding word vectors from the trained Word2Vec model.
    3. Uses t-SNE to project the high-dimensional vectors into a 2D space.
    4. Plots the target word and its similar words in a scatter plot for intuitive inspection.

    Args:
        model (gensim.models.Word2Vec): The trained Word2Vec model to query.
        target_word (str): The word whose similar neighbors will be visualized.
        topn (int, optional): Number of similar words to retrieve and plot. Default is 5.
    """

    similar_words = [w for w, _ in model.wv.most_similar(target_word, topn = topn)]
    all_words = [target_word] + similar_words

    vectors = np.array([model.wv[w] for w in all_words])

    tsne = TSNE(n_components = 2, random_state = 42, perplexity = 5)
    reduced = tsne.fit_transform(vectors)

    plt.figure(figsize=(10, 8))
    plt.scatter(reduced[:, 0], reduced[:, 1])

    for i, word in enumerate(all_words):
        if i == 0:
            plt.annotate(word, xy = (reduced[i, 0], reduced[i, 1]), fontsize = 14, color = 'red', fontweight = 'bold')
        else:
            plt.annotate(word, xy = (reduced[i, 0], reduced[i, 1]), fontsize = 12)

    plt.title(f"Words most similar to '{target_word}'")
    plt.show()

In [None]:
"""
Dictionary mapping each book (by its identifier) to a list of its main characters.

This structure is useful for querying trained Word2Vec models about 
relationships between characters, visualizing their embeddings, or 
analyzing co-occurrences within the corpus.

The keys correspond to the identifiers used in `books_urls`, while the 
values are lists containing the names of the primary characters from 
each book in lowercase (to match tokenized forms).
"""


main_characters = {
    "austen_pride": ["elizabeth", "darcy"],       
    "austen_emma": ["emma", "knightley"],         
    "austen_sense": ["elinor", "marianne"],       

    "dickens_two_cities": ["darnay", "carton"],   
    "dickens_expectations": ["pip", "havisham"],  
    "dickens_twist": ["oliver", "fagin"],         

    "twain_tom": ["tom", "becky"],               
    "twain_huck": ["huck", "jim"],               
    "twain_prince": ["prince", "pauper"],        
}

In [None]:
def plot_and_save_similar_words(model, target_word, book_id, dimension, topn, output_dir):
    
	"""
    Generates and saves a 2D visualization of the most similar words to a given target word
    using t-SNE for dimensionality reduction.

    This function:
    1. Retrieves the `topn` most similar words to the `target_word` from the model.
    2. Reduces the vectors of the target word and its similar words to 2D using t-SNE.
    3. Creates a scatter plot with the target word highlighted in red and the similar words
       annotated around it.
    4. Saves the resulting figure as a PNG file to the specified output directory.

    Args:
        model (gensim.models.Word2Vec): The trained Word2Vec model.
        target_word (str): The word to visualize and find similar neighbors for.
        book_id (str): Identifier of the book, used for labeling the plot.
        dimension (int): Dimensionality of the Word2Vec model used.
        topn (int): Number of most similar words to retrieve.
        output_dir (str): Directory where the generated plot image will be saved.
    """

	similar_words = model.wv.most_similar(target_word, topn = topn)
	words = [target_word] + [w for w, _ in similar_words]
	vectors = np.array([model.wv[w] for w in words])

	tsne = TSNE(n_components = 2, random_state = 42, perplexity = min(5, len(words)-1))
	reduced = tsne.fit_transform(vectors)

	plt.figure(figsize=(8, 6))
	plt.scatter(reduced[0, 0], reduced[0, 1], color = 'red')  
	plt.text(reduced[0, 0] + 1, reduced[0, 1] + 1, target_word, fontsize = 12, color = 'red', weight = 'bold')

	for i, word in enumerate(words[1:], start = 1):
		x, y = reduced[i, 0], reduced[i, 1]
		plt.scatter(x, y)
		plt.text(x + 1, y + 1, word, fontsize=10)

	plt.title(f"Most similar to '{target_word}' ({book_id}, dim = {dimension})")
	plt.tight_layout()

	filename = f"{target_word}_{dimension}d.png"
	filepath = os.path.join(output_dir, filename)
	plt.savefig(filepath)
	plt.close()  
	print(f"Saved: {filepath}")

In [None]:
"""
Generate and save similarity visualizations for each main character across all books 
using the 50-dimensional Word2Vec model.

For each book in `main_characters`, this loop:
1. Iterates through its main character names.
2. Calls `plot_and_save_similar_words` to generate a 2D t-SNE plot showing the character 
   and its top 4 most similar words in the embedding space.
3. Saves each figure as a PNG file in the `../figures` directory.

This provides an easy way to inspect how well the embedding model has captured 
the semantic context of key literary characters.
"""


for book_id, characters in main_characters.items():
	for name in characters:
		plot_and_save_similar_words(models[50], name, book_id, 50, 4, "../figures")

Saved: ../figures\elizabeth_50d.png
Saved: ../figures\darcy_50d.png
Saved: ../figures\emma_50d.png
Saved: ../figures\knightley_50d.png
Saved: ../figures\elinor_50d.png
Saved: ../figures\marianne_50d.png
Saved: ../figures\darnay_50d.png
Saved: ../figures\carton_50d.png
Saved: ../figures\pip_50d.png
Saved: ../figures\havisham_50d.png
Saved: ../figures\oliver_50d.png
Saved: ../figures\fagin_50d.png
Saved: ../figures\tom_50d.png
Saved: ../figures\becky_50d.png
Saved: ../figures\huck_50d.png
Saved: ../figures\jim_50d.png
Saved: ../figures\prince_50d.png
Saved: ../figures\pauper_50d.png


# 6. Dataset Preparation for Classification

In this section, the data is prepared for training classification models. The raw text is preprocessed, tokenized, and divided into fixed-size chunks to create consistent input samples. The dataset is then split into training, validation, and test sets, ensuring class balance across splits. This structured format allows the data to be used effectively by neural network architectures in the following stages.

In [21]:
from sklearn.model_selection import train_test_split
from collections import Counter
import random
import os

In [None]:
"""
Mappings between book filenames, their corresponding authors, and numeric labels.

`books_to_author` maps each text file containing a book to the author's name as a string.
This is useful for tasks such as author attribution, where we need to associate each
book with its original author.

`labels_to_ids` provides a mapping from author names to numeric identifiers.
These numeric labels are typically required for machine learning models that expect
categorical values to be encoded as integers.
"""


books_to_author = {
    "austen_pride.txt": "austen",
    "austen_emma.txt": "austen",
    "austen_sense.txt": "austen",
    "dickens_two_cities.txt": "dickens",
    "dickens_expectations.txt": "dickens",
    "dickens_twist.txt": "dickens",
    "twain_tom.txt": "twain",
    "twain_huck.txt": "twain",
    "twain_prince.txt": "twain",
}

labels_to_ids = {
    "austen": 0,
    "dickens": 1,
    "twain": 2
}

In [None]:
def read_text(path):
    
	"""
    Read the contents of a text file and return it as a string.

    Args:
        path (str): Path to the text file to be read.

    Returns:
        str: The full contents of the file as a single string.
    """
		
	with open(path, "r", encoding = "utf-8") as f:
		return f.read()

In [None]:
def preprocess_and_chunks(text, chunk_size, min_chunk_size):
    
    """
    Preprocess text into clean word tokens and split them into fixed-size chunks.

    This function performs sentence tokenization, lowercasing, hyphen replacement,
    word tokenization, filtering by regex pattern, removal of digits and stopwords,
    and finally splits the resulting token sequence into chunks of a specified size.
    Chunks shorter than `min_chunk_size` are discarded.

    Args:
        text (str): Raw text to preprocess.
        chunk_size (int): Number of tokens per chunk.
        min_chunk_size (int): Minimum number of tokens required for a chunk to be kept.

    Returns:
        list[str]: A list of preprocessed text chunks, each represented as a space-separated string.
    """

    sentences = sent_tokenize(text)
    all_tokens = []

    for sentence in sentences:
        sentence = sentence.lower()
        sentence = sentence.replace("-", " ")

        words = word_tokenize(sentence)

        cleaned = [word for word in words if token_pattern.match(word)]
        cleaned = [word for word in cleaned if not word.isdigit()]
        cleaned = [word for word in cleaned if word not in stop_words]

        if len(cleaned) >= 1:
            all_tokens.extend(cleaned)

    chunks = []
    for i in range(0, len(all_tokens), chunk_size):
        chunk_tokens = all_tokens[i:i+chunk_size]
        if len(chunk_tokens) >= min_chunk_size:
            chunk_text = " ".join(chunk_tokens)
            chunks.append(chunk_text)

    return chunks

In [None]:
def build_clasification_dataset(books_dir, test_size, validation_size, random_seed):

	"""
    Build a text classification dataset from preprocessed book chunks, splitting into train, validation, and test sets.

    This function reads all book files from a directory, preprocesses their content into token chunks,
    assigns each chunk a numeric label based on the book's author, and splits the resulting dataset
    into training, validation, and test subsets using stratified sampling to preserve class balance.

    Args:
        books_dir (str): Path to the directory containing the book text files.
        test_size (float): Proportion of the dataset to include in the test split.
        validation_size (float): Proportion of the remaining data (after train split) to allocate to validation.
        random_seed (int): Seed used for reproducible shuffling and splitting.

    Returns:
        tuple: A tuple containing three datasets:
            - (X_train, y_train): Training texts and labels.
            - (X_validation, y_validation): Validation texts and labels.
            - (X_test, y_test): Test texts and labels.
    """
	 
	random.seed(random_seed)
	texts, labels = [], []

	for file_name in sorted(os.listdir(books_dir)):
		author = books_to_author.get(file_name)
		
		raw_text = read_text(os.path.join(books_dir, file_name))
		chunks = preprocess_and_chunks(raw_text, 200, 200)
		print(f"{len(chunks)} chunks generated for {file_name}")

		texts.extend(chunks)
		labels.extend([author] * len(chunks))

	y = np.array([labels_to_ids[label] for label in labels])

	X_train, X_temp, y_train, y_temp = train_test_split(texts, y, test_size = test_size, random_state = random_seed, stratify = y)
	X_validation, X_test, y_validation, y_test = train_test_split(X_temp, y_temp, test_size = validation_size, random_state = random_seed, stratify = y_temp)

	def class_counts(y_arr):
		c = Counter(y_arr)
		return {label: c[idx] for label, idx in labels_to_ids.items()}

	summary = {
		"train": class_counts(y_train),
		"validation": class_counts(y_validation),
		"test": class_counts(y_test),
	}

	print("\nSummary samples by split and class:\n")
	for split, counts in summary.items():
		print(f"  {split}: {counts}")

	return (X_train, y_train), (X_validation, y_validation), (X_test, y_test)


In [None]:
"""
Build the classification dataset and split it into train, validation, and test sets.

This line applies the `build_clasification_dataset` function to the book corpus stored in `../data`.
The dataset is split with 30% of the data reserved for testing, and 50% of the remaining data
allocated to validation. A fixed random seed (42) is used to ensure reproducibility.

Returns:
    (X_train, y_train): Training texts and labels.
    (X_validation, y_validation): Validation texts and labels.
    (X_test, y_test): Test texts and labels.
"""


(X_train, y_train), (X_validation, y_validation), (X_test, y_test) = build_clasification_dataset("../data", 0.3, 0.5, 42)

348 chunks generated for austen_emma.txt
287 chunks generated for austen_pride.txt
268 chunks generated for austen_sense.txt
411 chunks generated for dickens_expectations.txt
384 chunks generated for dickens_twist.txt
322 chunks generated for dickens_two_cities.txt
252 chunks generated for twain_huck.txt
185 chunks generated for twain_prince.txt
175 chunks generated for twain_tom.txt

Summary samples by split and class:

  train: {'austen': 632, 'dickens': 782, 'twain': 428}
  validation: {'austen': 135, 'dickens': 168, 'twain': 92}
  test: {'austen': 136, 'dickens': 167, 'twain': 92}


In [None]:
"""
Inspect a few example samples from each dataset split (Train, Validation, Test).

This loop iterates through the three splits and prints two representative examples 
from each. For each example, it displays the human-readable class label (mapped from 
its numeric ID) and the first 50 characters of the corresponding text chunk.

This is a useful sanity check to ensure that:
    - The dataset has been split correctly.
    - The labels correspond to the right author.
    - The text chunks are correctly preprocessed and non-empty.
"""


for split_name, X_split, y_split in [("Train", X_train, y_train), ("Validation", X_validation, y_validation), ("Test", X_test, y_test),]:
    print(f"\n--- {split_name} examples ---")
    for i in range(2): 
        print(f"Label: {list(labels_to_ids.keys())[list(labels_to_ids.values()).index(y_split[i])]}")
        print("Text:", "".join(X_split[i][:50]), "...\n")


--- Train examples ---
Label: austen
Text: rather dark darker narrower one could wish miss sm ...

Label: dickens
Text: stood drinking little counter conversation defarge ...


--- Validation examples ---
Label: twain
Text: office child play wherefore last ladies visit draw ...

Label: dickens
Text: condition declared peril sake altering way living  ...


--- Test examples ---
Label: austen
Text: well satisfied consider present campbells may knig ...

Label: dickens
Text: though would shake hands let go room said boy retr ...



# 7. Building Embeddings Matrix

In this section, embedding matrices are constructed from the previously trained Word2Vec models. Each matrix maps the tokenizer's vocabulary to its corresponding vector representation in the embedding space. These matrices will later be used to initialize the embedding layers of different neural network architectures, enabling the models to leverage pretrained semantic knowledge during training.

In [28]:
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

In [None]:
"""
Tokenize the training texts and build the vocabulary for the classification model.

1. `sequence_length` is set to 200, matching the chunk size used during preprocessing.
2. A Keras `Tokenizer` is initialized with an out-of-vocabulary token `<OOV>` to handle 
   unseen words during inference.
3. The tokenizer is fitted on the training set to build the vocabulary.

This tokenizer will later be used to convert text chunks into integer sequences for 
training a neural network classifier.
"""


sequence_length = 200

tokenizer = Tokenizer(oov_token = "<OOV>")
tokenizer.fit_on_texts(X_train)

vocabulary_size = len(tokenizer.word_index) + 1
print("Training vocabulary size:", vocabulary_size)

Training vocabulary size: 21023


In [None]:
def build_embedding_matrix(embeddings_model, tokenizer, embedding_dimension):
    
	"""
    Build an embedding matrix aligned with the tokenizer's vocabulary using a pre-trained Word2Vec model.

    For each word in the tokenizer's vocabulary, this function looks up the corresponding
    vector in the Word2Vec model. If the word is found, its vector is placed at the
    appropriate index in the embedding matrix. Words not present in the Word2Vec model
    remain as zero vectors.

    Args:
        embeddings_model (gensim.models.Word2Vec): Pre-trained Word2Vec model containing word embeddings.
        tokenizer (keras.preprocessing.text.Tokenizer): Tokenizer fitted on the training texts.
        embedding_dimension (int): Dimensionality of the word vectors in the Word2Vec model.

    Returns:
        np.ndarray: A 2D NumPy array of shape (vocabulary_size, embedding_dimension) where each row
                    corresponds to a token index and contains its embedding vector.
    """
     
	matrix = np.zeros((vocabulary_size, embedding_dimension))
	for word, i in tokenizer.word_index.items():
		if word in embeddings_model.wv:
			matrix[i] = embeddings_model.wv[word]
	return matrix

In [None]:
"""
Create embedding matrices for each pre-trained Word2Vec model dimension.

For each embedding model (50D, 100D, 200D), this loop builds an embedding matrix 
aligned with the tokenizer's vocabulary using `build_embedding_matrix`. 
The resulting matrices are stored in the `embedding_matrices` dictionary, 
keyed by their embedding dimension.

These matrices will later be used to initialize the embedding layers 
of different neural network models for text classification.
"""


embedding_matrices = {}
for dimension, embedding_model in models.items():
    embedding_matrices[dimension] = build_embedding_matrix(embedding_model, tokenizer, dimension)

In [None]:
"""
Convert text datasets into integer sequences using the fitted tokenizer.

Each text chunk in the training, validation, and test sets is transformed 
into a sequence of integer token IDs based on the tokenizer's vocabulary. 
These sequences will later be padded to a fixed length and fed into 
the neural network models.
"""


X_train_sequences = np.array(tokenizer.texts_to_sequences(X_train))
X_validation_sequences = np.array(tokenizer.texts_to_sequences(X_validation))
X_test_sequences  = np.array(tokenizer.texts_to_sequences(X_test))

# 8. Training Feed-Forward Model for Classification (Trained Embeddings)

In this section, several feed-forward neural network architectures are trained for author classification using the custom Word2Vec embeddings. Each model leverages the pretrained embedding matrices as non-trainable layers, allowing the network to focus on learning classification patterns rather than word representations. The models are evaluated on validation and test sets to measure their accuracy, precision, and recall.

In [33]:
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Flatten, Dense
from sklearn.metrics import precision_score, recall_score, accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import pandas as pd
import numpy as np

In [34]:
num_authors = 3

In [None]:
def make_embedding_layer(embedding_matrix):

	"""
    Create a non-trainable Keras Embedding layer initialized with a pre-trained embedding matrix.

    This function constructs an `Embedding` layer using the provided matrix, 
    where each row corresponds to a token index and contains its pre-trained 
    embedding vector. The layer is set to `trainable=False` to keep the 
    embeddings fixed during model training, ensuring that the model relies 
    on the semantic structure learned by the Word2Vec model.

    Args:
        embedding_matrix (np.ndarray): A 2D NumPy array of shape 
            (vocabulary_size, embedding_dimension) containing pre-trained embeddings.

    Returns:
        keras.layers.Embedding: A configured embedding layer ready to be used 
        as the first layer in a neural network model.
    """	
	
	embedding_dimension = embedding_matrix.shape[1]

	return Embedding(
		input_dim = vocabulary_size,
		output_dim = embedding_dimension,
		weights = [embedding_matrix],
		trainable = False
	)

In [None]:
def build_model_a(embedding_layer):
    
    """
    Build a simple neural network for text classification using average pooling.

    This model architecture consists of:
        1. A pre-trained, non-trainable embedding layer.
        2. A `GlobalAveragePooling1D` layer that computes the average vector 
           representation of the input sequence.
        3. A dense hidden layer with ReLU activation for non-linear feature extraction.
        4. A softmax output layer with `num_authors` units for multi-class classification.

    Args:
        embedding_layer (keras.layers.Embedding): Pre-initialized embedding layer 
            built with `make_embedding_layer`.

    Returns:
        keras.models.Sequential: A Keras Sequential model.
    """

    model = Sequential([
        embedding_layer,
        GlobalAveragePooling1D(),
        Dense(32, activation = 'relu'),
        Dense(num_authors, activation = 'softmax')
    ])
    return model

In [None]:
def build_model_b(embedding_layer):
    
    """
    Build a deeper feed-forward neural network for text classification.

    This model extends `build_model_a` by adding an extra dense layer with 
    more units, allowing it to learn more complex feature interactions from 
    the averaged embeddings. The architecture consists of:

        1. A pre-trained, non-trainable embedding layer.
        2. A `GlobalAveragePooling1D` layer to aggregate token embeddings.
        3. A dense hidden layer with 128 units and ReLU activation.
        4. A second dense hidden layer with 64 units and ReLU activation.
        5. A softmax output layer with `num_authors` units for classification.

    Args:
        embedding_layer (keras.layers.Embedding): Pre-initialized embedding layer 
            built with `make_embedding_layer`.

    Returns:
        keras.models.Sequential: A Keras Sequential model.
    """

    model = Sequential([
        embedding_layer,
        GlobalAveragePooling1D(),
        Dense(128, activation = 'relu'),
        Dense(64, activation = 'relu'),
        Dense(num_authors, activation = 'softmax')
    ])
    return model

In [None]:
def build_model_c(embedding_layer):
    
	"""
    Build a neural network for text classification using flattened embeddings.

    Unlike the previous models that use average pooling, this architecture 
    flattens the entire sequence of embeddings into a single long vector 
    before passing it through dense layers. This allows the model to preserve 
    the positional structure of the sequence, at the cost of more parameters.

    The architecture consists of:
        1. A pre-trained, non-trainable embedding layer.
        2. A `Flatten` layer to convert the (sequence_length × embedding_dim) 
           tensor into a 1D vector.
        3. A dense hidden layer with 256 units and ReLU activation.
        4. A second dense hidden layer with 128 units and ReLU activation.
        5. A softmax output layer with `num_authors` units for classification.

    Args:
        embedding_layer (keras.layers.Embedding): Pre-initialized embedding layer 
            built with `make_embedding_layer`.

    Returns:
        keras.models.Sequential: A Keras Sequential model.
    """
      
	model = Sequential([
		embedding_layer,
		Flatten(),
		Dense(256, activation = 'relu'),
		Dense(128, activation = 'relu'),
		Dense(num_authors, activation = 'softmax')
	])
	return model

In [None]:
"""
Dictionary of model architecture builders.

This dictionary maps architecture names ("A", "B", "C") to their corresponding 
model-building functions (`build_model_a`, `build_model_b`, `build_model_c`). 
It provides a clean way to dynamically select and build different neural 
network architectures for experimentation and comparison.
"""


architectures = {
    "A": build_model_a,
    "B": build_model_b,
    "C": build_model_c
}

In [None]:
"""
This code block trains and evaluates multiple text classification models using different 
embedding dimensions and neural network architectures. For each combination of embedding 
matrix and architecture, it builds a Keras model, trains it on the training data, validates 
it on a separate validation set, and evaluates it on the test set.

The workflow proceeds as follows:
1. Iterate over all available embedding matrices, each corresponding to a specific 
   embedding dimensionality (e.g., 50D, 100D, 200D).
2. For each embedding dimension, iterate over all predefined model architectures (A, B, C), 
   where each architecture defines a different neural network structure.
3. Build an embedding layer from the current embedding matrix and pass it to the model 
   construction function.
4. Compile the model using the Adam optimizer and sparse categorical cross-entropy loss.
5. Train the model for 10 epochs on the training data, using the validation set for 
   monitoring performance.
6. Generate predictions on the test set and compute evaluation metrics: accuracy, 
   precision (macro-averaged), and recall (macro-averaged).
7. Store the performance metrics for each (architecture, embedding dimension) pair 
   in the `results` dictionary for later comparison.

The `results` dictionary uses keys in the format "{ARCH}_{DIM}D" (e.g., "A_50D") and maps 
them to metric dictionaries with the following structure:
{
    "accuracy": float,
    "precision": float,
    "recall": float
}
"""


results = {}

for embedding_dimension, embedding_matrix in embedding_matrices.items():
    for arch_name, build_fn in architectures.items():
        print(f"Training model {arch_name} with embedding {embedding_dimension}D")

        embedding_layer = make_embedding_layer(embedding_matrix)
        model = build_fn(embedding_layer)

        model.compile(
            optimizer = Adam(),
            loss = 'sparse_categorical_crossentropy',
            metrics = ['accuracy']
        )

        history = model.fit(
            X_train_sequences, y_train,
            validation_data = (X_validation_sequences, y_validation),
            epochs = 10,
            batch_size = 32,
            verbose = 1
        )

        y_pred_probs = model.predict(X_test_sequences, verbose = 0)
        y_pred = np.argmax(y_pred_probs, axis = 1)

        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average = 'macro', zero_division = 0)
        recall = recall_score(y_test, y_pred, average = 'macro', zero_division = 0)

        key = f"{arch_name}_{embedding_dimension}D"
        results[key] = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall
        }

Training model A with embedding 50D
Epoch 1/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.6308 - loss: 0.9395 - val_accuracy: 0.7241 - val_loss: 0.8330
Epoch 2/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7953 - loss: 0.7301 - val_accuracy: 0.8228 - val_loss: 0.6366
Epoch 3/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8599 - loss: 0.5447 - val_accuracy: 0.8835 - val_loss: 0.4722
Epoch 4/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9110 - loss: 0.4099 - val_accuracy: 0.9266 - val_loss: 0.3630
Epoch 5/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9321 - loss: 0.3193 - val_accuracy: 0.9494 - val_loss: 0.2874
Epoch 6/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9511 - loss: 0.2567 - val_accuracy: 0.9443 - val_loss: 0.2341
Epo

In [None]:
"""
This code block converts the collected training and evaluation results into a 
readable tabular format using a pandas DataFrame.

The resulting table allows quick inspection of which architecture–embedding combination 
performed best across the various metrics.
"""


df = pd.DataFrame(results).T 

df_formatted = df.map(lambda x: f"{x:.3f}")

print(df_formatted)

       accuracy precision recall
A_50D     0.967     0.966  0.968
B_50D     0.972     0.967  0.976
C_50D     0.965     0.964  0.963
A_100D    0.957     0.959  0.954
B_100D    0.980     0.976  0.983
C_100D    0.954     0.959  0.948
A_200D    0.970     0.966  0.972
B_200D    0.980     0.979  0.980
C_200D    0.952     0.958  0.941


# 9. Training Feed-Forward Model for Classification (Glove Embeddings)

In this section, the same feed-forward neural network architectures are trained for author classification, but this time using pretrained GloVe embeddings. By leveraging these widely used embeddings, the models can benefit from rich semantic representations learned from large external corpora. The goal is to compare their performance against the custom-trained embeddings and evaluate how well external embeddings generalize to the literary classification task.

In [42]:
import numpy as np

In [None]:
"""
This code block defines file paths to pre-trained GloVe embedding files with different 
vector dimensions (50, 100, and 200).

Steps:
1. `glove_dir` specifies the base directory where the GloVe files are stored.
2. `model_paths` is a dictionary that maps each embedding dimension (key) to its 
   corresponding file path (value). For example, the 50-dimensional embeddings are 
   located at `../models/glove.6b/glove.6b.50d.txt`.

These paths will be used later to load the corresponding GloVe embeddings and integrate 
them into the classification models.
"""


glove_dir = "../models/glove.6b"

model_paths = {
    50: f"{glove_dir}/glove.6b.50d.txt",
    100: f"{glove_dir}/glove.6b.100d.txt",
    200: f"{glove_dir}/glove.6b.200d.txt"
}

In [None]:
def load_glove_embeddings(path):
    
	"""
	Loads pre-trained GloVe word embeddings from a text file into a Python dictionary.

	Args:
		path (str): Path to the GloVe `.txt` file containing word vectors.

	Returns:
		dict: A dictionary mapping each word (str) to its corresponding embedding vector (numpy.ndarray).
	"""

	embeddings_index = {}
	with open(path, encoding = "utf8") as f:
		for line in f:
			values = line.split()
			word = values[0]
			vector = np.asarray(values[1:], dtype = "float32")
			embeddings_index[word] = vector
	print(f"Loaded {len(embeddings_index)} embeddings from {path}")
	return embeddings_index


In [None]:
def build_embedding_matrix_glove(embeddings_index, tokenizer, embedding_dimension):
    
	"""
	Builds an embedding matrix using pre-trained GloVe embeddings for a given tokenizer vocabulary.

	Args:
		embeddings_index (dict): A dictionary mapping words to their GloVe embedding vectors, 
			typically loaded using `load_glove_embeddings`.
		tokenizer (Tokenizer): A fitted Keras Tokenizer containing the vocabulary from the training dataset.
		embedding_dimension (int): The dimensionality of the GloVe vectors being used (e.g., 50, 100, 200).

	Returns:
		numpy.ndarray: A 2D NumPy array of shape `(vocabulary_size, embedding_dimension)` where each 
		row corresponds to a word in the tokenizer's vocabulary. Words not found in the GloVe 
		embeddings are represented by zero vectors.
	"""
		
	vocabulary_size = len(tokenizer.word_index) + 1
	matrix = np.zeros((vocabulary_size, embedding_dimension))
	for word, i in tokenizer.word_index.items():
		vector = embeddings_index.get(word)
		if vector is not None:
			matrix[i] = vector
	return matrix


In [None]:
"""
Loads pre-trained GloVe embeddings for multiple dimensions and builds corresponding 
embedding matrices aligned with the tokenizer's vocabulary.

For each specified dimensionality (e.g., 50, 100, 200), this code performs the following steps:
1. Loads the GloVe embeddings file using `load_glove_embeddings`, creating a dictionary 
   mapping words to their pre-trained vectors.
2. Builds an embedding matrix using `build_embedding_matrix_glove`, aligning each tokenizer 
   vocabulary word with its corresponding GloVe vector.
3. Stores the loaded GloVe embeddings in `glove_models` and the resulting matrices 
   in `embedding_matrices_glove`.
4. Prints the shape of each constructed embedding matrix to confirm successful loading.
"""


glove_models = {}
embedding_matrices_glove = {}

for dim, path in model_paths.items():
    print(f"\nLoading GloVe {dim}D...")
    glove_models[dim] = load_glove_embeddings(path)
    embedding_matrices_glove[dim] = build_embedding_matrix_glove(glove_models[dim], tokenizer, dim)
    print(f"Matrix {dim}D shape: {embedding_matrices_glove[dim].shape}")


Loading GloVe 50D...
Loaded 400000 embeddings from ../models/glove.6b/glove.6b.50d.txt
Matrix 50D shape: (21023, 50)

Loading GloVe 100D...
Loaded 400000 embeddings from ../models/glove.6b/glove.6b.100d.txt
Matrix 100D shape: (21023, 100)

Loading GloVe 200D...
Loaded 400000 embeddings from ../models/glove.6b/glove.6b.200d.txt
Matrix 200D shape: (21023, 200)


In [None]:
"""
This code block trains and evaluates multiple text classification models using different 
embedding dimensions and neural network architectures. For each combination of embedding 
matrix and architecture, it builds a Keras model, trains it on the training data, validates 
it on a separate validation set, and evaluates it on the test set.

The workflow proceeds as follows:
1. Iterate over all available embedding matrices, each corresponding to a specific 
   embedding dimensionality (e.g., 50D, 100D, 200D).
2. For each embedding dimension, iterate over all predefined model architectures (A, B, C), 
   where each architecture defines a different neural network structure.
3. Build an embedding layer from the current embedding matrix and pass it to the model 
   construction function.
4. Compile the model using the Adam optimizer and sparse categorical cross-entropy loss.
5. Train the model for 10 epochs on the training data, using the validation set for 
   monitoring performance.
6. Generate predictions on the test set and compute evaluation metrics: accuracy, 
   precision (macro-averaged), and recall (macro-averaged).
7. Store the performance metrics for each (architecture, embedding dimension) pair 
   in the `results` dictionary for later comparison.

The `results` dictionary uses keys in the format "{ARCH}_{DIM}D" (e.g., "A_50D") and maps 
them to metric dictionaries with the following structure:
{
    "accuracy": float,
    "precision": float,
    "recall": float
}
"""


results = {}

for embedding_dimension, embedding_matrix in embedding_matrices_glove.items():
    for arch_name, build_fn in architectures.items():
        print(f"Training model {arch_name} with embedding {embedding_dimension}D")

        embedding_layer = make_embedding_layer(embedding_matrix)
        model = build_fn(embedding_layer)

        model.compile(
            optimizer = Adam(),
            loss = 'sparse_categorical_crossentropy',
            metrics = ['accuracy']
        )

        history = model.fit(
            X_train_sequences, y_train,
            validation_data = (X_validation_sequences, y_validation),
            epochs = 10,
            batch_size = 32,
            verbose = 1
        )

        y_pred_probs = model.predict(X_test_sequences, verbose = 0)
        y_pred = np.argmax(y_pred_probs, axis = 1)

        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average = 'macro', zero_division = 0)
        recall = recall_score(y_test, y_pred, average = 'macro', zero_division = 0)

        key = f"{arch_name}_{embedding_dimension}D"
        results[key] = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall
        }

Training model A with embedding 50D
Epoch 1/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - accuracy: 0.6059 - loss: 0.9885 - val_accuracy: 0.6759 - val_loss: 0.9274
Epoch 2/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6819 - loss: 0.8758 - val_accuracy: 0.7089 - val_loss: 0.8168
Epoch 3/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7193 - loss: 0.7695 - val_accuracy: 0.7696 - val_loss: 0.7227
Epoch 4/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7503 - loss: 0.6832 - val_accuracy: 0.7873 - val_loss: 0.6542
Epoch 5/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7828 - loss: 0.6205 - val_accuracy: 0.8076 - val_loss: 0.5973
Epoch 6/10
[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8008 - loss: 0.5671 - val_accuracy: 0.8076 - val_loss: 0.5503
Epo

In [None]:
"""
This code block converts the collected training and evaluation results into a 
readable tabular format using a pandas DataFrame.

The resulting table allows quick inspection of which architecture–embedding combination 
performed best across the various metrics.
"""


df = pd.DataFrame(results).T 

df_formatted = df.map(lambda x: f"{x:.3f}")

print(df_formatted)

       accuracy precision recall
A_50D     0.830     0.827  0.820
B_50D     0.871     0.874  0.858
C_50D     0.800     0.807  0.772
A_100D    0.858     0.856  0.854
B_100D    0.891     0.924  0.860
C_100D    0.828     0.837  0.801
A_200D    0.896     0.897  0.889
B_200D    0.934     0.932  0.928
C_200D    0.853     0.861  0.829
