# Exploration of PTB Dataset and Corpus object

- This notebook presents the exploration of PTB dataset. 

    - Previews the first ten lines of the dataset. 
    - Creates a Corpus object 

- The Corpus object represents the training data as a sequence (list) of integer token IDs. It builds a Dictionary that assigns each unique word an unique index in order of first appearance, appends an '\<eos\>' token after each line.

- Corpus stores the data in:
    - corpus.data

- The vocabulary and mappings are available via:
    - corpus.dictionary.word2idx
    - corpus.dictionary.idx2word

- Vocabulary size is given by:
    - len(corpus.dictionary).








In [1]:
# For Google Colab
# Upload the folder containing this file to google drive.

import sys, os
# Checking if the notebook is opened in google colab
#If YES, mount the google drive and change the directory
if 'google.colab' in sys.modules:

    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    # change path to the folder
    path = '/content/drive/My Drive/xxxxx/xxxxx'
    print(path)
    #os.chdir changes the current working directory
    os.chdir(path)
    !pwd


In [2]:
from utils_mclm import *

# Preview the training dataset

In [3]:
data_folder = 'ptb'
#Preview the first ten lines of the dataset
preview_ptb_file(data_folder,"train.txt", num_lines=10)


Previewing ptb/train.txt:
 aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
 pierre <unk> N years old will join the board as a nonexecutive director nov. N
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group
 rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate
 a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported
 the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said
 <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N
 although prel

# Building the corpus object

In [4]:
# we use train data to build the model
corpus = Corpus(data_folder, 'train.txt') 

# Exploration of Corpus object

In [5]:
print("Basic Corpus Information:")
print(f"Vocabulary size: {len(corpus.dictionary)}")
print(f"Number of tokens in training data: {len(corpus.data)}")

# Look at the first few tokens and their indices
n = 5
print(f"\nFirst {n} tokens and their indices:")
first_n_indices = corpus.data[:n]
for idx in first_n_indices:
    word = corpus.dictionary.idx2word[idx]
    print(f"Index: {idx:4d}, Word: {word}")

# Get some statistics about the vocabulary
print("\nVocabulary Examples:")
print("First 10 words in vocabulary:", corpus.dictionary.idx2word[:10])
print("Last 10 words in vocabulary:", corpus.dictionary.idx2word[-10:])

# Look up some specific words
sample_words = ['the', 'a', '<eos>', '<unk>']
print("\nIndices for common words:")
for word in sample_words:
    if word in corpus.dictionary.word2idx:
        print(f"'{word}': {corpus.dictionary.word2idx[word]}")
    else:
        print(f"'{word}' not in vocabulary")

# Convert a small sequence back to words
print("\nSample sequence converted back to words:")
sample_sequence = corpus.data[100:110]  # Get 10 tokens
reconstructed_text = ' '.join([corpus.dictionary.idx2word[idx] for idx in sample_sequence])
print(reconstructed_text)

# Get some basic statistics
unique_indices = len(set(corpus.data))
print(f"\nNumber of unique tokens used in training data: {unique_indices}")
print(f"Total vocabulary size: {len(corpus.dictionary)}")

Basic Corpus Information:
Vocabulary size: 10000
Number of tokens in training data: 929589

First 5 tokens and their indices:
Index:    0, Word: aer
Index:    1, Word: banknote
Index:    2, Word: berlitz
Index:    3, Word: calloway
Index:    4, Word: centrust

Vocabulary Examples:
First 10 words in vocabulary: ['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec']
Last 10 words in vocabulary: ['lung-cancer', 'bikers', 'bofors', 'parsow', 'caci', 'isi', 'chestman', 'tci', 'trecker', 'unilab']

Indices for common words:
'the': 32
'a': 35
'<eos>': 24
'<unk>': 26

Sample sequence converted back to words:
workers exposed to it more than N years ago researchers

Number of unique tokens used in training data: 10000
Total vocabulary size: 10000
