# WikiText Dataset Analysis and Preprocessing
This notebook provides an analysis and preprocessing pipeline for the WikiText dataset, specifically the train dataset of the [WikiText-103-raw-v1 parquet file 1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1) version. The goal is to prepare the dataset for comparing RNN, LSTM, and Transformer models for text prediction tasks.

## 1. Load the WikiText Dataset
We first load the train split of the WikiText-2-raw-v1 dataset using the Hugging Face datasets library and convert it to a Pandas DataFrame for easy analysis.

In [1]:
import datasets
import pandas as pd

df = pd.read_parquet('../../datasets/wiki-103-text/raw/wiki-103-train.parquet')

df.head()


Unnamed: 0,text
0,( * ) Denotes co @-@ producer . \n
1,
2,= = Credits and personnel = = \n
3,
4,Credits adapted from Allmusic . \n


## 2. Dataset Overview
We will inspect the dataset's structure, including length distribution, and check the presence of special characters. This helps us understand the dataset better and guides us in preprocessing it for RNN, LSTM, and Transformer models.

In [2]:
# Check the structure of the dataset
print(f"Number of samples: {len(df)}")

# Calculate text lengths
df['text_length'] = df['text'].apply(lambda x: len(x.split()))

# General statistics about text length
df['text_length'].describe()

# Count short and long texts
short_texts = df[df['text_length'] < 5]
long_texts = df[df['text_length'] > 100]
total_texts = len(df)

print(f"Short texts (less than 5 words): {len(short_texts)} ({len(short_texts)/total_texts*100:.1f}%)")
print(f"Long texts (more than 100 words): {len(long_texts)} ({len(long_texts)/total_texts*100:.1f}%)")


Number of samples: 900675
Short texts (less than 5 words): 336903 (37.4%)
Long texts (more than 100 words): 232770 (25.8%)


## 3. Check for Special Characters
We'll check if the text contains special characters like punctuation, numbers, and symbols that might need to be cleaned for model training.

In [3]:
import re
from collections import Counter

# Function to check for special characters
def has_special_chars(text):
    return bool(re.search(r'[^a-zA-Z\s]', text))

# Percentage of texts with special characters
df['has_special_chars'] = df['text'].apply(has_special_chars)
special_char_percentage = df['has_special_chars'].mean() * 100
print(f"Percentage of texts with special characters: {special_char_percentage:.2f}%")

# Display a few samples with special characters
df[df['has_special_chars'] == True]['text'].head()


Percentage of texts with special characters: 63.66%


0      ( * ) Denotes co @-@ producer . \n
2        = = Credits and personnel = = \n
4      Credits adapted from Allmusic . \n
9               Joshua Berkman – A & R \n
10      Safaree " SB " Samuels – A & R \n
Name: text, dtype: object

## 4. Clean and Preprocess the Text
We'll now clean the text by removing special characters (such as =, " and numbers) and filter texts that are too short (less than 5 words) or too long (more than 100 words). This will create a clean dataset suitable for model training.

In [4]:
# Function to clean the text: remove special characters
def clean_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.strip()

# Apply text cleaning
df['clean_text'] = df['text'].apply(clean_text)

# Filter out short and long texts
df_clean = df[(df['text_length'] >= 5) & (df['text_length'] <= 100)].reset_index(drop=True)

print(f"Remaining samples after cleaning and filtering: {len(df_clean)}")
df_clean.head()


Remaining samples after cleaning and filtering: 331002


Unnamed: 0,text,text_length,has_special_chars,clean_text
0,( * ) Denotes co @-@ producer . \n,8,True,Denotes co producer
1,= = Credits and personnel = = \n,7,True,Credits and personnel
2,Credits adapted from Allmusic . \n,5,True,Credits adapted from Allmusic
3,Joshua Berkman – A & R \n,6,True,Joshua Berkman A R
4,"Safaree "" SB "" Samuels – A & R \n",9,True,Safaree SB Samuels A R


## 5. Vocabulary and Tokenization
Next, we will build a vocabulary from the cleaned text and tokenize the dataset. Tokenization converts words into integer indices, which is required for model input.

In [5]:
from collections import defaultdict

# Tokenize the cleaned text
def tokenize(text, word_to_index):
    return [word_to_index[word] for word in text.split() if word in word_to_index]

# Build the vocabulary
vocab = Counter(word for sentence in df_clean['clean_text'] for word in sentence.split())
word_to_index = {word: i for i, (word, _) in enumerate(vocab.most_common())}
index_to_word = {i: word for word, i in word_to_index.items()}

# Tokenize the dataset
df_clean['tokenized'] = df_clean['clean_text'].apply(lambda x: tokenize(x, word_to_index))

# Show tokenized sample
df_clean[['clean_text', 'tokenized']].head()


Unnamed: 0,clean_text,tokenized
0,Denotes co producer,"[39756, 625, 713]"
1,Credits and personnel,"[878, 2, 1065]"
2,Credits adapted from Allmusic,"[878, 1036, 16, 6954]"
3,Joshua Berkman A R,"[8725, 9903, 43, 640]"
4,Safaree SB Samuels A R,"[71158, 17729, 42428, 43, 640]"


## 6. Save the Preprocessed Dataset
We will save the preprocessed dataset as a CSV file for easy access during model training. This file contains both the cleaned text and tokenized versions.

In [6]:
path = '../../datasets/wiki-103-text/preprocessed/wiki-103.csv'

# Save the preprocessed dataset as a CSV file
df_clean[['clean_text', 'tokenized']].to_csv(path, index=False)

print(f"Preprocessed dataset saved at {path}")

Preprocessed dataset saved at ../../datasets/wiki-103-text/preprocessed/wiki-103.csv


In [7]:
import json

# Create the word-to-index and index-to-word mappings
def create_vocab_mapping(tokenized_texts):
    vocab = sorted(set(word for sentence in tokenized_texts for word in sentence.split()))
    word_to_index = {word: i for i, word in enumerate(vocab)}
    index_to_word = {i: word for i, word in enumerate(vocab)}
    return word_to_index, index_to_word

# Example tokenized dataset (from clean_text column)
tokenized_texts = df['clean_text'].tolist()

# Create the vocab mappings
word_to_index, index_to_word = create_vocab_mapping(tokenized_texts)

# Save the mappings to files
with open('../../datasets/wiki-103-text/preprocessed/word_to_index.json', 'w') as f:
    json.dump(word_to_index, f)

with open('../../datasets/wiki-103-text/preprocessed/index_to_word.json', 'w') as f:
    json.dump(index_to_word, f)

print(f"Vocabulary size: {len(word_to_index)}")


Vocabulary size: 387291


In [8]:
import pandas as pd
import json
from collections import Counter

# Path to the original dataset
path = '../../datasets/wiki-103-text/preprocessed/wiki-103_small.csv'

# Load the dataframe
df = pd.read_csv(path)

# Select 5,000 examples from the dataset
df_reduced = df.sample(n=20000, random_state=42)  # Randomly select 5,000 rows

# Tokenize the text and count word frequencies
def tokenize(text):
    if isinstance(text, str):
        return text.lower().split()
    return []  # Return an empty list for non-string values

print("Tokenizing dataset...")
word_counter = Counter()

# Count the words in the reduced dataset
for text in df_reduced['clean_text']:
    words = tokenize(text)
    word_counter.update(words)

# Select the top N most frequent words to limit vocabulary size
max_vocab_size = 5000
print(f"Limiting vocabulary to {max_vocab_size} most frequent words...")
most_common_words = [word for word, _ in word_counter.most_common(max_vocab_size)]

# Create word-to-index and index-to-word dictionaries
word_to_index_reduced = {word: idx + 1 for idx, word in enumerate(most_common_words)}  # Start indexing from 1
word_to_index_reduced["<unk>"] = 0  # Unknown token for words outside the vocabulary
index_to_word_reduced = {idx: word for word, idx in word_to_index_reduced.items()}

# Function to convert text to word indices
def text_to_indices(text, word_to_index):
    words = tokenize(text)
    return [word_to_index.get(word, word_to_index["<unk>"]) for word in words]

# Convert 'clean_text' to indexed sequences
print("Converting text to indexed sequences...")
df_reduced['indexed_text'] = df_reduced['clean_text'].apply(lambda x: text_to_indices(x, word_to_index_reduced))

# Save the reduced dataset to CSV
reduced_csv_path = path.replace('.csv', '_reduced.csv')
df_reduced[['clean_text', 'indexed_text']].to_csv(reduced_csv_path, index=False)
print(f"Reduced dataset saved to: {reduced_csv_path}")

# Save word_to_index and index_to_word as JSON files
with open('../../datasets/wiki-103-text/preprocessed/word_to_index_reduced.json', 'w') as f:
    json.dump(word_to_index_reduced, f)
with open('../../datasets/wiki-103-text/preprocessed/index_to_word_reduced.json', 'w') as f:
    json.dump(index_to_word_reduced, f)

print("Word-to-index and index-to-word mappings saved.")


Tokenizing dataset...
Limiting vocabulary to 5000 most frequent words...
Converting text to indexed sequences...


Reduced dataset saved to: ../../datasets/wiki-103-text/preprocessed/wiki-103_small_reduced.csv
Word-to-index and index-to-word mappings saved.
