## Stage 2: Data Preparation Pipeline & Feature Extraction

Milestone: Implementation of the oracle, feature extraction, and obtaining training samples for a Keras model.

In this section, we execute the complete data processing pipeline to transform the raw CoNLL-U training data into numerical vectors that can be fed into the Neural Network. This process involves four main steps:


Data Loading & Filtering: We load the training dataset (en_partut-ud-train.conllu) and filter out non-projective trees, as the Arc-Eager algorithm is restricted to projective dependency structures.

Oracle Execution (Obtaining Samples): We run the Oracle on every valid sentence. The Oracle simulates the parsing process using the "gold standard" tree to generate the correct sequence of States (Input) and Transitions (Output/Target).


Feature Extraction: We convert the complex State objects into fixed-length lists of features using the state_to_feats function. This extracts the specific words and UPOS tags from the top of the Stack and the Buffer.


Numerical Conversion (Vectorization): Neural networks require numerical input. We build vocabularies (dictionaries mapping strings to unique Integer IDs) for words, tags, actions, and dependency labels. Finally, we convert all text features into Numpy arrays (X_train, y_act, y_dep) ready for Keras.

In [12]:
import numpy as np
from conllu_reader import ConlluReader
from algorithm import ArcEager
import pickle

# --- 1. LOAD DATA (Use the TRAIN file, not test) ---
print("--- STEP 1: Data Loading ---")
reader = ConlluReader()
# Ensure the filename matches your specific training file path
train_sentences = reader.read_conllu_file("en_partut-ud-train_clean.conllu") 

# Filter out non-projective trees as Arc-Eager cannot handle them [cite: 1100]
train_sentences = reader.remove_non_projective_trees(train_sentences)
print(f" Loaded {len(train_sentences)} valid projective sentences for training.\n")

# --- 2. OBTAIN RAW SAMPLES (Oracle Execution) ---
print("--- STEP 2: Generating Samples with the Oracle ---")
arc_eager = ArcEager()
raw_samples = []

for sent in train_sentences:
    try:
        # The oracle returns a list of Sample objects (State + Transition) for this sentence
        samples = arc_eager.oracle(sent)
        raw_samples.extend(samples)
    except AssertionError:
        # If the oracle fails to reconstruct the exact gold tree, skip the sentence
        continue

print(f"Total samples (game states) generated: {len(raw_samples)}")

# VISUALIZATION: Let's see what a raw sample looks like
if raw_samples:
    print(f"Example of Raw Sample (Index 0):")
    print(f"   State: {raw_samples[0].state}")
    print(f"   Correct Action: {raw_samples[0].transition}\n")

# --- 3. FEATURE EXTRACTION (From State to List of Strings) ---
# We need to extract features from the stack and buffer [cite: 934, 1080]
print("--- STEP 3: Feature Extraction (Translation to Text) ---")
X_raw = [] # Stores lists of words/tags (Input features)
Y_raw = [] # Stores actions and dependencies (Outputs)

for sample in raw_samples:
    # Extract features (words and UPOS tags) using the implemented function
    # nbuffer_feats=2 and nstack_feats=2 is the suggested configuration [cite: 1091]
    features = sample.state_to_feats(nbuffer_feats=2, nstack_feats=2)
    X_raw.append(features)
    
    # Save the action (transition) and the dependency label
    action_name = sample.transition.action
    dep_label = sample.transition.dependency
    Y_raw.append((action_name, dep_label))

# VISUALIZATION: What do the lists contain now?
print(f" Example of Input (X_raw[0]): {X_raw[0]}")
print(f"   (This is what the network 'sees': words and tags)")
print(f"Example of Output (Y_raw[0]): {Y_raw[0]}")
print(f"   (This is what the network must predict: Action and Label)\n")

# --- 4. PREPARATION FOR KERAS (Vocabularies and Numerical Conversion) ---
# Neural networks require numerical input [cite: 733]
print("--- STEP 4: Numerical Conversion (For Keras) ---")

# 4.1 Create Dictionaries (Text -> Number Maps)
words_vocab = {'<PAD>': 0, '<UNK>': 1}
upos_vocab = {'<PAD>': 0, '<UNK>': 1}
actions_vocab = {}  # E.g., 'SHIFT': 0, 'LEFT-ARC': 1...
deprels_vocab = {None: 0} # E.g., 'nsubj': 1, 'det': 2...

# Fill vocabularies by iterating through all collected data
for features in X_raw:
    # Assuming features structure: [W_s2, W_s1, W_b1, W_b2, P_s2, P_s1, P_b1, P_b2]
    # The first half are words, the second half are UPOS tags
    num_words = len(features) // 2 
    
    words = features[:num_words]
    upos = features[num_words:]
    
    for w in words:
        if w not in words_vocab:
            words_vocab[w] = len(words_vocab)
    for u in upos:
        if u not in upos_vocab:
            upos_vocab[u] = len(upos_vocab)

for act, dep in Y_raw:
    if act not in actions_vocab:
        actions_vocab[act] = len(actions_vocab)
    if dep not in deprels_vocab:
        deprels_vocab[dep] = len(deprels_vocab)

print(f"Vocabulary Sizes:")
print(f"   Unique words: {len(words_vocab)}")
print(f"   Unique UPOS tags: {len(upos_vocab)}")
print(f"   Possible actions: {len(actions_vocab)} {actions_vocab}")
print(f"   Dependency relations: {len(deprels_vocab)}\n")

# 4.2 Convert everything to Numbers (Matrices for Keras)
# X_train will have shape (Num_Samples, Num_Features)
X_train_numerical = []
Y_train_actions = []
Y_train_deprels = []

for i in range(len(X_raw)):
    # Convert INPUT (Features)
    features = X_raw[i]
    num_vec = []
    
    # Convert words to IDs
    num_words = len(features) // 2
    for w in features[:num_words]:
        num_vec.append(words_vocab.get(w, words_vocab['<UNK>']))
    # Convert UPOS tags to IDs
    for u in features[num_words:]:
        num_vec.append(upos_vocab.get(u, upos_vocab['<UNK>']))
    
    X_train_numerical.append(num_vec)
    
    # Convert OUTPUT (Targets)
    act, dep = Y_raw[i]
    Y_train_actions.append(actions_vocab[act])
    # Use 0 if the dependency is None (e.g., for SHIFT or REDUCE)
    Y_train_deprels.append(deprels_vocab.get(dep, 0)) 

# Convert to Numpy arrays (The actual input format Keras expects)
X_train = np.array(X_train_numerical)
y_act = np.array(Y_train_actions)
y_dep = np.array(Y_train_deprels)

print("DATA")
print(f"Final numerical example (X_train[0]): {X_train[0]}")
print(f"   (Notice how words are now IDs)")
# Find the action name corresponding to the ID for display purposes
act_name = list(actions_vocab.keys())[list(actions_vocab.values()).index(y_act[0])]
print(f"Target Action (y_act[0]): {y_act[0]} -> Corresponds to '{act_name}'")

print(f"Target Action (y_act[0]): {y_act[0]} -> Corresponds to '{act_name}'")
np.savez("training_data.npz", X=X_train, y_act=y_act, y_dep=y_dep)
with open("vocabs.pkl", "wb") as f:
    pickle.dump((words_vocab, upos_vocab, actions_vocab, deprels_vocab), f)
print("Data saved to 'training_data.npz' and 'vocabs.pkl'")

--- STEP 1: Data Loading ---
 Loaded 1748 valid projective sentences for training.

--- STEP 2: Generating Samples with the Oracle ---
Total samples (game states) generated: 81182
Example of Raw Sample (Index 0):
   State: Stack (size=1): (0, ROOT, ROOT_UPOS)
Buffer (size=13): (1, Distribution, NOUN) | (2, of, ADP) | (3, this, DET) | (4, license, NOUN) | (5, does, AUX) | (6, not, PART) | (7, create, VERB) | (8, an, DET) | (9, attorney, NOUN) | (10, -, PUNCT) | (11, client, NOUN) | (12, relationship, NOUN) | (13, ., PUNCT)
Arcs (size=0): set()

   Correct Action: SHIFT

--- STEP 3: Feature Extraction (Translation to Text) ---
 Example of Input (X_raw[0]): ['<PAD>', 'ROOT', 'Distribution', 'of', '<PAD>', 'ROOT_UPOS', 'NOUN', 'ADP']
   (This is what the network 'sees': words and tags)
Example of Output (Y_raw[0]): ('SHIFT', None)
   (This is what the network must predict: Action and Label)

--- STEP 4: Numerical Conversion (For Keras) ---
Vocabulary Sizes:
   Unique words: 6872
   Unique 