#### About

> Dependency Parsing

Dependency parsing is a natural language processing (NLP) technique that involves analyzing the grammatical structure of a sentence by identifying the relationships or dependencies between the words in the sentence. It represents the syntactic structure of a sentence as a directed acyclic graph (DAG), where the words are the nodes and the dependencies between the words are the edges.



Example - 

For example, consider the sentence: "The cat chased the mouse." The dependency parse tree for this sentence would have "cat" and "mouse" as dependent nodes, and "chased" as the governing node. The edge between "cat" and "chased" would be labeled as "subject," indicating that "cat" is the subject of the verb "chased." Similarly, the edge between "mouse" and "chased" would be labeled as "object," indicating that "mouse" is the object of the verb "chased."



Dataset - UniversalDependencies(https://universaldependencies.org/)

Using spacy for Dependency parsing

In [None]:
import spacy
#load the spacy model
nlp = spacy.load("en_core_web_sm")
#input sentence
sentence = "The cat chased the mouse."

# preprocess the sentence with spacy
doc = nlp(sentence)

In [None]:
#extract the dependency parse tree

for token in doc:
  print("Word >>", token.text," >> Lemma >>", token.lemma_,">>  POS tag:>>", token.pos_, "  Dependency:>>", token.dep_, "  Head:>>", token.head.text)

Word >> The  >> Lemma >> the >>  POS tag:>> DET   Dependency:>> det   Head:>> cat
Word >> cat  >> Lemma >> cat >>  POS tag:>> NOUN   Dependency:>> nsubj   Head:>> chased
Word >> chased  >> Lemma >> chase >>  POS tag:>> VERB   Dependency:>> ROOT   Head:>> chased
Word >> the  >> Lemma >> the >>  POS tag:>> DET   Dependency:>> det   Head:>> mouse
Word >> mouse  >> Lemma >> mouse >>  POS tag:>> NOUN   Dependency:>> dobj   Head:>> chased
Word >> .  >> Lemma >> . >>  POS tag:>> PUNCT   Dependency:>> punct   Head:>> chased


####  Training a custom dependency parsing model using PyTorch


In [1]:
!pip install -U torch==1.8.0 torchtext==0.9.0

# Reload environment
exit()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp39-cp39-manylinux1_x86_64.whl (735.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m735.5/735.5 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp39-cp39-manylinux1_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.0.0+cu118
    Uninstalling torch-2.0.0+cu118:
      Successfully uninstalled torch-2.0.0+cu118
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.1
    Uninstalling torchtext-0.15.1:
      Successfully uninstalled torchtext-0.15.1
[31mERROR: pip's dependency resolver does not currently take into account all the

Dataset - https://github.com/UniversalDependencies/UD_English-EWT.git

UAS (Unlabeled Attachment Score) and LAS (Labeled Attachment Score) are two common metrics used to evaluate the performance of a dependency parser.

UAS measures the percentage of correct predictions for the head of each word, regardless of the type of dependency label. It is calculated by dividing the number of correctly predicted head words by the total number of words in the dataset.

LAS, on the other hand, measures the percentage of correct predictions for both the head and the dependency label of each word. It is calculated by dividing the number of correctly predicted heads and labels by the total number of words in the dataset.

In [273]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

import os
from conllu import parse_incr




In [274]:
def load_data(file_path):
    sentences = []
    labels = []

    with open(file_path, "r", encoding="utf-8") as file:
        for token_list in parse_incr(file):
            sentence = []
            label = []
            for token in token_list:
                sentence.append(token["form"])
                label.append(token["deprel"])
            sentences.append(sentence)
            labels.append(label)

    return sentences, labels



In [275]:
# Load the dataset
file_path = "/content/en_ewt-ud-train.conllu"
sentences, labels = load_data(file_path)




In [276]:
# Tokenize the sentences and labels
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(sentences)
word_index = word_tokenizer.word_index

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_index = label_tokenizer.word_index



In [278]:
# Convert sentences and labels to integer sequences
sequences = word_tokenizer.texts_to_sequences(sentences)
label_sequences = label_tokenizer.texts_to_sequences(labels)



In [279]:
# Pad the sequences
max_length = max([len(seq) for seq in sequences])
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding="post")
padded_labels = pad_sequences(label_sequences, maxlen=max_length, padding="post")



In [280]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, padded_labels, test_size=0.2, random_state=42)



In [281]:
# Create the Keras model
model = Sequential([
    Embedding(input_dim=len(word_index) + 1, output_dim=128, input_length=max_length),
    Bidirectional(LSTM(256, return_sequences=True)),
    Dense(len(label_index) + 1, activation="softmax")
])

# Compile the model
model.compile(optimizer=Adam(), loss="sparse_categorical_crossentropy", metrics=["accuracy"])



In [282]:
# Train the model
model.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=32, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f7a061751f0>

In [283]:
from tensorflow.keras.models import load_model

# Save the model
model.save("dependency_parsing_model.h5")


In [284]:
# Load the saved model
loaded_model = load_model("dependency_parsing_model.h5")


In [285]:
# Preprocess the sample text
sample_text = "This is a demo of custom dependency parsing model trained in keras on the universal dependency dataset."
tokenized_sample = word_tokenizer.texts_to_sequences([sample_text])
padded_sample = pad_sequences(tokenized_sample, maxlen=max_length, padding="post")


In [286]:
# Predict the dependency relations
predictions = loaded_model.predict(padded_sample)




In [287]:
# Convert the predictions to labels
predicted_labels = np.argmax(predictions, axis=-1)
label_sequences = label_tokenizer.sequences_to_texts(predicted_labels)


In [288]:
# Print the results
print("Sample text:", sample_text)
print("Predicted dependency relations:", label_sequences[0])

Sample text: This is a demo of custom dependency parsing model trained in keras on the universal dependency dataset.
Predicted dependency relations: nsubj cop det case amod compound nmod acl case case det amod obl


Calculating UAS and LAS scores

In [289]:
loss, accuracy = model.evaluate(X_val, y_val)




In [290]:
print("Accuracy:", accuracy)


Accuracy: 0.9846726655960083


In [291]:
predictions = model.predict(X_val)




In [292]:
# Convert the predictions to labels
predicted_labels = np.argmax(predictions, axis=-1)


In [293]:
# Convert the integer labels back to text labels
predicted_labels = label_tokenizer.sequences_to_texts(predicted_labels)
y_val_labels = label_tokenizer.sequences_to_texts(y_val)


In [295]:
# Calculate the UAS and LAS scores
total = 0
correct = 0
labeled_correct = 0

for i in range(len(predicted_labels)):
    for j in range(len(predicted_labels[i])):
        total += 1
        if predicted_labels[i][j] == y_val_labels[i][j]:
            correct += 1
            labeled_correct += 1
        if predicted_labels[i][j] != "punct" and predicted_labels[i][j] == y_val_labels[i][j]:
            labeled_correct += 1

UAS = correct / total
LAS = labeled_correct / total

IndexError: ignored

In [None]:
print("UAS:", UAS)
print("LAS:", LAS)