Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formulae for dependency distance calculation on Doc level #77

Closed
bma-vandijk opened this issue Dec 1, 2022 · 3 comments
Closed

Formulae for dependency distance calculation on Doc level #77

bma-vandijk opened this issue Dec 1, 2022 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@bma-vandijk
Copy link

bma-vandijk commented Dec 1, 2022

Hi,

first of all thanks for this very helpful library. I have a question regarding the way dependency distance (DD) for Doc objects is calculated.

Your function on calculating DD for a Doc, returns the DD value here: "dependency_distance_mean": np.mean(dep_dists). The mean returned is as far as I can see, the mean over mean DD of every sentence (contained in dep_dist) constituting the Doc object.

the two sources you cite on dependency distance in your documentation (Liu 2008 and Oya, 2008), however, have a different approach.

For calculating DD of a text, Liu seems to take the sum of the absolute DD found in the whole text and multiplies by (1 / (number of words - the number of sentences). Oya seems to takes a mean of means like you do, but for a sentence averages sum of absolute DD over the number of dependency links in an utterance. In your documentation nor in your code I can retrieve how you exactly calculate DD for a text.

Would you please be so kind as to explain with what approach you calculate DD for Doc objects, and provide pointers on how we may adapt the code to e.g. implement approaches by Liu and Oya? Thanks!

Which page or section is this issue related to?

https://hlasse.github.io/TextDescriptives/dependencydistance.html

@bma-vandijk bma-vandijk added the documentation Improvements or additions to documentation label Dec 1, 2022
@HLasse
Copy link
Owner

HLasse commented Dec 5, 2022

Hi @bma-vandijk, thanks for the question and for using the library!
The logic for calculating DD is contained in textdescriptives/components/dependency_distance.py with the main logic happening in token_dependency.

The implementation in textdescriptives follows Oya - we calculate the distances from each token to their dependent, and take the mean of this for calculating the mean dependency distance for spans. In our implementation, we calculate the Doc level DD by averaging over the sentence-level mean DDs.

To get the sentence level means as in Oya, you could simply do

dep_dists, adj_deps = zip(
            *[sent._.dependency_distance.values() for sent in doc.sents]
        )

To calculate the metric you cite from Liu (sum(DD) * (1 / (number of words - the number of sentences))
you could so something like the following (assuming you have added the dependency distance pipeline)

def liu_doc_dependency(doc: Doc) -> float:
    """Calculate mean dependency distance from Liu, 2008"""
    # get sum of token level dependency distance
    dd = sum[token._.dependency_distance["dependency_distance"] for token in doc]
    return dd * (1 / (len(doc) - len(list(doc.sents))))

Let me know if you have any other questions!

@bma-vandijk bma-vandijk changed the title Formulae for dependency distance calulcation on Doc level Formulae for dependency distance calculation on Doc level Dec 5, 2022
@KennethEnevoldsen
Copy link
Collaborator

@HLasse let us add this to the documentation as well

@bma-vandijk
Copy link
Author

This is super late, but I would still like to thank you for your swift and helpful response :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants