Skip to content

Add pos_proportions#6

Merged
HLasse merged 4 commits intoHLasse:masterfrom
MartinBernstorff:master
Aug 31, 2021
Merged

Add pos_proportions#6
HLasse merged 4 commits intoHLasse:masterfrom
MartinBernstorff:master

Conversation

@MartinBernstorff
Copy link
Contributor

Here goes!

The function runs fine separate from the package:

import spacy
from typing import Counter
from spacy.tokens import Doc, Span

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("Here is the first sentence. It was pretty short, yes. Let's make another one that's slightly longer and more complex.")

doc = nlp(text)

def pos_proportions(doc: Doc) -> dict:
        """
            Returns:
                Dict with proportions of part-of-speech tag in doc.
        """
        pos_counts = Counter()
    
        for token in doc:
            pos_counts[token.tag_] += 1

        pos_proportions = {}

        for tag in pos_counts:
            pos_proportions[tag] = pos_counts[tag] / sum(pos_counts.values())

        return pos_proportions

print(pos_proportions(doc))

However, the test fails with:

textdescriptives/tests/test_descriptive_stats.py F                       [100%]

=================================== FAILURES ===================================
_____________________________ test_pos_proportions _____________________________

nlp = <spacy.lang.en.English object at 0x7ffc82162550>

    def test_pos_proportions(nlp):
        doc = nlp(
            "Here is the first sentence. It was pretty short. Let's make another one that's slightly longer and more complex."
        )
    
>       assert doc._.pos_proportions == {'RB': 0.125, 'VBZ': 0.08333333333333333, 'DT': 0.08333333333333333, 'JJ': 0.125, 'NN': 0.08333333333333333, '.': 0.125, 'PRP': 0.08333333333333333, 'VBD': 0.041666666666666664, 'VB': 0.08333333333333333, 'WDT': 0.041666666666666664, 'JJR': 0.041666666666666664, 'CC': 0.041666666666666664, 'RBR': 0.041666666666666664}
E       AssertionError: assert {'': 1.0} == {'.': 0.125, ...': 0.125, ...}
E         Left contains 1 more item:
E         {'': 1.0}
E         Right contains 13 more items:
E         {'.': 0.125,
E          'CC': 0.041666666666666664,
E          'DT': 0.08333333333333333,
E          'JJ': 0.125,...

I wager that's because I've not implemented the function correctly in the package somewhere, and would love a hand with that :-)

Copy link
Owner

@HLasse HLasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! You solved that easily :) I've made a couple of suggestions on how to take better advantage of Counter and make it a bit more conscise.

Before I merge I'd like you to change a few things:

Make a new pipeline component for pos_proportions. None of the other functions in descriptive_stats need a pos-tagger which is nice if you just want to calculate descriptive statistics fast (without running the text through a model first). This is probably also the reason why the test fails: only an empty spacy pipeline is loaded for the descriptive_stats tests. So, to keep it this way, make a new script in the components folder where you implement the pos_proportions method and make it play with the rest of the package.
This requires a bit of fiddling with init.py folders and some other stuff. Reach out of you have issues!

MartinBernstorff and others added 2 commits August 31, 2021 13:34
@MartinBernstorff
Copy link
Contributor Author

MartinBernstorff commented Aug 31, 2021

Ah, makes a ton of sense! I've moved pos_proportions into the pos_stats component and added basic tests to test_pos_stats.py. The tests are passing now :-)

CodeFactor is flagging an issue with dataframe_extract. It probably isn't due to my commit, so I'll leave it be for now.

Let me know if you'd like more changes!

@HLasse
Copy link
Owner

HLasse commented Aug 31, 2021

looks good! let's keep the methods simple for now - we can always add complexity when required. If you update the docs then I'll merge

@HLasse HLasse merged commit b6f872d into HLasse:master Aug 31, 2021
@HLasse
Copy link
Owner

HLasse commented Aug 31, 2021

i tried to be smart and do the review in githubs online editor which somehow made me prematurely merge into main.. I'll clean some of the code if you go ahead with docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants