Add pos_proportions by MartinBernstorff · Pull Request #6 · HLasse/TextDescriptives

MartinBernstorff · 2021-08-25T13:11:55Z

Here goes!

The function runs fine separate from the package:

import spacy
from typing import Counter
from spacy.tokens import Doc, Span

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("Here is the first sentence. It was pretty short, yes. Let's make another one that's slightly longer and more complex.")

doc = nlp(text)

def pos_proportions(doc: Doc) -> dict:
        """
            Returns:
                Dict with proportions of part-of-speech tag in doc.
        """
        pos_counts = Counter()
    
        for token in doc:
            pos_counts[token.tag_] += 1

        pos_proportions = {}

        for tag in pos_counts:
            pos_proportions[tag] = pos_counts[tag] / sum(pos_counts.values())

        return pos_proportions

print(pos_proportions(doc))

However, the test fails with:

textdescriptives/tests/test_descriptive_stats.py F                       [100%]

=================================== FAILURES ===================================
_____________________________ test_pos_proportions _____________________________

nlp = <spacy.lang.en.English object at 0x7ffc82162550>

    def test_pos_proportions(nlp):
        doc = nlp(
            "Here is the first sentence. It was pretty short. Let's make another one that's slightly longer and more complex."
        )
    
>       assert doc._.pos_proportions == {'RB': 0.125, 'VBZ': 0.08333333333333333, 'DT': 0.08333333333333333, 'JJ': 0.125, 'NN': 0.08333333333333333, '.': 0.125, 'PRP': 0.08333333333333333, 'VBD': 0.041666666666666664, 'VB': 0.08333333333333333, 'WDT': 0.041666666666666664, 'JJR': 0.041666666666666664, 'CC': 0.041666666666666664, 'RBR': 0.041666666666666664}
E       AssertionError: assert {'': 1.0} == {'.': 0.125, ...': 0.125, ...}
E         Left contains 1 more item:
E         {'': 1.0}
E         Right contains 13 more items:
E         {'.': 0.125,
E          'CC': 0.041666666666666664,
E          'DT': 0.08333333333333333,
E          'JJ': 0.125,...

I wager that's because I've not implemented the function correctly in the package somewhere, and would love a hand with that :-)

HLasse

Good job! You solved that easily :) I've made a couple of suggestions on how to take better advantage of Counter and make it a bit more conscise.

Before I merge I'd like you to change a few things:

Make a new pipeline component for pos_proportions. None of the other functions in descriptive_stats need a pos-tagger which is nice if you just want to calculate descriptive statistics fast (without running the text through a model first). This is probably also the reason why the test fails: only an empty spacy pipeline is loaded for the descriptive_stats tests. So, to keep it this way, make a new script in the components folder where you implement the pos_proportions method and make it play with the rest of the package.
This requires a bit of fiddling with init.py folders and some other stuff. Reach out of you have issues!

textdescriptives/components/descriptive_stats.py

Make PR more pythonic Co-authored-by: HLasse <lasseh0310@gmail.com>

MartinBernstorff · 2021-08-31T12:34:43Z

Ah, makes a ton of sense! I've moved pos_proportions into the pos_stats component and added basic tests to test_pos_stats.py. The tests are passing now :-)

CodeFactor is flagging an issue with dataframe_extract. It probably isn't due to my commit, so I'll leave it be for now.

Let me know if you'd like more changes!

HLasse · 2021-08-31T12:46:42Z

looks good! let's keep the methods simple for now - we can always add complexity when required. If you update the docs then I'll merge

HLasse · 2021-08-31T12:49:26Z

i tried to be smart and do the review in githubs online editor which somehow made me prematurely merge into main.. I'll clean some of the code if you go ahead with docs

MartinBernstorff added 2 commits August 25, 2021 14:21

Add docstring to DescriptiveStatistics.counts

ab40f1c

Add pos_proportions and attempt at test

6da3dd1

HLasse requested changes Aug 30, 2021

View reviewed changes

textdescriptives/components/descriptive_stats.py Outdated Show resolved Hide resolved

textdescriptives/components/descriptive_stats.py Outdated Show resolved Hide resolved

MartinBernstorff and others added 2 commits August 31, 2021 13:34

Apply suggestions from code review

efc7fbf

Make PR more pythonic Co-authored-by: HLasse <lasseh0310@gmail.com>

Move pos_proportions into new pos_stats component and add tests

b6f872d

HLasse merged commit b6f872d into HLasse:master Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pos_proportions#6

Add pos_proportions#6
HLasse merged 4 commits intoHLasse:masterfrom
MartinBernstorff:master

MartinBernstorff commented Aug 25, 2021

Uh oh!

HLasse left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MartinBernstorff commented Aug 31, 2021 •

edited

Loading

Uh oh!

HLasse commented Aug 31, 2021

Uh oh!

HLasse commented Aug 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MartinBernstorff commented Aug 25, 2021

Uh oh!

HLasse left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MartinBernstorff commented Aug 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HLasse commented Aug 31, 2021

Uh oh!

HLasse commented Aug 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HLasse left a comment •

edited

Loading

MartinBernstorff commented Aug 31, 2021 •

edited

Loading