# Finding (Problem Statement) Signal in Sentence Phrasing

Here we look at intersecting (low signal) and exclusive (only in one class aka high signal) [n-grams](https://en.wikipedia.org/wiki/N-gram) of positive and negative labeled sentences.

If there are n-grams that *almost* exclusively appear in one class (0 or 1) then they could make great matching phrases either directly for classifying a sentence without using a machine learning model or just to do last-mile quality assurance and flag suspicious model decisions

In [2]:
import pandas as pd

In [8]:
df = pd.read_csv("datasets/problem_statements.csv")

In [5]:
def take_while(fn, coll):
    """Yield values from coll until fn is False"""
    for e in coll:
        if fn(e):
            yield e
        else:
            return

def partition(n, coll, step=None):
    return take_while(lambda e: len(e) == n,
        (coll[i:i+n] for i in range(0, len(coll), step or n)))

def partition_all(n, coll, step=None):
    return (coll[i:i+n] for i in range(0, len(coll), step or n))


In [13]:
#df = df.drop(columns=["Unnamed: 0", "index", "Unnamed: 0.1", "Unnamed: 9"])

In [2]:
"The difficulty with this task lies in the fact..."

SyntaxError: unexpected EOF while parsing (<ipython-input-2-2f9c2f8c8548>, line 1)

ERROR! Session/line number was not unique in database. History logging moved to new session 4


In [183]:
positives = df[df["labels"] == 1]
negatives = df[df["labels"] == 0]

In [218]:
positives[:3]

Unnamed: 0,PMID,source,Title,text,labels,DOI
0,,acl_cambridge,,The difficulty with this task lies in the fact...,1,
1,,acl_cambridge,,The problem with rich annotations is that they...,1,
2,,acl_cambridge,,"As a consequence , when adapting existing meth...",1,


In [184]:
def n_grams(texts, n_gram=2): return [" ".join(n) for t in texts for n in partition(n_gram, t.split(" "), 1)]

In [185]:
bi_grams_pos = n_grams(positives["text"], 2)
tri_grams_pos = n_grams(positives["text"], 3)
bi_grams_neg = n_grams(negatives["text"], 2)
tri_grams_neg = n_grams(negatives["text"], 3)

In [219]:
bi_grams_pos[:3]

['The difficulty', 'difficulty with', 'with this']

In [223]:
d1 = dict(Counter(bi_grams_pos))
d2 = dict(Counter(bi_grams_neg))

d3 = dict(Counter(tri_grams_pos))
d4 = dict(Counter(tri_grams_neg))

bi_grams_both = {x:(d1[x], d2[x]) for x in d1 if x in d2}

tri_grams_both = {x:(d3[x], d4[x]) for x in d3 if d4.get(x)}

In [224]:
bi_grams_pos_only = {x:d1[x] for x in d1 if not d2.get(x)} # and d2.get(x) < 5
tri_grams_pos_only = {x:d3[x] for x in d3 if not d4.get(x)}

In [229]:
[(k,v) for k,v in Counter(tri_grams_both).items() if (v[0] + v[1]) > 20 ]
#n-gram: (pos, neg)

[('the fact that', (12, 21)),
 ('et al. ,', (9, 66)),
 ('that they are', (12, 12)),
 ('one of the', (23, 29)),
 ('is that the', (65, 6)),
 ('the quality of', (7, 15)),
 ('is that it', (50, 3)),
 ('that it is', (11, 18)),
 ('in the training', (6, 18)),
 ('the training data', (7, 33)),
 ('it does not', (10, 24)),
 ('to the same', (4, 18)),
 ('in terms of', (6, 18)),
 ('the use of', (5, 22)),
 ('as well as', (5, 21)),
 ('the most common', (18, 11)),
 ('can be used', (3, 27)),
 ('the number of', (9, 38)),
 ('the treatment of', (2, 19)),
 ('a set of', (1, 21)),
 ('is one of', (14, 11)),
 ('of the most', (17, 8)),
 ('is a common', (19, 9)),
 ('in patients with', (2, 22))]

In [232]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

[('in the', (80, 311)),
 (', and', (22, 165)),
 ('is that', (272, 9)),
 ('of the', (154, 515)),
 ('is the', (74, 51)),
 (', the', (38, 108)),
 ('is not', (21, 89)),
 (', which', (21, 84)),
 ('can be', (18, 96)),
 ('number of', (18, 105)),
 ('that the', (82, 75)),
 ('does not', (26, 78)),
 ('to be', (28, 120)),
 ('to the', (48, 229)),
 ('it is', (24, 86)),
 ('on the', (23, 137)),
 ('is a', (88, 151)),
 ('of a', (14, 103)),
 ('the same', (19, 89)),
 ('for the', (19, 93)),
 (', we', (6, 120)),
 ('as a', (14, 90)),
 ('has been', (13, 93))]

Let's look at patterns that one class shows very rarely relative to the other

In [252]:
factor = 20
#show n-grams that appear at least 20x more often in positive samples
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0]/v[1]) > factor], [(k,v) for k,v in Counter(tri_grams_both).items() if (v[0]/v[1]) > factor/2]

([('is that', (272, 9)), ('problem of', (54, 1)), ('shortcoming of', (24, 1))],
 [('is that the', (65, 6)), ('is that it', (50, 3))])

Apparently "problem of" appears 54 in "problem statements" and only once in "non-problem" statement. Let's look at that one sentence!

In [250]:
negatives[negatives["text"].str.contains('problem of')].reset_index()["text"][0]

'peripheral neuropathy is the most common problem of diabetes.'

Interesting ... through the bi-gram analysis we found a statement that actually sounds like a problem statement and should/could be labeled a 1 (problem statement) ... but isn't in the dataset.
In that case I'd feel OK with using "problem of" as a hard-coded pattern-matching rule in finding positive samples

In [233]:
[(k,v) for k,v in Counter(bi_grams_pos_only).items() if v > 25]

[('the problem', 45),
 ('drawback of', 57),
 ('limitation is', 41),
 ('limitation of', 46),
 ('drawback is', 66)]

In [234]:
[(k,v) for k,v in Counter(tri_grams_pos_only).items() if v > 15]

[('is that they', 31),
 ('the problem of', 41),
 ('limitation is that', 32),
 ('drawback is that', 56),
 ('approach is that', 16),
 ('limitation of the', 17)]

In [None]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

In [216]:
texts = df[df["text"].str.contains("drawback is that")].reset_index()["text"]

In [217]:
texts

0     Its most obvious drawback is that the method c...
1     The most significant drawback is that ontologi...
2     The main drawback is that it needs almost 20,0...
3     One drawback is that it cannot deal with depen...
4     A potential drawback is that it might not work...
5     The drawback is that , since extracted events ...
6     The main drawback is that the entries produced...
7     An obvious drawback is that it is necessary to...
8     The only drawback is that it willperform slowe...
9     The drawback is that the estimates of paramete...
10    One possible drawback is that senses which one...
11    Their major drawback is that they require a gr...
12    The first drawback is that it requires more kn...
13    Another drawback is that it is impossible to a...
14    The drawback is that the solution may be only ...
15    The main drawback is that structures may not b...
16    Another major drawback is that it requires con...
17    The major drawback is that we have to gene