# Finding (Problem Statement) Signal in Sentence Phrasing

Here we look at intersecting (low signal) and exclusive (only in one class aka high signal) [n-grams](https://en.wikipedia.org/wiki/N-gram) of positive and negative labeled sentences.

If there are n-grams that *almost* exclusively appear in one class (0 or 1) then they could make great matching phrases either directly for classifying a sentence without using a machine learning model or just to do last-mile quality assurance and flag suspicious model decisions

In [None]:
# default_exp core

In [None]:
import pandas as pd
df = pd.read_csv("datasets/problem_statements.csv")

In [None]:
#export

def take_while(fn, coll):
    """Yield values from coll until fn is False"""
    for e in coll:
        if fn(e):
            yield e
        else:
            return

def partition(n, coll, step=None):
    return take_while(lambda e: len(e) == n,
        (coll[i:i+n] for i in range(0, len(coll), step or n)))

def partition_all(n, coll, step=None):
    return (coll[i:i+n] for i in range(0, len(coll), step or n))

def n_grams(texts, n_gram=2): return [" ".join(n) for t in texts for n in partition(n_gram, t.split(" "), 1)]

In [None]:
#export

import spacy
nlp = spacy.load("en_core_web_md")

def lemmatize(text, nlp=nlp):
    return " ".join([tok.lemma_ for tok in nlp(text)])

In [None]:
df["lemmatized"] = df["text"].map(lemmatize)

In [None]:
df["text"] = df["lemmatized"]

In [None]:
positives = df[df["labels"] == 1]
negatives = df[df["labels"] == 0]

In [None]:
positives[:3]

Unnamed: 0.1,Unnamed: 0,PMID,source,Title,text,labels,DOI,lemmatized
0,0,,acl_cambridge,,the difficulty with this task lie in the fact ...,1,,the difficulty with this task lie in the fact ...
1,1,,acl_cambridge,,the problem with rich annotation be that they ...,1,,the problem with rich annotation be that they ...
2,2,,acl_cambridge,,"as a consequence , when adapt exist method and...",1,,"as a consequence , when adapt exist method and..."


In [None]:
bi_grams_pos = n_grams(positives["text"], 2)
tri_grams_pos = n_grams(positives["text"], 3)
bi_grams_neg = n_grams(negatives["text"], 2)
tri_grams_neg = n_grams(negatives["text"], 3)

In [None]:
bi_grams_pos[:3]

['the difficulty', 'difficulty with', 'with this']

In [None]:
from collections import Counter

d1 = dict(Counter(bi_grams_pos))
d2 = dict(Counter(bi_grams_neg))

d3 = dict(Counter(tri_grams_pos))
d4 = dict(Counter(tri_grams_neg))

bi_grams_both = {x:(d1[x], d2[x]) for x in d1 if x in d2}

tri_grams_both = {x:(d3[x], d4[x]) for x in d3 if d4.get(x)}

In [None]:
bi_grams_pos_only = {x:d1[x] for x in d1 if not d2.get(x)} # and d2.get(x) < 5
tri_grams_pos_only = {x:d3[x] for x in d3 if not d4.get(x)}

In [None]:
[(k,v) for k,v in Counter(tri_grams_both).items() if (v[0] + v[1]) > 20 ]
#n-gram: (pos, neg)

[('the fact that', (12, 21)),
 ('be use for', (3, 24)),
 ('et al .', (9, 79)),
 ('al . ,', (9, 66)),
 ('that they be', (12, 12)),
 ('one of the', (23, 34)),
 ('be that the', (66, 9)),
 ('be that it', (52, 3)),
 ('- of -', (6, 23)),
 ('the quality of', (7, 15)),
 ('that it be', (12, 18)),
 ('it be not', (5, 18)),
 ('n - gram', (8, 27)),
 ('in the training', (6, 18)),
 ('the training datum', (6, 30)),
 ('have not be', (5, 23)),
 ('it do not', (10, 24)),
 ('to the same', (4, 18)),
 ('in term of', (6, 18)),
 ('the use of', (5, 25)),
 (', which be', (8, 22)),
 ('be able to', (6, 18)),
 ('as well as', (5, 21)),
 ('they do not', (7, 15)),
 ('the most common', (18, 13)),
 ('take into account', (5, 18)),
 ('can not be', (5, 19)),
 ('can be use', (3, 27)),
 ('we do not', (3, 18)),
 ('the number of', (9, 39)),
 ('in order to', (3, 18)),
 ('be use to', (3, 25)),
 ('the treatment of', (2, 19)),
 ('- to -', (3, 24)),
 ('the effect of', (2, 28)),
 ('there be a', (4, 27)),
 ('a set of', (1, 21)),
 ('b

In [None]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

[('in the', (81, 331)),
 (', and', (47, 221)),
 ('be that', (279, 12)),
 ('of the', (154, 519)),
 ('as a', (16, 96)),
 ('be use', (13, 106)),
 ('be the', (87, 70)),
 ('be not', (37, 159)),
 ('do not', (45, 144)),
 (', the', (44, 126)),
 (', which', (36, 93)),
 ('can be', (18, 96)),
 ('number of', (20, 108)),
 ('that the', (82, 75)),
 ('there be', (19, 85)),
 ('to be', (28, 120)),
 ('to the', (48, 229)),
 ('be a', (104, 181)),
 (') be', (64, 88)),
 ('it be', (28, 106)),
 ('on the', (24, 143)),
 ('of a', (14, 103)),
 ('the same', (19, 98)),
 ('have be', (26, 133)),
 ('for the', (19, 102)),
 (', we', (6, 122))]

Let's look at patterns that one class shows very rarely relative to the other

In [None]:
factor = 15
#show n-grams that appear at least 20x more often in positive samples
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0]/v[1]) > factor], [(k,v) for k,v in Counter(tri_grams_both).items() if (v[0]/v[1]) > factor/2]

([('be that', (279, 12)),
  ('problem of', (57, 1)),
  ('problem be', (22, 1)),
  ('limitation of', (48, 1)),
  ('shortcoming of', (24, 1))],
 [('be that it', (52, 3)), ('lead cause of', (9, 1))])

Apparently "problem of" appears 54 in "problem statements" and only once in "non-problem" statement. Let's look at that one sentence!

In [None]:
negatives[negatives["text"].str.contains('problem of')].reset_index()["text"][0]

'peripheral neuropathy be the most common problem of diabetes .'

Interesting ... through the bi-gram analysis we found a statement that actually sounds like a problem statement and should/could be labeled a 1 (problem statement) ... but isn't in the dataset.
In that case I'd feel OK with using "problem of" as a hard-coded pattern-matching rule in finding positive samples

In [None]:
[(k,v) for k,v in Counter(bi_grams_pos_only).items() if v > 10]

[('with this', 11),
 ('the problem', 57),
 ('problem with', 20),
 ('main drawback', 16),
 ('drawback of', 57),
 ('limitation be', 41),
 ('the disadvantage', 15),
 ('disadvantage of', 16),
 ('method be', 14),
 ('shortcoming be', 18),
 ('disadvantage be', 12),
 ('a serious', 13),
 ('issue of', 14),
 ('one drawback', 11),
 ('main limitation', 11),
 ('major drawback', 13),
 ('drawback be', 67),
 ('another limitation', 14),
 ('a drawback', 17),
 ('the drawback', 38),
 ('of use', 14),
 ('the major', 14)]

In [None]:
[(k,v) for k,v in Counter(tri_grams_pos_only).items() if v > 6]

[('lie in the', 7),
 ('be that they', 31),
 ('the problem of', 44),
 ('approach be the', 9),
 ('be the fact', 7),
 ('the main drawback', 12),
 ('limitation be that', 32),
 ('the disadvantage of', 7),
 (', the problem', 8),
 ('the problem be', 8),
 ('problem of the', 7),
 ('method be that', 10),
 ('limitation be the', 7),
 ('disadvantage be that', 10),
 ('be that there', 9),
 ('the issue of', 10),
 ('drawback of this', 7),
 ('the main limitation', 11),
 ('main limitation of', 9),
 ('approach be that', 19),
 ('major drawback of', 7),
 ('drawback be that', 57),
 ('drawback be the', 7),
 ('of this approach', 8),
 ('this approach be', 9),
 ('main drawback be', 8),
 ('limitation of the', 17),
 ('the drawback of', 20),
 ('the drawback be', 16),
 ('shortcoming be that', 12),
 ('that it do', 7),
 ('be that we', 10),
 ('be that a', 9),
 ('have the drawback', 9),
 ('another limitation be', 8),
 ('a drawback of', 8),
 ('drawback of the', 8),
 ('problem with this', 7),
 ('model be that', 7),
 ('pro

In [None]:
"hi you" not in ["hi you", "hey"]

False

In [None]:
#trigrams that are not supersets of bigrams
[bigram for bigram in bi_grams_pos_only.keys() if not any(bigram in tg for tg in tri_grams_pos_only.keys())]


In [None]:
texts = df[df["text"].str.contains("drawback be that")].reset_index()["text"]

In [None]:
texts

0     its most obvious drawback be that the method c...
1     the most significant drawback be that ontology...
2     the main drawback be that it need almost 20,00...
3     one drawback be that it can not deal with depe...
4     a potential drawback be that it might not work...
5     the drawback be that , since extract event in ...
6     the main drawback be that the entry produce au...
7     an obvious drawback be that it be necessary to...
8     the only drawback be that it willperform slow ...
9     the drawback be that the estimate of parameter...
10    one possible drawback be that sense which one ...
11    their major drawback be that they require a gr...
12    the first drawback be that it require more kno...
13    another drawback be that it be impossible to a...
14    the drawback be that the solution may be only ...
15    the main drawback be that structure may not be...
16    another major drawback be that it require cons...
17    the major drawback be that we have to gene