# Finding (Problem Statement) Signal in Sentence Phrasing

Here we look at intersecting (low signal) and exclusive (only in one class aka high signal) [n-grams](https://en.wikipedia.org/wiki/N-gram) of positive and negative labeled sentences.

If there are n-grams that *almost* exclusively appear in one class (0 or 1) then they could make great matching phrases either directly for classifying a sentence without using a machine learning model or just to do last-mile quality assurance and flag suspicious model decisions

In [1]:
# default_exp core

In [2]:
import pandas as pd
df = pd.read_csv("datasets/problem_statements.csv")

In [3]:
#export

def take_while(fn, coll):
    """Yield values from coll until fn is False"""
    for e in coll:
        if fn(e):
            yield e
        else:
            return

def partition(n, coll, step=None):
    return take_while(lambda e: len(e) == n,
        (coll[i:i+n] for i in range(0, len(coll), step or n)))

def partition_all(n, coll, step=None):
    return (coll[i:i+n] for i in range(0, len(coll), step or n))

def n_grams(texts, n_gram=2): return [" ".join(n) for t in texts for n in partition(n_gram, t.split(" "), 1)]

In [4]:
#export

import spacy
nlp = spacy.load("en_core_web_md")

def lemmatize(text, nlp=nlp):
    return " ".join([tok.lemma_ for tok in nlp(text)])

In [5]:
df["lemmatized"] = df["text"].map(lemmatize)

In [6]:
df["text"] = df["lemmatized"]

In [7]:
positives = df[df["labels"] == 1]
negatives = df[df["labels"] == 0]

In [8]:
positives[:3]

Unnamed: 0.1,Unnamed: 0,Title,PMID,text,DOI,labels,source,lemmatized
0,0,,,the difficulty with this task lie in the fact ...,,1,acl_cambridge,the difficulty with this task lie in the fact ...
1,1,,,the problem with rich annotation be that they ...,,1,acl_cambridge,the problem with rich annotation be that they ...
2,2,,,"as a consequence , when adapt exist method and...",,1,acl_cambridge,"as a consequence , when adapt exist method and..."


In [9]:
bi_grams_pos = n_grams(positives["text"], 2)
tri_grams_pos = n_grams(positives["text"], 3)
bi_grams_neg = n_grams(negatives["text"], 2)
tri_grams_neg = n_grams(negatives["text"], 3)

In [10]:
bi_grams_pos[:3]

['the difficulty', 'difficulty with', 'with this']

In [11]:
from collections import Counter

d1 = dict(Counter(bi_grams_pos))
d2 = dict(Counter(bi_grams_neg))

d3 = dict(Counter(tri_grams_pos))
d4 = dict(Counter(tri_grams_neg))

bi_grams_both = {x:(d1[x], d2[x]) for x in d1 if x in d2}

tri_grams_both = {x:(d3[x], d4[x]) for x in d3 if d4.get(x)}

In [12]:
bi_grams_pos_only = {x:d1[x] for x in d1 if not d2.get(x)} # and d2.get(x) < 5
tri_grams_pos_only = {x:d3[x] for x in d3 if not d4.get(x)}

In [13]:
[(k,v) for k,v in Counter(tri_grams_both).items() if (v[0] + v[1]) > 20 ]
#n-gram: (pos, neg)

[('the fact that', (13, 21)),
 ('be use for', (3, 24)),
 ('et al .', (9, 79)),
 ('al . ,', (9, 66)),
 ('that they be', (12, 12)),
 ('one of the', (37, 46)),
 ('be that the', (66, 9)),
 (', there be', (5, 16)),
 ('large number of', (6, 15)),
 ('be that it', (52, 3)),
 ('- of -', (6, 25)),
 ('the quality of', (7, 17)),
 (', such as', (6, 16)),
 ('that it be', (12, 18)),
 ('it be not', (5, 18)),
 ('n - gram', (8, 27)),
 ('in the training', (6, 18)),
 ('the training datum', (6, 30)),
 ('have not be', (5, 24)),
 ('it do not', (10, 24)),
 ('to the same', (4, 18)),
 ('in term of', (6, 19)),
 ('the use of', (5, 29)),
 (', which be', (10, 23)),
 (', it be', (6, 16)),
 ('be able to', (6, 18)),
 ('as well as', (6, 21)),
 ('they do not', (7, 15)),
 ('the most common', (29, 15)),
 ('take into account', (5, 18)),
 ('can not be', (5, 20)),
 ('can be use', (3, 27)),
 ('we do not', (3, 18)),
 (', and the', (6, 15)),
 ('the number of', (9, 42)),
 ('in order to', (4, 18)),
 ('depend on the', (3, 18)),
 (

In [14]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

[('in the', (99, 360)),
 (', and', (63, 246)),
 ('be that', (280, 12)),
 ('of the', (182, 576)),
 ('as a', (19, 106)),
 ('with the', (28, 74)),
 ('be use', (13, 106)),
 ('be the', (97, 84)),
 ('be not', (42, 162)),
 ('do not', (46, 145)),
 (', the', (46, 137)),
 (', which', (43, 99)),
 ('of this', (35, 70)),
 ('can be', (20, 100)),
 ('number of', (24, 114)),
 ('that the', (82, 82)),
 ('there be', (24, 95)),
 ('be an', (25, 79)),
 ('to be', (35, 122)),
 ('to the', (53, 239)),
 ('by the', (16, 86)),
 ('be a', (165, 222)),
 (') be', (107, 111)),
 (', but', (35, 72)),
 ('the most', (62, 56)),
 ('it be', (32, 109)),
 ('on the', (25, 151)),
 ('and the', (24, 82)),
 ('of a', (14, 111)),
 ('the same', (19, 100)),
 ('have be', (29, 143)),
 ('for the', (21, 115)),
 (', we', (6, 122))]

Let's look at patterns that one class shows very rarely relative to the other

In [15]:
factor = 15
#show n-grams that appear at least 20x more often in positive samples
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0]/v[1]) > factor], [(k,v) for k,v in Counter(tri_grams_both).items() if (v[0]/v[1]) > factor/2]

([('the problem', (57, 1)),
  ('be that', (280, 12)),
  ('problem of', (58, 1)),
  ('limitation of', (48, 2)),
  ('shortcoming of', (24, 1))],
 [('be that it', (52, 3))])

Apparently "problem of" appears 54 in "problem statements" and only once in "non-problem" statement. Let's look at that one sentence!

In [16]:
negatives[negatives["text"].str.contains('problem of')].reset_index()["text"][0]

'peripheral neuropathy be the most common problem of diabetes .'

Interesting ... through the bi-gram analysis we found a statement that actually sounds like a problem statement and should/could be labeled a 1 (problem statement) ... but isn't in the dataset.
In that case I'd feel OK with using "problem of" as a hard-coded pattern-matching rule in finding positive samples

In [28]:
bgs = [(k,v) for k,v in Counter(bi_grams_pos_only).items() if v > 10]
bgs

[('with this', 11),
 ('problem with', 20),
 ('main drawback', 16),
 ('drawback of', 57),
 ('limitation be', 41),
 ('the disadvantage', 15),
 ('disadvantage of', 16),
 ('method be', 14),
 ('shortcoming be', 18),
 ('disadvantage be', 12),
 ('a serious', 16),
 ('issue of', 14),
 ('one drawback', 11),
 ('main limitation', 11),
 ('major drawback', 13),
 ('drawback be', 67),
 ('another limitation', 14),
 ('a drawback', 17),
 ('the drawback', 38),
 ('of use', 14)]

In [25]:
"\n".join([t[0] for t in bgs])

'with this\nproblem with\nmain drawback\ndrawback of\nlimitation be\nthe disadvantage\ndisadvantage of\nmethod be\nshortcoming be\ndisadvantage be\na serious\nissue of\none drawback\nmain limitation\nmajor drawback\ndrawback be\nanother limitation\na drawback\nthe drawback\nof use'

In [26]:
with open("bigrams_pos_only_lemmatized.txt", "w") as f:
    f.write("\n".join([t[0] for t in bgs]))

In [27]:
tgs = [(k,v) for k,v in Counter(tri_grams_pos_only).items() if v > 6]

[('lie in the', 7),
 ('be that they', 32),
 ('the problem of', 44),
 ('approach be the', 9),
 ('be the fact', 7),
 ('the main drawback', 12),
 ('limitation be that', 32),
 ('the disadvantage of', 7),
 (', the problem', 8),
 ('the problem be', 8),
 ('problem of the', 7),
 ('method be that', 10),
 ('limitation be the', 7),
 ('disadvantage be that', 10),
 ('be that there', 9),
 ('the issue of', 10),
 ('drawback of this', 7),
 ('the main limitation', 11),
 ('main limitation of', 9),
 ('approach be that', 19),
 ('major drawback of', 7),
 ('drawback be that', 57),
 ('drawback be the', 7),
 ('of this approach', 8),
 ('this approach be', 9),
 ('main drawback be', 8),
 ('limitation of the', 17),
 ('the drawback of', 20),
 ('the drawback be', 16),
 ('shortcoming be that', 12),
 ('that it do', 7),
 ('be that we', 10),
 ('be that a', 9),
 ('have the drawback', 9),
 ('another limitation be', 8),
 ('a drawback of', 8),
 ('drawback of the', 8),
 ('problem with this', 7),
 ('model be that', 7),
 ('pro

In [18]:
#trigrams that are not supersets of bigrams
[bigram for bigram in bi_grams_pos_only.keys() if not any(bigram in tg for tg in tri_grams_pos_only.keys())]


[]

In [19]:
texts = df[df["text"].str.contains("drawback be that")].reset_index()["text"]

In [20]:
texts

Series([], Name: text, dtype: object)