# Finding (Problem Statement) Signal in Sentence Phrasing

Here we look at intersecting (low signal) and exclusive (only in one class aka high signal) [n-grams](https://en.wikipedia.org/wiki/N-gram) of positive and negative labeled sentences.

If there are n-grams that *almost* exclusively appear in one class (0 or 1) then they could make great matching phrases either directly for classifying a sentence without using a machine learning model or just to do last-mile quality assurance and flag suspicious model decisions

In [1]:
# default_exp core

In [2]:
import pandas as pd
df = pd.read_csv("downloads/40k_balanced_pm_acl.csv")


In [3]:
#export

def take_while(fn, coll):
    """Yield values from coll until fn is False"""
    for e in coll:
        if fn(e):
            yield e
        else:
            return

def partition(n, coll, step=None):
    return take_while(lambda e: len(e) == n,
        (coll[i:i+n] for i in range(0, len(coll), step or n)))

def partition_all(n, coll, step=None):
    return (coll[i:i+n] for i in range(0, len(coll), step or n))

def n_grams(texts, n_gram=2): return [" ".join(n) for t in texts for n in partition(n_gram, t.split(" "), 1)]

In [4]:
#export

import spacy
nlp = spacy.load("en_core_web_md")

def lemmatize(text, nlp=nlp):
    return " ".join([tok.lemma_ for tok in nlp(text)])

In [5]:
df["text_orig"] = df["text"]
df["text"] = df["text"].map(lemmatize)

In [9]:
df["labels"] = df["labels"].astype(str)
positives = df[df["labels"] == "1"]
negatives = df[df["labels"] == "0"]

In [11]:
positives[:3]

Unnamed: 0.1,Unnamed: 0,PMID,Title,text,DOI,labels,source,lemmatized,text_orig
0,0,,,the difficulty with this task lie in the fact ...,,1,acl_cambridge,the difficulty with this task lie in the fact ...,The difficulty with this task lies in the fact...
1,1,,,the problem with rich annotation be that they ...,,1,acl_cambridge,the problem with rich annotation be that they ...,The problem with rich annotations is that they...
2,2,,,"as a consequence , when adapt exist method and...",,1,acl_cambridge,"as a consequence , when adapt exist method and...","As a consequence , when adapting existing meth..."


In [12]:
bi_grams_pos = n_grams(positives["text"], 2)
tri_grams_pos = n_grams(positives["text"], 3)
bi_grams_neg = n_grams(negatives["text"], 2)
tri_grams_neg = n_grams(negatives["text"], 3)

In [13]:
bi_grams_pos[:3]

['the difficulty', 'difficulty with', 'with this']

In [14]:
from collections import Counter

d1 = dict(Counter(bi_grams_pos))
d2 = dict(Counter(bi_grams_neg))

d3 = dict(Counter(tri_grams_pos))
d4 = dict(Counter(tri_grams_neg))

bi_grams_both = {x:(d1[x], d2[x]) for x in d1 if x in d2}

tri_grams_both = {x:(d3[x], d4[x]) for x in d3 if d4.get(x)}

In [15]:
bi_grams_pos_only = {x:d1[x] for x in d1 if not d2.get(x)} # and d2.get(x) < 5
tri_grams_pos_only = {x:d3[x] for x in d3 if not d4.get(x)}

In [16]:
[(k,v) for k,v in Counter(tri_grams_both).items() if (v[0] + v[1]) > 20 ]
#n-gram: (pos, neg)

[('the fact that', (68, 29)),
 ('the problem of', (75, 1)),
 ('be use for', (13, 40)),
 ('which be not', (14, 10)),
 ('include in the', (4, 124)),
 ('so - call', (16, 6)),
 ('et al .', (21, 81)),
 ('al . ,', (15, 66)),
 ('that they be', (60, 13)),
 ('large - scale', (15, 7)),
 ('to overcome the', (21, 1)),
 ('however , the', (169, 19)),
 (', which have', (65, 6)),
 ('the limitation of', (13, 8)),
 ('problem of the', (32, 2)),
 ('of this method', (48, 3)),
 ('one of the', (238, 83)),
 ('small number of', (12, 19)),
 ('be that the', (351, 10)),
 ('have the potential', (19, 12)),
 ('be that there', (38, 1)),
 ('that there be', (59, 19)),
 ('there be no', (64, 213)),
 ('as compare to', (4, 33)),
 (', there be', (65, 88)),
 ('there be an', (13, 18)),
 ('be an important', (46, 29)),
 ('a common problem', (25, 4)),
 ('of non -', (29, 21)),
 (') , and', (46, 148)),
 ('the performance of', (12, 21)),
 (', which can', (66, 12)),
 (', the main', (20, 2)),
 ('large number of', (57, 23)),
 ('be tha

In [17]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

[('with this', (234, 6)),
 ('in the', (2772, 2553)),
 ('the fact', (73, 35)),
 ('they be', (170, 105)),
 (', and', (1955, 1907)),
 ('the problem', (395, 4)),
 ('problem with', (1182, 1)),
 ('be that', (1561, 21)),
 ('that they', (242, 19)),
 ('increase the', (80, 85)),
 ('of the', (2771, 2244)),
 ('as a', (1039, 321)),
 ('to a', (245, 249)),
 ('a new', (37, 82)),
 ('with the', (550, 438)),
 ('problem of', (366, 6)),
 ('could be', (83, 112)),
 ('be use', (123, 305)),
 ('use for', (44, 79)),
 ('approach be', (188, 25)),
 ('be the', (792, 249)),
 ('base on', (91, 246)),
 ('the main', (191, 52)),
 ('of these', (234, 38)),
 ('system be', (72, 39)),
 ('which be', (448, 142)),
 ('be not', (448, 395)),
 ('include in', (11, 236)),
 ('do not', (313, 489)),
 ('suffer from', (102, 46)),
 ('et al', (23, 82)),
 ('al .', (23, 81)),
 (') ,', (584, 717)),
 (', especially', (425, 66)),
 ('in a', (273, 746)),
 ('however ,', (912, 100)),
 (', the', (939, 600)),
 ('problem be', (127, 4)),
 ('be still', (38

Let's look at patterns that one class shows very rarely relative to the other

**TODO: only show bigrams that weren't also in the SPIKE SEARCH PATTERN**

In [18]:
factor = 15
#show n-grams that appear at least 20x more often in positive samples
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0]/v[1]) > factor], [(k,v) for k,v in Counter(tri_grams_both).items() if (v[0]/v[1]) > factor/2]

([('with this', (234, 6)),
  ('the problem', (395, 4)),
  ('problem with', (1182, 1)),
  ('be that', (1561, 21)),
  ('be face', (24, 1)),
  ('problem of', (366, 6)),
  ('problem (', (44, 2)),
  ('to overcome', (48, 2)),
  ('overcome the', (42, 1)),
  ('problem be', (127, 4)),
  ('be still', (382, 20)),
  ('weakness of', (17, 1)),
  ('this method', (82, 4)),
  ('a serious', (15899, 5)),
  ('problem in', (3836, 12)),
  ('these method', (19, 1)),
  ('shortcoming of', (37, 1)),
  ('main limitation', (16, 1)),
  ('a major', (851, 27)),
  ('this approach', (147, 8)),
  ('that some', (20, 1)),
  ('be its', (83, 3)),
  ('a severe', (1408, 3)),
  ('in its', (43, 2)),
  ('that many', (22, 1)),
  ('the major', (108, 7)),
  ('be their', (63, 1)),
  ('major problem', (141, 2)),
  ('to deal', (20, 1)),
  ('a problem', (169, 6)),
  ('this issue', (23, 1)),
  ('the recent', (42, 2)),
  ('in develop', (254, 4)),
  ('represent a', (977, 28)),
  ('challenge for', (48, 3)),
  ('rural area', (46, 3)),
  ('

Apparently "problem of" appears 54 in "problem statements" and only once in "non-problem" statement. Let's look at that one sentence!

In [19]:
negatives[negatives["text"].str.contains('limitation of')].reset_index()["text"][4]

'the translucency and opacity of ceramic play a significant role in emulate the natural color of tooth , but study of the masking property and limitation of dental ceramic when use as monolayer restoration be lack'

In [20]:
bgs = [(k,v) for k,v in Counter(bi_grams_pos_only).items() if v > 10]
tgs = [(k,v) for k,v in Counter(tri_grams_pos_only).items() if v > 6]
bgs, tgs

([('main drawback', 59),
  ('drawback of', 1510),
  ('these approach', 11),
  ('one limitation', 20),
  ('the disadvantage', 15),
  ('avoid the', 13),
  ('shortcoming be', 18),
  ('disadvantage be', 12),
  ('serious problem', 5966),
  ('the issue', 32),
  ('issue of', 44),
  ('one drawback', 15),
  ('this problem', 65),
  ('problem arise', 12),
  ('problem for', 1329),
  ('main problem', 78),
  ('it remain', 48),
  ('major drawback', 531),
  ('possible drawback', 20),
  ('drawback be', 69),
  ('significant drawback', 43),
  ('this strategy', 14),
  ('another limitation', 14),
  ('a drawback', 471),
  ('potential drawback', 60),
  ('the drawback', 136),
  ('major limitation', 18),
  ('a fundamental', 23),
  ('fundamental problem', 15),
  ('solve the', 15),
  ('one problem', 118),
  ('nevertheless ,', 30),
  ('know that', 14),
  ('problem not', 39),
  ('to solve', 14),
  ('of highly', 13),
  ('that its', 12),
  ('issue be', 19),
  ('troubling problem', 13),
  ('also for', 13),
  ('common

In [21]:
with open("ngrams_pos_only_lemmatized.txt", "w") as f:
    f.write("\n".join([t[0] for t in bgs+tgs]))

In [None]:
#trigrams that are not supersets of bigrams
[bigram for bigram in bi_grams_pos_only.keys() if not any(bigram in tg for tg in tri_grams_pos_only.keys())]


In [None]:
texts = df[df["text"].str.contains("drawback be that")].reset_index()["text"]

In [None]:
texts

0     its most obvious drawback be that the method c...
1     the most significant drawback be that ontology...
2     the main drawback be that it need almost 20,00...
3     one drawback be that it can not deal with depe...
4     a potential drawback be that it might not work...
5     the drawback be that , since extract event in ...
6     the main drawback be that the entry produce au...
7     an obvious drawback be that it be necessary to...
8     the only drawback be that it willperform slow ...
9     the drawback be that the estimate of parameter...
10    one possible drawback be that sense which one ...
11    their major drawback be that they require a gr...
12    the first drawback be that it require more kno...
13    another drawback be that it be impossible to a...
14    the drawback be that the solution may be only ...
15    the main drawback be that structure may not be...
16    another major drawback be that it require cons...
17    the major drawback be that we have to gene

In [None]:
#!find downloads -maxdepth 1 -type f -exec du -h {} + | sort --human-numeric-sort --reverse

128M	downloads/data.json
108M	downloads/sample_4k.json
28M	downloads/sample_1k.json
132K	downloads/oct7_test300.csv
