# Finding (Problem Statement) Signal in Sentence Phrasing

Here we look at intersecting (low signal) and exclusive (only in one class aka high signal) [n-grams](https://en.wikipedia.org/wiki/N-gram) of positive and negative labeled sentences.

If there are n-grams that *almost* exclusively appear in one class (0 or 1) then they could make great matching phrases either directly for classifying a sentence without using a machine learning model or just to do last-mile quality assurance and flag suspicious model decisions

In [2]:
# default_exp core

In [49]:
import pandas as pd
df = pd.read_csv("datasets/problem_statements.csv")

In [12]:
df

Unnamed: 0.1,Unnamed: 0,Title,PMID,text,DOI,labels,source
0,0,,,The difficulty with this task lies in the fact...,,1,acl_cambridge
1,1,,,The problem with rich annotations is that they...,,1,acl_cambridge
2,2,,,"As a consequence , when adapting existing meth...",,1,acl_cambridge
3,3,,,The second problem of traditional word alignme...,,1,acl_cambridge
4,4,,,The main drawback of these systems is that the...,,1,acl_cambridge
...,...,...,...,...,...,...,...
3574,3574,Ten-Year Trends and Clinical Relevance of the ...,28472788.0,Antimicrobial resistance of Streptococcus pneu...,10.1159/000470828,1,labels_oct7
3575,3575,"Program FACTOR at 10: Origins, development and...",28438248.0,We aim to provide a conceptual view of the ori...,10.7334/psicothema2016.304,0,labels_oct7
3576,3576,How low an effect of a preventive measure agai...,27777090.0,Traveller's diarrhoea (TD) is the most common ...,10.1016/j.tmaid.2016.10.005,1,labels_oct7
3577,3577,"Clinical study on the efficacy, acceptance, an...",29663792.0,The primary objective of this trial was to dem...,10.23736/S0031-0808.18.03447-X,0,labels_oct7


### Fetching dataset from Google Drive


In [60]:
path = 'https://drive.google.com/uc?export=download&id='
df1 = pd.read_csv(path + "1t2m6IxieiE0hZ8TcLJe9foc7upngrA_0")
df1["source"] = " limitation|problem ... with ... l=be that"

In [61]:
df2 = pd.read_csv(path + "1t-2rU83KdppSrwDnqdQUi6Pvtw13D0bQ")
df2["source"] = "a ... drawback of"
df3 = pd.read_csv(path + "1iGdfLJ4MkTO8oOomFJ84JCc7nf2zYBul")
df3["source"] = "a {negative_adjectives} ...  {problem_nouns}"

In [62]:
len(df1), len(df2), len(df3)

(947, 1455, 18573)

In [63]:
dff = df1.append(df2).append(df3)
dff.head()

Unnamed: 0,sentence_id,title,article_link,sentence_text,year,mesh_list,journal_title,volume,issue,abstract,source,paragraph_text
0,1263595,"Shoe-Integrated, Force Sensor Design for Conti...",https://pubmed.ncbi.nlm.nih.gov/32545528/,One problem with such methods is that they tap...,2020,"['Biomechanical Phenomena', 'Body Weight', 'Eq...","Sensors (Basel, Switzerland)",20,12,Traditional pedobarography methods use direct ...,limitation|problem ... with ... l=be that,
1,1821146,The Problem with Gout Is That It's Still Such ...,https://pubmed.ncbi.nlm.nih.gov/27481987/,The Problem with Gout Is That It 's Still Such...,2016,"['Allopurinol', 'therapeutic use', 'Arthritis,...",The Journal of rheumatology,43,8,,limitation|problem ... with ... l=be that,
2,1840641,Generation of noble-gas binding sites for crys...,https://pubmed.ncbi.nlm.nih.gov/11752783/,One problem with this approach is that not all...,2002,"['Bacteriophage T4', 'enzymology', 'Binding Si...","Acta crystallographica. Section D, Biological ...",58,Pt 1,"In recent years, the use of noble-gas complexe...",limitation|problem ... with ... l=be that,
3,2187391,"Segmentation of Crohn, Lymphangiectasia, Xanth...",https://pubmed.ncbi.nlm.nih.gov/22874169/,"However , a tough problem associated with this...",2012,"['Algorithms', 'Artificial Intelligence', 'Cap...",Studies in health technology and informatics,180,,Wireless capsule endoscopy (WCE) is a great br...,limitation|problem ... with ... l=be that,
4,3202138,Vection in virtual reality modulates vestibula...,https://pubmed.ncbi.nlm.nih.gov/31233640/,While significant technological advancements a...,2019,"['Female', 'Humans', 'Male', 'Muscle, Skeletal...",REFERENCES,50,10,The popularity of virtual reality (VR) has inc...,limitation|problem ... with ... l=be that,


In [64]:
dff["PMID"] = dff["article_link"].map(lambda t: t.split("/")[-2])

In [65]:
len(dff.drop_duplicates(subset=["PMID"])), len(dff)

(20822, 20975)

In [24]:
dff.columns, df.columns

(Index(['sentence_id', 'title', 'article_link', 'sentence_text', 'year',
        'mesh_list', 'journal_title', 'volume', 'issue', 'abstract', 'source',
        'paragraph_text', 'PMID'],
       dtype='object'),
 Index(['Unnamed: 0', 'Title', 'PMID', 'text', 'DOI', 'labels', 'source'], dtype='object'))

In [66]:
dff = dff.drop_duplicates(subset=["PMID"])

In [67]:
dff = dff.rename(columns={'sentence_text': "text", "title": "Title"}, errors="raise")

In [68]:
dff

Unnamed: 0,sentence_id,Title,article_link,text,year,mesh_list,journal_title,volume,issue,abstract,source,paragraph_text,PMID
0,1263595,"Shoe-Integrated, Force Sensor Design for Conti...",https://pubmed.ncbi.nlm.nih.gov/32545528/,One problem with such methods is that they tap...,2020,"['Biomechanical Phenomena', 'Body Weight', 'Eq...","Sensors (Basel, Switzerland)",20,12,Traditional pedobarography methods use direct ...,limitation|problem ... with ... l=be that,,32545528
1,1821146,The Problem with Gout Is That It's Still Such ...,https://pubmed.ncbi.nlm.nih.gov/27481987/,The Problem with Gout Is That It 's Still Such...,2016,"['Allopurinol', 'therapeutic use', 'Arthritis,...",The Journal of rheumatology,43,8,,limitation|problem ... with ... l=be that,,27481987
2,1840641,Generation of noble-gas binding sites for crys...,https://pubmed.ncbi.nlm.nih.gov/11752783/,One problem with this approach is that not all...,2002,"['Bacteriophage T4', 'enzymology', 'Binding Si...","Acta crystallographica. Section D, Biological ...",58,Pt 1,"In recent years, the use of noble-gas complexe...",limitation|problem ... with ... l=be that,,11752783
3,2187391,"Segmentation of Crohn, Lymphangiectasia, Xanth...",https://pubmed.ncbi.nlm.nih.gov/22874169/,"However , a tough problem associated with this...",2012,"['Algorithms', 'Artificial Intelligence', 'Cap...",Studies in health technology and informatics,180,,Wireless capsule endoscopy (WCE) is a great br...,limitation|problem ... with ... l=be that,,22874169
4,3202138,Vection in virtual reality modulates vestibula...,https://pubmed.ncbi.nlm.nih.gov/31233640/,While significant technological advancements a...,2019,"['Female', 'Humans', 'Male', 'Muscle, Skeletal...",REFERENCES,50,10,The popularity of virtual reality (VR) has inc...,limitation|problem ... with ... l=be that,,31233640
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18568,362702792,Falls and injuries in frail and vigorous commu...,https://pubmed.ncbi.nlm.nih.gov/1987256/,These findings suggest that fall-related injur...,1991,"['Accidental Falls', 'prevention & control', '...",Journal of the American Geriatrics Society,39,1,Identification of different types of falls and...,a {negative_adjectives} ... {problem_nouns},,1987256
18569,362714838,Intermittent catheterization: sterile or clean?,https://pubmed.ncbi.nlm.nih.gov/1989043/,Bacteriuria -- asymptomatic and symptomatic --...,1991,"['Bacteriuria', 'etiology', 'nursing', 'preven...",Rehabilitation nursing : the official journal ...,16,1,Bacteriuria--asymptomatic and symptomatic--alw...,a {negative_adjectives} ... {problem_nouns},,1989043
18570,362715256,Variceal rebleeding after portosystemic shunti...,https://pubmed.ncbi.nlm.nih.gov/1989102/,Strategies and solutions to a vexing problem .,1991,"['Combined Modality Therapy', 'Esophageal and ...",The Surgical clinics of North America,71,1,The purpose of this review was to discuss an a...,a {negative_adjectives} ... {problem_nouns},,1989102
18571,362724620,Neuroprotective Effect of Natural Products on ...,https://pubmed.ncbi.nlm.nih.gov/26645998/,Peripheral nerve injury ( PNI ) is a serious p...,2016,"['Animals', 'Biological Products', 'therapeuti...",Neurochemical research,41,4,Peripheral nerve injury (PNI) is a serious pub...,a {negative_adjectives} ... {problem_nouns},,26645998


In [58]:
dff.columns

Index(['sentence_id', 'Title', 'article_link', 'text', 'year', 'mesh_list',
       'journal_title', 'volume', 'issue', 'abstract', 'source',
       'paragraph_text', 'PMID', 'text'],
      dtype='object')

In [69]:
dff = dff[['Title',"text", 'mesh_list', 'journal_title', 'source', 'PMID']]

In [70]:
dff["labels"] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [90]:
len(dff[dff["labels"] == "0"])

0

In [101]:
df  = pd.read_csv("datasets/problem_statements.csv")


In [104]:
df = df.append(dff, ignore_index=True).reset_index()

In [109]:
df["PMID"] = df["PMID"].astype(str)
df["labels"] = df["labels"].astype(str)


In [106]:
df[df.duplicated(subset=["PMID"])]

Unnamed: 0.1,index,Unnamed: 0,Title,PMID,text,DOI,labels,source,mesh_list,journal_title
1,1,1.0,,,The problem with rich annotations is that they...,,1,acl_cambridge,,
2,2,2.0,,,"As a consequence , when adapting existing meth...",,1,acl_cambridge,,
3,3,3.0,,,The second problem of traditional word alignme...,,1,acl_cambridge,,
4,4,4.0,,,The main drawback of these systems is that the...,,1,acl_cambridge,,
5,5,5.0,,,Although these approaches do not suffer from s...,,1,acl_cambridge,,
...,...,...,...,...,...,...,...,...,...,...
1995,1995,1995.0,,,FNTBL is a transformation-based learner that i...,,0,acl_cambridge,,
1996,1996,1996.0,,,"Moreover , when the dependency parser is non-p...",,0,acl_cambridge,,
2795,2795,2795.0,Dermoscopy and the diagnosis of primary cutane...,2.88462e+07,Primary cutaneous B-cell lymphomas (PCBCLs) ar...,10.1111/jdv.14549,1,oct3_labels,,
2846,2846,2846.0,The Primary Result of Prospective Randomized M...,2.87447e+07,Postoperative adhesions are the major cause of...,10.1007/s11605-017-3503-1,1,oct3_labels,,


In [111]:
df[df["labels"] =="0"]

Unnamed: 0.1,index,Unnamed: 0,Title,PMID,text,DOI,labels,source,mesh_list,journal_title
497,497,497.0,,,Obtained lexical entries are guaranteed to con...,,0,acl_cambridge,,
498,498,498.0,,,Sections 3 and 4 describe the features induced...,,0,acl_cambridge,,
499,499,499.0,,,"Until the system finds a multiple entry , it b...",,0,acl_cambridge,,
500,500,500.0,,,-LRB- dictionary -RRB- No price for the new sh...,,0,acl_cambridge,,
501,501,501.0,,,Variation across languages can to a large exte...,,0,acl_cambridge,,
...,...,...,...,...,...,...,...,...,...,...
3567,3567,3567.0,The iTRAQ-based chloroplast proteomic analysis...,32131734.0,The perturbance of chloroplast proteins is a m...,10.1186/s12870-020-2297-6,0,labels_oct7,,
3569,3569,3569.0,Granulomatosis after autologous stem cell tran...,27904442.0,Sarcoidosis before and after treatment of mali...,10.1515/raon-2015-0033,0,labels_oct7,,
3571,3571,3571.0,Blackfullas in ivory towers: referenced reflec...,29169296.0,Indigenous representation is essential to ensu...,10.1080/10376178.2017.1409645,0,labels_oct7,,
3575,3575,3575.0,"Program FACTOR at 10: Origins, development and...",28438248.0,We aim to provide a conceptual view of the ori...,10.7334/psicothema2016.304,0,labels_oct7,,


In [96]:
df["source"].unique()

array(['acl_cambridge', 'predicts_acl_pm2500_vocabed',
       'Oct1_clinical_studies_pm', 'oct3_labels', 'labels_oct7',
       ' limitation|problem ... with ... l=be that', 'a ... drawback of',
       'a {negative_adjectives} ...  {problem_nouns}'], dtype=object)

In [112]:
df = df.dropna(subset=["text"])
df.to_csv("downloads/fat.csv")

In [100]:
dfo  = pd.read_csv("datasets/problem_statements.csv")
dfo[dfo["labels"] == 0]

Unnamed: 0.1,Unnamed: 0,Title,PMID,text,DOI,labels,source
497,497,,,Obtained lexical entries are guaranteed to con...,,0,acl_cambridge
498,498,,,Sections 3 and 4 describe the features induced...,,0,acl_cambridge
499,499,,,"Until the system finds a multiple entry , it b...",,0,acl_cambridge
500,500,,,-LRB- dictionary -RRB- No price for the new sh...,,0,acl_cambridge
501,501,,,Variation across languages can to a large exte...,,0,acl_cambridge
...,...,...,...,...,...,...,...
3567,3567,The iTRAQ-based chloroplast proteomic analysis...,32131734.0,The perturbance of chloroplast proteins is a m...,10.1186/s12870-020-2297-6,0,labels_oct7
3569,3569,Granulomatosis after autologous stem cell tran...,27904442.0,Sarcoidosis before and after treatment of mali...,10.1515/raon-2015-0033,0,labels_oct7
3571,3571,Blackfullas in ivory towers: referenced reflec...,29169296.0,Indigenous representation is essential to ensu...,10.1080/10376178.2017.1409645,0,labels_oct7
3575,3575,"Program FACTOR at 10: Origins, development and...",28438248.0,We aim to provide a conceptual view of the ori...,10.7334/psicothema2016.304,0,labels_oct7


In [84]:
df4 = pd.read_csv("downloads/first_sent_journal9500.csv")

In [85]:
df4.head()

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI
0,32961261,Does background color influence visual thresho...,"Pérez MM, Della Bona A, Carrillo-Pérez F, Dude...",J Dent. 2020 Nov;102:103475. doi: 10.1016/j.jd...,Pérez MM,J Dent,2020,2020/09/22,,,10.1016/j.jdent.2020.103475
1,30520368,3D Printing Technology in Drug Delivery: Recen...,"Kotta S, Nair A, Alsabeelah N.",Curr Pharm Des. 2018;24(42):5039-5048. doi: 10...,Kotta S,Curr Pharm Des,2018,2018/12/07,,,10.2174/1381612825666181206123828
2,29500865,Disparities in eating disorder diagnosis and t...,"Sonneville KR, Lipson SK.",Int J Eat Disord. 2018 Jun;51(6):518-526. doi:...,Sonneville KR,Int J Eat Disord,2018,2018/03/04,,,10.1002/eat.22846
3,33343146,A Systematic Review of Mindfulness Practices f...,"Smith SL, Langen WH.",Int J Yoga. 2020 Sep-Dec;13(3):177-182. doi: 1...,Smith SL,Int J Yoga,2020,2020/12/21,PMC7735497,,10.4103/ijoy.IJOY_4_20
4,30537093,The Effect of Ceramic Type and Background Colo...,"Al Hamad KQ, Obaidat II, Baba NZ.",J Prosthodont. 2020 Jul;29(6):511-517. doi: 10...,Al Hamad KQ,J Prosthodont,2020,2018/12/12,,,10.1111/jopr.13005


In [None]:
#export

def take_while(fn, coll):
    """Yield values from coll until fn is False"""
    for e in coll:
        if fn(e):
            yield e
        else:
            return

def partition(n, coll, step=None):
    return take_while(lambda e: len(e) == n,
        (coll[i:i+n] for i in range(0, len(coll), step or n)))

def partition_all(n, coll, step=None):
    return (coll[i:i+n] for i in range(0, len(coll), step or n))

def n_grams(texts, n_gram=2): return [" ".join(n) for t in texts for n in partition(n_gram, t.split(" "), 1)]

In [None]:
#export

import spacy
nlp = spacy.load("en_core_web_md")

def lemmatize(text, nlp=nlp):
    return " ".join([tok.lemma_ for tok in nlp(text)])

In [None]:
df["lemmatized"] = df["text"].map(lemmatize)

In [None]:
df["text"] = df["lemmatized"]

In [None]:
positives = df[df["labels"] == 1]
negatives = df[df["labels"] == 0]

In [None]:
positives[:3]

Unnamed: 0.1,Unnamed: 0,Title,PMID,text,DOI,labels,source,lemmatized
0,0,,,the difficulty with this task lie in the fact ...,,1,acl_cambridge,the difficulty with this task lie in the fact ...
1,1,,,the problem with rich annotation be that they ...,,1,acl_cambridge,the problem with rich annotation be that they ...
2,2,,,"as a consequence , when adapt exist method and...",,1,acl_cambridge,"as a consequence , when adapt exist method and..."


In [None]:
bi_grams_pos = n_grams(positives["text"], 2)
tri_grams_pos = n_grams(positives["text"], 3)
bi_grams_neg = n_grams(negatives["text"], 2)
tri_grams_neg = n_grams(negatives["text"], 3)

In [None]:
bi_grams_pos[:3]

['the difficulty', 'difficulty with', 'with this']

In [None]:
from collections import Counter

d1 = dict(Counter(bi_grams_pos))
d2 = dict(Counter(bi_grams_neg))

d3 = dict(Counter(tri_grams_pos))
d4 = dict(Counter(tri_grams_neg))

bi_grams_both = {x:(d1[x], d2[x]) for x in d1 if x in d2}

tri_grams_both = {x:(d3[x], d4[x]) for x in d3 if d4.get(x)}

In [None]:
bi_grams_pos_only = {x:d1[x] for x in d1 if not d2.get(x)} # and d2.get(x) < 5
tri_grams_pos_only = {x:d3[x] for x in d3 if not d4.get(x)}

In [None]:
[(k,v) for k,v in Counter(tri_grams_both).items() if (v[0] + v[1]) > 20 ]
#n-gram: (pos, neg)

[('the fact that', (13, 21)),
 ('be use for', (3, 24)),
 ('et al .', (9, 79)),
 ('al . ,', (9, 66)),
 ('that they be', (12, 12)),
 ('one of the', (37, 46)),
 ('be that the', (66, 9)),
 (', there be', (5, 16)),
 ('large number of', (6, 15)),
 ('be that it', (52, 3)),
 ('- of -', (6, 25)),
 ('the quality of', (7, 17)),
 (', such as', (6, 16)),
 ('that it be', (12, 18)),
 ('it be not', (5, 18)),
 ('n - gram', (8, 27)),
 ('in the training', (6, 18)),
 ('the training datum', (6, 30)),
 ('have not be', (5, 24)),
 ('it do not', (10, 24)),
 ('to the same', (4, 18)),
 ('in term of', (6, 19)),
 ('the use of', (5, 29)),
 (', which be', (10, 23)),
 (', it be', (6, 16)),
 ('be able to', (6, 18)),
 ('as well as', (6, 21)),
 ('they do not', (7, 15)),
 ('the most common', (29, 15)),
 ('take into account', (5, 18)),
 ('can not be', (5, 20)),
 ('can be use', (3, 27)),
 ('we do not', (3, 18)),
 (', and the', (6, 15)),
 ('the number of', (9, 42)),
 ('in order to', (4, 18)),
 ('depend on the', (3, 18)),
 (

In [None]:
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0] + v[1]) > 100]

[('in the', (99, 360)),
 (', and', (63, 246)),
 ('be that', (280, 12)),
 ('of the', (182, 576)),
 ('as a', (19, 106)),
 ('with the', (28, 74)),
 ('be use', (13, 106)),
 ('be the', (97, 84)),
 ('be not', (42, 162)),
 ('do not', (46, 145)),
 (', the', (46, 137)),
 (', which', (43, 99)),
 ('of this', (35, 70)),
 ('can be', (20, 100)),
 ('number of', (24, 114)),
 ('that the', (82, 82)),
 ('there be', (24, 95)),
 ('be an', (25, 79)),
 ('to be', (35, 122)),
 ('to the', (53, 239)),
 ('by the', (16, 86)),
 ('be a', (165, 222)),
 (') be', (107, 111)),
 (', but', (35, 72)),
 ('the most', (62, 56)),
 ('it be', (32, 109)),
 ('on the', (25, 151)),
 ('and the', (24, 82)),
 ('of a', (14, 111)),
 ('the same', (19, 100)),
 ('have be', (29, 143)),
 ('for the', (21, 115)),
 (', we', (6, 122))]

Let's look at patterns that one class shows very rarely relative to the other

In [None]:
factor = 15
#show n-grams that appear at least 20x more often in positive samples
[(k,v) for k,v in Counter(bi_grams_both).items() if (v[0]/v[1]) > factor], [(k,v) for k,v in Counter(tri_grams_both).items() if (v[0]/v[1]) > factor/2]

([('the problem', (57, 1)),
  ('be that', (280, 12)),
  ('problem of', (58, 1)),
  ('limitation of', (48, 2)),
  ('shortcoming of', (24, 1))],
 [('be that it', (52, 3))])

Apparently "problem of" appears 54 in "problem statements" and only once in "non-problem" statement. Let's look at that one sentence!

In [None]:
negatives[negatives["text"].str.contains('limitation of')].reset_index()["text"][4]

'the translucency and opacity of ceramic play a significant role in emulate the natural color of tooth , but study of the masking property and limitation of dental ceramic when use as monolayer restoration be lack'

In [None]:
bgs = [(k,v) for k,v in Counter(bi_grams_pos_only).items() if v > 10]
tgs = [(k,v) for k,v in Counter(tri_grams_pos_only).items() if v > 6]
bgs, tgs

([('with this', 11),
  ('problem with', 20),
  ('main drawback', 16),
  ('drawback of', 57),
  ('limitation be', 41),
  ('the disadvantage', 15),
  ('disadvantage of', 16),
  ('method be', 14),
  ('shortcoming be', 18),
  ('disadvantage be', 12),
  ('a serious', 16),
  ('issue of', 14),
  ('one drawback', 11),
  ('main limitation', 11),
  ('major drawback', 13),
  ('drawback be', 67),
  ('another limitation', 14),
  ('a drawback', 17),
  ('the drawback', 38),
  ('of use', 14)],
 [('lie in the', 7),
  ('be that they', 32),
  ('the problem of', 44),
  ('approach be the', 9),
  ('be the fact', 7),
  ('the main drawback', 12),
  ('limitation be that', 32),
  ('the disadvantage of', 7),
  (', the problem', 8),
  ('the problem be', 8),
  ('problem of the', 7),
  ('method be that', 10),
  ('limitation be the', 7),
  ('disadvantage be that', 10),
  ('be that there', 9),
  ('the issue of', 10),
  ('drawback of this', 7),
  ('the main limitation', 11),
  ('main limitation of', 9),
  ('approach b

In [None]:
with open("ngrams_pos_only_lemmatized.txt", "w") as f:
    f.write("\n".join([t[0] for t in bgs+tgs]))

In [None]:
#trigrams that are not supersets of bigrams
[bigram for bigram in bi_grams_pos_only.keys() if not any(bigram in tg for tg in tri_grams_pos_only.keys())]


[]

In [None]:
texts = df[df["text"].str.contains("drawback be that")].reset_index()["text"]

In [None]:
texts

0     its most obvious drawback be that the method c...
1     the most significant drawback be that ontology...
2     the main drawback be that it need almost 20,00...
3     one drawback be that it can not deal with depe...
4     a potential drawback be that it might not work...
5     the drawback be that , since extract event in ...
6     the main drawback be that the entry produce au...
7     an obvious drawback be that it be necessary to...
8     the only drawback be that it willperform slow ...
9     the drawback be that the estimate of parameter...
10    one possible drawback be that sense which one ...
11    their major drawback be that they require a gr...
12    the first drawback be that it require more kno...
13    another drawback be that it be impossible to a...
14    the drawback be that the solution may be only ...
15    the main drawback be that structure may not be...
16    another major drawback be that it require cons...
17    the major drawback be that we have to gene

In [None]:
#!find downloads -maxdepth 1 -type f -exec du -h {} + | sort --human-numeric-sort --reverse

128M	downloads/data.json
108M	downloads/sample_4k.json
28M	downloads/sample_1k.json
132K	downloads/oct7_test300.csv
