Document to compare results of the different combinations of heuristics available here:
https://docs.google.com/spreadsheets/d/1hL7YA6b26UpogHj4n259yNG7VktV5GEFvx3Pz5uu-vU/edit?usp=sharing

In [1]:
import re
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
import en_core_web_sm
nlp = en_core_web_sm.load()
from tqdm import tqdm
tqdm.pandas()

  from pandas import Panel


## Tweets Loading

In [18]:
tweets = pd.read_csv('tweets.tsv', sep='\t')

## External Resources
- Predicates: Source?
- Scientific terms: Two word lists merged into a single file, total of 78128 words
    - 1) Scientific terms: https://github.com/liantze/domain_wordlists (domains include biology, medicine, engineering, physics, neuroscience, geo, math)
    - 2) Covid-related terms: TweetsCOV19 list of 269 seed words https://data.gesis.org/tweetscov19/

For sciterms, after a frequency analysis, we find that sciterms from the list that occur in more than 20 tweets are the following:
 ('lol', 1272),
 ('omg', 233),
 ('ppl', 184),
 ('pls', 163),
 ('coronavirus', 94),
 ('isn', 84),
 ('stan', 63),
 ('dat', 57),
 ('soo', 48),
 ('pandemic', 41),
 ('behaviour', 40),
 ('abt', 39),
 ('dawg', 38),
 ('mask', 37),
 ('centre', 36),
 ('healthcare', 31),
 ('lockdown', 23)
 
 We delete from the sciterms list some of these words



In [3]:
#TODO : use token lemma prevents having to use different conjugations of the same verb
predicates = ['affect', 'affects',
              'cause', 'causes',
              'inhibit', 'inhibits',
              'prevent', 'prevents',
              'treat', 'treats',
              'lead to', 'leads to',
              'increase', 'increases',
              'decrease', 'decreases',
              'facilitate', 'facilitates',
              'hinder', 'hinders',
              'stop', 'stops',
              'associated with',
              'correlated with',
              'enable', 'enables',
              'are a',
              'need more', 'needs more',
              'need less', 'needs less',
              'support', 'supports',
              'lower', 'lowers',
              'promote', 'promotes',
              'process of', 'reason for', 'reason why', 'higher than', 'lower than']

In [90]:
with open('/Users/mac/Downloads/scientific_claims/scientific_terms.txt', 'r') as f:
    scientific_terms = [line.strip() for line in f]
    scientific_terms.sort()

for i in range(len(scientific_terms)):
    scientific_terms[i] = scientific_terms[i].lower()
    
text_speak_wordlist = ['lol','lmao','lmfao', 'omg', 'ppl', 'pls','thx','sry','dah', 'dat', 'soo', 'isn', 'stan','abt', 'dawg','cha','jk']
scientific_terms = [term for term in scientific_terms if term not in text_speak_wordlist]

In [5]:
from spacy_arguing_lexicon import ArguingLexiconParser
from spacy.language import Language

@Language.factory("custom_lexicon_parser")
def my_component(nlp,name="custom_lexicon_parser"):
    return ArguingLexiconParser()

nlp.add_pipe("custom_lexicon_parser")
#nlp.remove_pipe("custom_lexicon_parser")

<spacy_arguing_lexicon.parsers.ArguingLexiconParser at 0x15a26de10>

## Option 1: Applying Sci-claim Heuristics directly on the TweetsKB 100K sample

Results:
- contains_arg_relation => 289 tweets
- contains_sciterms => 1392 tweets
- contains_arg_relation + contains_sciterms => 10 tweets

### Sci-Heuristics

In [6]:
def contains_arg_relation(tweet_sentence):
    for pred in predicates:
        if re.match('.*\s('+pred+')\s.{2,}', tweet_sentence) is not None:
            return pred
    return ""

def contains_argument(tweet_sentence):
    doc = nlp(tweet_sentence)
    argument_span_generator = doc._.arguments.get_argument_spans()
    for argument in argument_span_generator:
        #print("Argument lexicon:", argument.text)
        #print("Label of lexicon:", argument.label_)
        #print("Sentence where lexicon occurs:", argument.sent.text.strip())
        #print("\n")
        return True, {'argument_lexicon':argument.text,'argument_label':argument.label_}
    return False, {}

def contains_scientific_term(tweet_sentence):
    sciterms = []
    tweet_tokens = word_tokenize(tweet_sentence)
    for sciterm in scientific_terms:
        if sciterm in tweet_tokens:
            sciterms.append(sciterm)
    if len(sciterms) > 0:
        return True, sciterms
    return False, []

def is_claim(tweet):
    tweet = tweet.lower()
    pred = contains_arg_relation(tweet)
    if pred != "":
        sentences = sent_tokenize(tweet)
        
        for sent in sentences:
            doc = nlp(sent)

            if " "+pred+" " in sent:
                tags = [token.tag_ for token in doc]
                poss = [token.pos_ for token in doc]
                ents = [token.ent_type_ for token in doc]
                texts = [token.lower_ for token in doc]

                if len(pred.split(" ")) > 1:
                    pred_index = texts.index(pred.split(" ")[0])
                else:
                    pred_index = texts.index(pred)

                #if (pred == "support" and poss[pred_index] != 'NOUN') or pred != "support":
                tags_before = tags[:pred_index]
                poss_before = poss[:pred_index]
                ents_before = ents[:pred_index]

                tags_after = tags[pred_index+1:]
                poss_after = poss[pred_index+1:]
                ents_after = ents[pred_index+1:]


                # Condition = what's before the predicate IS a noun AND IS NOT one of the following: personal pronoun, possessive pronoun, person including fictional
                if 'PRP' not in tags_before and 'PRP$' not in tags_before and 'PERSON' not in ents_before and 'NOUN' in poss_before:
                    # Same condition for what's after the predicate
                    if 'PRP' not in tags_after and 'PRP$' not in tags_after and 'PERSON' not in ents_after and 'NOUN' in poss_after:
                        if "?" in sent:
                            if " how " in sent or "when " in sent or "why " in sent:
                                return True, 'claim_question', sent
                            else:
                                return True, 'question', sent
                        else:
                            return True, pred, sent

    return False, "", ""

### Compound heuristics

In [7]:
def is_claim_with_sciterm(tweet):
    return is_claim(tweet)[0] and contains_scientific_term(tweet)[0]

### Testing

In [8]:
sample_1 = 'girls kissing girls cause is hot right?'
print(is_claim(sample_1), contains_scientific_term(sample_1), contains_argument(sample_1))

sample_2 = 'the vaccine causes cancer'
print(is_claim(sample_2), contains_scientific_term(sample_2), contains_argument(sample_2))

(False, '', '') (False, []) (False, {})
(True, 'causes', 'the vaccine causes cancer') (True, ['vaccine']) (False, {})


### Results: contains_arg_relation (289 tweets)

In [99]:
res = tweets['text'].progress_apply(is_claim)
res = list(map(list, zip(*res.values)))

tweets['is_claim'] = res[0]
tweets['claim_pred'] = res[1]
tweets['claim_sentence'] = res[2]

100%|██████████| 100000/100000 [00:48<00:00, 2045.21it/s]


In [100]:
tweets[tweets['is_claim']]['claim_sentence']

2173                          boredom leads to desperation
2583     so very proud of the canberra community that c...
3150     studies have found that marijuana use does not...
3520     #happy30thbdaybradley don't stop tweeting people!
4809                  @chihibaby u are a gang leader 4ever
                               ...                        
98458      yes amen a superior thread berets are a godsend
99146    this 🍊🤡 continues to set the bar lower & lower...
99428    @namelessarab @zac_887 @georges84034435 @marti...
99655    delusional to think four seasons would come to...
99900    touchdown‼️‼️ @maxgilliam11 finds @steve4six f...
Name: claim_sentence, Length: 289, dtype: object

In [115]:
list(tweets[tweets['is_claim']]['claim_sentence'])

['boredom leads to desperation',
 'so very proud of the canberra community that came out in force to support #marriageequality #auspol on a topic... http://t.co/7ikk6igdof',
 'studies have found that marijuana use does not actually lower the iq of teens.',
 "#happy30thbdaybradley don't stop tweeting people!",
 '@chihibaby u are a gang leader 4ever',
 '@bdayspring would hate for u to stop political blabber for even a few min.',
 '"@shaado_9: toothaches are a bitch ☹ .."',
 "@reemsarah95 don't worry about a thing cause every little ting is gonna be alright #bobmarley",
 '@gettingfitttttt to stop hoarding so much😂 and to have a healthier relationship with food!',
 'lower risk of death is associated with vegetarian diets http://t.co/cq3kg4meqy',
 '#beliebersneedsheartbreaker stop say "soon" @justinbieber',
 "try to find some fanfiction, once find one, can't stop reading the story & terrorising the author to make another continuation #facepalm",
 'tweet 2000 goes out to @hannah_tut for bein

### Results: contains_sciterm (1392 tweets)

- Initial result gave 3516 tweets, but after eliminating most frequent text-speak words (cf sciterms section), results reduce from 3516 tweets to 1392 tweets

In [92]:
res = tweets['text'].progress_apply(contains_scientific_term)

tweets['contains_sciterm'] = [contains_sciterm[0] for contains_sciterm in res]

100%|██████████| 100000/100000 [48:36<00:00, 34.28it/s] 


In [93]:
tweets[tweets['contains_sciterm']]['text']

21       paris top 120 ws95 #fororder #bbm #236cdb67 #e...
195      @swinnie80 he's been quality mate #englandno1 ...
228      @Giantbaby1994 I wanted to but how can? I'm ha...
290      @SweetGastochi Ki onda con el 1000Tom?? I LOVE...
300      Ay dios mio. It's been a long day but it was a...
                               ...                        
99850    GP If NY’s coronavirus peak occurs this week, ...
99927    @burbancharlie @ida_praestgaard @ADDiaz1977 @C...
99937                    Another 30 days of lockdown??? 😭😭
99962    a funky hermes I drew for @Bashoded✨ I had a t...
99998    #DecadeForMassiveRagada Aa swag, Rayalaseema d...
Name: text, Length: 1392, dtype: object

In [94]:
list(tweets[tweets['contains_sciterm']]['text'])

['paris top 120 ws95 #fororder #bbm #236cdb67 #email #officialcerise@hotmail.com http://t.co/3dMG3xFaex',
 "@swinnie80 he's been quality mate #englandno1 no one els comes close",
 "@Giantbaby1994 I wanted to but how can? I'm having my sched now bb, next time kay? ask manager oppa to cook for you then ehe",
 '@SweetGastochi Ki onda con el 1000Tom?? I LOVE YOU MORE ROSA <3',
 "Ay dios mio. It's been a long day but it was at least productive.",
 'foget it . nite',
 'What am I going to do with the rest of my day? Watch pll obviously.',
 "@SnaggleSprout @LauraBaileyVO that's exactly why I'm getting it. It's why I got SRTT, preordered SR4, and watch shin chan :P",
 'morning "@Jirayu_28: Morning gan"',
 "A preclinical study led by researchers at Children's National Medical Center has found that a new oral drug shows early promise for the t...",
 "@GaernKyu is it because you're an evil? not cool at all oppa, you have black wings and i prefer the white one",
 'my friend was yelling at me to go 

### Results: contains_arg_relation + contains_sciterm (10 tweets)

In [97]:
res = tweets['text'].progress_apply(is_claim_with_sciterm)

tweets['is_claim_with_sciterm'] = res

100%|██████████| 100000/100000 [01:02<00:00, 1588.58it/s]


In [101]:
tweets[tweets['is_claim_with_sciterm']]['claim_sentence']

33207    manufacturing organisation @sme_mfg partners w...
33641    job authority may increase depression symptoms...
33676    #germany makes in #india – yet again: marquard...
33939    hypothyroidism and hyperthyroidism: causes and...
39072              question: can nuts cause cold symptoms?
44744    not a word from worthless @cnnbrk on white hou...
59330    consuming food rich in probiotics may promote ...
81351    is the s.m.a.r.t stent the smart option to tre...
90617    @rfdubai100 @kelly66617 @sbsnews influenza vac...
93799    more on the role of iminosugars to inhibit cor...
Name: claim_sentence, dtype: object

In [102]:
len(tweets[tweets['is_claim_with_sciterm']]['claim_sentence'])

10

In [103]:
list(tweets[tweets['is_claim_with_sciterm']]['claim_sentence'])

['manufacturing organisation @sme_mfg partners with trade fair organiser demat to support euromold trade show',
 'job authority may increase depression symptoms in women http://t.co/xxmiakkeg3',
 '#germany makes in #india – yet again: marquardt group increases footprint in india & sets up r&d centre in #pune http://t.co/tiy2ur3kf5',
 'hypothyroidism and hyperthyroidism: causes and natural remedies https://t.co/3sezjfijjq #ayurvedic',
 'question: can nuts cause cold symptoms?',
 'not a word from worthless @cnnbrk on white house announcement healthcare premiums will increase by 25% in 2017.',
 'consuming food rich in probiotics may promote lower #bloodpressure levels & increase effectiveness of medications.',
 'is the s.m.a.r.t stent the smart option to treat femoropopliteal disease: the femoropopliteal fp artery is a common site for endovascular interventions.',
 '@rfdubai100 @kelly66617 @sbsnews influenza vaccination increases risk of corona virus by 36%.jpeg https://t.co/wrie26bssy',


## Option 2: Using a supervised classifier as a claim filter from tweets then use sci-heuristics on the filtered dataset of claim-tweets

Model was trained using SciBERT finetuned on the labeled tweets dataset (~3k5 tweets) used for CheckThat!2022 Subtask 1B: A tweet is labeled "Yes" if it contains a factual verifiable claim

The classifier labeled **11k tweets** as claims from the initial **100k** sample.

The classifier code is on a separate notebook.

In [104]:
claim_tweets = pd.read_csv('positive_outputs.csv', sep='\t')

In [105]:
claim_tweets

Unnamed: 0,text
0,Eurozone industrial production dips 0.4 PCT | ...
1,Looking forward to the new Leaf Cafe & booksho...
2,#mp3 #ThreeWordsSheWantsToHear $0.1 Il Cammino...
3,#2: Agadir Argan Oil Daily Shampoo + Condition...
4,Philadelphia 76ers now own a D-League team bas...
...,...
11768,@ASageInglis ggggiiiirrrrrllllllllllllllllll i...
11769,@TPointUK Yet still no evidence of ‘cheating’ ...
11770,"November 14th 1999, Survivor Series. 21 years ..."
11771,Nicky is a fawn colored tom with yellow colore...


### Results: contains_arg_relation (102 tweets)

In [106]:
res = claim_tweets['text'].progress_apply(is_claim)
res = list(map(list, zip(*res.values)))

claim_tweets['is_claim'] = res[0]
claim_tweets['claim_pred'] = res[1]
claim_tweets['claim_sentence'] = res[2]

100%|██████████| 11773/11773 [00:11<00:00, 1044.59it/s]


In [107]:
claim_tweets[claim_tweets['is_claim']]['claim_sentence']

218      studies have found that marijuana use does not...
504      @bdayspring would hate for u to stop political...
611      lower risk of death is associated with vegetar...
876      uk storms cause rush-hour disruption for road,...
970      putin urges kiev to stop fighting, ensure dial...
                               ...                        
11104    "ahead of the missouri primary on august 4, si...
11451    wages of migrants sent home could drop $142bn ...
11669    this 🍊🤡 continues to set the bar lower & lower...
11732    delusional to think four seasons would come to...
11758    touchdown‼️‼️ @maxgilliam11 finds @steve4six f...
Name: claim_sentence, Length: 102, dtype: object

In [108]:
list(claim_tweets[claim_tweets['is_claim']]['claim_sentence'])

['studies have found that marijuana use does not actually lower the iq of teens.',
 '@bdayspring would hate for u to stop political blabber for even a few min.',
 'lower risk of death is associated with vegetarian diets http://t.co/cq3kg4meqy',
 'uk storms cause rush-hour disruption for road, rail and air travellers http://t.co/sx0blu5gam',
 'putin urges kiev to stop fighting, ensure dialogue: tv http://t.co/c9q6stvpl4',
 'investments will increase access to affordable housing http://t.co/fqtyhsiwqb',
 'agra forced religious conversion row causes uproar in both house of parliament | indilens !',
 '@jordan___evans @doughboyy____ @_ignorethehype creatine affects the kidneys if not properly taken with a proper dosage of h2o😎',
 'no man da one deh ago cause a wharf load a problems lmao #tns @zjchrome',
 'on avg, insurance premium growth in the first year of aca looks to be lower than pre-aca growth http://t.co/mfghec9ngd',
 '#genome sequencing reveals mutation responsible for lower fat lev

## Results: contains_sciterm (302 tweets)

In [113]:
res = claim_tweets['text'].progress_apply(contains_scientific_term)

claim_tweets['contains_sciterm'] = [contains_sciterm[0] for contains_sciterm in res]

100%|██████████| 11773/11773 [08:50<00:00, 22.18it/s]


In [114]:
claim_tweets[claim_tweets['contains_sciterm']]['text']

6        paris top 120 ws95 #fororder #bbm #236cdb67 #e...
49       A preclinical study led by researchers at Chil...
129      I fckin dove to the ground..michah yelling at ...
133      “@shanesmith30: @VICE next week we ramp up. Th...
155      i had like 4 hours of sleep so im in a weird m...
                               ...                        
11719    @greengal66 @lauriejwolfe @meganwbaygirl Thank...
11745    Corona is more important than the child you wa...
11748    I never seen such heartless people like Donald...
11755    GP If NY’s coronavirus peak occurs this week, ...
11765                    Another 30 days of lockdown??? 😭😭
Name: text, Length: 302, dtype: object

In [116]:
list(claim_tweets[claim_tweets['contains_sciterm']]['text'])

['paris top 120 ws95 #fororder #bbm #236cdb67 #email #officialcerise@hotmail.com http://t.co/3dMG3xFaex',
 "A preclinical study led by researchers at Children's National Medical Center has found that a new oral drug shows early promise for the t...",
 'I fckin dove to the ground..michah yelling at me like lexa btch get up...i rolled over & got up in one movement. I swear ,',
 '“@shanesmith30: @VICE next week we ramp up. Three strong eps back to back.” I can’t fucking wait. Your show is changing my life.',
 'i had like 4 hours of sleep so im in a weird mood atm',
 'http://t.co/25Mzm6OVAa Amber alert. I urge those with braz, peruv and mong release it and put it in a draw. You might lose £200 + tomorrow',
 'He killin that poor lil white girl smh https://t.co/VXctEA9WMe',
 'LASIK is most similar to another surgical corrective procedure, photorefractive keratectomy (PRK)',
 '@CoolNEGuy DealBook: Detroit Wins $55 Million in Concessions From 2 Banks: The agreement by the two banks was rare be

## Results: contains_arg_relation + contains_sciterm (6 tweets)

In [109]:
res = claim_tweets['text'].progress_apply(is_claim_with_sciterm)

claim_tweets['is_claim_with_sciterm'] = res

100%|██████████| 11773/11773 [00:15<00:00, 784.27it/s]


In [110]:
claim_tweets[claim_tweets['is_claim_with_sciterm']]['claim_sentence']

2788     job authority may increase depression symptoms...
2793     #germany makes in #india – yet again: marquard...
4028     not a word from worthless @cnnbrk on white hou...
5811     consuming food rich in probiotics may promote ...
9020     is the s.m.a.r.t stent the smart option to tre...
10400    @rfdubai100 @kelly66617 @sbsnews influenza vac...
Name: claim_sentence, dtype: object

In [112]:
list(claim_tweets[claim_tweets['is_claim_with_sciterm']]['claim_sentence'])

['job authority may increase depression symptoms in women http://t.co/xxmiakkeg3',
 '#germany makes in #india – yet again: marquardt group increases footprint in india & sets up r&d centre in #pune http://t.co/tiy2ur3kf5',
 'not a word from worthless @cnnbrk on white house announcement healthcare premiums will increase by 25% in 2017.',
 'consuming food rich in probiotics may promote lower #bloodpressure levels & increase effectiveness of medications.',
 'is the s.m.a.r.t stent the smart option to treat femoropopliteal disease: the femoropopliteal fp artery is a common site for endovascular interventions.',
 '@rfdubai100 @kelly66617 @sbsnews influenza vaccination increases risk of corona virus by 36%.jpeg https://t.co/wrie26bssy']