**From Real News vs Fake News Article**

Stylistic Features

- NLTK Part of Speech (POS) tagger
- Stop words, punctuation, quotes, negations, informal/swear words, interrogatives (how, when, what, why)
- All caps words
- LIWC

Complexity Features

- Look at sentence and word level
- Number of words per sentence
- Syntax Tree depth
- Verb Phrase syntax tree depth using Standford Parser
- "We expect that more words per sentence and deeper syntax trees mean the average sentence structure complexity is high."
- Readability using grade level index: Gunning Fog, SMOG Grade, Flesh-Kincaid (higher score, higher reading level)
- Type-Token Ratio (TTR). Unique words divided by total number of words
- Fluency: " how frequently a term in a document is found in a large English corpus" (corpus of contemporary american english)

Psychological Features

- Use LIWC to measure cognitive processes, drives, and personal concerns
- Basic bag-of-words sentiment (senti-strength)

**From Text Mining Article**

- Word Pairings

**Other**

- Use of punctuation

In [18]:
import pandas as pd
import numpy as np
import re
import LiwcTrie as lt

In [19]:
data = pd.DataFrame([{
    'article_host': None,
    'article_name': None,
    'article_subtitle': 'This is where you start, right here, with these excellent resources that medical '
        'professionals, scientists and activists have created for us so YOU can understand '
        'this vaccine topic more clearly.',
    'img-label': 'Amazon Censorship Of Vaccine Books',
    'img_src': './Stop Mandatory Vaccination - '
        'Posts_files/53730872_6120443278387_5582746568928264192_n.png.jpg',
    'linked_profiles': [],
    'links': [],
    'text': ' Get educated. Seriously, read several books and learn and understand this topic inside and out '
        'so you can rebuke the pediatrician, the politician, your friends, your family, and everyone who '
        'may harass you for not vaccinating. And then, get into this fight with us and educate others. '
        'This is where you start, right here, with these excellent resources that medical professionals, '
        'scientists and activists have created for us so YOU can understand this vaccine topic more '
        'clearly. I suggest you purchase as many as you can afford: there is no such thing as being '
        '“overeducated” on this topic that affects us all.',
    'timestamp': 'March 3 at 4:27 PM'
}])

data

Unnamed: 0,article_host,article_name,article_subtitle,img-label,img_src,linked_profiles,links,text,timestamp
0,,,"This is where you start, right here, with thes...",Amazon Censorship Of Vaccine Books,./Stop Mandatory Vaccination - Posts_files/537...,[],[],"Get educated. Seriously, read several books a...",March 3 at 4:27 PM


In [20]:
liwc_df = pd.read_excel('LIWC2007dictionary poster.xls')

dict_words = [term for term in liwc_df.values.flatten() if not pd.isnull(term)]
liwc_dict = lt.LiwcTrieNode('*')
for word in dict_words:
    lt.add(liwc_dict, word)

liwc_cat_map = {}
for category in liwc_df:
    cat_words = liwc_df[category].dropna().values
    cat_dict = lt.LiwcTrieNode('*')
    for word in cat_words:
        lt.add(cat_dict, word)
    liwc_cat_map[category] = cat_dict

liwc_df

Unnamed: 0,Funct,Pronoun,Ppron,I,We,You,SheHe,They,Ipron,Article,...,Work,Achiev,Leisure,Home,Money,Relig,Death,Assent,Nonflu,Filler
0,a,anybod*,hed,i,lets,thee,he,their*,anybod*,a,...,absent*,abilit*,actor*,address,account*,afterlife*,autops*,absolutely,er,blah
1,about,anyone*,he'd,Id,let's,thine,hed,them,anyone*,alot,...,academ*,able*,actress*,apartment*,atm,agnost*,alive,agree,hm*,idontknow
2,above,anything,her,I'd,our,thou,he'd,themselves,anything,an,...,accomplish*,accomplish*,aerobic*,backyard,atms,alla,bereave*,ah,sigh,imean
3,absolutely,everybod*,hers,I'll,ours,thoust,her,they,everybod*,the,...,achiev*,ace,amus*,bake*,auction*,allah*,burial*,alright*,uh,like
4,across,everyone*,herself,Im,ourselves,thy,hers,theyd,everyone*,,...,administrat*,achiev*,apartment*,baking,audit,altar*,buried,aok,um,ohwell
5,actually,everything*,hes,I'm,us,ya,herself,they'd,everything*,,...,advertising,acquir*,art,balcon*,audited,amen,bury,aw,umm*,rr*
6,after,hed,he's,ive,we,yall,hes,theyll,it,,...,advis*,acquisition*,artist*,bath*,auditing,amish,casket*,awesome,well,yakno*
7,again,he'd,him,I've,we'd,y'all,he's,they'll,itd,,...,agent,adequa*,arts,bed,auditor,angel,casualt*,cool,zz*,ykn*
8,against,her,himself,me,we'll,ye,him,theyve,it'd,,...,agents,advanc*,athletic*,bedding,auditors,angelic*,cemet*,duh,,youknow
9,ahead,hers,his,mine,we're,you,himself,they've,itll,,...,ambiti*,advantag*,ball,bedroom*,audits,angels,coffin*,ha,,


In [21]:
# List of Words, no punctuation
def words(text):
    return re.sub("'", " ", re.sub("[^a-zA-Z'\\s]", ' ', text)).split()

# List of Sentences, punction included except end marks (!, ?, .)
def sentences(text):
    sentences = re.compile("[!?.]+").split(text)
    return [sentence for sentence in sentences if sentence] # filter empty sentences

w = data.text.str.lower().apply(words) # Split text into lower case words
s = data.text.apply(sentences) # Split text into sentences

wc = w.apply(len) # word count
sc = s.apply(len) # sentence count
wps = s.apply(lambda sentences: [len(words(s)) for s in sentences]).apply(np.mean) # words per sentence
dic = w.apply(lambda words: np.count_nonzero([lt.find(liwc_dict, word) for word in words])) / wc # % of words in dictionary
sixltr = w.apply(lambda words: [w for w in words if len(w) > 6]).apply(len) # words >6 letters

liwc_cols = ['wc', 'sc', 'wps', 'dic', 'sixltr']
liwc_data = [wc, sc, wps, dic, sixltr]

for category in liwc_df:
    c = w.map(lambda words: [word for word in words if lt.find(liwc_cat_map[category], word)]).apply(len) / wc
    liwc_data.append(c)
    liwc_cols.append(category.lower())

In [22]:
pd.concat(liwc_data, axis=1, keys=liwc_cols)

Unnamed: 0,wc,sc,wps,dic,sixltr,funct,pronoun,ppron,i,we,...,work,achiev,leisure,home,money,relig,death,assent,nonflu,filler
0,102,5,20.4,0.901961,24,0.656863,0.294118,0.147059,0.009804,0.029412,...,0.196078,0.058824,0.088235,0.029412,0.058824,0.078431,0.009804,0.0,0.0,0.068627
