Functions for feature extraction, based on **Table 4** from **Zhou and Zafarani, 2020**. The semantic level features are divided in to the following broad categories,


---


1.   Quantity
2.   Complexity
3.   Uncertainity
4.   Subjectivity
5.   Non-immediancy
6.   Sentiment
7.   Diversity
8.   Informality
9.   Specificity
10.  Readability


---




Pipeline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Installing and Importing Libraries**

In [None]:
# Following are the required libraries for this notebook

! pip install textstat -q
! pip install lexical-diversity -q
! pip install spacy -q
! python -m spacy download en_core_web_sm -q
! pip install vaderSentiment -q
! pip install transformers -q

!python -m spacy download en_core_web_lg -q

[K     |████████████████████████████████| 105 kB 8.1 MB/s 
[K     |████████████████████████████████| 2.0 MB 57.7 MB/s 
[K     |████████████████████████████████| 117 kB 8.9 MB/s 
[K     |████████████████████████████████| 12.0 MB 12.2 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[K     |████████████████████████████████| 125 kB 9.2 MB/s 
[K     |████████████████████████████████| 3.8 MB 8.9 MB/s 
[K     |████████████████████████████████| 596 kB 52.1 MB/s 
[K     |████████████████████████████████| 895 kB 57.2 MB/s 
[K     |████████████████████████████████| 6.5 MB 58.2 MB/s 
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
[?25h  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
# importing the necessary libraries

import textstat
import re

import pandas as pd
import numpy as np
import multiprocessing as mp

import nltk
nltk.download("punkt")

import string

from lexical_diversity import lex_div as ld

import spacy
from spacy.matcher import Matcher
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp_tagger = spacy.load('en_core_web_sm')
nlp_tagger.disable_pipes('parser', 'ner')

from spacy.lang.en import English
nlp_stop = English()

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

from tqdm.auto import tqdm
tqdm.pandas()

import en_core_web_lg 
nlp=en_core_web_lg.load()
matcher = Matcher(nlp.vocab)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **Def Fun**

In [None]:
# mapping of the labels to 0,1 
def label_map(x): 
  if x in ['true', 'mostly-true', 'half-true', 'real', 'Real', 0, 'REAL']:
    return 0
  elif x in ['false', 'pants-fire', 'barely-true', 'fake', 'Fake', 1, 'FAKE']:
    return 1
  else:return x

## **1) Quantity**

Quantity includes the following features,


*   Number of characters
*   Number of words
*   Number of Noun Phrases
*   Number of sentences
*   Number of paragraphs

Out of the above features, we are only interested in **characters, words and sentences**.

### **1.1) Number of characters**

Following is the function to calculate the number of characters in a text.

In [None]:
def num_chars(text):
  return len(text)

In [None]:
def url_count(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    urls=re.findall(url_pattern,text)
    return len(urls)

In [None]:
# remove urls
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [None]:
def remove_nonascii(sent):
  return " ".join(("".join([i for i in sent if i.isascii()])).split())

In [None]:
def get_no_of_qn_marks(text,percentage=False):
  return text.count("?")/len(text) if percentage else text.count("?")

### **1.2) Number of words**

This function calculates the number of words in the text, **excluding the punctuations**.

In [None]:
def num_words(text):
  return textstat.lexicon_count(text, removepunct=True)

### **1.3) Number of sentences**

This function calculates the number of sentences in the text.

In [None]:
def num_sentences(text):
  return textstat.sentence_count(text)

## **2) Complexity**

Complexity includes the following features,

*   Average number of characters per word
*   Average number of words per sentence
*   Average number of clauses per sentence
*   Average number of punctuations per sentence

### **2.1) Average number of words per sentence**

The following function calculates the average number of words per sentence, i.e. (number of words/number of sentences)



In [None]:
# This function uses functions defined in the previous section.
def words_per_sentence(text):
  return float(num_words(text))/num_sentences(text)

### **2.2) Average number of characters per word**

The function calculates the number of characters per word.

In [None]:
def characters_per_word(text):
  tokens = nltk.word_tokenize(text)
  nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
  filtered = [w for w in tokens if nonPunct.match(w)]
  return float(sum(map(len, filtered))) / len(filtered) if len(filtered)>0 else 0

### **2.3) Average number of punctuations per sentence**

This function calculates the average number of punctuations per sentence.

In [None]:
def punctuations_per_sentence(text):
  punc_count = sum([1 if char in string.punctuation else 0 for char in text])
  return punc_count / float(num_sentences(text))

## **3) Sentiment**

Sentiment includes the following features,

*  Percentage of Positive words
*  Percentage of Negative words
*  Number of Exclamation marks
*  Content Sentiment Polarity
*  Percentage of Anxiety/angry/sadness words

### **3.1) Percentage of Positive words**

This function calculates the amount of postive words in the sentence as a percentage.

The function uses a corpus comparison method, the corpus is from, **Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews.", Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.** 

In [None]:
# First we read the word list from drive
with open("/content/drive/Shareddrives/FYP - knk/word_lists/positive_words.txt") as f:
    positive_words = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
positive_words = [x.strip() for x in positive_words]

def positive(text):
  tokens = nltk.word_tokenize(text)
  nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
  filtered = [w for w in tokens if nonPunct.match(w)]

  count = 0
  for word in filtered:
    if word in positive_words:
      count+=1

  return (float(count)/len(filtered))*100  if len(filtered)>0 else 0

### **3.2) Percentage of Negative words**

This function calculates the amount of negative words in the sentence as a percentage.

The function uses a corpus comparison method, the corpus is from, **Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews.", Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.** 

In [None]:
# First we read the word list from drive 
with open("/content/drive/Shareddrives/FYP - knk/word_lists/negative_words.txt" ,encoding="utf-8" ,  errors="ignore") as f:
    negative_words = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
negative_words = [x.strip() for x in negative_words]

def negative(text):
  tokens = nltk.word_tokenize(text)
  nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
  filtered = [w for w in tokens if nonPunct.match(w)]

  count = 0
  for word in filtered:
    if word in negative_words:
      count+=1

  return (float(count)/len(filtered))*100  if len(filtered)>0 else 0

### **3.3) Number of exclamation marks**

This function calculates the number of exclamation marks in the text

In [None]:
def num_exclamation(text):
  tokens = nltk.word_tokenize(text)
  return len([w for w in tokens if w == "!"])

### **3.4) Content Sentiment Polarity**

This is calculated using [VaderSentiment](https://github.com/cjhutto/vaderSentiment).

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

In [None]:
def get_sentiment_polarity(text):
  vader_scores = sentiment_analyzer.polarity_scores(text)
  return vader_scores['compound'] # Outputs something like this {'pos': 0.094, 'compound': -0.7042, 'neu': 0.579, 'neg': 0.327}

## **4) Diversity**

Diversity includes the following features,



*   Lexical diversity
*   Content word diversity
*   Redundancy
*   Unique Nouns/ Verbs/ Adjectives/ Adverbs



### **4.1) Lexical Diversity**

This function calculates the **TTR** - TTR is the ratio obtained by dividing the types (the total number of different words) occurring in a text or utterance by its tokens (the total number of words).

This function is written using the **pypi lexical-diversity**.

In [None]:
'''
Lemmatize is by default set to False, but if we want to lemmatize we could set 
that to True.

The lemmatizer which is not part of speech specific ('run' as a noun and 'run' 
as a verb are treated as the same word). However, it is likely better to use a 
part of speech sensitive lemmatizer (e.g., using spaCy).
'''

def lexical_diversity(text, lemmatize = False):
  tokens = ld.flemmatize(text) if lemmatize else ld.tokenize(text)
  return ld.ttr(tokens) * 100

### **4.2) Content word diversity and Redundancy(Function words)**

In [None]:
def content_word_diversity_and_redundancy(text):

  text = re.sub(r'[^\w\s]', '', text)
  doc = nlp_stop(text)

  # Create list of word tokens
  token_list = []
  for token in doc:
      token_list.append(token.text)

  # Create list of word tokens after removing stopwords
  content_words, function_words =[], []
  for word in token_list:
      lexeme = nlp_stop.vocab[word]
      if lexeme.is_stop == False:
          content_words.append(word) 
      else:
          function_words.append(word)
  
  output = {
      'content_word_diversity': (float(len(list(set(content_words)))) / num_words(text)) * 100  if num_words(text)>0 else 0 ,
      'redundancy': (float(len(list(set(function_words)))) / num_words(text)) * 100 if num_words(text)>0 else 0 ,
  }
  
  return output

In [None]:
text="The pandemic sped up the shift to online shopping, and the continued growth of e-commerce sales will lead to more stores shutting down after the pandemic ends, UBS retail analysts predicted in a recent report. The report estimates that around 80,000 stores will close over the next five years. They also believe the number of US malls will also decline over the same period."

### **4.3) Percentage of unique nouns/verbs/adjectives/adverbs**

This function calculates the percentage of unique nouns, verbs, adjectives and adverbs and returns the results as a list.

In [None]:
def nvaa(text):
  # Regular expression to take out the punctuations
  text = re.sub(r'[^\w\s]', '', text)

  doc = nlp_tagger(text)

  pos_tags = {
      'NOUN': [],
      'VERB': [],
      'ADJ': [],
      'ADV': [],
  }

  keys = pos_tags.keys()

  # Iterate over the tokens
  for token in doc:
      pos = token.pos_
      if pos in keys:
        pos_tags[pos].append(token.text)
  # print(pos_tags)

  output = {
      'NOUN': 0,
      'VERB': 0,
      'ADJ': 0,
      'ADV': 0
  }

  for key in output.keys():
    output[key] = (len(list(set(pos_tags[key]))) / float(num_words(text))) * 100 if num_words(text)>0 else 0

  return output

## **5) Subjectivity**

This attribute includes the following features,


*  Percentage of biased lexicons
*  Percentage of subjective verbs
*  Percentage of report verbs
*  Percentage of factive verbs



## **6) uncertainity**

#### model verbs

In [None]:
def to_nlp_tags(text):
  doc=nlp(text)
  sents=list(doc.sents)
  return pd.Series([doc,sents], index=['doc', 'sents'])    
  get_MD_verb_count(doc,sents)

In [None]:
# for spacy need to install below
# !python -m spacy download en_core_web_lg
# import en_core_web_lg 
# nlp=en_core_web_lg.load()
# matcher = Matcher(nlp.vocab)

def get_MD_verb_count(doc ,sents,percentage=True,lib="spacy"):
  if lib=="spacy":
    # doc=nlp(text)
    # sents = list(doc.sents)
    # dep=[]
    # tag=[]
    count=0
    for sent in sents:
      for token in sent:
        # dep.append(token.dep_)
        # tag.append(token.tag_)
        if token.tag_ =="MD":
          count=count+1

    return count/len(sents) if percentage else count

  else: ## nltk
    tok=nltk.word_tokenize(text.lower())
    postags=nltk.pos_tag(tok)
    return len([i[0] for i in postags if i[1]=="MD"])/len(tok) if percentage else len([i[0] for i in postags if i[1]=="MD"])

# text="You must obey the rules, or you will get puniched"
# get_MD_verb_count(to_nlp_tags(text))

#### qn marks

In [None]:
# def get_no_of_qn_marks(text,percentage=False):
#   return text.count("?")/len(text) if percentage else text.count("?")

#### quantifiers

In [None]:
def count_quantifiers(doc,sents ,nlp=nlp ,matcher=matcher,percentage=True):
  # doc=nlp(text)
  # sents = list(doc.sents)
  # dep=[]
  # tag=[]
  count=0
  for sent in sents:
    for token in sent:
      # dep.append(token.dep_)
      # tag.append(token.tag_)
      if token.tag_ =="CD":
        # print(token)
        count=count+1

  return count/len(sents) if percentage else count

In [None]:
# count_quantifiers("thirteen fifteen sixty three few many year old ")

## **7) non_immediacy**

In [None]:
# run below commented section to downlaod spacy corpous 
# !python -m spacy download en_core_web_lg
# https://gist.github.com/armsp/30c2c1e19a0f1660944303cf079f831a
# import en_core_web_lg 
# nlp=en_core_web_lg.load()
# matcher = Matcher(nlp.vocab)
def count_passive(doc ,nlp=nlp ,matcher=matcher):
  # doc = nlp(text)
  # sents = list(doc.sents)
  # print("Number of Sentences = ",len(sents))
  # for sent in doc.sents:
  #   for token in sent:
  #       print(token.dep_,token.tag_, end = " ")
  #   print()
  passive_rule = [{'DEP':'nsubjpass'},{'DEP':'aux','OP':'*'},{'DEP':'auxpass'},{'TAG':'VBN'}]
  matcher.add('Passive',None,passive_rule)
  matches = matcher(doc)
  # print("Number of PASSIVE Sentences = " , len(matches))
  return len(matches)

In [None]:
def non_immediacy(doc,sents):
  FPS = ["i"	,"me",	"my"	,"mine",	"myself"]  # first person singular
  FPP = ["we"	,"us",	"our",	"ours",	"ourselves"] #first person pluran
  SPS =["you",	"your",	"yours",	"yourself"] # second person singular
  # SPP =["you",	"your",	"yours",	"yourself"] # second person plural
  TPS = ["he",	"him",	"his",	"himself" ,"she"	,"her",	"hers"	,"herself","it"	,"its",	"itself"]  # third person singular
  TPP=["they",	"them",	"their"	,"theirs"	,"themselves"] # third person plural

  # doc=nlp(text.lower())
  # sents=list(doc.sents)

  dic={"fps":0,"fpp":0,"sps":0,"spp":0,"tps":0 ,"tpp":0}
  qt=False
  qtCount=0
  for sent in sents:
    for token in sent :
      strVal=token.string.lower()
      if (token.tag_=="``"):
        if qt:
          qtCount+=1
        qt=!qt
      if  strVal in FPS :
        dic["fps"]+=1
      elif strVal in FPP:
        dic["fpp"]+=1
      elif strVal in SPS:
        dic["sps"]+=1
      elif strVal in TPS:
        dic["tps"]+=1
      elif strVal in TPP:
        dic["tpp"]+=1
  dic["passive_count"]=count_passive(doc)
  dic["sentence_count"]=len(sents)
  dic["quated_text"]=qtCount
  return dic
  # return pd.Series([dic["passive_count"],dic["sentence_count"],dic["quated_text"],dic["fps"],dic["fpp"],dic["sps"],dic["tps"],dic["tpp"]], index=['passive_count', 'sentence_count', 'quated_text','fps','fpp','sps','tps','tpp'])

In [None]:
def uncertinity_and_non_immediacy(text):
  doc =nlp(text)
  sents=list(doc.sents)

  model_verbs=get_MD_verb_count(doc,sents)
  quantifiers=count_quantifiers(doc,sents)

  output=non_immediacy(doc,sents)
  output["model_verbs"]=model_verbs
  output["quantifiers"]=quantifiers

  return pd.Series(data = output)


# **read DF**



In [None]:
# data= {
#   'text':["“Durham’s documents show that Hillary Clinton hired people who hacked into Trump’s home and office computers” and “planted evidence, fabricated evidence connecting Trump to Russia.”" ,
#   "Says that President Joe Biden said Americans will start seeing “direct deposits in their bank accounts this weekend,” and that Medicare recipients will get back $2,880.",
#   "Study proves children’s hearts destroyed by COVID vaccine." ] ,"label":[1,1,1]
# }
# df=pd.DataFrame(data)

##Old dont run

In [None]:
# dfGossipR=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet/gossipcop_real.csv")
# dfGossipR["label"] = "Real"
# dfGossipR["source"] ="gossipcop"

# dfGossipF = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet/gossipcop_fake.csv")
# dfGossipF["label"] = "Fake"
# dfGossipF["source"]= "gossipcop"
# dfGossip=dfGossipF.append(dfGossipR,ignore_index = True)

# dfpoliR=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet/politifact_real.csv")
# dfpoliR["label"] = "Real"
# dfGossipR["source"] ="politifact"

# dfpoliF = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet/politifact_fake.csv")
# dfpoliF["label"] = "Fake"
# dfGossipF["source"]= "politifact"
# dfpoli=dfpoliF.append(dfpoliR,ignore_index = True)

# df = dfGossip.append(dfpoli,ignore_index = True)

In [None]:
# dfGossipR=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet Tweets/gossipcop_real_tweets.csv")
# dfGossipR["label"] = "Real"
# dfGossipR["source"] ="gossipcop"

# dfGossipF = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet Tweets/gossipcop_fake_tweets.csv")
# dfGossipF["label"] = "Fake"
# dfGossipF["source"]= "gossipcop"
# dfGossip=dfGossipF.append(dfGossipR,ignore_index = True)

# dfpoliR=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet Tweets/politifact_real_tweets.csv")
# dfpoliR["label"] = "Real"
# dfGossipR["source"] ="politifact"

# dfpoliF = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet Tweets/politifact_fake_tweets.csv")
# dfpoliF["label"] = "Fake"
# dfGossipF["source"]= "politifact"
# dfpoli=dfpoliF.append(dfpoliR,ignore_index = True)

# df = dfGossip.append(dfpoli,ignore_index = True)

In [None]:
# df.head()

In [None]:
# dffake = pd.read_csv("/content/drive/Shareddrives/FYP - knk/Datasets/ISOT/Fake.csv")
# dftrue = pd.read_csv("/content/drive/Shareddrives/FYP - knk/Datasets/ISOT/True.csv" )
# # dffake.head()
# dftrue.head()
# dftrue["label"]=0
# dffake["label"]=1
# df=dftrue.append(dffake,ignore_index = True)
# df.drop(["subject","date"],axis=1 , inplace=True)
# df = df.sample(frac=1).reset_index(drop=True)

In [None]:
# columns=["ID","label","statement","subject(s)","speaker","speaker_job_title","state","party","credit_history_count_barely_true","credit_history_count_false","credit_history_count_half_true","credit_history_count_mostly_true","credit_history_count_pants_on_fire","context"] 
# dftest = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/test.tsv", names=columns,sep="\t" )
# dftest["split"]="test"

# dftrain = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/train.tsv", names=columns,sep="\t" )
# dftrain["split"]="train"
# df=dftest.append(dftrain,ignore_index = True)

# dfvalid = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/valid.tsv", names=columns,sep="\t" )
# dfvalid["split"]="valid"
# df=df.append(dfvalid,ignore_index = True)

In [None]:
# CodeLab Covid
# dftrain = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/CodaLab Covid/Constraint_English_Train.csv")
# dftrain["split"]="train"

# dfvalid = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/CodaLab Covid/Constraint_English_Val .csv")
# dfvalid["split"]="val"
# dftest = pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/CodaLab Covid/english_test_with_labels.csv")
# dftest["split"]="test"
# df=dftest.append(dftrain,ignore_index = True)
# df=df.append(dfvalid,ignore_index = True)


In [None]:
# df.to_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/CodaLab Covid/Constraint_English_All.csv",index=False)

In [None]:
# dfFake=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Kaggle/Fake.csv")
# dfFake["label"]="Fake"
# dfTrue=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Kaggle/True.csv")
# dfTrue["label"]="True"
# df=dfFake.append(dfTrue,ignore_index = True)


In [None]:
# df["id"]=df.index
# df.to_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/Liar_all.csv",index=False)
# dff=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/Liar_all.csv")
# dff.head()

##New

In [None]:
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/CodaLab Covid/Constraint_English_All.csv")
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FakeNewsNet/FakeNewsNet_All.csv")
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/ISOT/ISOT.csv")
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Kaggle_real_fake/fake_or_real_news.csv")
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/LIAR/Liar_all.csv")
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/WELFake/WELFake.csv")
df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/FA-KES/FA-KES.csv")  # 1 - fake ,
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Politifact_test/Politifact_testset.csv")   
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/COVID19_test/COVID19_test.csv")

# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Kaggle/Kaggle.csv")
# columns= ['cred_label', 'claim_id', 'claim_text', 'claim_source', 'evidence', 'evidence_source' ]
# df=pd.read_csv("/content/drive/Shareddrives/[FYP] Fake News Detection/Datasets/Politifact/Politifact.tsv",names=columns,sep="\t")

In [None]:
unnamed=df.columns[df.columns.str.contains('unnamed',case = False)]
unnamed

Index([], dtype='object')

In [None]:
columns=df.columns
columns

Index(['unit_id', 'article_title', 'article_content', 'source', 'date',
       'location', 'label'],
      dtype='object')

In [None]:
df.drop(columns=unnamed,inplace=True)

In [None]:
# df["id"]=df.index

In [None]:
# df=df[ ["id", 'claim_id','cred_label', 'claim_text', 'claim_source', 'evidence', 'evidence_source' ]]

In [None]:
# df["split"].value_counts()

In [None]:
df.shape

(789, 7)

In [None]:
df.head()

Unnamed: 0,unit_id,article_title,article_content,source,date,location,label
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,4/5/2017,idlib,1
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,4/7/2017,homs,1
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,1
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,1
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,7/10/2016,aleppo,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789 entries, 0 to 788
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   unit_id          789 non-null    int64 
 1   article_title    789 non-null    object
 2   article_content  789 non-null    object
 3   source           789 non-null    object
 4   date             789 non-null    object
 5   location         789 non-null    object
 6   label            789 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 43.3+ KB


In [None]:
df.head()

Unnamed: 0,unit_id,article_title,article_content,source,date,location,label
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,4/5/2017,idlib,1
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,4/7/2017,homs,1
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,1
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,1
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,7/10/2016,aleppo,1


In [None]:
# df["text"]=df["claim_text"]
# df["text"]=df["statement"]
# df["text"]=df["title"]
# df["text"]=df["tweet"]
# df["text"]=df["article_content"]

In [None]:
df.duplicated(subset='text', keep='first').sum()

15

In [None]:
len(df['text'])-len(df.dropna(subset=['text'], how='all'))

0

In [None]:
df = df.dropna(subset=['text'], how='all')
df = df.reset_index(drop=True)
df['text'] = df['text'].replace(np.nan, '', regex=True)
df = df.dropna(subset=['text'], how='all')
# df= df.drop_duplicates(subset=["text"])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 789 entries, 0 to 788
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   unit_id          789 non-null    int64 
 1   article_title    789 non-null    object
 2   article_content  789 non-null    object
 3   source           789 non-null    object
 4   date             789 non-null    object
 5   location         789 non-null    object
 6   label            789 non-null    int64 
 7   text             789 non-null    object
dtypes: int64(2), object(6)
memory usage: 55.5+ KB


In [None]:
df.label.value_counts()

0    418
1    371
Name: label, dtype: int64

# **apply Fun** 

In [None]:
# dgg=df["text"].progress_apply(uncertinity_and_non_immediacy)
# df=pd.concat([df,dgg],axis=1)

In [None]:
df["url_count"] = df["text"].progress_apply(url_count)
df["text_cleaned"] = df["text"].progress_apply(remove_urls)
df["text_cleaned"] = df["text_cleaned"].progress_apply(remove_nonascii)

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
df["qn_symbol"]=df["text"].progress_apply(get_no_of_qn_marks)

In [None]:
df["num_chars"]=df["text"].progress_apply(num_chars)
df["num_words"]=df["text"].progress_apply(num_words)
df["num_sentences"]=df["text"].progress_apply(num_sentences)
df["words_per_sentence"]=df["text"].progress_apply(words_per_sentence)
df["characters_per_word"]=df["text"].progress_apply(characters_per_word)
df["punctuations_per_sentence"]=df["text"].progress_apply(punctuations_per_sentence)

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
# count=0
# for i in df["text"]:
#   characters_per_word(i)
#   count+=1

In [None]:
# df["positive"]=df["text"].progress_apply(positive)
# df["negative"]=df["text"].progress_apply(negative)
df["num_exclamation"]=df["text"].progress_apply(num_exclamation)
df["get_sentiment_polarity"]=df["text"].progress_apply(get_sentiment_polarity)
df["lexical_diversity"]=df["text"].progress_apply(lexical_diversity)
df["content_word_diversity_and_redundancy"]=df["text"].progress_apply(content_word_diversity_and_redundancy)
df["nvaa"]=df["text"].progress_apply(nvaa)

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
df["content_word_diversity_and_redundancy"][0]

{'content_word_diversity': 77.77777777777779, 'redundancy': 22.22222222222222}

In [None]:
# df["content_word_diversity"]=df["content_word_diversity_and_redundancy"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][0])
df["content_word_diversity"]=df["content_word_diversity_and_redundancy"].progress_apply(lambda x: x["content_word_diversity"])

# df["redundancy"]=df["content_word_diversity_and_redundancy"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][1])
df["redundancy"]=df["content_word_diversity_and_redundancy"].progress_apply(lambda x: x["redundancy"])

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
df["nvaa"][0]
# df["noun"]=df["nvaa"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][0])
df["noun"]=df["nvaa"].progress_apply(lambda x: x["NOUN"])

# df["verb"]=df["nvaa"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][1])
# df["adj"]=df["nvaa"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][2])
# df["adv"]=df["nvaa"].progress_apply(lambda x: [i.split(":")[1] for i in x.strip("{").strip("}").split(",")][3])

df["verb"]=df["nvaa"].progress_apply(lambda x: x["VERB"])
df["adj"]=df["nvaa"].progress_apply(lambda x: x["ADJ"])
df["adv"]=df["nvaa"].progress_apply(lambda x: x["ADV"])

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
df.head()

Unnamed: 0,unit_id,article_title,article_content,source,date,location,label,text,url_count,text_cleaned,...,get_sentiment_polarity,lexical_diversity,content_word_diversity_and_redundancy,nvaa,content_word_diversity,redundancy,noun,verb,adj,adv
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,4/5/2017,idlib,1,Syria attack symptoms consistent with nerve ag...,0,Syria attack symptoms consistent with nerve ag...,...,-0.4767,100.0,"{'content_word_diversity': 77.77777777777779, ...","{'NOUN': 44.44444444444444, 'VERB': 0.0, 'ADJ'...",77.777778,22.222222,44.444444,0.0,22.222222,0.0
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,4/7/2017,homs,1,Homs governor says U.S. attack caused deaths b...,0,Homs governor says U.S. attack caused deaths b...,...,-0.6808,100.0,"{'content_word_diversity': 76.92307692307693, ...","{'NOUN': 30.76923076923077, 'VERB': 23.0769230...",76.923077,30.769231,30.769231,23.076923,15.384615,0.0
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,1,Death toll from Aleppo bomb attack at least 112,0,Death toll from Aleppo bomb attack at least 112,...,-0.8807,100.0,"{'content_word_diversity': 66.66666666666666, ...","{'NOUN': 44.44444444444444, 'VERB': 0.0, 'ADJ'...",66.666667,33.333333,44.444444,0.0,11.111111,11.111111
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,1,Aleppo bomb blast kills six Syrian state TV,0,Aleppo bomb blast kills six Syrian state TV,...,-0.7717,100.0,"{'content_word_diversity': 87.5, 'redundancy':...","{'NOUN': 50.0, 'VERB': 12.5, 'ADJ': 25.0, 'ADV...",87.5,12.5,50.0,12.5,25.0,0.0
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,7/10/2016,aleppo,1,29 Syria Rebels Dead in Fighting for Key Alepp...,0,29 Syria Rebels Dead in Fighting for Key Alepp...,...,-0.8225,100.0,"{'content_word_diversity': 80.0, 'redundancy':...","{'NOUN': 20.0, 'VERB': 0.0, 'ADJ': 10.0, 'ADV'...",80.0,20.0,20.0,0.0,10.0,0.0


In [None]:
def detect_lang(text):
  try:
      return detect(text)
  except:
      return 'error'
      
! pip install langdetect -q
from langdetect import detect
# from langdetect import DetectorFactory
# DetectorFactory.seed = 0
df["lang"]=df["text"].progress_apply(detect_lang)

  0%|          | 0/789 [00:00<?, ?it/s]

In [None]:
df.head()

Unnamed: 0,unit_id,article_title,article_content,source,date,location,label,text,url_count,text_cleaned,...,lexical_diversity,content_word_diversity_and_redundancy,nvaa,content_word_diversity,redundancy,noun,verb,adj,adv,lang
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,4/5/2017,idlib,1,Syria attack symptoms consistent with nerve ag...,0,Syria attack symptoms consistent with nerve ag...,...,100.0,"{'content_word_diversity': 77.77777777777779, ...","{'NOUN': 44.44444444444444, 'VERB': 0.0, 'ADJ'...",77.777778,22.222222,44.444444,0.0,22.222222,0.0,en
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,4/7/2017,homs,1,Homs governor says U.S. attack caused deaths b...,0,Homs governor says U.S. attack caused deaths b...,...,100.0,"{'content_word_diversity': 76.92307692307693, ...","{'NOUN': 30.76923076923077, 'VERB': 23.0769230...",76.923077,30.769231,30.769231,23.076923,15.384615,0.0,en
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,1,Death toll from Aleppo bomb attack at least 112,0,Death toll from Aleppo bomb attack at least 112,...,100.0,"{'content_word_diversity': 66.66666666666666, ...","{'NOUN': 44.44444444444444, 'VERB': 0.0, 'ADJ'...",66.666667,33.333333,44.444444,0.0,11.111111,11.111111,en
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,1,Aleppo bomb blast kills six Syrian state TV,0,Aleppo bomb blast kills six Syrian state TV,...,100.0,"{'content_word_diversity': 87.5, 'redundancy':...","{'NOUN': 50.0, 'VERB': 12.5, 'ADJ': 25.0, 'ADV...",87.5,12.5,50.0,12.5,25.0,0.0,en
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,7/10/2016,aleppo,1,29 Syria Rebels Dead in Fighting for Key Alepp...,0,29 Syria Rebels Dead in Fighting for Key Alepp...,...,100.0,"{'content_word_diversity': 80.0, 'redundancy':...","{'NOUN': 20.0, 'VERB': 0.0, 'ADJ': 10.0, 'ADV'...",80.0,20.0,20.0,0.0,10.0,0.0,en


In [None]:
df[df["lang"]!='en'].shape

(57, 29)

In [None]:
df.describe()

Unnamed: 0,unit_id,label,url_count,qn_symbol,num_chars,num_words,num_sentences,words_per_sentence,characters_per_word,punctuations_per_sentence,num_exclamation,get_sentiment_polarity,lexical_diversity,content_word_diversity,redundancy,noun,verb,adj,adv
count,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0,789.0
mean,1936169000.0,0.470215,0.0,0.002535,62.806084,10.268695,1.011407,10.205957,5.208366,0.411914,0.0,-0.665384,98.468246,81.334121,18.100557,23.781631,12.471205,7.652969,1.181199
std,18875390.0,0.499429,0.0,0.050315,14.816446,2.318278,0.106259,2.36267,0.603655,0.740959,0.0,0.263419,3.446035,11.087469,9.768798,15.347228,6.977814,8.004723,4.165951
min,1914948000.0,0.0,0.0,0.0,28.0,4.0,1.0,4.0,3.555556,0.0,0.0,-0.9601,83.333333,50.0,0.0,0.0,0.0,0.0,0.0
25%,1923848000.0,0.0,0.0,0.0,52.0,9.0,1.0,9.0,4.777778,0.0,0.0,-0.8316,100.0,75.0,11.111111,11.111111,9.090909,0.0,0.0
50%,1924058000.0,0.0,0.0,0.0,60.0,10.0,1.0,10.0,5.181818,0.0,0.0,-0.7096,100.0,81.818182,18.181818,25.0,11.111111,8.333333,0.0
75%,1962496000.0,1.0,0.0,0.0,72.0,12.0,1.0,12.0,5.571429,1.0,0.0,-0.6124,100.0,88.888889,25.0,33.333333,16.666667,11.111111,0.0
max,1965511000.0,1.0,0.0,1.0,136.0,21.0,2.0,21.0,7.5,4.0,0.0,0.4767,100.0,114.285714,44.444444,66.666667,50.0,44.444444,22.222222


In [None]:
df["qn_symbol_per_sentence"]=df["qn_symbol"]/df["num_sentences"]
df["num_exclamation_per_sentence"]=df["num_exclamation"]/df["num_sentences"]
df["url_count_per_sentence"]=df["url_count"]/df["num_sentences"]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 789 entries, 0 to 788
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   unit_id                                789 non-null    int64  
 1   article_title                          789 non-null    object 
 2   article_content                        789 non-null    object 
 3   source                                 789 non-null    object 
 4   date                                   789 non-null    object 
 5   location                               789 non-null    object 
 6   label                                  789 non-null    int64  
 7   text                                   789 non-null    object 
 8   url_count                              789 non-null    int64  
 9   text_cleaned                           789 non-null    object 
 10  qn_symbol                              789 non-null    int64  
 11  num_ch

In [None]:
df.head()

Unnamed: 0,unit_id,article_title,article_content,source,date,location,label,text,url_count,text_cleaned,...,content_word_diversity,redundancy,noun,verb,adj,adv,lang,qn_symbol_per_sentence,num_exclamation_per_sentence,url_count_per_sentence
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,4/5/2017,idlib,1,Syria attack symptoms consistent with nerve ag...,0,Syria attack symptoms consistent with nerve ag...,...,77.777778,22.222222,44.444444,0.0,22.222222,0.0,en,0.0,0.0,0.0
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,4/7/2017,homs,1,Homs governor says U.S. attack caused deaths b...,0,Homs governor says U.S. attack caused deaths b...,...,76.923077,30.769231,30.769231,23.076923,15.384615,0.0,en,0.0,0.0,0.0
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,1,Death toll from Aleppo bomb attack at least 112,0,Death toll from Aleppo bomb attack at least 112,...,66.666667,33.333333,44.444444,0.0,11.111111,11.111111,en,0.0,0.0,0.0
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,1,Aleppo bomb blast kills six Syrian state TV,0,Aleppo bomb blast kills six Syrian state TV,...,87.5,12.5,50.0,12.5,25.0,0.0,en,0.0,0.0,0.0
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,7/10/2016,aleppo,1,29 Syria Rebels Dead in Fighting for Key Alepp...,0,29 Syria Rebels Dead in Fighting for Key Alepp...,...,80.0,20.0,20.0,0.0,10.0,0.0,en,0.0,0.0,0.0


In [None]:
# df.describe()

# Save

In [None]:
df.columns

Index(['unit_id', 'article_title', 'article_content', 'source', 'date',
       'location', 'label', 'text', 'url_count', 'text_cleaned', 'qn_symbol',
       'num_chars', 'num_words', 'num_sentences', 'words_per_sentence',
       'characters_per_word', 'punctuations_per_sentence', 'num_exclamation',
       'get_sentiment_polarity', 'lexical_diversity',
       'content_word_diversity_and_redundancy', 'nvaa',
       'content_word_diversity', 'redundancy', 'noun', 'verb', 'adj', 'adv',
       'lang', 'qn_symbol_per_sentence', 'num_exclamation_per_sentence',
       'url_count_per_sentence'],
      dtype='object')

In [None]:
# to_drop=["tweet","text_cleaned"] # CodaLab Covid
# to_drop=["news_url","tweet_ids","text_cleaned"] # FakeNewsNet
# to_drop=["title","text_cleaned"]  #kaggle_real_fake
# to_drop=["subject","title","date","text_cleaned"] #isot
# to_drop=[] #welfake
# to_drop=['subject(s)', 'speaker',"speaker_job_title","state","party","text","credit_history_count_pants_on_fire","context"]  #LIAR
# to_drop = ["author","source","date","text_cleaned","text"] #politifact_test
# to_drop=['title', 'text', 'subcategory'] # covid_test

In [None]:
df.id.is_unique

In [None]:
df.head()

In [None]:
df.drop(columns=to_drop , inplace=True)

In [None]:
path="/content/drive/Shareddrives/[FYP] Fake News Detection/Results/FA-KES/FA-KES_title_sementic.csv"

In [None]:
df.to_csv(path,index=False)

In [None]:
dff = pd.read_csv(path)

In [None]:
dff.head()

Unnamed: 0,label,id,url_count,text_cleaned,qn_symbol,num_chars,num_words,num_sentences,words_per_sentence,characters_per_word,...,content_word_diversity,redundancy,noun,verb,adj,adv,lang,qn_symbol_per_sentence,num_exclamation_per_sentence,url_count_per_sentence
0,1,0,0,FACEBOOK DELETES MICHIGAN ANTI-LOCKDOWN GROUP ...,0,66,8,1,8.0,7.375,...,87.5,12.5,25.0,0.0,0.0,0.0,de,0.0,0.0,0.0
1,0,1,0,Other Viewpoints: COVID-19 is worse than the flu,0,48,8,1,8.0,5.0,...,50.0,50.0,12.5,0.0,25.0,0.0,en,0.0,0.0,0.0
2,0,2,0,Bermuda's COVID-19 cases surpass 100,0,36,5,1,5.0,5.333333,...,100.0,0.0,40.0,40.0,0.0,0.0,ca,0.0,0.0,0.0
3,1,3,0,Purdue University says students face 'close to...,0,143,24,1,24.0,4.958333,...,66.666667,20.833333,25.0,16.666667,4.166667,4.166667,en,0.0,0.0,0.0
4,1,4,0,THE HIGH COST OF LOCKING DOWN AMERICA: WEVE SE...,0,109,20,1,20.0,3.863636,...,55.0,35.0,25.0,10.0,15.0,0.0,en,0.0,0.0,0.0
