<a href="https://colab.research.google.com/github/Anushree-B/Lie-detector/blob/main/lie_detector_sentence_analysis_(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lie detector using neural network

## Step 2 : Senetence Analysis

Based on the analysis by Victor Palacios https://github.com/Victor-Palacios/Deception_Detection_Capstone, this notebook dives into analyzing sentences for our project on detecting deception in political speech. We'll dissect 14400 labeled quotes (true, false, etc.) to identify linguistic patterns that might differentiate truth from lies.

Here's how we'll dissect these quotes:

- **Word Breakdown**: We'll calculate the total number of words to understand the overall conciseness or complexity of statements.
- **Part-of-Speech Distribution**: We'll analyze the frequency of verbs, adverbs, adjectives, and nouns. This can reveal patterns in how politicians express information (e.g., high noun count for factual claims, frequent adverbs for softening statements).
- **Sentiment Analysis**: We'll determine the emotional tone (positive, negative, neutral) of each quote. This can highlight potential biases or manipulative language.
- **Vocabulary Exploration**: We'll explore the specific words used, identifying keywords or phrases that might correlate with truthfulness.

Thus, based on this analysis, aim is to create a dataset contating all the above factors. This will be done using Natural language processing.

## Importing libraries

In [11]:
import numpy as np
import pandas as pd
import re
import nltk
from tqdm import tqdm

In [7]:
df = pd.read_csv("Data/politifact.csv")
df.head()

Unnamed: 0,Politician,Quote,Image URL
0,Kamala Harris,black women in the u s are three to four tim...,True
1,Byron Donalds,kamala harris co sponsored fully sponsored ...,True
2,David Crowley,under the biden administration we have witne...,True
3,Tony Evers,wisconsin had a record breaking year for tou...,True
4,Social Media,under federal law former president donald tru...,True


## adding character count, word count and word length of Quote

In [8]:
df['char_count'] = np.zeros(14400)
df['word_count'] = np.zeros(14400)
df['word_length'] = np.zeros(14400)

# for i in range(0,14400):
#   df['char_count'][i] = len(df['Quote'][i])
#   df['word_count'][i] = len(df['Quote'][i].split())
#   df['word_length'][i] = np.mean([len(word) for word in df['Quote'][i].split()])

df['char_count'] = df['Quote'].apply(len)
df['word_count'] = df['Quote'].apply(lambda x: len(x.split()))
df['word_length'] = df['Quote'].apply(lambda x: np.mean([len(word) for word in x.split()]))

In [9]:
df.head()

Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length
0,Kamala Harris,black women in the u s are three to four tim...,True,114,22,4.136364
1,Byron Donalds,kamala harris co sponsored fully sponsored ...,True,64,10,5.2
2,David Crowley,under the biden administration we have witne...,True,123,19,5.263158
3,Tony Evers,wisconsin had a record breaking year for tou...,True,53,9,4.777778
4,Social Media,under federal law former president donald tru...,True,105,17,5.117647


## FInding the number of adverbs, adjectives, nouns and verbs in Quote

In [12]:
nltk.download('averaged_perceptron_tagger') #used for tagging words with their parts of speech (POS)
nltk.download('wordnet')

adv_count = np.zeros(14400)
adj_count = np.zeros(14400)
noun_count = np.zeros(14400)
verb_count = np.zeros(14400)
det_count = np.zeros(14400)

for i in tqdm(range(0,14400)):
  words = df['Quote'][i].split()
  tagged_words = nltk.pos_tag(words)
  for word, tag in tagged_words:
    if tag.startswith('RB'):
      adv_count[i] += 1
    elif tag.startswith('JJ'):
      adj_count[i] += 1
    elif tag.startswith('NN'):
      noun_count[i] += 1
    elif tag.startswith('VB'):
      verb_count[i] += 1
    elif tag.startswith('DT'):
      det_count[i] += 1

df['adv_count'] = adv_count
df['adj_count'] = adj_count
df['noun_count'] = noun_count
df['verb_count'] = verb_count
df['det_count'] = det_count

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  0%|          | 0/14400 [00:00<?, ?it/s]

100%|██████████| 14400/14400 [00:20<00:00, 718.08it/s]


In [13]:
df.head()

Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length,adv_count,adj_count,noun_count,verb_count,det_count
0,Kamala Harris,black women in the u s are three to four tim...,True,114,22,4.136364,1.0,4.0,6.0,2.0,1.0
1,Byron Donalds,kamala harris co sponsored fully sponsored ...,True,64,10,5.2,1.0,2.0,1.0,4.0,1.0
2,David Crowley,under the biden administration we have witne...,True,123,19,5.263158,0.0,5.0,4.0,3.0,2.0
3,Tony Evers,wisconsin had a record breaking year for tou...,True,53,9,4.777778,0.0,0.0,5.0,1.0,1.0
4,Social Media,under federal law former president donald tru...,True,105,17,5.117647,2.0,2.0,8.0,1.0,0.0


In [14]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon (if not already downloaded)
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [15]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [16]:
def sentiment_score(text):
  # Create a Vader SentimentIntensityAnalyzer object
  analyzer = SentimentIntensityAnalyzer()
  # Get sentiment scores (compound score for overall sentiment)
  sentiment = analyzer.polarity_scores(text)
  return sentiment["compound"]

def named_entities(text):
  # Create a spaCy document
  doc = nlp(text)
  # Extract named entities and their labels (PERSON, ORG, etc.)
  entities = [(entity.text, entity.label_) for entity in doc.ents]
  return entities

In [17]:
df["sentiment"] = df['Quote'].apply(sentiment_score)
df["named_entities"] = df['Quote'].apply(lambda text: ", ".join([str(entity) for entity in named_entities(text)]))

In [18]:
# getting count of named_entities
df['named_entities_count'] = df['Quote'].apply(lambda x: len(named_entities(x)))

In [19]:
df = df.drop(['named_entities'], axis = 1)

In [20]:
df.head()

Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length,adv_count,adj_count,noun_count,verb_count,det_count,sentiment,named_entities_count
0,Kamala Harris,black women in the u s are three to four tim...,True,114,22,4.136364,1.0,4.0,6.0,2.0,1.0,-0.6326,1
1,Byron Donalds,kamala harris co sponsored fully sponsored ...,True,64,10,5.2,1.0,2.0,1.0,4.0,1.0,0.0,1
2,David Crowley,under the biden administration we have witne...,True,123,19,5.263158,0.0,5.0,4.0,3.0,2.0,0.3818,1
3,Tony Evers,wisconsin had a record breaking year for tou...,True,53,9,4.777778,0.0,0.0,5.0,1.0,1.0,0.0,1
4,Social Media,under federal law former president donald tru...,True,105,17,5.117647,2.0,2.0,8.0,1.0,0.0,-0.6908,1


In [21]:
df.to_csv("Data/politifact_updated.csv",index=False)