<a href="https://colab.research.google.com/github/Anushree-B/Lie-detector/blob/main/lie_detector_sentence_analysis_(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lie detector using neural network

## Step 2 : Senetence Analysis

Based on the analysis by Victor Palacios https://github.com/Victor-Palacios/Deception_Detection_Capstone, this notebook dives into analyzing sentences for our project on detecting deception in political speech. We'll dissect 14400 labeled quotes (true, false, etc.) to identify linguistic patterns that might differentiate truth from lies.

Here's how we'll dissect these quotes:

- **Word Breakdown**: We'll calculate the total number of words to understand the overall conciseness or complexity of statements.
- **Part-of-Speech Distribution**: We'll analyze the frequency of verbs, adverbs, adjectives, and nouns. This can reveal patterns in how politicians express information (e.g., high noun count for factual claims, frequent adverbs for softening statements).
- **Sentiment Analysis**: We'll determine the emotional tone (positive, negative, neutral) of each quote. This can highlight potential biases or manipulative language.
- **Vocabulary Exploration**: We'll explore the specific words used, identifying keywords or phrases that might correlate with truthfulness.

Thus, based on this analysis, aim is to create a dataset contating all the above factors. This will be done using Natural language processing.

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import re
import nltk

In [None]:
df = pd.read_csv("politifact.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Politician,Quote,Image URL
0,0,Tyler August,Nearly of all UW graduates stay in Wiscon...,True
1,1,Mark Pocan,We passed bills last year which is the fe...,True
2,2,Lisa Subeck,The United States is an outlier one of only ...,True
3,3,Brian Schimming,We ve had elections in years in Wiscons...,True
4,4,Tammy Baldwin,We re facing situations these days where you ...,True


## adding character count, word count and word length of Quote

In [None]:
df['char_count'] = np.zeros(14400)
df['word_count'] = np.zeros(14400)
df['word_length'] = np.zeros(14400)

for i in range(0,14400):
  df['char_count'][i] = len(df['Quote'][i])
  df['word_count'][i] = len(df['Quote'][i].split())
  df['word_length'][i] = np.mean([len(word) for word in df['Quote'][i].split()])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['word_length'][i] = np.mean([len(word) for word in df['Quote'][i].split()])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['char_count'][i] = len(df['Quote'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['word_count'][i] = len(df['Quote'][i].split())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-do

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length
0,0,Tyler August,Nearly of all UW graduates stay in Wiscon...,True,79.0,12.0,4.916667
1,1,Mark Pocan,We passed bills last year which is the fe...,True,73.0,12.0,4.583333
2,2,Lisa Subeck,The United States is an outlier one of only ...,True,160.0,29.0,4.37931
3,3,Brian Schimming,We ve had elections in years in Wiscons...,True,99.0,16.0,4.25
4,4,Tammy Baldwin,We re facing situations these days where you ...,True,144.0,25.0,4.6


## FInding the number of adverbs, adjectives, nouns and verbs in Quote

In [None]:
nltk.download('averaged_perceptron_tagger') #used for tagging words with their parts of speech (POS)
nltk.download('wordnet')

adv_count = np.zeros(14400)
adj_count = np.zeros(14400)
noun_count = np.zeros(14400)
verb_count = np.zeros(14400)
det_count = np.zeros(14400)

for i in range(0,14400):
  words = df['Quote'][i].split()
  tagged_words = nltk.pos_tag(words)
  for word, tag in tagged_words:
    if tag.startswith('RB'):
      adv_count[i] += 1
    elif tag.startswith('JJ'):
      adj_count[i] += 1
    elif tag.startswith('NN'):
      noun_count[i] += 1
    elif tag.startswith('VB'):
      verb_count[i] += 1
    elif tag.startswith('DT'):
      det_count[i] += 1

df['adv_count'] = adv_count
df['adj_count'] = adj_count
df['noun_count'] = noun_count
df['verb_count'] = verb_count
df['det_count'] = det_count

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length,adv_count,adj_count,noun_count,verb_count,det_count
0,0,Tyler August,Nearly of all UW graduates stay in Wiscon...,True,79.0,12.0,4.916667,1.0,0.0,4.0,2.0,1.0
1,1,Mark Pocan,We passed bills last year which is the fe...,True,73.0,12.0,4.583333,0.0,2.0,3.0,2.0,2.0
2,2,Lisa Subeck,The United States is an outlier one of only ...,True,160.0,29.0,4.37931,1.0,3.0,9.0,3.0,4.0
3,3,Brian Schimming,We ve had elections in years in Wiscons...,True,99.0,16.0,4.25,0.0,1.0,4.0,5.0,0.0
4,4,Tammy Baldwin,We re facing situations these days where you ...,True,144.0,25.0,4.6,0.0,1.0,8.0,5.0,3.0


In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon (if not already downloaded)
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
def sentiment_score(text):
  # Create a Vader SentimentIntensityAnalyzer object
  analyzer = SentimentIntensityAnalyzer()
  # Get sentiment scores (compound score for overall sentiment)
  sentiment = analyzer.polarity_scores(text)
  return sentiment["compound"]

def named_entities(text):
  # Create a spaCy document
  doc = nlp(text)
  # Extract named entities and their labels (PERSON, ORG, etc.)
  entities = [(entity.text, entity.label_) for entity in doc.ents]
  return entities

In [None]:
df["sentiment"] = df['Quote'].apply(sentiment_score)
df["named_entities"] = df['Quote'].apply(lambda text: ", ".join([str(entity) for entity in named_entities(text)]))

In [None]:
# getting count of named_entities
df['named_entities_count'] = df['Quote'].apply(lambda x: len(named_entities(x)))

In [None]:
df = df.drop(['named_entities'], axis = 1)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Politician,Quote,Image URL,char_count,word_count,word_length,adv_count,adj_count,noun_count,verb_count,det_count,sentiment,named_entities_count
0,0,Tyler August,Nearly of all UW graduates stay in Wiscon...,True,79.0,12.0,4.916667,1.0,0.0,4.0,2.0,1.0,0.0,2
1,1,Mark Pocan,We passed bills last year which is the fe...,True,73.0,12.0,4.583333,0.0,2.0,3.0,2.0,2.0,-0.5719,2
2,2,Lisa Subeck,The United States is an outlier one of only ...,True,160.0,29.0,4.37931,1.0,3.0,9.0,3.0,4.0,0.6199,3
3,3,Brian Schimming,We ve had elections in years in Wiscons...,True,99.0,16.0,4.25,0.0,1.0,4.0,5.0,0.0,0.0,1
4,4,Tammy Baldwin,We re facing situations these days where you ...,True,144.0,25.0,4.6,0.0,1.0,8.0,5.0,3.0,-0.0772,2


In [None]:
df.to_csv("politifact_updated.csv")