<a href="https://colab.research.google.com/github/Anushree-B/Lie-detector/blob/main/lie_detector_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lie detector using neural network

## Step 1 : web scraping
We will be scraping the data from "politifact.com" website

The website contains data of different US polititcians and we will be scraping data of each politician one by one, the data of a politician is spread out on many different pages.

## Importing libraries

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

## Checking response of the url

In [2]:
url = r"https://www.politifact.com/factchecks/list/?ruling=mostly-true"
response = get(url)
print(response)

<Response [200]>


As the response of the URL is 200, therefore it indicates that we can perform scraping on the website

## Scraping the data accouding to truth meter categories

Now in total, we have 6 differnet ruling categories, those are

- true
- mostly-true
- half-true
- barely-true
- false
- pants-fire

For each category, 80 pages are scraped ie. 2400 statements, thus in total the dataset consists of 14400 statements.

In [3]:
base_url = r"https://www.politifact.com/factchecks/list/?page={}&ruling={}"

politicians = []
quotes = []
image_urls = []

rulings = ['true', 'mostly-true', 'half-true', 'barely-true', 'false', 'pants-fire']

def fetch_data(ruling, page):
    url = base_url.format(page, ruling)
    response = get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    local_politicians = []
    local_quotes = []
    local_image_urls = []
    
    for article in soup.find_all("article", class_="m-statement"):
        politician = article.find("a").text.strip()
        quote = article.find("div", class_="m-statement__quote").text.strip()
        
        local_politicians.append(politician)
        local_quotes.append(quote)
        local_image_urls.append(ruling)
    
    return local_politicians, local_quotes, local_image_urls

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = []
    for ruling in rulings:
        for page in range(1, 81):
            futures.append(executor.submit(fetch_data, ruling, page))
    
    for future in tqdm(as_completed(futures)):
        try:
            local_politicians, local_quotes, local_image_urls = future.result()
            politicians.extend(local_politicians)
            quotes.extend(local_quotes)
            image_urls.extend(local_image_urls)
        except Exception as e:
            print(f"An error occurred: {e}")

# Create a DataFrame
df = pd.DataFrame({
    "Politician": politicians,
    "Quote": quotes,
    "Image URL": image_urls
})

df.head()

480it [02:07,  3.75it/s]


Unnamed: 0,Politician,Quote,Image URL
0,Melissa Agard,"Potato chips, KitKat bars and Viagra are not t...",True
1,Melissa Agard,"“Since 1981, the state Senate has only rejecte...",True
2,Instagram posts,“Lego donates model MRI kits to hospitals to h...,True
3,Brian Krassenstein,It has been U.S. policy “for at least 79 years...,True
4,Mike Oliverio,"“A few years back,” Mitchell Stadium in Bluefi...",True


In [4]:
df.shape

(14400, 3)

In [5]:
df['Image URL'].value_counts()

Image URL
true           2400
mostly-true    2400
half-true      2400
barely-true    2400
false          2400
pants-fire     2400
Name: count, dtype: int64

## Removing the punctuations from the Quote

In [6]:
df["Quote"] = df["Quote"].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
df["Quote"] = df["Quote"].apply(lambda x: x.strip().lower())
df.head()

Unnamed: 0,Politician,Quote,Image URL
0,Melissa Agard,potato chips kitkat bars and viagra are not t...,True
1,Melissa Agard,since the state senate has only rejected...,True
2,Instagram posts,lego donates model mri kits to hospitals to he...,True
3,Brian Krassenstein,it has been u s policy for at least years...,True
4,Mike Oliverio,a few years back mitchell stadium in bluefie...,True


## Saving the data into a csv file

In [7]:
df.to_csv('Data/politifact.csv',index=False)