<a href="https://colab.research.google.com/github/Anushree-B/Lie-detector/blob/main/lie_detector_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lie detector using neural network

## Step 1 : web scraping
We will be scraping the data from "politifact.com" website

The website contains data of different US polititcians and we will be scraping data of each politician one by one, the data of a politician is spread out on many different pages.

## Importing libraries

In [None]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re

## Checking response of the url

In [None]:
url = r"https://www.politifact.com/factchecks/list/?ruling=mostly-true"
response = get(url)
print(response)

<Response [200]>


As the response of the URL is 200, therefore it indicates that we can perform scraping on the website

## Scraping the data accouding to truth meter categories

Now in total, we have 6 differnet ruling categories, those are

- true
- mostly-true
- half-true
- barely-true
- false
- pants-fire

For each category, 80 pages are scraped ie. 2400 statements, thus in total the dataset consists of 14400 statements.

In [None]:
base_url = r"https://www.politifact.com/factchecks/list/?page={}&ruling={}"

politicians = []
quotes = []
image_urls = []

rulings = ['true','mostly-true', 'half-true', 'barely-true','false','pants-fire']

for ruling in rulings:
  for page in range(1,81):
    url = base_url.format(page,ruling)
    response = get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    for article in soup.find_all("article", class_="m-statement"):
        politician = article.find("a").text.strip()
        quote = article.find("div", class_="m-statement__quote").text.strip()

        politicians.append(politician)
        quotes.append(quote)
        image_urls.append(ruling)

# Create a DataFrame
df = pd.DataFrame({
    "Politician": politicians,
    "Quote": quotes,
    "Image URL": image_urls
})

df.head()

Unnamed: 0,Politician,Quote,Image URL
0,Tyler August,“Nearly 90% of all UW graduates stay in Wiscon...,True
1,Mark Pocan,"“We passed 27 bills last year, which is the fe...",True
2,Lisa Subeck,"“The United States is an outlier, one of only ...",True
3,Brian Schimming,“We’ve had 12 elections in 24 years in Wiscons...,True
4,Tammy Baldwin,“We’re facing situations these days where you ...,True


In [None]:
df.shape

(14400, 3)

In [None]:
df['Image URL'].value_counts()

Image URL
true           2400
mostly-true    2400
half-true      2400
barely-true    2400
false          2400
pants-fire     2400
Name: count, dtype: int64

## Removing the punctuations from the Quote

In [None]:
df["Quote"] = df["Quote"].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

In [None]:
df.head()

Unnamed: 0,Politician,Quote,Image URL
0,Tyler August,Nearly of all UW graduates stay in Wiscon...,True
1,Mark Pocan,We passed bills last year which is the fe...,True
2,Lisa Subeck,The United States is an outlier one of only ...,True
3,Brian Schimming,We ve had elections in years in Wiscons...,True
4,Tammy Baldwin,We re facing situations these days where you ...,True


## Saving the data into a csv file

In [None]:
df.to_csv('politifact.csv')
