# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

### Jupyter Notebook (.ipynb file) implementing the assignment. 

### PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

Jeg benytter pandas til at importere data'en

In [3]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

### Determine which article types should be omitted, if any.

Jeg kigger først på, hvilke kategoriet vi har

In [4]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [5]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Jeg har valgt, at indkludere de kategorier, der ikke indeholder direkte misinformation, som reliable.

In [6]:
reliable_types = ['clickbait', 'reliable', 'political', 'bias', 'hate']
fake_types = ['unreliable', 'fake', 'conspiracy', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

Vi tæller mængden af artikler i hver kategori

In [7]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 34
Fake amount: 198
Reliable percentage: 14.66%


Vi ser, at vores sample kun indeholder 14,66% reliable artikler. 

Jeg ville tro, at fordelingen er vigtig, at få balanceret, da vi i så fald, har nok data, som modellen kan træne på, og undgår mulige fejlklassificeringer på grund af mangel af data.

# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install selenium (pip install selenium). Create a new Jupyter Notebook and import this module:

In [8]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By 
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import pandas as pd
import time

Jeg havde problemer med at få selenium til at køre på store mængder data, så jeg har også valgt, at bruge BeautifulSoup til at scrape dataen.

### verify that contents holds the HTML source of the webpage:

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [9]:
browser = webdriver.Firefox()
browser.get('https://www.bbc.com/news/world/europe')

## 3. Extract Articles:

### Selenium allows us to extract information easily. You can read the documentation here. Write a function to extract all articles from the page using the find_elements method. For each article, retrieve the headline, the summary, and the link to the article (href). (Hint: You might have to inspect the page for classes, ids, etc. that might be useful for extracting the information)

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Click on button using .click(). You may also have to add a delay for the page to load using time.wait(n_seconds))

Jeg bruger CLASS_NAME til at finde elementer i html-koden.

In [10]:
def url_finder():
    urls = [url.get_attribute('href') for url in browser.find_elements(By.CLASS_NAME, 'sc-2e6baa30-0')  ]
    urls = [url for url in urls if url.find('news/articles') != -1]
    return urls

print(f'URL amount: {len(url_finder())}')
url_finder()

URL amount: 26


['https://www.bbc.com/news/articles/cx2gg8le1kpo',
 'https://www.bbc.com/news/articles/c3e44qev1dvo',
 'https://www.bbc.com/news/articles/cn527pz54neo',
 'https://www.bbc.com/news/articles/c2kggzqy0x7o',
 'https://www.bbc.com/news/articles/c0eggy1104po',
 'https://www.bbc.com/news/articles/c3e44qev1dvo',
 'https://www.bbc.com/news/articles/czxnnzz558eo',
 'https://www.bbc.com/news/articles/cd655917g6qo',
 'https://www.bbc.com/news/articles/cn527pz54neo',
 'https://www.bbc.com/news/articles/c0q1188p1n2o',
 'https://www.bbc.com/news/articles/cq5zzn81229o',
 'https://www.bbc.com/news/articles/cx2rreg04dpo',
 'https://www.bbc.com/news/articles/cpq222rqv4po',
 'https://www.bbc.com/news/articles/cn7vg0nvzkko',
 'https://www.bbc.com/news/articles/clyj3j2r25wo',
 'https://www.bbc.com/news/articles/cg4kkv3e1v9o',
 'https://www.bbc.com/news/articles/cgr2n8xx5gyo',
 'https://www.bbc.com/news/articles/c17qe11wy52o',
 'https://www.bbc.com/news/articles/c8j0e7ye4m4o',
 'https://www.bbc.com/news/arti

In [11]:
def headline_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('h1').text
    return headline

headlines = [headline_finder(url) for url in url_finder()]

print(f'Headlines amount: {len(headlines)}')
headlines

Headlines amount: 26


['Greeks hold mass protests demanding justice after train tragedy',
 'Minister questions why Tate brothers were allowed to leave Romania',
 'What we know about US-Ukraine minerals deal',
 'Jailed Kurdish leader issues call to lay down arms',
 'Free trade deal with India could come this year - EU Commission chief',
 'Minister questions why Tate brothers were allowed to leave Romania',
 'Dozens arrested in global hit against AI-generated child abuse',
 "Pope's health improving as he remains in hospital",
 'What we know about US-Ukraine minerals deal',
 'Austrian centrists agree government deal sidelining far right',
 'Thirty-two people deported to Georgia on Irish chartered flight',
 "North Korea has sent more troops to Russia, South's spy agency says",
 'Tate brothers arrive in US after Romania prosecutors lift travel ban',
 'Zelensky to meet Trump in Washington to sign minerals deal',
 'Congratulations to insults... Trump and Zelensky, in their own words',
 'Why are the Tate brothers i

In [12]:
def summary_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('p').text
    return headline

summaries = [summary_finder(url) for url in url_finder()]

print(f'Summary amount: {len(summaries)}')
summaries

Summary amount: 26


['Greeks are holding their biggest protests for years and taking part in a general strike to mark the second anniversary of a rail disaster that left 57 dead and dozens more injured.',
 "Romania's Justice Minister Radu Marinescu has called for a public explanation into why controversial social media influencers Andrew and Tristan Tate were allowed to leave the country on Thursday.",
 "Ukraine's President Volodymyr Zelensky will meet US President Trump in Washington on Friday to sign an agreement that would give the US access to its deposits of rare earth minerals.",
 'Abdullah Ocalan, the jailed leader of the outlawed Kurdish group PKK, has called on his movement to lay down its arms and dissolve itself.',
 'The head of the European Commission, Ursula von der Leyen said EU and India were pushing to get a free trade agreement during this year.',
 "Romania's Justice Minister Radu Marinescu has called for a public explanation into why controversial social media influencers Andrew and Tris

## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: US & Canada, UK, Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 800 article links.

Vi klikker på siden, så vi accepterer cookies. Herefter klikker vi ind på hver side, og samler alle links vi finder i html-koden.

In [13]:
WebDriverWait(browser, 25).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_1192447")))
WebDriverWait(browser, 25).until(EC.element_to_be_clickable((By.XPATH, "//button[contains(text(),'I agree')]"))).click()

browser.switch_to.default_content()
bannerCookie = browser.find_element(By.ID, "bbccookies-continue-button").click()

In [14]:
subsites = ["https://www.bbc.com/news/us-canada", 
    "https://www.bbc.com/news/world/africa", 
    "https://www.bbc.com/news/world/asia", 
    "https://www.bbc.com/news/world/europe", 
    "https://www.bbc.com/news/world/latin_america", 
    "https://www.bbc.com/news/world/middle_east", 
    "https://www.bbc.com/news/world/australia",
    "https://www.bbc.com/news/uk"]

In [15]:
def url_finder_extended(url): 
    urls = [url.get_attribute('href') for url in browser.find_elements(By.XPATH, "//a[@href]")]
    wanted_urls = [url for url in urls if url.find("/news/articles") != -1]
    print(wanted_urls)
    return wanted_urls

In [16]:
output_urls = []
page_amount = 17
for url in subsites:
    browser.get(url)
    for i in range(1, page_amount):
        try:
            output_urls.extend(url_finder_extended(browser))
            nextPage = browser.find_element(By.XPATH, f"//button[contains(text(),'{i}')]")
            nextPage.click()
        except Exception as e:
            print(f"Error for: {url}. {e}")
            break

['https://www.bbc.com/news/articles/cqlyy1rld0ko', 'https://www.bbc.com/news/articles/cdell8n14x2o', 'https://www.bbc.com/news/articles/clydd7zeye7o', 'https://www.bbc.com/news/articles/cqlyy1rld0ko', 'https://www.bbc.com/news/articles/cn9v1l80350o', 'https://www.bbc.com/news/articles/c62zzd3zp50o', 'https://www.bbc.com/news/articles/cdell8n14x2o', 'https://www.bbc.com/news/articles/cn7vxlrvxyeo', 'https://www.bbc.com/news/articles/cvgee7rl24ro', 'https://www.bbc.com/news/articles/c1kjj032d8do', 'https://www.bbc.com/news/articles/cly22wdedqeo', 'https://www.bbc.com/news/articles/cyvee9rpdq6o', 'https://www.bbc.com/news/articles/cedll3282qzo', 'https://www.bbc.com/news/articles/c170l0n8j54o', 'https://www.bbc.com/news/articles/c4gm9851559o', 'https://www.bbc.com/news/articles/cd7ev3wygl4o', 'https://www.bbc.com/news/articles/ce9882yv2nyo', 'https://www.bbc.com/news/articles/cn9v1l80350o', 'https://www.bbc.com/news/articles/clydd7zeye7o', 'https://www.bbc.com/news/articles/cy7xxlr2pggo',

In [17]:
unique_urls = list(dict.fromkeys(output_urls))
print(f'Urls found: {len(unique_urls)}')

Urls found: 671


## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

In [18]:
pd.DataFrame(unique_urls, columns=['urls']).to_csv('bbc_urls.csv', index=False)

# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

In [20]:
def content_finder(url):
    response = requests.get(url)
    paragraphs = BeautifulSoup(response.content, 'html.parser').find_all('p')
    main_text = ' '.join([p.text for p in paragraphs])
    main_text = main_text.replace('\n', ' ')
    return main_text

content = [content_finder(url) for url in url_finder()]

print(f'Summary amount: {len(content)}')
content

Summary amount: 26


['Greeks are holding their biggest protests for years and taking part in a general strike to mark the second anniversary of a rail disaster that left 57 dead and dozens more injured. "I am here in memory of the people who were killed in the train crash. We demand justice," said 13-year-old Dimitris who had come with his father Petros Polyzos to the largest rally in Greece, in Syntagma Square in downtown Athens. It was during the night of 28 February 2023 that a passenger train packed with students collided head-on with a goods train near the Tempi gorge in central Greece. An inquiry concluded on Thursday that the accident was caused by human error, poor maintenance, and inadequate staffing. The report by Greece\'s Air and Rail Accident Investigation Authority warned that the safety failings exposed by the crash had not yet been addressed. "Those children were killed because the train was not safe," said the authority\'s chief Christos Papadimitriou. The Tempi disaster shocked Greeks wi

## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

Jeg bruger samme metode som tidligere.

In [21]:
def publish_date_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    date = soup.find('time')['datetime']
    return date

publish_dates = [publish_date_finder(url) for url in url_finder()]

print(f'Publish date amount: {len(publish_dates)}')
publish_dates

Publish date amount: 26


['2025-02-28T11:49:16.825Z',
 '2025-02-28T15:57:24.153Z',
 '2025-02-28T11:53:08.423Z',
 '2025-02-27T17:06:59.458Z',
 '2025-02-28T08:38:02.812Z',
 '2025-02-28T15:57:24.153Z',
 '2025-02-28T11:59:58.404Z',
 '2025-02-28T09:14:14.982Z',
 '2025-02-28T11:53:08.423Z',
 '2025-02-27T14:07:08.796Z',
 '2025-02-28T10:49:17.280Z',
 '2025-02-27T12:54:52.742Z',
 '2025-02-27T21:19:11.619Z',
 '2025-02-27T02:03:10.229Z',
 '2025-02-28T10:54:01.631Z',
 '2025-02-27T22:08:14.612Z',
 '2025-02-27T16:02:47.092Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-28T17:19:56.316Z',
 '2025-02-28T16:51:26.425Z',
 '2025-02-28T15:57:24.153Z',
 '2025-02-28T11:59:58.404Z',
 '2025-02-28T11:49:16.825Z',
 '2025-02-28T10:49:17.280Z',
 '2025-02-28T09:54:22.250Z',
 '2025-02-28T09:14:14.982Z']

In [22]:
def author_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    author_tag = soup.find('span', 'kItaYD')
    if author_tag:
        return author_tag.text.strip()
    return 'Unknown'

authors = [author_finder(url) for url in url_finder()]

print(f'Author amount: {len(authors)}')
authors

Author amount: 26


['Kostas Koukoumakas',
 'Nick Thorpe &',
 'Ian Aikman & James Gregory',
 'Paul Kirby',
 'Nikhil Inamdar',
 'Nick Thorpe &',
 'Jack Burgess',
 'Thomas Mackintosh',
 'Ian Aikman & James Gregory',
 'Bethany Bell',
 'Unknown',
 'Kathryn Armstrong',
 'Nick Thorpe, Mircea Barbu & Paul Kirby',
 'James Gregory',
 'Aleks Phillips',
 'Ian Aikman',
 'Georgina Rannard',
 'Nick Beake',
 'Kevin Sharkey',
 'Tom McArthur',
 'Nick Thorpe &',
 'Jack Burgess',
 'Kostas Koukoumakas',
 'Unknown',
 'Unknown',
 'Thomas Mackintosh']

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

In [23]:
def function_combiner(url):
    return {'url': url, 'publish_date': publish_date_finder(url), 'author': author_finder(url), 'headline': headline_finder(url), 'summary': summary_finder(url), 'content': content_finder(url)}

In [24]:
df = pd.DataFrame([function_combiner(url) for url in unique_urls])

browser.quit()

display(df)

KeyboardInterrupt: 

## 4. Data Storage:

### Save the scraped article data to a file.

In [None]:
df.to_csv('scraped_news.csv', index=False)

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!