# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

### Jupyter Notebook (.ipynb file) implementing the assignment. 

### PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

In [225]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

### Determine which article types should be omitted, if any.

In [226]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [227]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Arugmenter...

In [228]:
reliable_types = ['clickbait', 'reliable', 'political']
fake_types = ['unreliable', 'fake', 'conspiracy', 'bias', 'hate', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

In [229]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 27
Fake amount: 205
Reliable percentage: 11.64%


# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install selenium (pip install selenium). Create a new Jupyter Notebook and import this module:

In [230]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By 
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import pandas as pd
import time

### verify that contents holds the HTML source of the webpage:

In [231]:
firefox_options = Options()
firefox_options.add_argument('--headless')
firefox_options.add_argument('--disable-gpu')
firefox_options.add_argument('--no-sandbox')
firefox_options.add_argument('--disable-dev-shm-usage')

browser = webdriver.Firefox(options=firefox_options)
browser.get('https://www.bbc.com/news/world/africa')

# Wait for the iframe to load and switch to it
WebDriverWait(browser, 10).until(
    EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_1192447"))
)

# Wait for the "I agree" button to appear and click it
WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(),'I agree')]"))
).click()

# Switch back to the main page
browser.switch_to.default_content()
bannerCookie = browser.find_element(By.ID, "bbccookies-continue-button")
bannerCookie.click()

In [None]:
def link_finder(link): 
    link_elements = link.find_elements(By.XPATH, "//a[@href]")
    links = [link.get_attribute('href') for link in link_elements]
    usable_links = [link for link in links if link.find("/news/articles") != -1]
    print(usable_links)
    return usable_links

all_sites = ["https://www.bbc.com/news/us-canada", 
             "https://www.bbc.com/news/world/africa", 
             "https://www.bbc.com/news/world/asia", 
             "https://www.bbc.com/news/world/europe", 
             "https://www.bbc.com/news/world/latin_america", 
             "https://www.bbc.com/news/world/middle_east", 
             "https://www.bbc.com/news/world/australia",
             "https://www.bbc.com/news/uk"]

all_links = []
section_length = 16
for url in all_sites:
    browser.get(url)
    for i in range(1, section_length):
        try:
            all_links.extend(link_finder(browser))
            nextPage = browser.find_element(By.XPATH, f"//button[contains(text(),'{i}')]")
            nextPage.click()
        except Exception as e:
            print(f"Error at : {url} \n{e}")
            break
    print(all_links)
    print(len(all_links))

browser.quit()

['https://www.bbc.com/news/articles/cewkkkvkzn9o', 'https://www.bbc.com/news/articles/c62x7p4465no', 'https://www.bbc.com/news/articles/c778rp2je47o', 'https://www.bbc.com/news/articles/c4gm9851559o', 'https://www.bbc.com/news/articles/cvgwwp2gd3jo', 'https://www.bbc.com/news/articles/c62x7p4465no', 'https://www.bbc.com/news/articles/clyderx4v8go', 'https://www.bbc.com/news/articles/c5y44gw5gpro', 'https://www.bbc.com/news/articles/cn48z5q28vyo', 'https://www.bbc.com/news/articles/cn4ymkl7294o', 'https://www.bbc.com/news/articles/c170l0n8j54o', 'https://www.bbc.com/news/articles/c2lj0vrkv9yo', 'https://www.bbc.com/news/articles/cvg584grxwwo', 'https://www.bbc.com/news/articles/cde98k6e2dno', 'https://www.bbc.com/news/articles/crlx8w5wdl0o', 'https://www.bbc.com/news/articles/ce8yy3wpn6eo', 'https://www.bbc.com/news/articles/cq6yy9e368vo', 'https://www.bbc.com/news/articles/cvgwwp2gd3jo', 'https://www.bbc.com/news/articles/c778rp2je47o', 'https://www.bbc.com/news/articles/c170l0n8j54o',

In [233]:
unique_urls = list(dict.fromkeys(all_links))
print(unique_urls)
print(len(unique_urls))

['https://www.bbc.com/news/articles/cewkkkvkzn9o', 'https://www.bbc.com/news/articles/c62x7p4465no', 'https://www.bbc.com/news/articles/c778rp2je47o', 'https://www.bbc.com/news/articles/c4gm9851559o', 'https://www.bbc.com/news/articles/cvgwwp2gd3jo', 'https://www.bbc.com/news/articles/clyderx4v8go', 'https://www.bbc.com/news/articles/c5y44gw5gpro', 'https://www.bbc.com/news/articles/cn48z5q28vyo', 'https://www.bbc.com/news/articles/cn4ymkl7294o', 'https://www.bbc.com/news/articles/c170l0n8j54o', 'https://www.bbc.com/news/articles/c2lj0vrkv9yo', 'https://www.bbc.com/news/articles/cvg584grxwwo', 'https://www.bbc.com/news/articles/cde98k6e2dno', 'https://www.bbc.com/news/articles/crlx8w5wdl0o', 'https://www.bbc.com/news/articles/ce8yy3wpn6eo', 'https://www.bbc.com/news/articles/cq6yy9e368vo', 'https://www.bbc.com/news/articles/c5yxvywr015o', 'https://www.bbc.com/news/articles/cq5zgvdz2z0o', 'https://www.bbc.com/news/articles/cx2rmxr90eyo', 'https://www.bbc.com/news/articles/cwydeppzggno',

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [234]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text

## 3. Extract Articles:

### Selenium allows us to extract information easily. You can read the documentation here. Write a function to extract all articles from the page using the find_elements method. For each article, retrieve the headline, the summary, and the link to the article (href). (Hint: You might have to inspect the page for classes, ids, etc. that might be useful for extracting the information)

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Click on button using .click(). You may also have to add a delay for the page to load using time.wait(n_seconds))

In [236]:
def url_finder():
    urls = [url.get_attribute('href') for url in browser.find_elements(By.CLASS_NAME, 'sc-2e6baa30-0')  ]
    urls = [url for url in urls if url.find('news/articles') != -1]
    return urls

print(f'URL amount: {len(url_finder())}')
url_finder()

MaxRetryError: HTTPConnectionPool(host='localhost', port=63835): Max retries exceeded with url: /session/cd207468-63dd-42e6-8485-70fedeabfb76/elements (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10aade3f0>: Failed to establish a new connection: [Errno 61] Connection refused'))

In [None]:
def headline_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('h1').text
    return headline

headlines = [headline_finder(url) for url in url_finder()]

print(f'Headlines amount: {len(headlines)}')
headlines

Headlines amount: 24


['Jailed Kurdish separatist leader issues call to lay down arms',
 'Tate brothers allowed to leave Romania for US',
 'Leaked recordings challenge Greek account of deadly shipwreck',
 "North Korea has sent more troops to Russia, South's spy agency says",
 'Zelensky to meet Trump in Washington to sign minerals deal',
 'Tate brothers allowed to leave Romania for US',
 'Why are the Tate brothers heading to the US?',
 'Austrian centrists agree government deal sidelining far right',
 'Leaked recordings challenge Greek account of deadly shipwreck',
 'Bosnian-Serb leader sentenced to jail in landmark trial',
 'Romanian far-right presidential hopeful detained on street and indicted',
 'Starmer cuts aid to fund increase in defence spending',
 'Tesla shares slump after European sales fall',
 "Macron walks tightrope with Trump as he makes Europe's case on Ukraine",
 'Can Europe still count on the US coming to its defence?',
 'German politics froze out the far right for years – is this about to cha

In [None]:
def summary_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('p').text
    return headline

summaries = [summary_finder(url) for url in url_finder()]

print(f'Summary amount: {len(summaries)}')
summaries

Summary amount: 24


['Abdullah Ocalan, the jailed leader of the Kurdish separatist PKK, has called on his movement to lay down its arms and dissolve itself.',
 'British-American influencers Andrew and Tristan Tate - who are facing trial in Romania on charges of rape, trafficking minors and money laundering - have left the country after prosecutors lifted a two-year travel ban.',
 "Leaked audio instructions by Greek rescue co-ordinators have cast further doubt on Greece's official version of events in the hours before a migrant boat sank along with up to 650 people onboard.",
 "North Korea has sent more soldiers to Russia and re-deployed others to the frontline in the western Kursk region, according to South Korea's intelligence agency. ",
 "Ukrainian President Volodymyr Zelensky will meet US President Donald Trump in Washington on Friday to sign an agreement on sharing his country's mineral resources, Trump has said. ",
 'British-American influencers Andrew and Tristan Tate - who are facing trial in Roman

## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: US & Canada, UK, Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 800 article links.

## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

In [None]:
def combiner(link):
    return {'url': link, 'headline': headline_finder(link), 'summary': summary_finder(link)}

df = pd.DataFrame([combiner(url) for url in url_finder()])
df.to_csv('scraped_news.csv', index=False)
display(df)

Unnamed: 0,url,headline,summary
0,https://www.bbc.com/news/articles/c2kggzqy0x7o,Jailed Kurdish separatist leader issues call t...,"Abdullah Ocalan, the jailed leader of the Kurd..."
1,https://www.bbc.com/news/articles/cpq222rqv4po,Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...
2,https://www.bbc.com/news/articles/c17qe11wy52o,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...
3,https://www.bbc.com/news/articles/cx2rreg04dpo,"North Korea has sent more troops to Russia, So...",North Korea has sent more soldiers to Russia a...
4,https://www.bbc.com/news/articles/cn7vg0nvzkko,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...
5,https://www.bbc.com/news/articles/cpq222rqv4po,Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...
6,https://www.bbc.com/news/articles/cg4kkv3e1v9o,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...
7,https://www.bbc.com/news/articles/c0q1188p1n2o,Austrian centrists agree government deal sidel...,Five months after the far-right Freedom Party ...
8,https://www.bbc.com/news/articles/c17qe11wy52o,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...
9,https://www.bbc.com/news/articles/cdrxy1zp8mxo,Bosnian-Serb leader sentenced to jail in landm...,A one-year prison sentence and a six-year ban ...


# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

In [None]:
def content_finder(link):
    response = requests.get(link)
    paragraphs = BeautifulSoup(response.content, 'html.parser').find_all('p')
    main_text = ' '.join([p.text for p in paragraphs])
    return main_text

content = [content_finder(url) for url in url_finder()]

print(f'Summary amount: {len(content)}')
content

Summary amount: 24


['Abdullah Ocalan, the jailed leader of the Kurdish separatist PKK, has called on his movement to lay down its arms and dissolve itself. His statement, read out in a letter by MPs from a pro-Kurdish party, was aimed at ending four decades of armed struggle in south-eastern Turkey in which tens of thousands of people have been killed. Ocalan, 75, had earlier met the MPs for several hours on Imrali, an island in the Sea of Marmara south-west of Istanbul where he has been imprisoned in solitary confinement since 1999. His announcement came months after ultra-nationalist leader Devlet Bahceli, who is part of Turkey\'s government, launched an initiative to bring an end to the conflict. "There is no alternative to democracy in the pursuit and realisation of a political system," Ocalan\'s letter read. "Democratic consensus is the fundamental way." Appealing to members of the PKK - the Kurdistan Workers\' Movement - Ocalan said "all groups must lay their arms and the PKK must dissolve itself".

## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

In [None]:
def publish_date_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    date = soup.find('time')['datetime']
    return date

publish_dates = [publish_date_finder(url) for url in url_finder()]

print(f'Publish date amount: {len(publish_dates)}')
publish_dates

Publish date amount: 24


['2025-02-27T15:20:23.722Z',
 '2025-02-27T13:09:39.367Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-27T12:54:52.742Z',
 '2025-02-27T02:03:10.229Z',
 '2025-02-27T13:09:39.367Z',
 '2025-02-27T15:47:43.622Z',
 '2025-02-27T14:07:08.796Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-26T21:03:11.477Z',
 '2025-02-26T17:11:57.047Z',
 '2025-02-26T02:23:47.711Z',
 '2025-02-26T07:53:15.992Z',
 '2025-02-25T01:47:56.921Z',
 '2025-02-26T00:00:32.454Z',
 '2025-02-25T13:08:08.773Z',
 '2025-02-27T14:07:08.796Z',
 '2025-02-27T14:05:01.764Z',
 '2025-02-27T13:09:39.367Z',
 '2025-02-27T08:55:24.894Z',
 '2025-02-27T05:12:03.716Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-26T21:10:02.468Z',
 '2025-02-26T21:03:11.477Z']

In [None]:
def author_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    author_tag = soup.find('span', 'kItaYD')
    if author_tag:
        return author_tag.text.strip()
    return 'Unknown'

authors = [author_finder(url) for url in url_finder()]

print(f'Author amount: {len(authors)}')
authors

Author amount: 24


['Paul Kirby',
 'Nick Thorpe, Mircea Barbu & Paul Kirby',
 'Nick Beake',
 'Kathryn Armstrong',
 'James Gregory',
 'Nick Thorpe, Mircea Barbu & Paul Kirby',
 'Ian Aikman',
 'Bethany Bell',
 'Nick Beake',
 'Guy Delauney',
 'Paul Kirby & Mircea Barbu',
 'Joshua Nevett and Sam Francis',
 'Tom Espiner',
 "Gary O'Donoghue",
 'Frances Mao',
 'Paul Kirby & Kristina Volk',
 'Bethany Bell',
 'Unknown',
 'Nick Thorpe, Mircea Barbu & Paul Kirby',
 'Unknown',
 'Sarah Rainsford',
 'Nick Beake',
 'Unknown',
 'Guy Delauney']

In [None]:
def combiner(link):
    return {'url': link, 'publish_date': publish_date_finder(link), 'author': author_finder(link), 'headline': headline_finder(link), 'summary': summary_finder(link), 'content': content_finder(link)}

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

In [None]:
df = pd.DataFrame([combiner(url) for url in unique_urls])
display(df)

Unnamed: 0,url,publish_date,author,headline,summary,content
0,https://www.bbc.com/news/articles/c2kggzqy0x7o,2025-02-27T15:20:23.722Z,Paul Kirby,Jailed Kurdish separatist leader issues call t...,"Abdullah Ocalan, the jailed leader of the Kurd...","Abdullah Ocalan, the jailed leader of the Kurd..."
1,https://www.bbc.com/news/articles/cpq222rqv4po,2025-02-27T13:09:39.367Z,"Nick Thorpe, Mircea Barbu & Paul Kirby",Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...,British-American influencers Andrew and Trista...
2,https://www.bbc.com/news/articles/c17qe11wy52o,2025-02-27T00:23:25.454Z,Nick Beake,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...,Leaked audio instructions by Greek rescue co-o...
3,https://www.bbc.com/news/articles/cx2rreg04dpo,2025-02-27T12:54:52.742Z,Kathryn Armstrong,"North Korea has sent more troops to Russia, So...",North Korea has sent more soldiers to Russia a...,North Korea has sent more soldiers to Russia a...
4,https://www.bbc.com/news/articles/cn7vg0nvzkko,2025-02-27T02:03:10.229Z,James Gregory,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...,Ukrainian President Volodymyr Zelensky will me...
5,https://www.bbc.com/news/articles/cpq222rqv4po,2025-02-27T13:09:39.367Z,"Nick Thorpe, Mircea Barbu & Paul Kirby",Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...,British-American influencers Andrew and Trista...
6,https://www.bbc.com/news/articles/cg4kkv3e1v9o,2025-02-27T15:47:43.622Z,Ian Aikman,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...,Controversial influencer Andrew Tate and his b...
7,https://www.bbc.com/news/articles/c0q1188p1n2o,2025-02-27T14:07:08.796Z,Bethany Bell,Austrian centrists agree government deal sidel...,Five months after the far-right Freedom Party ...,Five months after the far-right Freedom Party ...
8,https://www.bbc.com/news/articles/c17qe11wy52o,2025-02-27T00:23:25.454Z,Nick Beake,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...,Leaked audio instructions by Greek rescue co-o...
9,https://www.bbc.com/news/articles/cdrxy1zp8mxo,2025-02-26T21:03:11.477Z,Guy Delauney,Bosnian-Serb leader sentenced to jail in landm...,A one-year prison sentence and a six-year ban ...,A one-year prison sentence and a six-year ban ...


## 4. Data Storage:

### Save the scraped article data to a file.

In [None]:
df.to_csv('scraped_news.csv', index=False)

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!