# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

### Jupyter Notebook (.ipynb file) implementing the assignment. 

### PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

In [296]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

### Determine which article types should be omitted, if any.

In [297]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [298]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Arugmenter...

In [299]:
reliable_types = ['clickbait', 'reliable', 'political']
fake_types = ['unreliable', 'fake', 'conspiracy', 'bias', 'hate', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

In [300]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 27
Fake amount: 205
Reliable percentage: 11.64%


# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install selenium (pip install selenium). Create a new Jupyter Notebook and import this module:

In [301]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By 
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import pandas as pd
import time

### verify that contents holds the HTML source of the webpage:

In [302]:
firefox_options = Options()
firefox_options.add_argument('--headless')
firefox_options.add_argument('--disable-gpu')
firefox_options.add_argument('--no-sandbox')
firefox_options.add_argument('--disable-dev-shm-usage')

browser = webdriver.Firefox(options=firefox_options)
browser.get('https://www.bbc.com/news/world/africa')

# Wait for the iframe to load and switch to it
WebDriverWait(browser, 10).until(
    EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_1192447"))
)

# Wait for the "I agree" button to appear and click it
WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(),'I agree')]"))
).click()

# Switch back to the main page
browser.switch_to.default_content()
bannerCookie = browser.find_element(By.ID, "bbccookies-continue-button")
bannerCookie.click()

In [303]:
def link_finder(link): 
    link_elements = link.find_elements(By.XPATH, "//a[@href]")
    links = [link.get_attribute('href') for link in link_elements]
    usable_links = [link for link in links if link.find("/news/articles") != -1]
    print(usable_links)
    return usable_links

all_sites = ["https://www.bbc.com/news/us-canada", 
             "https://www.bbc.com/news/world/africa", 
             "https://www.bbc.com/news/world/asia", 
             "https://www.bbc.com/news/world/europe", 
             "https://www.bbc.com/news/world/latin_america", 
             "https://www.bbc.com/news/world/middle_east", 
             "https://www.bbc.com/news/world/australia",
             "https://www.bbc.com/news/uk"]

all_links = []
section_length = 16
for link in all_sites:
    browser.get(url)
    for i in range(1, section_length):
        try:
            all_links.extend(link_finder(browser))
            nextPage = browser.find_element(By.XPATH, f"//button[contains(text(),'{i}')]")
            nextPage.click()
        except Exception as e:
            print(f"Error at : {link} \n{e}")
            break
    print(all_links)
    print(len(all_links))

browser.quit()

['https://www.bbc.com/news/articles/c5ymvygpj7go', 'https://www.bbc.com/news/articles/ce8yy0116y8o', 'https://www.bbc.com/news/articles/c86pp14j0dxo', 'https://www.bbc.com/news/articles/cx2009jgj58o', 'https://www.bbc.com/news/articles/cqjdz9q170yo', 'https://www.bbc.com/news/articles/ce8yy0116y8o', 'https://www.bbc.com/news/articles/c4g77e57q17o', 'https://www.bbc.com/news/articles/c86pp14j0dxo', 'https://www.bbc.com/news/articles/c5yxxe2zl1qo', 'https://www.bbc.com/news/articles/cqjdd7zx4plo', 'https://www.bbc.com/news/articles/cdryre7y4n0o', 'https://www.bbc.com/news/articles/c62zzxeeveeo', 'https://www.bbc.com/news/articles/cqlyy2e5w4vo', 'https://www.bbc.com/news/articles/ce30nrnpwygo', 'https://www.bbc.com/news/articles/c62k6qxwz1ko', 'https://www.bbc.com/news/articles/c241pn09qqjo', 'https://www.bbc.com/news/articles/c8rkj3rg1j2o', 'https://www.bbc.com/news/articles/cjevxep9jnpo', 'https://www.bbc.com/news/articles/cpdeezw4z92o', 'https://www.bbc.com/news/articles/c15qqlv79gqo',

In [304]:
unique_urls = list(dict.fromkeys(all_links))
print(unique_urls)
print(len(unique_urls))

['https://www.bbc.com/news/articles/c5ymvygpj7go', 'https://www.bbc.com/news/articles/ce8yy0116y8o', 'https://www.bbc.com/news/articles/c86pp14j0dxo', 'https://www.bbc.com/news/articles/cx2009jgj58o', 'https://www.bbc.com/news/articles/cqjdz9q170yo', 'https://www.bbc.com/news/articles/c4g77e57q17o', 'https://www.bbc.com/news/articles/c5yxxe2zl1qo', 'https://www.bbc.com/news/articles/cqjdd7zx4plo', 'https://www.bbc.com/news/articles/cdryre7y4n0o', 'https://www.bbc.com/news/articles/c62zzxeeveeo', 'https://www.bbc.com/news/articles/cqlyy2e5w4vo', 'https://www.bbc.com/news/articles/ce30nrnpwygo', 'https://www.bbc.com/news/articles/c62k6qxwz1ko', 'https://www.bbc.com/news/articles/c241pn09qqjo', 'https://www.bbc.com/news/articles/c8rkj3rg1j2o', 'https://www.bbc.com/news/articles/cjevxep9jnpo', 'https://www.bbc.com/news/articles/cpdeezw4z92o', 'https://www.bbc.com/news/articles/c15qqlv79gqo', 'https://www.bbc.com/news/articles/cpwddq1gnzeo', 'https://www.bbc.com/news/articles/c9844j17ldyo',

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [305]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text

## 3. Extract Articles:

### Selenium allows us to extract information easily. You can read the documentation here. Write a function to extract all articles from the page using the find_elements method. For each article, retrieve the headline, the summary, and the link to the article (href). (Hint: You might have to inspect the page for classes, ids, etc. that might be useful for extracting the information)

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Click on button using .click(). You may also have to add a delay for the page to load using time.wait(n_seconds))

In [306]:
browser = webdriver.Firefox(options=firefox_options)
browser.get('https://www.bbc.com/news/world/africa')

In [307]:
def url_finder():
    urls = [url.get_attribute('href') for url in browser.find_elements(By.CLASS_NAME, 'sc-2e6baa30-0')  ]
    urls = [url for url in urls if url.find('news/articles') != -1]
    return urls

print(f'URL amount: {len(url_finder())}')
url_finder()

URL amount: 27


['https://www.bbc.com/news/articles/c8d4467y5dno',
 'https://www.bbc.com/news/articles/cx2eyqz49v5o',
 'https://www.bbc.com/news/articles/c798d074wp2o',
 'https://www.bbc.com/news/articles/c5yr7j18edjo',
 'https://www.bbc.com/news/articles/clyn914g249o',
 'https://www.bbc.com/news/articles/cx2eyqz49v5o',
 'https://www.bbc.com/news/articles/cevxxxmdmxeo',
 'https://www.bbc.com/news/articles/cdryre7y4n0o',
 'https://www.bbc.com/news/articles/c798d074wp2o',
 'https://www.bbc.com/news/articles/cx2gn417300o',
 'https://www.bbc.com/news/articles/c0kg58xk4dvo',
 'https://www.bbc.com/news/articles/cy7x87ev5jyo',
 'https://www.bbc.com/news/articles/c70q5wjjl4yo',
 'https://www.bbc.com/news/articles/cn04lpe38rwo',
 'https://www.bbc.com/news/articles/c9d5zqg3228o',
 'https://www.bbc.com/news/articles/cwyew21yyjzo',
 'https://www.bbc.com/news/articles/czepewl780eo',
 'https://www.bbc.com/news/articles/cn57l3xvk4po',
 'https://www.bbc.com/news/articles/c8d4467y5dno',
 'https://www.bbc.com/news/arti

In [308]:
def headline_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('h1').text
    return headline

headlines = [headline_finder(url) for url in url_finder()]

print(f'Headlines amount: {len(headlines)}')
headlines

Headlines amount: 27


['Gunfire and explosions hit rebel rally in DR Congo',
 "The 'hero' ship fixing Africa's internet blackouts - the BBC goes aboard",
 'BBC Komla Dumor Award 2025 launched',
 "'I need help': Freed from Myanmar's scam centres, thousands are now stranded",
 'Sudan military plane crashes in residential area',
 "The 'hero' ship fixing Africa's internet blackouts - the BBC goes aboard",
 'No sheep for Eid, king tells Moroccans',
 'Son loses case against parents over move to Africa',
 'BBC Komla Dumor Award 2025 launched',
 "'They took all the women here': Rape survivors recall horror of DR Congo jailbreak",
 "Regrets, executions and coups: Four takeaways from former Nigerian military ruler's book",
 "'People will starve' because of US aid cut to Sudan",
 'The Kenyans saying no to motherhood and yes to sterilisation',
 "Race policies or Israel - what's really driving Trump's fury with South Africa?",
 "How DR Congo's Tutsis become foreigners in their own country",
 'BBC undercover filming expo

In [309]:
def summary_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('p').text
    return headline

summaries = [summary_finder(url) for url in url_finder()]

print(f'Summary amount: {len(summaries)}')
summaries

Summary amount: 27


['Gunfire and explosions have ripped through a rally held by rebel leaders in a city they recently captured in eastern Democratic Republic of Congo. ',
 'A ship the size of a football field, crewed by more than 50 engineers and technicians, cruises the oceans around Africa to keep the continent online.',
 'The BBC is seeking a rising star of African journalism for the BBC News Komla Dumor Award, which is now in its 10th year.',
 '"I swear to God I need help," said the man quietly on the other end of the line.',
 'At least 46 people have been killed and 10 others injured after a Sudanese military plane crashed in a residential neighbourhood in Omdurman, the twin city of the capital Khartoum, state media says. ',
 'A ship the size of a football field, crewed by more than 50 engineers and technicians, cruises the oceans around Africa to keep the continent online.',
 "King Mohammed VI has asked Moroccans to abstain from performing the Muslim rite of slaughtering sheep during Eid al-Adha th

## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: US & Canada, UK, Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 800 article links.

## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

In [310]:
def combiner(link):
    return {'url': link, 'headline': headline_finder(link), 'summary': summary_finder(link)}

df = pd.DataFrame([combiner(url) for url in url_finder()])
df.to_csv('scraped_news.csv', index=False)
display(df)

Unnamed: 0,url,headline,summary
0,https://www.bbc.com/news/articles/c8d4467y5dno,Gunfire and explosions hit rebel rally in DR C...,Gunfire and explosions have ripped through a r...
1,https://www.bbc.com/news/articles/cx2eyqz49v5o,The 'hero' ship fixing Africa's internet black...,"A ship the size of a football field, crewed by..."
2,https://www.bbc.com/news/articles/c798d074wp2o,BBC Komla Dumor Award 2025 launched,The BBC is seeking a rising star of African jo...
3,https://www.bbc.com/news/articles/c5yr7j18edjo,'I need help': Freed from Myanmar's scam centr...,"""I swear to God I need help,"" said the man qui..."
4,https://www.bbc.com/news/articles/clyn914g249o,Sudan military plane crashes in residential area,At least 46 people have been killed and 10 oth...
5,https://www.bbc.com/news/articles/cx2eyqz49v5o,The 'hero' ship fixing Africa's internet black...,"A ship the size of a football field, crewed by..."
6,https://www.bbc.com/news/articles/cevxxxmdmxeo,"No sheep for Eid, king tells Moroccans",King Mohammed VI has asked Moroccans to abstai...
7,https://www.bbc.com/news/articles/cdryre7y4n0o,Son loses case against parents over move to Af...,A 14-year-old boy has lost a court case he bro...
8,https://www.bbc.com/news/articles/c798d074wp2o,BBC Komla Dumor Award 2025 launched,The BBC is seeking a rising star of African jo...
9,https://www.bbc.com/news/articles/cx2gn417300o,'They took all the women here': Rape survivors...,Warning: This article contains distressing con...


# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

In [311]:
def content_finder(link):
    response = requests.get(link)
    paragraphs = BeautifulSoup(response.content, 'html.parser').find_all('p')
    main_text = ' '.join([p.text for p in paragraphs])
    return main_text

content = [content_finder(url) for url in url_finder()]

print(f'Summary amount: {len(content)}')
content

Summary amount: 27


['Gunfire and explosions have ripped through a rally held by rebel leaders in a city they recently captured in eastern Democratic Republic of Congo.  Videos show chaotic scenes with bodies on the streets after the crowd fled the rally in Bukavu, the second-biggest city in the east, in panic.  Casualty figures are unclear, but AFP news agency has quoted a hospital source as saying that at least 11 people have been killed and 60 others are wounded. This was the first rally that the Rwanda-backed rebels were holding in Bukavu since taking the city from government forces earlier this month following a rapid advance through the region.   The rebels accused President Felix Tshisekedi\'s government of orchestrating the attack. However, Tshisekedi blamed it on "a foreign army"  that he said was operating in the east. The rally had earlier been addressed by Corneille Nangaa, the head of the alliance of rebel groups that includes the Rwanda-backed M23. He promised the crowd that the rebels would

## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

In [312]:
def publish_date_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    date = soup.find('time')['datetime']
    return date

publish_dates = [publish_date_finder(url) for url in url_finder()]

print(f'Publish date amount: {len(publish_dates)}')
publish_dates

Publish date amount: 27


['2025-02-27T16:38:34.443Z',
 '2025-02-27T00:39:31.213Z',
 '2025-02-27T03:45:39.550Z',
 '2025-02-26T22:34:29.906Z',
 '2025-02-26T11:54:00.146Z',
 '2025-02-27T00:39:31.213Z',
 '2025-02-27T09:15:38.438Z',
 '2025-02-27T15:35:00.536Z',
 '2025-02-27T03:45:39.550Z',
 '2025-02-25T17:38:48.519Z',
 '2025-02-25T15:41:09.260Z',
 '2025-02-25T00:16:28.781Z',
 '2025-02-24T01:31:47.325Z',
 '2025-02-23T07:30:18.478Z',
 '2025-02-22T01:16:14.763Z',
 '2025-02-21T00:53:54.048Z',
 '2025-02-20T01:28:35.942Z',
 '2025-01-31T15:12:37.052Z',
 '2025-02-27T16:38:34.443Z',
 '2025-02-27T03:45:39.550Z',
 '2025-02-27T00:39:31.213Z',
 '2025-02-26T12:51:01.462Z',
 '2025-02-26T11:54:00.146Z',
 '2025-02-25T17:38:48.519Z',
 '2025-02-25T17:19:28.020Z',
 '2025-02-25T16:59:07.600Z',
 '2025-02-25T16:23:51.476Z']

In [313]:
def author_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    author_tag = soup.find('span', 'kItaYD')
    if author_tag:
        return author_tag.text.strip()
    return 'Unknown'

authors = [author_finder(url) for url in url_finder()]

print(f'Author amount: {len(authors)}')
authors

Author amount: 27


['Danai Nesta Kupemba in London and Emery Makumeno in Kinshasa',
 'Daniel Dadzie',
 'Unknown',
 'Jonathan Head, Lulu Luo and Thanyarat Doksone',
 'Basillioh Rukanga',
 'Daniel Dadzie',
 'David Bamford & Natasha Booty',
 'Sanchia Berg and Levi Jouavel',
 'Unknown',
 'Orla Guerin',
 'Mansur Abubakar',
 'Barbara Plett Usher & Anne Soy',
 'Danai Nesta Kupemba',
 'Mayeni Jones',
 'Wedaeli Chibelushi',
 'BBC Eye Investigations',
 'Alfred Lasteck',
 'Unknown',
 'Danai Nesta Kupemba in London and Emery Makumeno in Kinshasa',
 'Unknown',
 'Daniel Dadzie',
 'Wedaeli Chibelushi',
 'Basillioh Rukanga',
 'Orla Guerin',
 'Mansur Abubakar',
 'Khanyisile Ngcobo',
 'Basillioh Rukanga']

In [314]:
def combiner(link):
    return {'url': link, 'publish_date': publish_date_finder(link), 'author': author_finder(link), 'headline': headline_finder(link), 'summary': summary_finder(link), 'content': content_finder(link)}

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

In [315]:
df = pd.DataFrame([combiner(url) for url in unique_urls])
display(df)

Unnamed: 0,url,publish_date,author,headline,summary,content
0,https://www.bbc.com/news/articles/c5ymvygpj7go,2025-02-27T08:48:28.389Z,Chris Mason,"US needs role in Ukraine to deter Putin, says PM",Sir Keir Starmer has reiterated his call for a...,Sir Keir Starmer has reiterated his call for a...
1,https://www.bbc.com/news/articles/ce8yy0116y8o,2025-02-27T16:06:20.591Z,Paul Burnell,MP Mike Amesbury's jail term suspended on appeal,"MP Mike Amesbury, who admitted repeatedly punc...","MP Mike Amesbury, who admitted repeatedly punc..."
2,https://www.bbc.com/news/articles/c86pp14j0dxo,2025-02-27T15:08:42.171Z,"Chloe Harcombe, Leigh Boobyer and Lee Madan","Woman, 19, dies in suspected XL bully dog attack",A 19-year-old woman has died after she was att...,A 19-year-old woman has died after she was att...
3,https://www.bbc.com/news/articles/cx2009jgj58o,2025-02-27T15:55:29.543Z,Paul Glynn and Steven McIntosh,Eggheads star Chris Hughes dies aged 77,"Quizzer Chris Hughes, star of the popular BBC ...","Quizzer Chris Hughes, star of the popular BBC ..."
4,https://www.bbc.com/news/articles/cqjdz9q170yo,2025-02-27T11:59:24.643Z,"Sean Dilley, Molly Stazicker, Esme Stallard & ...",Gatwick second runway backed by government,A second runway at Gatwick Airport has been ba...,A second runway at Gatwick Airport has been ba...
...,...,...,...,...,...,...
101,https://www.bbc.com/news/articles/cyve4m79e6lo,2025-02-25T18:45:36.148Z,Josh Parry,American loses UK appeal to become legally non...,An American who wanted to be formally recognis...,An American who wanted to be formally recognis...
102,https://www.bbc.com/news/articles/c05mlp43gnqo,2025-02-25T12:33:33.560Z,Unknown,Funds in place for Borders Railway extension s...,A funding package has been confirmed to allow ...,A funding package has been confirmed to allow ...
103,https://www.bbc.com/news/articles/c36we3pnxy6o,2025-02-24T19:16:29.986Z,Jack Burgess & Tom McArthur,Stonewall jobs at risk after Trump's foreign a...,"Stonewall has said it is trying to ""meet the c...","Stonewall has said it is trying to ""meet the c..."
104,https://www.bbc.com/news/articles/cq5zlq394dyo,2025-02-24T17:07:21.064Z,Chloe Parkman,'Windy boat proposal was challenging but worth...,"A man who experienced a ""challenging"" marriage...","A man who experienced a ""challenging"" marriage..."


## 4. Data Storage:

### Save the scraped article data to a file.

In [316]:
df.to_csv('scraped_news.csv', index=False)

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!