# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

### Jupyter Notebook (.ipynb file) implementing the assignment. 

### PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

In [87]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

### Determine which article types should be omitted, if any.

In [88]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [89]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Arugmenter...

In [90]:
reliable_types = ['clickbait', 'reliable', 'political']
fake_types = ['unreliable', 'fake', 'conspiracy', 'bias', 'hate', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

In [91]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 27
Fake amount: 205
Reliable percentage: 11.64%


# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install selenium (pip install selenium). Create a new Jupyter Notebook and import this module:

In [106]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests

### verify that contents holds the HTML source of the webpage:

In [93]:
firefox_options = Options()
firefox_options.add_argument('--headless')
firefox_options.add_argument('--disable-gpu')
firefox_options.add_argument('--no-sandbox')
firefox_options.add_argument('--disable-dev-shm-usage')

In [94]:
browser = webdriver.Firefox(options=firefox_options)
browser.get('https://www.bbc.com/news/world/europe')

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [95]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text

## 3. Extract Articles:

### Selenium allows us to extract information easily. You can read the documentation here. Write a function to extract all articles from the page using the find_elements method. For each article, retrieve the headline, the summary, and the link to the article (href). (Hint: You might have to inspect the page for classes, ids, etc. that might be useful for extracting the information)

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Click on button using .click(). You may also have to add a delay for the page to load using time.wait(n_seconds))

In [96]:
def url_finder():
    urls = [url.get_attribute('href') for url in browser.find_elements(By.CLASS_NAME, 'sc-2e6baa30-0')  ]
    urls = [url for url in urls if url.find('news/articles') != -1]
    return urls

print(f'URL amount: {len(url_finder())}')
url_finder()

URL amount: 24


['https://www.bbc.com/news/articles/cpq222rqv4po',
 'https://www.bbc.com/news/articles/cg4kkv3e1v9o',
 'https://www.bbc.com/news/articles/cn7vg0nvzkko',
 'https://www.bbc.com/news/articles/c70eky7l6pxo',
 'https://www.bbc.com/news/articles/cdrxy1zp8mxo',
 'https://www.bbc.com/news/articles/cg4kkv3e1v9o',
 'https://www.bbc.com/news/articles/c17qe11wy52o',
 'https://www.bbc.com/news/articles/cx2rreg04dpo',
 'https://www.bbc.com/news/articles/cn7vg0nvzkko',
 'https://www.bbc.com/news/articles/clyrkkv4gd7o',
 'https://www.bbc.com/news/articles/cvgd9v3r69qo',
 'https://www.bbc.com/news/articles/cvg592557vgo',
 'https://www.bbc.com/news/articles/c7435pnle0go',
 'https://www.bbc.com/news/articles/cx29pnzy1l7o',
 'https://www.bbc.com/news/articles/c0l1w1w41xzo',
 'https://www.bbc.com/news/articles/cy9dl4drr8lo',
 'https://www.bbc.com/news/articles/cpq222rqv4po',
 'https://www.bbc.com/news/articles/c23444ddzjyo',
 'https://www.bbc.com/news/articles/c0l11gr35gwo',
 'https://www.bbc.com/news/arti

In [97]:
def headline_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('h1').text
    return headline

headlines = [headline_finder(url) for url in url_finder()]

print(f'Headlines amount: {len(headlines)}')
headlines

Headlines amount: 24


['Tate brothers allowed to leave Romania for US',
 'Why are the Tate brothers heading to the US?',
 'Zelensky to meet Trump in Washington to sign minerals deal',
 'Romanian far-right presidential hopeful detained on street and indicted',
 'Bosnian-Serb leader sentenced to jail in landmark trial',
 'Why are the Tate brothers heading to the US?',
 'Leaked recordings challenge Greek account of deadly shipwreck',
 "North Korea has sent more troops to Russia, South's spy agency says",
 'Zelensky to meet Trump in Washington to sign minerals deal',
 'Starmer cuts aid to fund increase in defence spending',
 'Tesla shares slump after European sales fall',
 "Macron walks tightrope with Trump as he makes Europe's case on Ukraine",
 'US sides with Russia in UN resolutions on Ukraine ',
 "Spanish city 'adopts' migrants who intervened in homophobic attack",
 'Can Europe still count on the US coming to its defence?',
 'German politics froze out the far right for years – is this about to change?',
 'T

In [98]:
def summary_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('p').text
    return headline

summaries = [summary_finder(url) for url in url_finder()]

print(f'Summary amount: {len(summaries)}')
summaries

Summary amount: 24


['British-American influencers Andrew and Tristan Tate - who are facing trial in Romania on charges of rape, trafficking minors and money laundering - have left the country after prosecutors lifted a two-year travel ban.',
 'Controversial influencer Andrew Tate and his brother Tristan are travelling to the US after leaving Romania, where they face charges of rape, trafficking minors and money laundering, all of which they deny.',
 "Ukrainian President Volodymyr Zelensky will meet US President Donald Trump in Washington on Friday to sign an agreement on sharing his country's mineral resources, Trump has said. ",
 "Far-right populist Calin Georgescu, who came from nowhere to win the first round of last year's presidential election, has been detained by police and is facing criminal proceedings on a series of charges.",
 'A one-year prison sentence and a six-year ban on holding public office might seem like a heavy penalty for a politician. ',
 'Controversial influencer Andrew Tate and hi

## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: US & Canada, UK, Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 800 article links.

## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

In [99]:
def combiner(link):
    return {'url': link, 'headline': headline_finder(link), 'summary': summary_finder(link)}

df = pd.DataFrame([combiner(url) for url in url_finder()])
df.to_csv('scraped_news.csv', index=False)
display(df)

Unnamed: 0,url,headline,summary
0,https://www.bbc.com/news/articles/cpq222rqv4po,Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...
1,https://www.bbc.com/news/articles/cg4kkv3e1v9o,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...
2,https://www.bbc.com/news/articles/cn7vg0nvzkko,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...
3,https://www.bbc.com/news/articles/c70eky7l6pxo,Romanian far-right presidential hopeful detain...,"Far-right populist Calin Georgescu, who came f..."
4,https://www.bbc.com/news/articles/cdrxy1zp8mxo,Bosnian-Serb leader sentenced to jail in landm...,A one-year prison sentence and a six-year ban ...
5,https://www.bbc.com/news/articles/cg4kkv3e1v9o,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...
6,https://www.bbc.com/news/articles/c17qe11wy52o,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...
7,https://www.bbc.com/news/articles/cx2rreg04dpo,"North Korea has sent more troops to Russia, So...",North Korea has sent more soldiers to Russia a...
8,https://www.bbc.com/news/articles/cn7vg0nvzkko,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...
9,https://www.bbc.com/news/articles/clyrkkv4gd7o,Starmer cuts aid to fund increase in defence s...,Sir Keir Starmer has set out plans to increase...


# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

In [100]:
def content_finder(link):
    response = requests.get(link)
    paragraphs = BeautifulSoup(response.content, 'html.parser').find_all('p')
    main_text = ' '.join([p.text for p in paragraphs])
    return main_text

content = [content_finder(url) for url in url_finder()]

print(f'Summary amount: {len(content)}')
content

Summary amount: 24


['British-American influencers Andrew and Tristan Tate - who are facing trial in Romania on charges of rape, trafficking minors and money laundering - have left the country after prosecutors lifted a two-year travel ban. Andrew Tate, 38, and his brother Tristan, 36, left Bucharest on a private jet early on Thursday. Both deny the allegations against them. Romanian prosecutors stressed the case against them had not been dropped and they would be expected to return, understood to be at the end of next month. However, their decision has raised concern that they came under pressure from leading figures in the Trump administration, with one leading Romanian politician saying she was outraged. The brothers are facing separate, unrelated charges in the UK over allegations of rape and human trafficking, which they also deny. A separate, civil case has been opened in the US. Romanian reports said they were heading to Fort Lauderdale in Florida on a non-stop flight that would take 12 hours, howe

## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

In [101]:
def publish_date_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    date = soup.find('time')['datetime']
    return date

publish_dates = [publish_date_finder(url) for url in url_finder()]

print(f'Publish date amount: {len(publish_dates)}')
publish_dates

Publish date amount: 24


['2025-02-27T13:09:39.367Z',
 '2025-02-27T12:19:05.853Z',
 '2025-02-27T02:03:10.229Z',
 '2025-02-26T17:11:57.047Z',
 '2025-02-26T21:03:11.477Z',
 '2025-02-27T12:19:05.853Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-27T12:54:52.742Z',
 '2025-02-27T02:03:10.229Z',
 '2025-02-26T02:23:47.711Z',
 '2025-02-26T07:53:15.992Z',
 '2025-02-25T01:47:56.921Z',
 '2025-02-25T12:39:17.669Z',
 '2025-02-25T11:16:13.166Z',
 '2025-02-26T00:00:32.454Z',
 '2025-02-25T13:08:08.773Z',
 '2025-02-27T13:09:39.367Z',
 '2025-02-27T08:55:24.894Z',
 '2025-02-27T05:12:03.716Z',
 '2025-02-27T00:23:25.454Z',
 '2025-02-26T21:10:02.468Z',
 '2025-02-26T21:03:11.477Z',
 '2025-02-26T17:24:17.617Z',
 '2025-02-26T17:11:57.047Z']

In [102]:
def author_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    author_tag = soup.find('span', 'kItaYD')
    if author_tag:
        return author_tag.text.strip()
    return 'Unknown'

authors = [author_finder(url) for url in url_finder()]

print(f'Author amount: {len(authors)}')
authors

Author amount: 24


['Nick Thorpe, Mircea Barbu & Paul Kirby',
 'Unknown',
 'James Gregory',
 'Paul Kirby & Mircea Barbu',
 'Guy Delauney',
 'Unknown',
 'Nick Beake',
 'Kathryn Armstrong',
 'James Gregory',
 'Joshua Nevett and Sam Francis',
 'Tom Espiner',
 "Gary O'Donoghue",
 'James Landale',
 'Frances Mao',
 'Frances Mao',
 'Paul Kirby & Kristina Volk',
 'Nick Thorpe, Mircea Barbu & Paul Kirby',
 'Unknown',
 'Sarah Rainsford',
 'Nick Beake',
 'Unknown',
 'Guy Delauney',
 'Maia Davies',
 'Paul Kirby & Mircea Barbu']

In [103]:
def combiner(link):
    return {'url': link, 'publish_date': publish_date_finder(link), 'author': author_finder(link), 'headline': headline_finder(link), 'summary': summary_finder(link), 'content': content_finder(link)}

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

In [104]:
df = pd.DataFrame([combiner(url) for url in url_finder()])
display(df)

Unnamed: 0,url,publish_date,author,headline,summary,content
0,https://www.bbc.com/news/articles/cpq222rqv4po,2025-02-27T13:09:39.367Z,"Nick Thorpe, Mircea Barbu & Paul Kirby",Tate brothers allowed to leave Romania for US,British-American influencers Andrew and Trista...,British-American influencers Andrew and Trista...
1,https://www.bbc.com/news/articles/cg4kkv3e1v9o,2025-02-27T12:19:05.853Z,Unknown,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...,Controversial influencer Andrew Tate and his b...
2,https://www.bbc.com/news/articles/cn7vg0nvzkko,2025-02-27T02:03:10.229Z,James Gregory,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...,Ukrainian President Volodymyr Zelensky will me...
3,https://www.bbc.com/news/articles/c70eky7l6pxo,2025-02-26T17:11:57.047Z,Paul Kirby & Mircea Barbu,Romanian far-right presidential hopeful detain...,"Far-right populist Calin Georgescu, who came f...","Far-right populist Calin Georgescu, who came f..."
4,https://www.bbc.com/news/articles/cdrxy1zp8mxo,2025-02-26T21:03:11.477Z,Guy Delauney,Bosnian-Serb leader sentenced to jail in landm...,A one-year prison sentence and a six-year ban ...,A one-year prison sentence and a six-year ban ...
5,https://www.bbc.com/news/articles/cg4kkv3e1v9o,2025-02-27T12:19:05.853Z,Unknown,Why are the Tate brothers heading to the US?,Controversial influencer Andrew Tate and his b...,Controversial influencer Andrew Tate and his b...
6,https://www.bbc.com/news/articles/c17qe11wy52o,2025-02-27T00:23:25.454Z,Nick Beake,Leaked recordings challenge Greek account of d...,Leaked audio instructions by Greek rescue co-o...,Leaked audio instructions by Greek rescue co-o...
7,https://www.bbc.com/news/articles/cx2rreg04dpo,2025-02-27T12:54:52.742Z,Kathryn Armstrong,"North Korea has sent more troops to Russia, So...",North Korea has sent more soldiers to Russia a...,North Korea has sent more soldiers to Russia a...
8,https://www.bbc.com/news/articles/cn7vg0nvzkko,2025-02-27T02:03:10.229Z,James Gregory,Zelensky to meet Trump in Washington to sign m...,Ukrainian President Volodymyr Zelensky will me...,Ukrainian President Volodymyr Zelensky will me...
9,https://www.bbc.com/news/articles/clyrkkv4gd7o,2025-02-26T02:23:47.711Z,Joshua Nevett and Sam Francis,Starmer cuts aid to fund increase in defence s...,Sir Keir Starmer has set out plans to increase...,Sir Keir Starmer has set out plans to increase...


## 4. Data Storage:

### Save the scraped article data to a file.

In [105]:
df.to_csv('scraped_news.csv', index=False)

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!