# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

### Jupyter Notebook (.ipynb file) implementing the assignment. 

### PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

Jeg benytter pandas til at importere data'en

In [12]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

### Determine which article types should be omitted, if any.

Jeg kigger først på, hvilke kategoriet vi har

In [13]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [14]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Jeg har valgt, at indkludere de kategorier, der ikke indeholder direkte misinformation, som en del af reliable kategorien.

In [15]:
reliable_types = ['clickbait', 'reliable', 'political', 'bias', 'hate']
fake_types = ['unreliable', 'fake', 'conspiracy', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

Vi tæller mængden af artikler i hver kategori

In [16]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 34
Fake amount: 198
Reliable percentage: 14.66%


Vi ser, at vores sample kun indeholder 14,66% reliable artikler. 

Jeg ville tro, at fordelingen er vigtig, at få balanceret, da vi i så fald, har nok data, som modellen kan træne på, og undgår mulige fejlklassificeringer på grund af mangel af data.

# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install selenium (pip install selenium). Create a new Jupyter Notebook and import this module:

In [17]:
from selenium import webdriver
from selenium.webdriver.common.by import By 
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

Jeg havde problemer med at få selenium til at køre på store mængder data, så jeg har også valgt, at bruge BeautifulSoup til at scrape selve atiklerne mens jeg bruger selenium til at navigere mellem pages'ne.

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [35]:
browser = webdriver.Firefox()
browser.get('https://www.bbc.com/news/world/europe')

## 3. Extract Articles:

### Selenium allows us to extract information easily. You can read the documentation here. Write a function to extract all articles from the page using the find_elements method. For each article, retrieve the headline, the summary, and the link to the article (href). (Hint: You might have to inspect the page for classes, ids, etc. that might be useful for extracting the information)

Jeg bruger CLASS_NAME til at finde elementer i html-koden.

In [19]:
def url_finder(site):
    urls = [url.get_attribute('href') for url in site.find_elements(By.XPATH, "//a[@href]")]
    urls = [url for url in urls if url.find('news/articles') != -1]
    return urls

urls = url_finder(browser)

print(f'URL amount: {len(url_finder(browser))}')
urls[0:10]

URL amount: 26


['https://www.bbc.com/news/articles/cx2gg8le1kpo',
 'https://www.bbc.com/news/articles/c3e44qev1dvo',
 'https://www.bbc.com/news/articles/ce982zpz1k3o',
 'https://www.bbc.com/news/articles/cn527pz54neo',
 'https://www.bbc.com/news/articles/c0eggy1104po',
 'https://www.bbc.com/news/articles/c3e44qev1dvo',
 'https://www.bbc.com/news/articles/czxnnzz558eo',
 'https://www.bbc.com/news/articles/cd655917g6qo',
 'https://www.bbc.com/news/articles/ce982zpz1k3o',
 'https://www.bbc.com/news/articles/c0q1188p1n2o']

In [20]:
def headline_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('h1').text
    return headline

headlines = [headline_finder(url) for url in url_finder(browser)]

print(f'Headlines amount: {len(headlines)}')
headlines[0:10]

Headlines amount: 26


['Greeks hold mass protests demanding justice after train tragedy',
 'Minister questions why Tate brothers were allowed to leave Romania',
 'Czech firefighters tackle large toxic train fire',
 'What we know about US-Ukraine minerals deal',
 'Free trade deal with India could come this year - EU Commission chief',
 'Minister questions why Tate brothers were allowed to leave Romania',
 'Dozens arrested in global hit against AI-generated child abuse',
 "Pope has 'isolated' breathing crisis in hospital, Vatican says",
 'Czech firefighters tackle large toxic train fire',
 'Austrian centrists agree government deal sidelining far right']

In [21]:
def summary_finder(link):
    response = requests.get(link)
    headline = BeautifulSoup(response.content, 'html.parser').find('p').text
    return headline

summaries = [summary_finder(url) for url in url_finder(browser)]

print(f'Summary amount: {len(summaries)}')
summaries[0:10]

Summary amount: 26


['Greeks have held their largest protests in years and took part in a general strike to mark the second anniversary of a rail disaster that left 57 dead and dozens more injured.',
 "Romania's Justice Minister Radu Marinescu has called for a public explanation into why controversial social media influencers Andrew and Tristan Tate were allowed to leave the country on Thursday.",
 'A freight train carrying the highly toxic chemical benzene has derailed in the Czech Republic, sparking a huge fire.',
 "Ukraine's President Volodymyr Zelensky met US President Trump in Washington on Friday to sign an agreement that would give the US access to its deposits of rare earth minerals.",
 'The head of the European Commission, Ursula von der Leyen said EU and India were pushing to get a free trade agreement during this year.',
 "Romania's Justice Minister Radu Marinescu has called for a public explanation into why controversial social media influencers Andrew and Tristan Tate were allowed to leave th

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Click on button using .click(). You may also have to add a delay for the page to load using time.wait(n_seconds))

Vi klikker på siden, så vi accepterer cookies. Herefter klikker vi ind på hver side, og samler alle links vi finder i html-koden.

In [22]:
WebDriverWait(browser, 25).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_1192447")))
WebDriverWait(browser, 25).until(EC.element_to_be_clickable((By.XPATH, "//button[contains(text(),'I agree')]"))).click()

browser.switch_to.default_content()
bannerCookie = browser.find_element(By.ID, "bbccookies-continue-button").click()

In [23]:
output_urls = []
page_amount = 20
for i in range(1, page_amount):
    try:
        output_urls.extend(url_finder(browser))
        nextPage = browser.find_element(By.XPATH, f"//button[contains(text(),'{i}')]")
        nextPage.click()
    except Exception:
        break

unique_urls = list(dict.fromkeys(output_urls))
print(f'Unique URLs in Europe: {len(unique_urls)}')

Unique URLs in Europe: 78


## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: US & Canada, UK, Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 800 article links.

In [24]:
subsites = ["https://www.bbc.com/news/us-canada", 
    "https://www.bbc.com/news/world/africa", 
    "https://www.bbc.com/news/world/asia", 
    "https://www.bbc.com/news/world/europe", 
    "https://www.bbc.com/news/world/latin_america", 
    "https://www.bbc.com/news/world/middle_east", 
    "https://www.bbc.com/news/world/australia",
    "https://www.bbc.com/news/uk"]

In [25]:
output_urls = []
page_amount = 20
for subsite in subsites:
    browser.get(subsite)
    for i in range(1, page_amount):
        try:
            output_urls.extend(url_finder(browser))
            nextPage = browser.find_element(By.XPATH, f"//button[contains(text(),'{i}')]")
            nextPage.click()
        except Exception:
            break

In [26]:
unique_urls = list(dict.fromkeys(output_urls))
print(f'Urls found: {len(unique_urls)}')
unique_urls[0:10]

Urls found: 707


['https://www.bbc.com/news/articles/cn9v1l80350o',
 'https://www.bbc.com/news/articles/c62zzd3zp50o',
 'https://www.bbc.com/news/articles/cdell8n14x2o',
 'https://www.bbc.com/news/articles/c9dejydynngo',
 'https://www.bbc.com/news/articles/cqlyy1rld0ko',
 'https://www.bbc.com/news/articles/cn7vxlrvxyeo',
 'https://www.bbc.com/news/articles/clydd7zeye7o',
 'https://www.bbc.com/news/articles/cvgee7rl24ro',
 'https://www.bbc.com/news/articles/c1kjj032d8do',
 'https://www.bbc.com/news/articles/cedll3282qzo']

Afviger med under 10% fra de 800 artikler. Så jeg vurderer det som godt nok.

## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

In [27]:
pd.DataFrame(unique_urls, columns=['urls']).to_csv('bbc_urls.csv', index=False)

# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

In [37]:
def content_finder(url):
    response = requests.get(url)
    paragraphs = BeautifulSoup(response.content, 'html.parser').find_all('p')
    main_text = ' '.join([p.text for p in paragraphs])
    main_text = main_text.replace('\n', ' ')
    return main_text

content = [content_finder(url) for url in url_finder(browser)]

print(f'Content amount: {len(content)}')

Content amount: 26


## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

Det sværeste jeg opdagede, var at få fat i forfatterne, men jeg fandt en løsning, som man kan se længere ned, når jeg definerer funktionen.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

Jeg bruger samme metode som tidligere.

In [29]:
def publish_date_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    date = soup.find('time')['datetime']
    return date

publish_dates = [publish_date_finder(url) for url in url_finder(browser)]

print(f'Publish date amount: {len(publish_dates)}')
publish_dates[0:10]

Publish date amount: 23


['2025-02-28T17:06:10.619Z',
 '2025-02-28T16:37:33.805Z',
 '2025-02-28T20:39:11.634Z',
 '2025-02-28T18:51:58.465Z',
 '2025-02-28T15:37:02.753Z',
 '2025-02-28T16:37:33.805Z',
 '2025-02-28T16:21:43.835Z',
 '2025-02-28T13:04:28.913Z',
 '2025-02-28T20:39:11.634Z',
 '2025-02-28T17:46:36.227Z']

In [30]:
def author_finder(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    author_tag = soup.find('span', 'kItaYD')
    if author_tag:
        return author_tag.text.strip()
    return 'Unknown'

authors = [author_finder(url) for url in url_finder(browser)]

print(f'Author amount: {len(authors)}')
authors[0:10]

Author amount: 23


['Kate Whannel',
 'Iain Watson',
 'Tom Symonds',
 'Unknown',
 'Martin Eastaugh',
 'Iain Watson',
 'Clara Bullock',
 'Emma Stanley',
 'Tom Symonds',
 'Sam Francis']

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

Jeg har lavet en funktion, der køerer alle tidligere funktioner på givne url'er.

In [31]:
def function_combiner(url):
    return {'url': url, 'publish_date': publish_date_finder(url), 'author': author_finder(url), 'headline': headline_finder(url), 'summary': summary_finder(url), 'content': content_finder(url)}

Vi kører alle unikke urls på vores function combiner, og gemmer resultaterne i en dataframe. Efterfølgende lukkes firefox.

In [32]:
df = pd.DataFrame([function_combiner(url) for url in unique_urls])

browser.quit()

display(df)

Unnamed: 0,url,publish_date,author,headline,summary,content
0,https://www.bbc.com/news/articles/cn9v1l80350o,2025-02-28T16:40:24.021Z,Caitlin Wilson,Trump to order English as official US language,Donald Trump will sign an executive order on F...,Donald Trump will sign an executive order on F...
1,https://www.bbc.com/news/articles/c62zzd3zp50o,2025-02-28T11:54:38.680Z,Nadine Yousif,'Trump thinks he can break us' - Ontario's Dou...,"The leader of Ontario, Doug Ford, has vowed to...","The leader of Ontario, Doug Ford, has vowed to..."
2,https://www.bbc.com/news/articles/cdell8n14x2o,2025-02-28T11:27:09.108Z,Thomas Mackintosh,Hundreds in US climate agency fired in latest ...,Hundreds of National Oceanic and Atmospheric A...,Hundreds of National Oceanic and Atmospheric A...
3,https://www.bbc.com/news/articles/c9dejydynngo,2025-02-28T19:19:27.428Z,Tom McArthur,Furious Trump accuses Zelensky of 'gambling wi...,A furious Donald Trump has clashed with Volody...,A furious Donald Trump has clashed with Volody...
4,https://www.bbc.com/news/articles/cqlyy1rld0ko,2025-02-28T09:29:17.748Z,Patrick Jackson,What we know about the deaths of Gene Hackman ...,US investigators are trying to establish how O...,US investigators are trying to establish how O...
...,...,...,...,...,...,...
702,https://www.bbc.com/news/articles/c7vdjv31gmmo,2025-02-04T15:51:34.308Z,Ruth Comerford,"I will not stop working, Anna Wintour tells King","Vogue editor Dame Anna Wintour has ""firmly"" to...","Vogue editor Dame Anna Wintour has ""firmly"" to..."
703,https://www.bbc.com/news/articles/c14nm47y65go,2025-02-04T15:23:06.485Z,Rebecca Brahde,Ferries cancelled or changed due to spring tides,Extensive changes have been made to this week'...,Extensive changes have been made to this week'...
704,https://www.bbc.com/news/articles/clyk521rrxyo,2025-02-04T06:03:06.818Z,Shona Elliott and Ruth Clegg,Celebrity butt-lift injector who left women wi...,"A self-styled ""beauty consultant"", whose celeb...","A self-styled ""beauty consultant"", whose celeb..."
705,https://www.bbc.com/news/articles/c805jgk3vj1o,2025-02-03T18:08:56.044Z,Sean Coughlan,Kate photographed by Prince Louis to mark Worl...,A photograph of the Princess of Wales taken by...,A photograph of the Princess of Wales taken by...


## 4. Data Storage:

### Save the scraped article data to a file.

Vi gemmer resultaterne til csv.

In [33]:
df.to_csv('scraped_news.csv', index=False)

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

Vi talte til en af forelæsningerne om, at tidsmæssig ny data, kunne forårsage data leakage, ved at give modellen information om fremtiden, som den ikke ellers ville vide. 

Vi talte dog også om, at man kunne forhindre dette, ved at dele datasættet op efter udgivelsestidspunkt, i så fald ville modellen nok ikke være påvirket af data leakage, men det er ikke noget jeg er sikker på endnu. Hvis det er muligt og ikke skaber problemer, ville det da være fordelagtigt med mere data.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!