# Web Scraping for Fake News Detection

### In this assignment, you'll gain hands-on experience in web scraping, a crucial skill in data science, especially when structured datasets are not readily available. Specifically, you'll focus on extracting information from news websites, a vital step in creating a dataset for training a fake news detection model. In terms of the Data Science Pipeline, you will mainly focusing on acquiring raw data, processing data, cleaning and explorative data analysis, and structured representation and storage of data.

# Submission Requirements

## Jupyter Notebook (.ipynb file) implementing the assignment. PDF printout of the executed Jupyter Notebook displaying the results.

# Part 1: Analyze the Fake News Dataset

## 1. Import Dataset: 

### Import the cleaned dataset from last assignment

In [118]:
import pandas as pd

df = pd.read_csv('cleaned_news_sample.csv')

## 2. Dataset Analysis:

## A

### Determine which article types should be omitted, if any.

In [119]:
unique_types = df['type'].unique()
pd.DataFrame(unique_types, columns=['types']).head(11)

Unnamed: 0,types
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political
9,


Jeg vælger alle bort set fra NaN og unknown

In [120]:
wanted_types = ['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable', 'bias', 'hate', 'junksci', 'political']
df['type'] = df['type'].apply(lambda x: x if x in wanted_types else None)
df = df.dropna(subset=['type'])
unique_types = df['type'].unique()
pd.DataFrame(unique_types).head(9)

Unnamed: 0,0
0,unreliable
1,fake
2,clickbait
3,conspiracy
4,reliable
5,bias
6,hate
7,junksci
8,political


## B

### Group the remaining types into 'fake' and 'reliable'. Argue for your choice.

Arugmenter...

In [121]:
reliable_types = ['clickbait', 'reliable', 'political']
fake_types = ['unreliable', 'fake', 'conspiracy', 'bias', 'hate', 'junksci']

for type in reliable_types:
    df['type'] = df['type'].replace(type, 'reliable')

for type in fake_types:
    df['type'] = df['type'].replace(type, 'fake')

## C

### Examine the percentage distribution of 'reliable' vs. 'fake' articles. Is the dataset balanced? Discuss the importance of a balanced distribution.

In [129]:
reliable_amount = 0
fake_amount = 0
for type in df['type']:
    if type == 'reliable':
        reliable_amount += 1
    else:
        fake_amount += 1

print(f'Reliable amount: {reliable_amount}')
print(f'Fake amount: {fake_amount}')
print(f'Reliable percentage: {reliable_amount / len(df['type']) * 100:.2f}%')

Reliable amount: 27
Fake amount: 205
Reliable percentage: 11.64%


# Part 2: Gathering Links

### In this part of the exercise you will write code to extract a collection of article links.

## 1. Library Installation:

### Install beautifulsoup4 and requests. Create a new Jupyter Notebook and import these modules:

In [126]:
import requests
from bs4 import BeautifulSoup

## 2. Retrieve HTML Content: 

### Use the following example code to fetch the HTML content of a webpage and verify that contents holds the HTML source of the webpage:

In [None]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text

## 3. Extract Articles:

### Parse the html content with soup = BeautifulSoup(contents, 'html.parser'). BeautifulSoup allows us to easily extract information after parsing. You can read the documentation hereLinks to an external site. Write a function to extract all articles (with the attribute type 'article') from the page using the find_all method. For each article, retrieve the corresponding link to the article (href).

## 4. Scrape Multiple Pages:

### Identify the number of pages available for the 'Europe' section (Hint: See buttons at the bottom of the page). Write a function that extracts all article links from all these pages (Hint: Append ?page=x to access certain page number)

## 5. Expand the Scope:

### Extend your scraping to include articles from other regions: Australia, Asia, Africa, Latin America, and the Middle East. If done correctly, you should get around 2.000 article links.

## 6. Save Your Results: 

### Store the collected links in a file (CSV, JSON, or TXT format).

# Part 3: Scraping Article Text

### In this final part of the exercise, you will scrape the article text and store it on disk.

## 1. Article Inspection: 

### Manually inspect a few articles to find unique attributes to identify the text, the headline, the published date, and the author.

## 2. Text Scraping Function:

### Implement a function that takes a URL and returns a dictionary with the article's text, headline, published date, and author.

## 3. Scrape All Articles:

### Loop through all the collected article links to scrape their contents (May take a long time. So try on a smaller subset to start with). Remember to implement error handling and possibly introduce delays to avoid being blocked.

## 4. Data Storage:

### Save the scraped article data to a file.

## 5. Discussion:

### Discuss whether it would make sense to include this newly acquired data in the dataset. Argue why or why not and if possible include statistics to support your claim.

# Part 4: Preservation

### Keep the data that you have scraped so you can use it for your Group Project!