## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Using the requests and BeautifulSoup libraries (Python Software Foundation 2022; Crummy 2021), text material from Project Gutenberg ebooks is scraped from the web. Based on random ebook IDs, we develop a method to scrape text excerpts and ebook titles. For analysis, the data that was scraped is kept in a pandas DataFrame. Until the desired number of ebooks is attained, the procedure is repeated. After that, the DataFrame is stored for later use in a pickle file.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
def scrape_wikipedia_article(url):
    response = requests.get(url)
    if response.status_code != 200:
        return None  # Return None if the page request failed
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Finding the title of the article
    title = soup.find('h1', id='firstHeading')  # Wikipedia uses this ID for article titles
    article_title = title.get_text(strip=True) if title else 'Wikipedia Article'  # Fallback title if not found
    
    # Finding the main content of the article
    content_div = soup.find('div', class_='mw-parser-output')
    
    # Extract text from all paragraphs within the content div, excluding certain elements
    paragraphs = []
    for p in content_div.find_all('p', recursive=True):
        # Excluding paragraphs within tables or infoboxes
        if not p.find_parent('table') and not p.find_parent('infobox'):
            paragraphs.append(p.get_text(strip=True))
    article_text = ' '.join(paragraphs)
    
    return {
        'title': article_title,
        'text': article_text.strip()
    }

# Initializing an empty list to hold the articles
articles = []

# base URL for random Wikipedia articles
random_article_url = 'https://en.wikipedia.org/wiki/Special:Random'

# number of articles we want to scrape
target_article_count = 150

while len(articles) < target_article_count:
    # Scrape the article
    article_data = scrape_wikipedia_article(random_article_url)
    
    # Check if the article was scraped successfully
    if article_data and article_data['text']:
        # Add the article to our list if it has content
        articles.append({'url': random_article_url, 'title': article_data['title'], 'text': article_data['text'], 'label': 'Human-written'})
        print(f"Collected articles: {len(articles)}/{target_article_count}")  # Progress output

wiki_df = pd.DataFrame(articles)

Collected articles: 1/150
Collected articles: 2/150
Collected articles: 3/150
Collected articles: 4/150
Collected articles: 5/150
Collected articles: 6/150
Collected articles: 7/150
Collected articles: 8/150
Collected articles: 9/150
Collected articles: 10/150
Collected articles: 11/150
Collected articles: 12/150
Collected articles: 13/150
Collected articles: 14/150
Collected articles: 15/150
Collected articles: 16/150
Collected articles: 17/150
Collected articles: 18/150
Collected articles: 19/150
Collected articles: 20/150
Collected articles: 21/150
Collected articles: 22/150
Collected articles: 23/150
Collected articles: 24/150
Collected articles: 25/150
Collected articles: 26/150
Collected articles: 27/150
Collected articles: 28/150
Collected articles: 29/150
Collected articles: 30/150
Collected articles: 31/150
Collected articles: 32/150
Collected articles: 33/150
Collected articles: 34/150
Collected articles: 35/150
Collected articles: 36/150
Collected articles: 37/150
Collected 

In [3]:
wiki_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     150 non-null    object
 1   title   150 non-null    object
 2   text    150 non-null    object
 3   label   150 non-null    object
dtypes: object(4)
memory usage: 4.8+ KB


In [4]:
(wiki_df['text'] == '').sum()

0

## Data storage for further analysis

After successfully scraping and organizing the data, it is stored. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [5]:
wiki_df.to_pickle("wiki_data_300.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)
