## Web Scraping Approach

A web scraping process aimed at collecting news articles from Al Jazeera's online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Using the requests and BeautifulSoup libraries (Python Software Foundation 2022; Crummy 2021), text material from Project Gutenberg ebooks is scraped from the web. Based on random ebook IDs, we develop a method to scrape text excerpts and ebook titles. For analysis, the data that was scraped is kept in a pandas DataFrame. Until the desired number of ebooks is attained, the procedure is repeated. After that, the DataFrame is stored for later use in a pickle file.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

def scrape_gutenberg_ebook(ebook_id):
    url = f'https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-h/{ebook_id}-h.htm'
    response = requests.get(url)
    if response.status_code != 200:
        return None  # If the fetch fails, return None
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    title_tag = soup.find('title')
    title = title_tag.get_text(strip=True) if title_tag else 'No Title Found'
    
    body = soup.find('body')
    if body:
        text = ' '.join(body.get_text().split()[:2000])  # Limit to first 2000 words
    else:
        text = ''
    
    return {
        'url': url,
        'title': title,
        'text': text,
        'label': 'Human-written'
    }

# Initializing an empty list to store the ebook data
ebooks = []

# Defining the target number of ebooks
target_ebook_count = 150

while len(ebooks) < target_ebook_count:
    # Generating a random eBook ID 
    ebook_id = random.randint(1, 60000) 
    
    result = scrape_gutenberg_ebook(ebook_id)
    
    # Checking if the eBook was successfully scraped and has content
    if result and result['text']:
        ebooks.append(result)
        print(f"Collected eBooks: {len(ebooks)}/{target_ebook_count}")  # Progress output

ebooks_df = pd.DataFrame(ebooks)

print(f"Total rows: {len(ebooks_df)}")

Collected eBooks: 1/150
Collected eBooks: 2/150
Collected eBooks: 3/150
Collected eBooks: 4/150
Collected eBooks: 5/150
Collected eBooks: 6/150
Collected eBooks: 7/150
Collected eBooks: 8/150
Collected eBooks: 9/150
Collected eBooks: 10/150
Collected eBooks: 11/150
Collected eBooks: 12/150
Collected eBooks: 13/150
Collected eBooks: 14/150
Collected eBooks: 15/150
Collected eBooks: 16/150
Collected eBooks: 17/150
Collected eBooks: 18/150
Collected eBooks: 19/150
Collected eBooks: 20/150
Collected eBooks: 21/150
Collected eBooks: 22/150
Collected eBooks: 23/150
Collected eBooks: 24/150
Collected eBooks: 25/150
Collected eBooks: 26/150
Collected eBooks: 27/150
Collected eBooks: 28/150
Collected eBooks: 29/150
Collected eBooks: 30/150
Collected eBooks: 31/150
Collected eBooks: 32/150
Collected eBooks: 33/150
Collected eBooks: 34/150
Collected eBooks: 35/150
Collected eBooks: 36/150
Collected eBooks: 37/150
Collected eBooks: 38/150
Collected eBooks: 39/150
Collected eBooks: 40/150
Collected

In [2]:
ebooks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     150 non-null    object
 1   title   150 non-null    object
 2   text    150 non-null    object
 3   label   150 non-null    object
dtypes: object(4)
memory usage: 4.8+ KB


In [3]:
(ebooks_df['text'] == '').sum()

0

## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [4]:
ebooks_df.to_pickle("gutenberg_data_300.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)
