# Data Scrapping PostaMate
In this notebook we are going to scrape news headlines, dates, contents, and urls from the [PostaMate](https://postamate.com/page/) website. Postamate is a news source in Kenya that delivers satirical and sarcastic news. 

Data from this source is important for our model as 8 out of 10 times, it is difficult for humans to read the satire in a text. This will help people identify satirical or sarcastic news articles.

## Importing dependencies/ libraries
Before we begin we are going to import the necessary libraries for scrapping. The libraries we are importing include:
- `request` - this library will help us load webpages into the notebook

- `BeautifulSoup` - this will help us parse HTML so that we can extract data

- `pandas` - this library will help us store and export our data as a DataFrame or a CSV file.

- `time.sleep()` - We will be using this library to avoid overloading the website.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Setup
headers = {"User-Agent": "Mozilla/5.0"}
base_url = "https://postamate.com/page/"
pages_to_scrape = 10  

# Store scraped data
titles, dates, urls, contents, sources, labels = [], [], [], [], [], []

# STEP 1: Loop through multiple pages
for page_num in range(1, pages_to_scrape + 1):
    print(f"🔍 Crawling Page {page_num}")
    url = base_url + str(page_num) + "/"
    res = requests.get(url, headers=headers)

    if res.status_code != 200:
        print(f"⚠️ Failed to load page {page_num}")
        continue

    soup = BeautifulSoup(res.text, "html.parser")
    articles = soup.find_all("div", class_="single-post-wrapper")

    for article in articles:
        try:
            # Get title and article link
            title_tag = article.find("h3", class_="post-title")
            a_tag = title_tag.find("a") if title_tag else None

            title = a_tag.get_text(strip=True) if a_tag else "No Title"
            article_url = a_tag["href"] if a_tag else None

            # Get publish date
            time_tag = article.find("time", class_="entry-date published")
            pub_date = time_tag.get_text(strip=True) if time_tag else "No Date"
            


            # Get full content from article page
            if article_url:
                try:
                    article_res = requests.get(article_url, headers=headers, timeout=10)
                    article_soup = BeautifulSoup(article_res.text, "html.parser")
                except requests.exceptions.RequestException:
                    print("⚠️ Failed to retrieve:", article_url)
                    content = "Request failed"
                    continue

                content_block = article_soup.find("div", class_="entry-content")
                if content_block:
                    paragraphs = content_block.find_all("p")
                    content = " ".join(p.get_text(strip=True) for p in paragraphs)
                else:
                    content = "No content"
            else:
                content = "No URL"

            # Store
            titles.append(title)
            dates.append(pub_date)
            urls.append(article_url)
            contents.append(content)
            sources.append("PostaMate")
            labels.append("satire")

            time.sleep(1)

        except Exception as e:
            print(f"❌ Error parsing article: {e}")
            continue

# STEP 2: Save to CSV
df = pd.DataFrame({
    "title": titles,
    "date": dates,
    "url": urls,
    "content": contents,
    "source": sources,
    "label": labels
})

df.to_csv("postamate_satire_articles.csv", index=False)
print(f"\n✅ Done! Saved {len(df)} articles to 'postamate_satire_articles.csv'")


🔍 Crawling Page 1
🔍 Crawling Page 2

✅ Done! Saved 54 articles to 'postamate_satire_articles.csv'


In [3]:
df.to_csv("../Data/RawData/postamate_satire_articles.csv", index=False)
