# Setting Up The Enviornment

This notebook was [written in Google Colab](https://colab.research.google.com/drive/11_X7N26-ZN7lyLvzKPVgRVfoCJmtRMmb?usp=sharing) and should be ran in Colab

In [2]:
!pip install google-colab-selenium

Collecting google-colab-selenium
  Downloading google_colab_selenium-1.0.14-py3-none-any.whl.metadata (2.7 kB)
Collecting selenium (from google-colab-selenium)
  Downloading selenium-4.32.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio~=0.17 (from selenium->google-colab-selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium->google-colab-selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio~=0.17->selenium->google-colab-selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium->google-colab-selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading google_colab_selenium-1.0.14-py3-none-any.whl (8.2 kB)
Downloading selenium-4.32.0-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m51.0 MB/s[0m eta 

In [33]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import google_colab_selenium as gs
import pandas as pd
from IPython import get_ipython
from IPython.display import Image, display, IFrame
import time
import datetime

# Scraping BBC Swahili Afya

In [25]:
driver = gs.Chrome()
url = "https://www.bbc.com/swahili/topics/cvjp2jj60v3t" # choosing the Alya category
driver.get(url)

<IPython.core.display.Javascript object>

The functions below were useful for navigating the driver and validate my code with the website from my computer.

In [5]:
# Scroll to bottom of the page (generated from stackoverflow)
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Zoom out (typically used before screenshot)
def zoom_out(driver):
    driver.execute_script("document.body.style.zoom='80%'")

# Take the screenshot, pretty much the take and show screenshot of colab 5
def take_screenshot(screenshot_path):
    driver.save_screenshot(screenshot_path)
    display(Image(screenshot_path))

# Scroll to the next element in a list
def scroll_to_element(driver, element):
    ActionChains(driver)\
        .scroll_to_element(element)\
        .perform()

# Meant to scroll down a distance, but goes to bottom of the page
def scroll_down(driver, distance):
    driver.execute_script(f"window.scrollTo({distance}, document.body.scrollHeight);")
# Meant to scroll up a certain distance, but goes to the top of the page
def scroll_up(driver, distance):
    driver.execute_script(f"window.scrollTo(document.body.scrollHeight, {distance});")

### ONLY USE IF THERE'S A COOKIES BANNER ERROR
I was using options with Selenium, and the website would occasionally put up a cookies banner. After hours of trying to figure out why, I deleted options and haven't gotten a cookies banner since.

Below is if the cookies banner shows up again.

In [None]:
try:
    # Using the Xpath to find the "Accept Cookies" button,  because its text is originally in Swahili
    cookie_button = driver.find_element(By.XPATH, "//button[@data-cookie-banner='accept']")
    cookie_button.click()
    print("Cookies accepted!")
except:
    print("Could not find or click the cookie button:")

Could not find or click the cookie button:


### Working?

In [21]:
def retrieve_articles (base_url, tag_name):
    all_articles = []
    page_number = 1
    max_pages = 40
    while page_number <= max_pages:
        url = f"{base_url}?page={page_number}"
        driver.get(url)
        time.sleep(5)
        page_articles = driver.find_elements(By.CLASS_NAME, tag_name)
        for article in page_articles:
            link = article.find_element(By.TAG_NAME, "a")
            href = link.get_attribute("href")
            if href:
                all_articles.append(href)
                print(f"Page number is {page_number} and the link is: {href}")
        page_number += 1
    return all_articles

In [16]:
article_links = retrieve_articles(url, "bbc-t44f9r")
print(article_links)

Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cwyv13plgz9o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/c4grnvrp8yvo
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cx282gkkdd4o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cqj4jveqp9qo
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/ce3vjpllx12o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/crm3g37d9v4o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cr5d9gmzd28o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cdjlk0gm8x7o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cvgnyxe3wrqo
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/c4g3wnkzw42o
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/c75dgxzd4pgo
Page number is 1 and the link is: https://www.bbc.com/swahili/articles/cgkg4

In [18]:
def read_short_article(link, article_data):
    driver.get(link)
    article_body = driver.find_element(By.XPATH, "/html/body/div/div/div/main/div[2]/div[2]/div/ol/li[11]/article/div[1]")
    try:
        body_elements = article_body.find_elements(By.TAG_NAME, "p")
        article_data["Text"] = "\n".join([elem.text for elem in body_elements])
    except:
        article_data["Text"] = "N/A"
    return article_data

In [19]:
def read_long_article(link, article_data):
    driver.get(link)
    try:
        long_article_body = driver.find_elements(By.TAG_NAME, "p")
        long_article_body = long_article_body[3:]
        combined_long_article = ""
        for element in long_article_body:
            combined_long_article += element.text + "\n"
        article_data["Text"] = combined_long_article
    except:
        article_data["Text"] = "N/A"

    return article_data

In [20]:
def read_article(link):
    article_data = {}
    if "bbc.in" in link:
        return read_short_article(link, article_data)
        # pass
    elif "bbc.com" in link:
        return read_long_article(link, article_data)
    # elif "articles" not in link:
    #     pass
    else:
        return article_data

In [26]:
articles_data = []
for link in article_links:
    article_info = read_article(link)
    print(f"At this link: {link}")
    if article_info:
        articles_data.append(article_info)

At this link: https://www.bbc.com/swahili/articles/cwyv13plgz9o
At this link: https://www.bbc.com/swahili/articles/c4grnvrp8yvo
At this link: https://www.bbc.com/swahili/articles/cx282gkkdd4o
At this link: https://www.bbc.com/swahili/articles/cqj4jveqp9qo
At this link: https://www.bbc.com/swahili/articles/ce3vjpllx12o
At this link: https://www.bbc.com/swahili/articles/crm3g37d9v4o
At this link: https://www.bbc.com/swahili/articles/cr5d9gmzd28o
At this link: https://www.bbc.com/swahili/articles/cdjlk0gm8x7o
At this link: https://www.bbc.com/swahili/articles/cvgnyxe3wrqo
At this link: https://www.bbc.com/swahili/articles/c4g3wnkzw42o
At this link: https://www.bbc.com/swahili/articles/c75dgxzd4pgo
At this link: https://www.bbc.com/swahili/articles/cgkg4nnlgrko
At this link: https://www.bbc.com/swahili/articles/cj9eg9znn0vo
At this link: https://www.bbc.com/swahili/articles/c0jzy0098l1o
At this link: https://www.bbc.com/swahili/articles/c8dg29r2mglo
At this link: https://www.bbc.com/swahil

In [None]:
articles_data

[{'Type': 'Long Article',
  'Title': 'Tundu Lissu asalia mikononi mwa polisi Tanzania',
  'Date': '10 Aprili 2025',
  'Text': 'Awali katika matangazo ya Amka na BBC yaliyoruka asubuhi hii, mwandishi mwandamizi wa Idhaa ya Kiswahili Florian Kaijage alieleza kwa undani tumachokifahamu kuhusu tukio hilo. Sikiliza.\n© 2025 BBC. BBC haihusiki na taarifa za kutoka mitandao ya nje. Soma kuhusu mtazamo wetu wa viambatanishi vya nje.\n'},
 {'Type': 'Long Article',
  'Title': 'Tetesi za soka Ulaya Alhamisi: Man City jicho kwa Guimaraes',
  'Date': '10 Aprili 2025',
  'Text': "Manchester City wanafikiria kutoa kati ya euro 50-60m (£42-51m) kwa kiungo wa AC Milan na Uholanzi Tijjani Reijnders, 26, licha ya hivi karibuni kusaini mkataba mpya hadi 2030. (Calciomercato)\nCHANZO CHA PICHA,\nGETTY IMAGES\nEverton wanatarajia kumpoteza mlinzi wa kati wa England Jarrad Branthwaite, 22, msimu huu huku Manchester United na Tottenham zikimtaka. (Sun)\nChelsea na Newcastle wanavutiwa na mshambuliaji wa Benfi

In [34]:
print(len(articles_data))

939


In [27]:
df_bbc = pd.DataFrame(articles_data)
df_bbc

Unnamed: 0,Text
0,"Kila tarehe Mei 12, dunia huadhimisha Siku ya ..."
1,Matibabu ya sasa yanapaswa kuendana na aina ma...
2,"Fauka ya hayo, kutokana na tafiti kadhaa za ki..."
3,''Niliamua kutembea miguu chuma miaka 6 iliyop...
4,Maji safi yana virutubishi ambavyo hukata kiu ...
...,...
934,"Elkabbas, amekanusha mashitaka hayo na kuambia..."
935,"""Sielewi: nyinyi ni watu wa nchi zilizoendelea..."
936,Wengine zaidi wanaandikishwa. Hii inakuja chan...
937,


# Scraping Habri Leo

In [43]:
driver = gs.Chrome()
url = "https://habarileo.co.tz/category/afya/" # choosing the Alya category
# driver.get(url)

<IPython.core.display.Javascript object>

In [42]:
def retrieve_articles (base_url, tag_name):
    all_articles = []
    page_number = 1
    max_pages = 104
    while page_number <= max_pages:
        url = f"{base_url}/page/{page_number}"
        driver.get(url)
        time.sleep(2)
        page_articles = driver.find_elements(By.CLASS_NAME, tag_name)
        for article in page_articles:
            link = article.find_element(By.TAG_NAME, "a")
            href = link.get_attribute("href")
            if href:
                all_articles.append(href)
                print(f"Page number is {page_number} and the link is: {href}")
        page_number += 1
    return all_articles

In [44]:
habari_article_links = retrieve_articles(url, "container-wrapper.post-element.tie-standard.masonry-brick")
print(habari_article_links)

Page number is 1 and the link is: https://habarileo.co.tz/zaidi-ya-milioni-450-kutekeleza-miradi-ya-afya-longido/
Page number is 1 and the link is: https://habarileo.co.tz/mariam-mwinyi-azindua-zanzibar-afya-week/
Page number is 12 and the link is: https://habarileo.co.tz/sh-milioni-78-kuchangia-masuala-ya-lishe-msalala/
Page number is 12 and the link is: https://habarileo.co.tz/wabobezi-muhimbili-warejesha-100-sauti-ya-mtoto-maliki/
Page number is 12 and the link is: https://habarileo.co.tz/bima-ya-afya-kwa-wote-kuanza-haraka/
Page number is 12 and the link is: https://habarileo.co.tz/zijue-sheria-zitakazowatia-hatiani-wasambazaji-picha-chafu/
Page number is 13 and the link is: https://habarileo.co.tz/daktari-aeleza-hatari-maumivu-ya-kifua/
Page number is 13 and the link is: https://habarileo.co.tz/kambi-maalum-kuchunguza-maumivu-ya-viungo/
Page number is 13 and the link is: https://habarileo.co.tz/mhagama-atoa-maagizo-kwa-halmashauri-zote/
Page number is 13 and the link is: https://h

In [45]:
def read_habari_article(link):
    habari_article_data = {}
    driver.get(link)
    try:
        habari_article_wrapper = driver.find_element(By.CLASS_NAME, "entry-content entry clearfix")
        combined_article = ""
        habari_paragraphs = habari_article_wrapper.find_elements(By.TAG_NAME, "p")
        for paragraph in habari_paragraphs:
            combined_article += element.text + "\n"
        habari_article_data["Text"] = combined_article
    except:
        habari_article_data["Text"] = "N/A"

    return habari_article_data

In [None]:
habari_articles_data = []
for link in habari_article_links:
    habri_article_info = read_habari_article(link)
    print(f"At this link: {link}")
    if article_info:
        habari_articles_data.append(article_info)

At this link: https://habarileo.co.tz/zaidi-ya-milioni-450-kutekeleza-miradi-ya-afya-longido/
At this link: https://habarileo.co.tz/mariam-mwinyi-azindua-zanzibar-afya-week/
At this link: https://habarileo.co.tz/sh-milioni-78-kuchangia-masuala-ya-lishe-msalala/
At this link: https://habarileo.co.tz/wabobezi-muhimbili-warejesha-100-sauti-ya-mtoto-maliki/
At this link: https://habarileo.co.tz/bima-ya-afya-kwa-wote-kuanza-haraka/
At this link: https://habarileo.co.tz/zijue-sheria-zitakazowatia-hatiani-wasambazaji-picha-chafu/
At this link: https://habarileo.co.tz/daktari-aeleza-hatari-maumivu-ya-kifua/
At this link: https://habarileo.co.tz/kambi-maalum-kuchunguza-maumivu-ya-viungo/
At this link: https://habarileo.co.tz/mhagama-atoa-maagizo-kwa-halmashauri-zote/
At this link: https://habarileo.co.tz/zmbf-kuendelea-kuboresha-maisha-ya-jamii/
At this link: https://habarileo.co.tz/tanzania-nchi-jirani-kutokomeza-polio/
At this link: https://habarileo.co.tz/wengi-wakutwa-na-shikizo-la-damu-aru

In [None]:
df_habari = pd.DataFrame(articles_data)
df_habari

# Exporting CSV & Quitting The Drive

In [29]:
curr_date = datetime.date.today()

In [30]:
print(curr_date)

2025-05-12


In [31]:
df_bbc.to_csv(f"bbc_swahili_articles_{curr_date}.csv", index=False)

In [None]:
df_habari.to_csv(f"habari_leo_articles_{curr_date}.csv", index=False)

In [40]:
driver.quit()

