# SCRAPING DATA:
* We scraped data from `https://yo.globalvoices.org/page/`. The site is a dynamic site heavily loaded with javascript enabled contents, this is why `selenium` was chosen instead of the regular beautifulsoup(bs4). Selenium was chosen instead of other javascript-compatible python scrapers such as `playwright` for convienience sake.
* We scraped from 10 webpages, each containing at least 10 article headlines which hyperlinks to the article body.
* The article bodies were extracted and further processed to prune duplicates before writing to a .txt file per sentence.
* A total of `33193` sentences are accounted for in the .txt file.

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

In [None]:
# Setting selenium to go headless and start with a maximized window.
options = Options()
options.add_argument("--start-maximized")
options.add_argument("--headless")

# Setting the chromium driver since we have not installed the chromium browser on our system.
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
big_article = []
articles = []

# While loop to go through two pages
x = 1
while x < 10:
    # For every page:
    url = f'https://yo.globalvoices.org/page/{x}/'

    driver.get(url)
    time.sleep(4)

    # Get all the article titles in the page
    article_titles = driver.find_elements(By.CLASS_NAME, 'post-title')
    print(f'There are {len(article_titles)} article titles on the current page.')
    texts = [title.text.strip() for title in article_titles]

    # For each articles in the page
    for i in texts:
        try:
            # Click the link text of the article
            new_page = driver.find_element(By.LINK_TEXT, i)
            new_page.click()

            # While web elements is not avaliable on the DOM, wait for 20 minutes.
            WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'entry')))

            # Get the items under entry
            body = driver.find_element(By.CLASS_NAME, 'entry')
            texts = [t.text.strip() for t in body.find_elements(By.TAG_NAME, 'p')]

            # Append the list of string in <p> to the `articles` variable
            articles.append(' '.join(texts).strip())
            print(articles[-1][:50])
            driver.back()
            WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'post-title')))
        except:
            big_article.append(articles)

    big_article.append(articles)
    x+=1

driver.close()

There are 14 article titles on the current page.
Àwọn elétò ìlera-bìnrin agbègbè níbi ìdánilẹ́kọ̀ọ́
Àwọn ọmọ ẹgbẹ́ Ohùn Àgbáyé (Global Voices) níbi àp
Àjọ gbogboogbò tí ó ń rí sí ìfopin sí ìfipá-munis
Arábìnrin aráa Togo tó ń wa kùsà. A mú àwòrán nàá
Àwọn ọmọdé ní orílẹ̀-èdè Tanzania ń lo ẹ̀rọ-ayárab
Àwọn afẹ̀hónúhàn dúró pẹ̀lú àsia orílẹ̀-èdè
Òṣèré Soca, Machel Montano ní Ayẹyẹ OVO ní Toronto
Àwòrán àfihàn láti orí Canva Pro. Emma  Lewis, òka
Ọjà ẹja bèbè òkun kan ní Dakar, Senegal, pẹ̀lú àwọ
France jẹ́ orílẹ̀ èdè tí ó ń sọ èdè tó pọ̀, tí àw
Àwọn afẹ̀hónúhàn lásìkò ìwọ́de #EndSars. Àwòrán lá
Damilola Olawoyin. Toheeb Babalola ni ó pèsè àwòrá
Ilẹ̀ agbègbè Rumuwhara. The Colonist Report Africa
Àwòrán-atọ́ka láti ọwọ́ Táíwò Tèmilolúwa John fún 
There are 10 article titles on the current page.
Àwọn afẹ̀hónúhàn lásìkò ìwọ́de #EndSars. Àwòrán lá
Damilola Olawoyin. Toheeb Babalola ni ó pèsè àwòrá
Ilẹ̀ agbègbè Rumuwhara. The Colonist Report Africa
Àwòrán-atọ́ka láti ọwọ́ Táíwò Tèmil

In [16]:
len(big_article)
data = []

In [17]:
for i in range(9):
    data.extend(big_article[i])

In [19]:
for i in data:
    with open('yo_corpus.txt', 'a', encoding='utf-8') as f:
        f.write(i + '\n')

# READING THE SCRAPED DATA FOR FUTHER PREPROCESSING

In [None]:
with open('yo_corpus.txt', 'r', encoding='utf-8') as f:
    lines = f.read()

lines = [text+'. ' for text in lines.split('. ')]
print(f'There are {len(lines)} sentences in your data.', f'## Printing the first 30 sentences', sep='\n\n', end='\n')
for i, j in enumerate(lines[:100], 1):
    print(f'{i}: {j}')

There are 33193 sentences in your data.

## Printing the first 30 sentences
1: Àwọn elétò ìlera-bìnrin agbègbè níbi ìdánilẹ́kọ̀ọ́ kan ní Dankpen Prefecture, tí ó wà ní ẹkùn Kara ní apá Àríwá Togo. 
2:  Àwòrán láti ọwọ́ Émile Bobozi. 
3: A gba àṣẹ láti lò ó. 
4: Láti ọwọ́ Emile Bobozi Ní àwọn ìgbèríko orílẹ̀-èdè Togo, àwọn obìnrin ló ń gbé èyí tó jù nínú bùkátà ẹbí. 
5: Síbẹ̀, wọ́n tẹramọ́ ìdàgbàsókè ìlú wọn bíótilẹ̀jẹ́pé àwọn àṣà àjogúnbá ń fínná mọ́ wọn. 
6: Ọ̀pọ̀lọpọ̀ àwọn ìgbèríko ní ẹkùn Kara ní apá àríwá Togo ni àwọn ohun amáyédẹrùn ti di ẹgẹrẹmìtì, tí ó sì ń tàbùkù ìgbé-ayé ìrọ̀rùn fún àwọn ará agbègbè Bassar, Kabye, Lamba àti Konkomba. 
7: Lọ́pọ̀ ìgbà ni àwọn ojú ọ̀nà wọn kìí ṣe é gbà, àwọn ọ̀nà ojú omií mú kí ìrìnàjò sí ilé-ìwòsàn ó nira, tí èyí sì ń fà àwọn ìdènà sí ìrọ́wọ́tó ètò ìlera. 
8: Àwọn òṣìṣẹ́ ìlera-bìnrin ní láti pa iṣẹ́ oòjọ́ àti ojúṣe wọn nínú ẹbí pọ̀ láti lè pèsè ètò ìlera tó péye fún àdúgbò wọn. 
9: Nínú èèyàn bíi ẹgbẹ̀lẹ́gbẹ̀ 9 tó wà ní Togo, ìdá 51.3 nínú ọ

# REMOVE DIACRITICS FROM TEXT:
* Due to the persistence of the tone marks and diacritics, we need to specially rewrite the text back to the plain ASCII representation.
* All tone markers and diacritics falls within unicode range `u0300 and u036f`
* We use regex to substitute/remove any occurence of the symbols that fall within the unicode range for the diacritics or/and tone marks.
We checked the ord() of the text and saw that the accented texts in the scraped data, has a different `"ord()"` from the rest.

In [None]:
import unicodedata, re

def remove_dia(text):
    norm = unicodedata.normalize('NFD', text)
    stripped = re.sub(r'[\u0300-\u036f]', '', norm) # Any char that falls within this range is a diacritic
    return stripped

with open('yo_corpus.txt', 'r', encoding='utf-8') as f:
    lines = f.read()

lines = [text+'. ' for text in lines.split('. ')]

In [8]:
import pandas as pd

diac = []
undiac = []

for i in lines:
    diac.append(i)
    undiac.append(remove_dia(i))

data = {
    'label':diac,
    'feature':undiac
    }

df = pd.DataFrame(data)

df.to_parquet('data.parquet')
print('Saved data as parquet...')

Saved data as parquet...


In [11]:
read_df = pd.read_parquet('data.parquet')
read_df.head()

Unnamed: 0,label,feature
0,Àwọn elétò ìlera-bìnrin agbègbè níbi ìdánilẹ́k...,Awon eleto ilera-binrin agbegbe nibi idanileko...
1,Àwòrán láti ọwọ́ Émile Bobozi.,Aworan lati owo Emile Bobozi.
2,A gba àṣẹ láti lò ó.,A gba ase lati lo o.
3,Láti ọwọ́ Emile Bobozi Ní àwọn ìgbèríko orílẹ̀...,Lati owo Emile Bobozi Ni awon igberiko orile-e...
4,"Síbẹ̀, wọ́n tẹramọ́ ìdàgbàsókè ìlú wọn bíótilẹ...","Sibe, won teramo idagbasoke ilu won biotilejep..."
