#Overview

This script scrapes, extracts, and structures Yoruba proverbs and idioms from a public webpage. The goal is to build a structured dataset containing each proverb, its literal English translation, and its actual meaning in English. It would be further used to bulid a translation models in another script.

The input is the url to [Steemit](https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms) web Page that contains Yoruba proverbs, their translations and cultural meanings

The output is a CSV file "yoruba_phrases.csv" that consist of:

Phrase: Yoruba proverb.

Literal Translation: Direct word-for-word English translation.

Actual Meaning: Idiomatic or interpretive English meaning.

In [None]:
# Importing the necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Web Scraping

The HTML content is fetched using requests.get(), and BeautifulSoup is then used to parse the page and extract <li> tags containing Yoruba proverbs and <p> tags containing their literal translations and meanings.

In [None]:
# Send a GET request to the webpage and get its HTML content
html_text = requests.get("https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms").text

# Parse the HTML content using BeautifulSoup and the lxml parser
soup = BeautifulSoup(html_text, "lxml")

In [None]:
# Find the main <div> tag that contains the proverbs and text content
div_tag = soup.find("div", class_="MarkdownViewer Markdown")

# From the <div>, find all the <li> tags (contains the Yoruba proverbs)
li_tag = div_tag.find_all("li")

# Create an empty list to store the Yoruba phrases
phrase =[]

# Loop through each <li> tag and extract the text content (Yoruba proverb), then add to the list
for tag in li_tag:
    phrase.append(tag.get_text(strip=True))  # strip=True removes surrounding spaces

# One of the proverbs (index 32) contains a translation mistakenly. We remove that part
phrase[32] = phrase[32].replace("Translation: A good beginning is of no value unless one perseveres to the end.", "")

In [None]:
# Find all the <p> tags inside the <div> (contains translations and meanings)
p_tags = div_tag.find_all("p")

# Skip the first two <p> tags (they are not part of the data)
p_tag = p_tags[2:]

#Manual Corrections

Remove some out-of-place items manually by index to align the data correctly


In [None]:
del p_tag[101]

In [None]:
del p_tag[49]

In [None]:
del p_tag[20]

In [None]:
# Create an empty list to store the translations and meanings
text =[]

# Extract and clean text from each remaining <p> tag
for tag in p_tag:
    text.append(tag.get_text(strip=True))

# Add back the translation we removed earlier (to keep all data in sync)
text.insert(66, "Translation: A good beginning is of no value unless one perseveres to the end.")

#Text Cleanup

At this point, text[] contains alternating translations and meanings.
Now we clean up by removing "Meaning: " and "Translation: " from each item

In [None]:
cleaned = []
for item in text:
    cleaned.append(item.replace("Meaning: ", "").replace("Translation: ", ""))

#Alignment

Spliting the cleaned list into two lists:

list1 will contain the literal translations

list2 will contain the actual meanings

In [None]:
list1 = []
list2 = []

for i, tag in enumerate(cleaned):
    if i % 2 == 0:  # Even indices → literal translation
        list1.append(tag)
    else:  # Odd indices → actual meanin
        list2.append(tag)

#Expansion

In [None]:
# Add additional proverbs directly to the phrase list
phrase.extend((["Tí ọmọdé bá rí ẹgbẹ́ rẹ̀ tó ńgun ẹṣin lọ níwájú, ó yẹ kó wo ẹgbẹ́ bàbá a rẹ̀ tó ńfi ẹsẹ̀ rìn bọ̀ lẹ́hìn.", "Ibi gbogbo là ńdá iná alẹ́, ọbẹ̀ ló kàn dùn ju ara wọn lọ.", "Ìgbín ò lè sáré bí Ajá; ìyẹn ò ní kó máà de ibi tó ńlọ.", "Àmójúkúrò ni í mú ẹ̀mí ìfẹ́ gùn.", "Bí a kò bá dẹ́kun ìgbìyànjú, bó pẹ́ bó yá akitiyan wa á dópin lọ́jọ́ kan.", "Ọmọ aráyé ńp'àtẹ́wọ́ fún yànmùyánmú, inú ẹ̀ ńdùn; kò mọ̀ pé òun ńfi ikú ṣeré ni.", "Bí a kò bá dẹ́kun ìgbìyànjú, bó pẹ́ bó yá akitiyan á dópin lọ́jọ́ kan.", "Bàtà orí àkìtàn náà re òde ìyàwó rí.", "Ọ̀rẹ́ ẹ mí só, èmi náà yóò só; olúwarẹ̀ yóò kàn ṣu sára ni.", "Òòrùn tó kù lókè tó aṣọ ọ́ gbẹ."]))

# Add their corresponding literal translations to list1
list1.extend((["A youngster who's anxious at his (or her) peers riding horses ahead, will do well to look back to see his father's mates trudging behind.", "Supper is prepared in every home; some stews are simply tastier than others.", "The snail is truly not as fast as the dog; yet, this won't stop the snail from getting to its destination.", "Willingness to overlook is what makes for an enduring loving relationship.", "If we won't give up, our hustling will one day end.","People clap for mosquito and it rejoices, oblivious that it is toying with death.", "If one won't give up trying, one's hustling will one day come to an end.", "The pair of shoes now on the refuse dump was once worn to a wedding.", "My friend farts, I must fart; you will simply end up defecating on yourself.", "The remaining sunlight is still good enough to dry the clothes."]))

# Add their actual meanings to list2
list2.extend((["Contentment is it", "We are all blessed, albeit in different ways and to different extents; keep hope alive and remain grateful", "Compare yourself with no one; we are all differently endowed; stay in your lane and envy no one", "If it is wrong, love will not see it; tolerance is crucial.", "Finish whatever you start: never give up, never quit; if we won't quit, we will win.", "Be perceptive: not all positive gestures are noble.", "Persistence is it; if you won't quit, you will win, ultimately.", "Life is turn by turn; a season goes, another comes; humility is it.", "Stay in your lane and embrace your uniqueness; be secure in who you are: don't compare yourself with anyone.", "Keep hope alive: don't despair; what is left can still go far: work with what you have, right where you are."]))

In [None]:
# Create a dictionary to organize the data into columns for the CSV file
data ={
    "Phrase": phrase,
    "Literal Translation": list1,
    "Actual Meaning": list2
}

# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(data)

In [None]:
# Save the DataFrame into a CSV file named 'yoruba_phrases.csv' without the row numbers
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/yoruba_phrases.csv', index=False)