<a href="https://colab.research.google.com/github/AbiemwenseMaureenOshobugie/Text-Classification/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping
The purpose of each code is to scrape news articles from the BBC website in different languages, specifically Igbo, Yoruba, Hausa, and Pidgin. the following steps were accomplshed:

Firstly, the necessary packages are installed, BeautifulSoup and Requests, using the pip package manager. 

The below codes scrape news headlines and articles from four websites of (https://www.bbc.com) and organize the extracted data into a pandas dataframe. 

The text from each headline and article is extracted using the BeautifulSoup library and stored as tuples in a list. The list is then used to create a pandas dataframe. 

The headline and body of each article are combined into a single text and stored in a new column in the dataframe called 'Text'. 

The original 'Headline' and 'Body' columns are dropped and a new column called 'Language' is added to the dataframe, with the value as either igbo, yoruba, hausa, or pidgin, for every row.

The four dataframes are concatenated into the four major languages in Nigeria and the rows are shuffled.

The shuffled data frame is split in the ratio of 70 to 30 for train and test data, after which the test label is dropped from the test data. Both train and test data are saved as csv files which are kept in the 'nigerian_4lang.zip' folder.

In [None]:
!pip -q install beautifulsoup4
!pip -q install requests

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.model_selection import train_test_split
import zipfile


Scraping the **Igbo** Language Text

In [None]:
# list of URLs to scrape
urls_igbo = ['https://www.bbc.com/igbo', 'https://www.bbc.com/igbo/news', 'https://www.bbc.com/igbo/sport']

# list to store the extracted text
data_igbo = []

# create a BeautifulSoup object
igbo_data = BeautifulSoup(requests.get(urls_igbo[0]).content, 'html.parser')

# loop through the URLs and extract the text
for url in urls_igbo:
    page_igbo = requests.get(url)
    for news in igbo_data.findAll('h3'):
        headline = news.text
        body = news.find_next('p').text
        data_igbo.append((headline, body))

# create a dataframe from the extracted text
df_igbo = pd.DataFrame(data_igbo, columns=["Headline", "Body"])

# drop rows where either the headline or the body is missing
df_igbo.dropna(how='any', inplace=True, axis=0)

# reset the index of the dataframe
df_igbo.reset_index(drop=True, inplace=True)

# create a function to combine the headline and body into a single text
def combine_text(row):
    return row['Headline'] + ' ' + row['Body']

# apply the function to the dataframe to create a new column containing the combined text
df_igbo['Text'] = df_igbo.apply(combine_text, axis=1)

# drop the headline and the body columns
df_igbo = df_igbo.drop(['Headline', 'Body'], axis = 1)

# add a column 'Language' with the value 'igbo'
df_igbo = df_igbo.assign(Language='igbo')

# print the dataframe
print(df_igbo.shape)

df_igbo.head()


(165, 2)


Unnamed: 0,Text,Language
0,Old Naira Deadline: Ego ochie ọ tọrọ gị n'aka?...,igbo
1,NA EME UGBU A Nicola Sturgeon ga-agba arụkwagh...,igbo
2,"Ọ ga-atụ ha n'anya, aga m abụ Gọvanọ Legọs - G...",igbo
3,Ụzọ ndị ọzọ i nwereike iji gosi obi ụtọ gị ịhụ...,igbo
4,Imo insecurity: Gịnị mere ndị omekome ji awakp...,igbo


In [None]:
# print just the first row full text
print(df_igbo.iloc[0, :].Text)


Old Naira Deadline: Ego ochie ọ tọrọ gị n'aka? Lee etu ị ga-esi kwụnye ya CBN na-ekwusi ike na ego naịra ochie abụghịzi ihe a ga-eji na-atụ mgbere ahịa na Naịjirịa.


Scraping the **Yoruba** Language Text

In [None]:
# list of URLs to scrape
urls_yoruba = ['https://www.bbc.com/yoruba', 'https://www.bbc.com/yoruba/news', 'https://www.bbc.com/yoruba/sport']

# list to store the extracted text
data_yoruba = []

# create a BeautifulSoup object
yoruba_data = BeautifulSoup(requests.get(urls_yoruba[0]).content, 'html.parser')

# loop through the URLs and extract the text
for url in urls_yoruba:
    page_yoruba = requests.get(url)
    for news in yoruba_data.findAll('h3'):
        headline = news.text
        body = news.find_next('p').text
        data_yoruba.append((headline, body))

# create a dataframe from the extracted text
df_yoruba = pd.DataFrame(data_yoruba, columns=["Headline", "Body"])

# drop rows where either the headline or the body is missing
df_yoruba.dropna(how='any', inplace=True, axis=0)

# reset the index of the dataframe
df_yoruba.reset_index(drop=True, inplace=True)

# create a function to combine the headline and body into a single text
def combine_text(row):
    return row['Headline'] + ' ' + row['Body']

# apply the function to the dataframe to create a new column containing the combined text
df_yoruba['Text'] = df_yoruba.apply(combine_text, axis=1)

# drop the headline and the body columns
df_yoruba = df_yoruba.drop(['Headline', 'Body'], axis = 1)

# add a column 'Language' with the value 'igbo'
df_yoruba = df_yoruba.assign(Language='yoruba')

# print the dataframe
print(df_yoruba.shape)

df_yoruba.head()


(249, 2)


Unnamed: 0,Text,Language
0,"A wá gbogbo inú yàrá Christian Atsu, ẹsẹ̀ bàtà...",yoruba
1,Ìléẹjọ́ gíga Naijiria sún ìgbẹ́jọ́ lórí owó tu...,yoruba
2,Ìgboro ti dàrú ní Ibadan àti Ilorin torí ọ̀wọ́...,yoruba
3,Ọkùnrin méjì tó rà mọ́tò Fatinoye bọ́ sí gbaga...,yoruba
4,"'Torí mo wà lókè òkun, ẹ ní mi ò lè dìbò, èyí ...",yoruba


In [None]:
# print just the first row full text
print(df_yoruba.iloc[0, :].Text)


A wá gbogbo inú yàrá Christian Atsu, ẹsẹ̀ bàtà méjì la rí, a kò fojú gáání Christian rárá - Agbẹnusọ O kere tan ẹniyan ẹgbẹrun lọna mọkanlelogun lo ti ba iṣẹlẹ naa lọ, ti ọpọ si jẹ ara ilẹ Turkey.


Scraping the **Hausa** Language Text

In [None]:
# list of URLs to scrape
urls_hausa = ['https://www.bbc.com/hausa', 'https://www.bbc.com/hausa/news', 'https://www.bbc.com/hausa/sport']

# list to store the extracted text
data_hausa = []

# create a BeautifulSoup object
hausa_data = BeautifulSoup(requests.get(urls_hausa[0]).content, 'html.parser')

# loop through the URLs and extract the text
for url in urls_hausa:
    page_hausa = requests.get(url)
    for news in hausa_data.findAll('h3'):
        headline = news.text
        body = news.find_next('p').text
        data_hausa.append((headline, body))

# create a dataframe from the extracted text
df_hausa = pd.DataFrame(data_hausa, columns=["Headline", "Body"])

# drop rows where either the headline or the body is missing
df_hausa.dropna(how='any', inplace=True, axis=0)

# reset the index of the dataframe
df_hausa.reset_index(drop=True, inplace=True)

# create a function to combine the headline and body into a single text
def combine_text(row):
    return row['Headline'] + ' ' + row['Body']

# apply the function to the dataframe to create a new column containing the combined text
df_hausa['Text'] = df_hausa.apply(combine_text, axis=1)

# drop the headline and the body columns
df_hausa = df_hausa.drop(['Headline', 'Body'], axis = 1)

# add a column 'Language' with the value 'igbo'
df_hausa = df_hausa.assign(Language='hausa')

# print the dataframe
print(df_hausa.shape)

df_hausa.head()


(159, 2)


Unnamed: 0,Text,Language
0,KAI TSAYE Kotun ƙolin Najeriya ta ɗage shari'a...,hausa
1,An saki wani mutum bayan kuskuren ɗaure shi ts...,hausa
2,Ana tuhumar mabiya coci da laifin karan-tsaye ...,hausa
3,'Dukkanin gwamnoni na goyon bayan soke wa’adin...,hausa
4,Amnesty ta yi gargaɗi kan razana masu kaɗa ƙur...,hausa


In [None]:
# print just the first row full text
print(df_hausa.iloc[0, :].Text)


KAI TSAYE Kotun ƙolin Najeriya ta ɗage shari'a kan soke wa'adin tsofaffin kuɗi Wannan shafi ne da yake kawo muku abubuwan da ke faruwa a Najeriya da sauran sassan duniya.


Scraping the **Pidgin** Language Text

In [None]:
# list of URLs to scrape
urls_pidgin = ['https://www.bbc.com/pidgin', 'https://www.bbc.com/pidgin/news', 'https://www.bbc.com/pidgin/sport']

# list to store the extracted text
data_pidgin = []

# create a BeautifulSoup object
pidgin_data = BeautifulSoup(requests.get(urls_pidgin[0]).content, 'html.parser')

# loop through the URLs and extract the text
for url in urls_pidgin:
    page_pidgin = requests.get(url)
    for news in pidgin_data.findAll('h3'):
        headline = news.text
        body = news.find_next('p').text
        data_pidgin.append((headline, body))

# create a dataframe from the extracted text
df_pidgin = pd.DataFrame(data_pidgin, columns=["Headline", "Body"])

# drop rows where either the headline or the body is missing
df_pidgin.dropna(how='any', inplace=True, axis=0)

# reset the index of the dataframe
df_pidgin.reset_index(drop=True, inplace=True)

# create a function to combine the headline and body into a single text
def combine_text(row):
    return row['Headline'] + ' ' + row['Body']

# apply the function to the dataframe to create a new column containing the combined text
df_pidgin['Text'] = df_pidgin.apply(combine_text, axis=1)

# drop the headline and the body columns
df_pidgin = df_pidgin.drop(['Headline', 'Body'], axis = 1)

# add a column 'Language' with the value 'igbo'
df_pidgin = df_pidgin.assign(Language='pidgin')

# print the dataframe
print(df_pidgin.shape)

df_pidgin.head()


(117, 2)


Unnamed: 0,Text,Language
0,Supreme Court adjourn case on cbn old and new ...,pidgin
1,Scotland First Minister Nicola Sturgeon resign...,pidgin
2,"Di fighting don finish, but di rapes still dey...",pidgin
3,How you fit deposit your old naira notes to CB...,pidgin
4,US say three unidentified objects dem shoot do...,pidgin


In [None]:
# print just the first row full text
print(df_pidgin.iloc[0, :].Text)


Supreme Court adjourn case on cbn old and new naira notes policy till February 22 Wetin dis mean according to di counsel to di Federal Goment, Kanu Agabi SAN be say both currencies still remain legal tender for di kontri and If CBN do other wise, e go mean say dem dey go against di Supreme Court


Combining and Shuffling the Languages in a DataFrame

In [None]:
# Concatenate the data frames for major languages in Nigeria and shuffle the rows
nigeria_major_languages = pd.concat([df_igbo, df_yoruba, df_hausa, df_pidgin], axis=0)
nigeria_major_languages = nigeria_major_languages.sample(frac=1).reset_index(drop=True)

# Split the data into training and test sets (Test = 30%)
train, test_df = train_test_split(nigeria_major_languages, test_size=0.3, random_state=42)


In [None]:
# Write the dataframe to a CSV file. 
# The index=False argument prevents the index from being written to the file.
train.to_csv('Train.csv', index = False)

# Print the shape of the training set and display the first few rows
print(train.shape)
train.head()


(483, 2)


Unnamed: 0,Text,Language
178,Ina da ƙwarin gwiwar cewa APC za ta lashe zaɓe...,hausa
265,"Video, Tins you go do during election wey go l...",pidgin
352,Ọgbambọ: Ka m si jiri naanị dọla ise bido ahịa...,igbo
529,Sedentary Lifestyle: Etu ịnọdụ ala ogologo oge...,igbo
409,Scotland First Minister Nicola Sturgeon resign...,pidgin


In [None]:
# Write the dataframe to a CSV file. 
# The index=False argument prevents the index from being written to the file.
test_df.to_csv('test_label.csv', index = False)

# Print the shape of the test_df set and display the first few rows
print(test_df.shape)
test_df.head()

(207, 2)


Unnamed: 0,Text,Language
286,Bikin al'ada da hawan dawakai na cikin hotunan...,hausa
511,"""Ìmúra ìgbeyàwó là ń ṣé, ọjọ́ tó yẹ ká kéde ìy...",yoruba
257,KAI TSAYE Kotun ƙolin Najeriya ta ɗage shari'a...,hausa
336,Ọwọ́ ọlọ́pàá ti tẹ àwọn tó ń kó ìbọn wọ ìpínlẹ...,yoruba
318,How your favourite celebs dey jolly dis valent...,pidgin


Dropping the Label of the Test Data

In [None]:
# Drop the 'Language' column from the test set
test = test_df.drop(['Language'], axis = 1)

# Write the dataframe to a CSV file. 
# The index=False argument prevents the index from being written to the file.
test.to_csv('Test.csv', index = False)

# Print the shape of the test set and display the first few rows
print(test.shape)
test.head()


(207, 1)


Unnamed: 0,Text
286,Bikin al'ada da hawan dawakai na cikin hotunan...
511,"""Ìmúra ìgbeyàwó là ń ṣé, ọjọ́ tó yẹ ká kéde ìy..."
257,KAI TSAYE Kotun ƙolin Najeriya ta ɗage shari'a...
336,Ọwọ́ ọlọ́pàá ti tẹ àwọn tó ń kó ìbọn wọ ìpínlẹ...
318,How your favourite celebs dey jolly dis valent...


Zip the Train and Test data in 'nigerian_4lang.zip' Folder

In [None]:
# Specify the file paths of the CSV files you want to zip
Train = '/content/Train.csv'
Test = '/content/Test.csv'
test_label = '/content/test_label.csv'

# Specify the name of the output zip file
nigerian_4lang_zip = '/content/nigerian_4lang.zip'

# Open the output zip file in write mode
with zipfile.ZipFile(nigerian_4lang_zip, 'w') as myzip:
  
    # Add the CSV files to the zip file
    myzip.write(Train)
    myzip.write(Test)
    myzip.write(test_label)


The zipped csv files will be used for Text Classification in Natural Language Processing - NLP.

# *Thank you!*