# Web Scrapping with Beautiful Soup and Wptool API
## Introduction
This project is about extracting data from different websites. The aim is to list the best 100 universities in Nigeria. Python package, [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), will be used to get the list from the [Webometrics Ranking table of 2020](https://www.theabusites.com/webometrics-ranking-2019/), found on the THEABUSITE website.

The Webometrics Ranking of World Universities, also known as Ranking Web of Universities, is a ranking system for the world’s universities based on a composite indicator that takes into account both the volume of the Web content (number of web pages and files) and the visibility and impact of these web publications according to the number of external links (site citations) they received.

Additional data about each university, such as a motto, year of establishment, chancellor, vice-chancellor, students, e.t.c, will be extracted from Wikipedia using the API, [wptools](https://pypi.org/project/wptools/).

## Importing Required Libraries/Package

In [1]:
import requests as r
import pandas as pd
from bs4 import BeautifulSoup
import os
import wptools as wp
import time
import json
import numpy as np

## Accessing the website using request library and saving the content to a file

In [2]:
# url of the website where the table of top 100 universities are.
url = 'https://www.theabusites.com/webometrics-ranking-2019/'
html = r.get(url)

# Creatiing a folder on the file directory to save the content of the url
folder = 'top_100_universities_in_ng'
if not os.path.exists(folder):
    os.makedirs(folder)
with open(os.path.join(folder, "./webometrics_ranking_2019.html"), mode='wb') as file:
    file.write(html.content)

## Accessing the saved content for extraction

In [2]:
with open("top_100_universities_in_ng/webometrics_ranking_2019.html", 'r') as file:
    soup = BeautifulSoup(file, 'lxml')
    table = soup.find('table')

In [3]:
# Filter all row in the table
rows = table.find_all('tr')
sub = rows[1:] # remove first item in the list, it contains table headings.

# Empty list to save extracted information
top = []
for i in range(len(sub)):
    rank = sub[i].find_all('td')[0].text
    world_rank = sub[i].find_all('td')[1].text
    universities = sub[i].find_all('td')[2].text
    url = sub[i].find_all('a')[0]['href']
    presence_rank = sub[i].find_all('td')[4].text
    impact_rank = sub[i].find_all('td')[5].text
    openness_rank = sub[i].find_all('td')[6].text
    excellence_rank = sub[i].find_all('td')[7].text
    
    top.append({
        'ranking':int(rank),
        'world_rank':int(world_rank),
        'universities':universities,
        'website':url,
        'presence_rank':presence_rank,
        'impact_rank':impact_rank,
        'openness_rank':openness_rank,
        'excellence_rank':excellence_rank
    })

# Creating column names as seen on web
col=['ranking', 'world_rank', 'universities', 'website', 'presence_rank',
    'impact_rank', 'openness_rank', 'excellence_rank']

# Converting extracted information to pandas dataframe
df_1 = pd.DataFrame(top, columns=col)

## First extraction result in a data frame

In [4]:
# Preview of extracted information
df_1

Unnamed: 0,ranking,world_rank,universities,website,presence_rank,impact_rank,openness_rank,excellence_rank
0,1,1322,University of Ibadan,https://www.ui.edu.ng/,2113,2088,1057,1561
1,2,1742,Covenant University Ota,http://covenantuniversity.edu.ng/,1169,3884,1356,1797
2,3,1805,University of Nigeria,http://www.unn.edu.ng/,1311,3279,1038,2243
3,4,1984,University of Lagos,https://unilag.edu.ng/,161,4143,1521,2312
4,5,2053,Obafemi Awolowo University,http://oauife.edu.ng/,2916,4560,1616,2025
...,...,...,...,...,...,...,...,...
95,96,14054,Yobe State University (Bukar Abba Ibrahim Univ...,https://www.ysu.edu.ng/,17040,15459,6490,6084
96,97,14091,Taraba State University Jalingo,http://www.tsuniversity.edu.ng/,18418,14883,6690,6084
97,98,14110,Ondo State University of Science & Technology ...,http://www.osustech.edu.ng/,23528,12868,7168,6084
98,99,14347,Anchor University Lagos,https://aul.edu.ng/,9304,17841,5779,6084


## List of the school wikipedia urls

In [17]:
# List of wikipedia urls for each school
top_100_urls = [
    'https://en.wikipedia.org/wiki/University_of_Ibadan',
    'https://en.wikipedia.org/wiki/Covenant_University',
    'https://en.wikipedia.org/wiki/University_of_Nigeria',
    'https://en.wikipedia.org/wiki/University_of_Lagos',
    'https://en.wikipedia.org/wiki/Obafemi_Awolowo_University',
    'https://en.wikipedia.org/wiki/Ahmadu_Bello_University',
    'https://en.wikipedia.org/wiki/University_of_Ilorin',
    'https://en.wikipedia.org/wiki/Federal_University_of_Technology_Akure',
    'https://en.wikipedia.org/wiki/University_of_Port_Harcourt',
    'https://en.wikipedia.org/wiki/Adekunle_Ajasin_University',
    'https://en.wikipedia.org/wiki/University_of_Benin_(Nigeria)',
    'https://en.wikipedia.org/wiki/Federal_University_of_Technology,_Minna',
    'https://en.wikipedia.org/wiki/Ladoke_Akintola_University_of_Technology',
    'https://en.wikipedia.org/wiki/Rivers_State_University',
    'https://en.wikipedia.org/wiki/University_of_Calabar',
    'https://en.wikipedia.org/wiki/Bayero_University_Kano',
    'https://en.wikipedia.org/wiki/Lagos_State_University',
    'https://en.wikipedia.org/wiki/University_of_Jos',
    'https://en.wikipedia.org/wiki/Federal_University_of_Technology_Owerri',
    'https://en.wikipedia.org/wiki/University_of_Uyo',
    'https://en.wikipedia.org/wiki/Nnamdi_Azikiwe_University',
    'https://en.wikipedia.org/wiki/Olabisi_Onabanjo_University',
    'https://en.wikipedia.org/wiki/Federal_University_of_Agriculture,_Abeokuta',
    'https://en.wikipedia.org/wiki/University_of_Abuja',
    'https://en.wikipedia.org/wiki/University_of_Maiduguri',
    'https://en.wikipedia.org/wiki/Usmanu_Danfodiyo_University',
    'https://en.wikipedia.org/wiki/Abubakar_Tafawa_Balewa_University',
    'https://en.wikipedia.org/wiki/Ebonyi_State_University',
    'https://en.wikipedia.org/wiki/Federal_University_of_Petroleum_Resources_Effurun',
    'https://en.wikipedia.org/wiki/Benue_State_University',
    'https://en.wikipedia.org/wiki/American_University_of_Nigeria',
    'https://en.wikipedia.org/wiki/Federal_University_Oye_Ekiti',
    'https://en.wikipedia.org/wiki/University_of_Agriculture,_Makurdihttps://en.wikipedia.org/wiki/Niger_Delta_University',
    'https://en.wikipedia.org/wiki/African_University_of_Science_and_Technology',
    'https://en.wikipedia.org/wiki/Skyline_University',
    'https://en.wikipedia.org/wiki/Landmark_University',
    'https://en.wikipedia.org/wiki/Delta_State_University,_Abraka',
    'https://en.wikipedia.org/wiki/Ekiti_State_University',
    'https://en.wikipedia.org/wiki/Babcock_University',
    'https://en.wikipedia.org/wiki/Michael_Okpara_University_of_Agriculture',
    'https://en.wikipedia.org/wiki/Alex_Ekwueme_Federal_University_Ndufu_Alike_Ikwo',
    'https://en.wikipedia.org/wiki/Osun_State_University',
    'https://en.wikipedia.org/wiki/Cross_River_University_of_Technology',
    'https://en.wikipedia.org/wiki/Redeemer\'s_University_Nigeria',
    'https://en.wikipedia.org/wiki/Kwara_State_University',
    'https://en.wikipedia.org/wiki/Michael_Okpara_University_of_Agriculture',
    'https://en.wikipedia.org/wiki/Abia_State_University_Uturu',
    'https://en.wikipedia.org/wiki/Federal_University,_Dutsin-Ma',
    'https://en.wikipedia.org/wiki/Edo_University,_Iyamho',
    'https://en.wikipedia.org/wiki/Umaru_Musa_Yar\'adua_University',
    'https://en.wikipedia.org/wiki/Nigerian_Defence_Academy',
    'https://en.wikipedia.org/wiki/Imo_State_University',
    'https://en.wikipedia.org/wiki/Enugu_State_University_of_Science_and_Technology',
    '',
    'https://en.wikipedia.org/wiki/Joseph_Ayo_Babalola_University',
    'https://en.wikipedia.org/wiki/Federal_University_Dutse',
    'https://en.wikipedia.org/wiki/Akwa_Ibom_State_University',
    'https://en.wikipedia.org/wiki/Kaduna_State_University',
    'https://en.wikipedia.org/wiki/Federal_University,_Otuoke',
    'https://en.wikipedia.org/wiki/Lagos_Business_School',
    'https://en.wikipedia.org/wiki/Modibbo_Adama_Federal_University_of_Technology,_Yola',
    'https://en.wikipedia.org/wiki/Godfrey_Okoye_University',
    'https://en.wikipedia.org/wiki/Tai_Solarin_University_of_Education',
    'https://en.wikipedia.org/wiki/Chukwuemeka_Odumegwu_Ojukwu_University',
    '',
    'https://en.wikipedia.org/wiki/Igbinedion_University',
    'https://en.wikipedia.org/wiki/Auchi_Polytechnic',
    'https://en.wikipedia.org/wiki/Federal_University,_Lokoja',
    'https://en.wikipedia.org/wiki/Ibrahim_Badamasi_Babangida_University',
    'https://en.wikipedia.org/wiki/Ambrose_Alli_University',
    'https://en.wikipedia.org/wiki/Elizade_University',
    'https://en.wikipedia.org/wiki/Kogi_State_University',
    'https://en.wikipedia.org/wiki/National_Open_University_of_Nigeria',
    'https://en.wikipedia.org/wiki/Yaba_College_of_Technology',
    'https://en.wikipedia.org/wiki/Baze_University',
    'https://en.wikipedia.org/wiki/Nile_University_of_Nigeria',
    'https://en.wikipedia.org/wiki/University_of_Medical_Sciences,_Ondo',
    'https://en.wikipedia.org/wiki/Nasarawa_State_University',
    'https://en.wikipedia.org/wiki/Federal_Polytechnic,_Ilaro',
    'https://en.wikipedia.org/wiki/Pan-Atlantic_University',
    'https://en.wikipedia.org/wiki/Ajayi_Crowther_University',
    'https://en.wikipedia.org/wiki/Adeleke_University',
    'https://en.wikipedia.org/wiki/Federal_University,_Wukari',
    'https://en.wikipedia.org/wiki/Lead_City_University',
    'https://en.wikipedia.org/wiki/Federal_University_of_Lafia',
    'https://en.wikipedia.org/wiki/Benson_Idahosa_University',
    'https://en.wikipedia.org/wiki/Al-Hikmah_University',
    'https://en.wikipedia.org/wiki/Bauchi_State_University',
    'https://en.wikipedia.org/wiki/Kebbi_State_University_of_Science_and_Technology',
    'https://en.wikipedia.org/wiki/Bells_University_of_Technology',
    'https://en.wikipedia.org/wiki/Kano_State_University_of_Technology',
    'https://en.wikipedia.org/wiki/Bingham_University',
    'https://en.wikipedia.org/wiki/Lagos_State_University_of_Science_and_Technology',
    'https://en.wikipedia.org/wiki/Yobe_State_University',
    'https://en.wikipedia.org/wiki/Taraba_State_University',
    'https://en.wikipedia.org/wiki/Ondo_State_University_of_Science_and_Technology',
    'https://en.wikipedia.org/wiki/Anchor_University',
    'https://en.wikipedia.org/wiki/Federal_University,_Birnin_Kebbi'
]
checkpoint = ['https://en.wikipedia.org/wiki/Ebonyi_State_University',
             'https://en.wikipedia.org/wiki/Kaduna_State_University',
             'https://en.wikipedia.org/wiki/Bauchi_State_University',]

## Accessing the wikipedia urls and extracting their infobox data, then save it to a file

In [19]:
"""A timed function to iterate through the above urls and extract the infobox data from each website. 
This data is saved to a text file.

This code will take about 10 minutes to run.
"""
start = time.time()
with open('top_100_universities_in_ng/wikipedia_info.txt', 'w') as opened_file:
    for url in top_100_urls:
        page = wp.page(url.split('/')[-1], silent=True)
        pg = page.get()
        all_info = pg.data['infobox']
        
#       Dumping the JSON file to the text file      
        json.dump(all_info, opened_file)
        opened_file.write('\n')
#       Checkpoint to let the user know the code is active and running
        if url in checkpoint:
            print("The code is still running, please exercise patience")

total_time = "%s"%(time.time() - start)
minit, sec = divmod(float(total_time), 60)
hr, minit = divmod(minit, 60)
print(f'Total time taken is {hr} hr {minit} minutes {total_time} seconds')

The code is still running, exercise patience
The code is still running, exercise patience
The code is still running, exercise patience
Total time taken is 0.0hr 10.0minutes 602.9921143054962 seconds


## Accessing the saved infobox data for extraction

In [5]:
""" The function below iterate through each infobox data found on the wikipedia.txt file.
It extracts the required information, if the information does not exist in a particular 
infobox data, it replace it with NaN.

If the page has no infobox, it prints out the school by its rank
"""
# Empty list to save the extracted informtion
rank = 0
info = []
with open('top_100_universities_in_ng/wikipedia_info.txt', 'r') as f:
    for line in f:
        try:
            rank += 1
            each_uni = json.loads(line)
            name = each_uni.setdefault('name', np.NaN)
            motto = each_uni.setdefault('motto', np.NaN)
            estab = each_uni.setdefault('established', np.NaN)
            typ1 = each_uni.setdefault('type', np.NaN)
            typ = each_uni.setdefault('type', np.NaN)
            chanc = each_uni.setdefault('chancellor', np.NaN)
            vice_chanc = each_uni.setdefault('vice_chancellor', np.NaN)
            stu = each_uni.setdefault('students', np.NaN)
            undrgrd = each_uni.setdefault('undergrad', np.NaN)
            pstgrd = each_uni.setdefault('postgrad', np.NaN)
            acad_staff = each_uni.setdefault('academic_staff', np.NaN)
            adminstratv_staff = each_uni.setdefault('administrative_staff', np.NaN)
            city = each_uni.setdefault('city', np.NaN)
            state = each_uni.setdefault('state', np.NaN)
            camp = each_uni.setdefault('campus', np.NaN)

            info.append({
                'ranking': rank,
                'name': name,
                'motto': motto,
                'established': estab,
                'type': typ,
                'chancellor': chanc,
                'vice_chancellor': vice_chanc,
                'students': stu,
                'undergraduates': undrgrd,
                'postgraduates': pstgrd,
                'academic_staff': acad_staff,
                'administrative_staff': adminstratv_staff,
                'city': city,
                'state': state,
                'campus': camp
            })
        except AttributeError as e:
            print(f'School of rank {rank} has no infobox on wikipedia')

# Creating columns name based on extracted information
cols = ['ranking', 'name', 'motto', 'established', 'type', 'chancellor',
       'vice_chancellor', 'students', 'undergraduates', 'postgraduates',
       'academic_staff', 'administrative_staff', 'city', 'state', 'campus']

# Converting extracted information to pandas dataframe
df_2 = pd.DataFrame(info, columns = cols)


School of rank 42 has no infobox on wikipedia
School of rank 57 has no infobox on wikipedia
School of rank 61 has no infobox on wikipedia
School of rank 63 has no infobox on wikipedia
School of rank 87 has no infobox on wikipedia
School of rank 89 has no infobox on wikipedia
School of rank 96 has no infobox on wikipedia


## Second extraction result in a data frame

In [6]:
# Prviewing extracted informatiion from wikipedia
df_2

Unnamed: 0,ranking,name,motto,established,type,chancellor,vice_chancellor,students,undergraduates,postgraduates,academic_staff,administrative_staff,city,state,campus
0,1,University of Ibadan,"""''Recte Sapere Fons''"" (To think straight is ...",{{start date and age|1948}},[[public university|Public]],"[[Sa'adu Abubakar|Saad Abubakar]], [[Sultan of...",Professor [[Kayode Adebowale ]],41743,,,,,[[Ibadan]],[[Oyo State|Oyo]],
1,2,Covenant University,''Raising a New Generation of Leaders'',21 October 2002,Private,[[David Oyedepo]],[[Abiodun H. Adebayo]],,,,,,"[[Ota, Ogun State]]",,Urban
2,3,University of Nigeria,''To Restore the Dignity of Man'',1955,[[public university|Public]],,[[Charles Igwe Arizechukwu]],36000,,,,,[[Nsukka]],[[Enugu state|Enugu]],Rural<br /> {{convert|871|ha|acre}} (Nsukka ca...
3,4,University of Lagos,In deed and in truth,1962,[[Public university|Public]] [[research univer...,Alhaji (Dr.) Abubakar IBN Umar Garbai El-Kanem...,[[Oluwatoyin Ogundipe|Prof. Oluwatoyin Ogundipe]],"55,000 (2017)","43,784 (2017)","9,070 (2017)","1,736 (2017)",552 (2017),[[Lagos]],,Urban
4,5,Obafemi Awolowo University,For Learning and Culture,1961,[[public university|Public]],Etsu [[Yahaya Abubakar]],[[Adebayo Simeon Bamire]],"about 35,000",13000,7500,,,[[Ile-Ife]],[[Osun State|Osun]],Urban {{convert|2020|ha|acre}}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,93,Lagos State University of Science and Technology,Mission and Professionalism,{{start date|1977}},[[Public university|Public]],,Oluremi Nurudeen Olaleye,50000,,,Above 808,,"[[Ikorodu, Lagos State]]",,"[[Ikorodu]], [[Isolo]], [[Surulere]]"
87,94,Yobe State University,Knowledge Is Light,2006,Public,Ahmed Tijjani Ibn Saleh (Emir of Ngazargamu),Professor Mala Mohammed Daura,,10000,2400,,1300,[[Damaturu]],[[Yobe State]],
88,95,Taraba State University,,2008,,,,,,,,,[[Jalingo]],[[Taraba State]],
89,97,Anchor University,"Character, Competence, Courage",September 2014,Private,[[William Kumuyi]],[[Joseph Afolayan]],,,,,,"[[Ayobo, Lagos|Ayobo-Ipaja]]",[[Lagos State]],


In [7]:
# dropping the name column on second datafram, then merge with the first on ranking.
sub_df_2 = df_2.drop(columns ='name')
master = df_1.merge(sub_df_2, on = 'ranking', how = 'left')
master

Unnamed: 0,ranking,world_rank,universities,website,presence_rank,impact_rank,openness_rank,excellence_rank,motto,established,...,chancellor,vice_chancellor,students,undergraduates,postgraduates,academic_staff,administrative_staff,city,state,campus
0,1,1322,University of Ibadan,https://www.ui.edu.ng/,2113,2088,1057,1561,"""''Recte Sapere Fons''"" (To think straight is ...",{{start date and age|1948}},...,"[[Sa'adu Abubakar|Saad Abubakar]], [[Sultan of...",Professor [[Kayode Adebowale ]],41743,,,,,[[Ibadan]],[[Oyo State|Oyo]],
1,2,1742,Covenant University Ota,http://covenantuniversity.edu.ng/,1169,3884,1356,1797,''Raising a New Generation of Leaders'',21 October 2002,...,[[David Oyedepo]],[[Abiodun H. Adebayo]],,,,,,"[[Ota, Ogun State]]",,Urban
2,3,1805,University of Nigeria,http://www.unn.edu.ng/,1311,3279,1038,2243,''To Restore the Dignity of Man'',1955,...,,[[Charles Igwe Arizechukwu]],36000,,,,,[[Nsukka]],[[Enugu state|Enugu]],Rural<br /> {{convert|871|ha|acre}} (Nsukka ca...
3,4,1984,University of Lagos,https://unilag.edu.ng/,161,4143,1521,2312,In deed and in truth,1962,...,Alhaji (Dr.) Abubakar IBN Umar Garbai El-Kanem...,[[Oluwatoyin Ogundipe|Prof. Oluwatoyin Ogundipe]],"55,000 (2017)","43,784 (2017)","9,070 (2017)","1,736 (2017)",552 (2017),[[Lagos]],,Urban
4,5,2053,Obafemi Awolowo University,http://oauife.edu.ng/,2916,4560,1616,2025,For Learning and Culture,1961,...,Etsu [[Yahaya Abubakar]],[[Adebayo Simeon Bamire]],"about 35,000",13000,7500,,,[[Ile-Ife]],[[Osun State|Osun]],Urban {{convert|2020|ha|acre}}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,14054,Yobe State University (Bukar Abba Ibrahim Univ...,https://www.ysu.edu.ng/,17040,15459,6490,6084,,,...,,,,,,,,,,
96,97,14091,Taraba State University Jalingo,http://www.tsuniversity.edu.ng/,18418,14883,6690,6084,"Character, Competence, Courage",September 2014,...,[[William Kumuyi]],[[Joseph Afolayan]],,,,,,"[[Ayobo, Lagos|Ayobo-Ipaja]]",[[Lagos State]],
97,98,14110,Ondo State University of Science & Technology ...,http://www.osustech.edu.ng/,23528,12868,7168,6084,,{{start date and age|2013}},...,,Professor [[Auwal H Yadudu]],,,,,,"[[Birnin Kebbi]], [[Kebbi State]]",,Urban
98,99,14347,Anchor University Lagos,https://aul.edu.ng/,9304,17841,5779,6084,,,...,,,,,,,,,,


In [8]:
# Saving merged dataframe to a csv file.
master.to_csv('top_100_universities_in_nigeria.csv', index = False)

## Summary
This project aims to get the top 100 universities in Nigeria and gather more information from Wikipedia. The extraction from webometrics ranking using BeautifulSoup was successful.

While manually getting the URLs of each school from Wikipedia, some schools, such as Federal University Kashere Gombe State, do not have a Wikipedia page, while some school has a Wikipedia page but no infobox. Due to some schools not having the required information, the extracted JSON data has NaN as the default value.

Both data from Beautilfulsoup extraction and wptools extraction were merged and saved to a file `top_100_universities_in_nigeria.csv`.