# <span style="color:blue">Scraping information on MBA programmes from MBAstudies.com</span>
## About MBAstudies.com
MBAstudies.com allows students to browse thousands of graduate degrees from around the world. The website provides information on several different types of MBAs such as Master of Business Administration (MBA), Executive MBAs (EMBAs), Executive Courses, Online Degrees, etc.

## Scope of the project
For the Online Data Collection & Management course at Tilburg University, our team developed a scraper, which collects general information on the Master of Business Administration (MBA) programmes (to be found at www.mbastudies.com/MBA/) per each specified country. The list of countries can be modified according to the researcher's interest. In our project, we are going to include the Group of Seven (G7) countries: Canada, France, Germany, Italy, Japan, the United Kingdom, and the United States.

The following information is going to be retrieved from each programme: All Locations, Duration, Earliest Start Date, Application Deadline, Languages, Study Type, Pace, and Tuition Fees.

## Workflow
The following workflow was applied in this project:
1. Installing and importing required packages.
2. Seed generation: 
    - generating base URLs for each country
    - generating all pages URLs for each country
    - scraping links to programmes from each page
3. Scraping programme information from each programme URL.
4. Saving data: 
    - raw data in a json format
    - dataframe in a csv file for an analysis in R

Furthermore, the data from the csv file is going to be cleaned and preapred for the statistical analysis in R.



## 1. Importing packages
The following packages are needed to run this scraper, so make sure you install and/or load them first. The follwing commands in the Command Prompt (on Windows) will allow you to install the essencial libraries/tools:
- pip install bs4
- pip install webdriver-manager
- pip install selenium


In [1]:
from bs4 import BeautifulSoup # for getting data out of HTML, XML, and other markup languages
import requests # for sending HTTP/1.1 requests
from requests import models
import re # to check if a string contains the specified search pattern
import pandas as pd # data analysis library
from time import sleep # to pause the execution of future commands for a given number of seconds
import json # to convert the python dictionary into a JSON string that can be written into a file
import csv # for creating a csv file
import itertools # provides various functions that work on iterators to produce complex iterators
import selenium # for controlling web browsers through programs
from selenium.webdriver.common.by import By # for finding elements on the website by xpath, class name, etc.
from selenium import webdriver # for performing browser automation
from webdriver_manager.chrome import ChromeDriverManager # for getting the webdriver
from selenium.webdriver.chrome.service import Service #used for the webdriver to work properly
import os # for identifying the current directory (used when downloading files to specific directories)
from datetime import datetime # for creating a timestamp of current time and date when downloading csv/json file

## 2. Seed generation
The seed generation step consists of several substeps. The ultimate goal is to get a list of URLs of all MBA programmes in selected countries.
Firstly, we define the 'base url' which is the part of the url that is the same for different countries. 

In [2]:
base_url = 'https://www.mbastudies.com/MBA/'

Secondly, we define a list of countries that we want to scrape data from. It is important that they are written in exactly the same way as they are presented in the url. The list of countries can be adjusted manually.

In [3]:
countries_list = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United-Kingdom', 'USA']

Thirdly, a function is defined where the input is the base url and the countries list. The function consists of a for loop which connects the base URL with each country name from the list of countries. The result of the function is a list with URLs for all countries.

In [4]:
def generate_country_urls(base_url, country):
    '''
    Function generating links for each country from the list.
    '''
    page_urls = []
    for country in countries_list:
        country_url = base_url + country
        page_urls.append(country_url)
    return page_urls

A list of URLs for all countries is stored in "country_urls" variable.

In [5]:
country_urls = generate_country_urls(base_url, countries_list)

Next, we need to retrieve the total number of pages in the country URLs in order to be able to scrape all programme URLs from each page.

Therefore, a function that defines the total number of pages in the country URLs is created. The total number of pages is displayed on each country page and stored as an attribute of class "pagination mx-auto". The common text for all pages is "Page 1 of [max number of pages]". Therefore, the total number of pages is given. In the case when there's only one page in a country URL, this text is not displayed. Therefore, if the attribute is not found on the website, it can be assumed that there's only one page in that country.

In [6]:
def total_pages(country_url):
    '''
    Function defining the number of pages per country.
    country_url: base URL for a country 
    '''
    result = requests.get(country_url)
    sleep(2)
    src = result.text
    soup = BeautifulSoup(src, "html.parser")
    
    # searching for the "page number" attribute on the website:
    check_for_pages = soup.find(attrs={"class":"pagination mx-auto"}).find("span") 
    
    if check_for_pages:
        text = check_for_pages.next # getting the text from that attribute, if found on the website
    else:
        text = "Page 1 of 1" # if the attribute is not found, there's only one page in that country URL
    
    clean = text.replace("Page 1 of ","") # removing unnecessary text
    total_pages = int(clean) # transforming variable to an integer
    return total_pages

Furthermore, a function is defined where the input is the country URL for which we want to generate pages URLs and the total number of pages in that country. The function consists of a for loop which connects the country URL with the page number until the last page is reached. As a result, we get a list of all pages per country.

In [7]:
def generate_page_urls(country_url, num_pages):
    '''
    Function generating URLs for all pages in the country URL.
    country_url: base URL for a country
    num_pages: the total number of pages in that country URL
    '''
    all_country_urls = []  
    for counter in range(1, num_pages+1):
        full_url = country_url + "/?page=" + str(counter)
        all_country_urls.append(full_url)
    return all_country_urls

The page URLs for each coutry are generated with the for loop and stored in a list:

In [8]:
one_list = []
num_countries = len(countries_list)

for i in range(num_countries): 
    all_links = generate_page_urls(country_urls[i], total_pages(country_urls[i]))
    one_list.append(all_links)
    one_list2 = list(itertools.chain.from_iterable(one_list)) # to remove a list from each URL and store them in a signle list

Finally, from all pages in each country, the URLs to study programmes are scraped. These links will be used to scrape the information about these MBA programmes. The result of the function is stored in a list "programs".

In [9]:
def scrape_final_urls(list_links):
    '''
    Function scraping the URLs to MBA programmes.
    list_links: URL of page for which the programme URLs should be scraped
    '''
    programs = []
    for url in list_links: 
        res = requests.get(url)
        sleep(2)
        soup2 = BeautifulSoup(res.text, "html.parser")
             
        lenght = len(soup2.find_all(class_="program_title"))
    
        for x in range(lenght):
            prog = soup2.find_all(class_="program_title")[x].find("a").attrs["href"]
            programs.append(prog)
  
    return programs

The list of all programme URLs is stored in the "program_urls" variable. Some of the listed programmes do not have a link, and therefore, empty values are removed from the list.

In [10]:
program_urls = scrape_final_urls(one_list2)
while("" in program_urls) :
    program_urls.remove("")

## 3. Scraping programme information
After we have collected all programme URLs per each country and per each page, we can proceed with retrieveing programme information from the website. A function which scrapes general information from programme URLs is created. The elements on the website are found by xpath.

In [11]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

def get_and_parse(urls, sys_null=None):
    """
    Function scraping general information on MBA programmes.
    urls: programme URL that should be scraped
    """
    program_information = []
    for url in urls:
        driver.get(url)
        sleep(2)
        field_names = driver.find_elements_by_xpath("//div/strong")
        fields = driver.find_elements_by_xpath("//div[@class='cell auto']")
        fields = [n.text for n in fields]
        fields2 = {

        }

        for n in fields:
            try:
                fields2[n.split("\n")[0]] = n.split("\n")[1]
            except:
                pass
            fields = fields2
        program_information.append(fields)
        fields["url"]=url
        
    return program_information

    driver.quit()



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
Driver [C:\Users\ambro\.wdm\drivers\chromedriver\win32\99.0.4844.51\chromedriver.exe] found in cache


The general information about all MBA programmes is stored as a variable "final_data". Furthermore, this variable is transformed into the dataframe and stored as "final_dataframe".

In [12]:
final_data = get_and_parse(program_urls) # saving the output as variable 'final_data'
final_dataframe = pd.DataFrame(final_data) # dataframe with all data for programmes in selected countries

  field_names = driver.find_elements_by_xpath("//div/strong")
  fields = driver.find_elements_by_xpath("//div[@class='cell auto']")


## 4. Saving data
After the data is gathered, it will be downloaded as a json and csv file. The name of the file will consist of MBAStudies + current time and date. If you want to specify the directory where the files should be saved, please, uncomment the code and provide the correct path on your device.

In [15]:
#os.chdir("C:/Users/ambro/Desktop") # specify, where you want the output to be saved
filename = datetime.now().strftime('MBAStudies_%H%M_%m%d%Y.csv')
final_dataframe.to_csv(filename, encoding='utf-8')

filename2 = datetime.now().strftime('MBAStudies_%H%M_%m%d%Y.json')
with open(filename2.format(1), 'w', encoding='utf-8') as file:
    json.dump(final_data, file, ensure_ascii=False, indent=5)

In [14]:
print("Scraping finished and data saved")

Scraping finished and data saved
